Skip to main content

apache flume to fetch twitter data

we are using Apache flume to fetch tweeter data and store it in to HDFS. so lets get started , flume version I am using here is apache flume-1.4.0  


Download apache flume-
on your unix terminal type this command

 wget http://apache.mirrors.hoobly.com/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz

create directory - "flume-ng"
create directory in your /usr/lib folder, type this command
sudo mkdir /usr/lib/flume-ng


Now copy the flume tar file you have downloaded to your usr/lib/flume-ng directory, which you just have created. command is
sudo cp –r apache-flume-1.4.0-bin.tar.gz /usr/lib/flume-ng/


check if your tar file is copied to your flume-ng directory , give command
ls /usr/lib/flume-ng/


untar the tar file in flume-ng directory , but first you need to change your directory from /Home to /usr/lib/flume-ng/
cd /usr/lib/flume-ng/
and now untar the file with the command
sudo tar -xvf /usr/lib/flume-ng/apache-flume-1.4.0-bin.tar.gz


Now its time to check if you have the desired jar file in your apache-flume/bin/lib/ directory or not; give below command
ls /usr/lib/flume-ng/apache-flume-1.4.0-bin/lib/flume-*

scroll till you see the file with below name
/usr/lib/flume-ng/apache-flume-1.4.0/bin/lib/flume-sources-1.0-SNAPSHOT.jar

if its not there, google and copy it to the said directory.

now we need to edit some configuration files , first we need a flume-env.sh, a template for this file is given in flume/bin/conf/ with file name as flume-env.sh.template , copy it.

sudo cp /usr/lib/flume-ng/apache-flume-1.4.0/bin/conf/flume-env.sh.template /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume-env.sh

now open flume-env.sh with gedit
sudo gedit /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume-env.sh

here in this file you need to give two paths, one for JAVA_HOME and second for  FLUME_CLASSPATH (path to our Jar file in lib folder).
give your java path to Java_home, in my case it was like this

JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24

and now we have to give our FLUME_CLASSPATH, which in our case is

FLUME_CLASSPATH=/usr/lib/flume-ng/apache-flume-1.4.0/bin/lib/ flume-sources-1.0-SNAPSHOT.jar

we are almost set, now time to open twitter's website. Go to https://dev.twitter.com/  and enter your log in credentials, then create a new App , do the needful and click on create your access token.
Now refresh the page and copy few things ,
1. value for your consumer key
2. value for your consumer secret
3. value for your Access token .
4. value for your Access token secret


finally we need to create one more configuration file in our flume/bin/conf directory , this is flume.conf file, command is given below.

sudo gedit /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume.conf

this will create a blank file, you can check the complete flume.conf file here

just replace access token and consumer keys and secretes with the your keys and secretes  , I will explain flume.conf here

first line is name of our flume agent, in our case we named it as twitteragent as
TwitterAgent.sources= Twitter

our sink is HDFS hence this line TwitterAgent.sinks=HDFS, next we need to define a channel  which links sources and sink, in our case it is memChannel  as in the following line
TwitterAgent.channels= MemChannel

next 4 things are your consumer key and secrete and your access token key and secrete.
in TwitterAgent.sources.Twitter.keywords  you can put any required word as keyword.
furthermore our sink is HDFS so we need to give in some other things too like file system URL, path to create files, frequency of file rotation etc. which are given here

TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600

when you are done editing flume.conf, save and close it, and change your directory to /apache-flume-1.4.0-bin/bin/
give the bellow command to start fetching data from twitter
./flume-ng agent -n TwitterAgent -c conf -f /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume.conf

you can check if your data has started  coming or not, do you remember we had given twitterAgent.sinks.HDFS.hdfs.path ? well thats the place where you are collecting tweets in our case, i.e. in your HDFS file system go to /user/flume/tweets directory. 

Comments