we are using Apache flume to fetch tweeter data and store it in to HDFS. so lets get started , flume version I am using here is apache flume-1.4.0
Download apache flume-
on your unix terminal type this command
wget http://apache.mirrors.hoobly.com/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz
create directory - "flume-ng"
create directory in your /usr/lib folder, type this command
sudo mkdir /usr/lib/flume-ng
Now copy the flume tar file you have downloaded to your usr/lib/flume-ng directory, which you just have created. command is
sudo cp –r apache-flume-1.4.0-bin.tar.gz /usr/lib/flume-ng/
check if your tar file is copied to your flume-ng directory , give command
ls /usr/lib/flume-ng/
untar the tar file in flume-ng directory , but first you need to change your directory from /Home to /usr/lib/flume-ng/
cd /usr/lib/flume-ng/
and now untar the file with the command
sudo tar -xvf /usr/lib/flume-ng/apache-flume-1.4.0-bin.tar.gz
Now its time to check if you have the desired jar file in your apache-flume/bin/lib/ directory or not; give below command
ls /usr/lib/flume-ng/apache-flume-1.4.0-bin/lib/flume-*
scroll till you see the file with below name
/usr/lib/flume-ng/apache-flume-1.4.0/bin/lib/flume-sources-1.0-SNAPSHOT.jar
if its not there, google and copy it to the said directory.
now we need to edit some configuration files , first we need a flume-env.sh, a template for this file is given in flume/bin/conf/ with file name as flume-env.sh.template , copy it.
sudo cp /usr/lib/flume-ng/apache-flume-1.4.0/bin/conf/flume-env.sh.template /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume-env.sh
now open flume-env.sh with gedit
sudo gedit /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume-env.sh
here in this file you need to give two paths, one for JAVA_HOME and second for FLUME_CLASSPATH (path to our Jar file in lib folder).
give your java path to Java_home, in my case it was like this
JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24
and now we have to give our FLUME_CLASSPATH, which in our case is
FLUME_CLASSPATH=/usr/lib/flume-ng/apache-flume-1.4.0/bin/lib/ flume-sources-1.0-SNAPSHOT.jar
we are almost set, now time to open twitter's website. Go to https://dev.twitter.com/ and enter your log in credentials, then create
a new App , do the needful and click
on create your access token.
Now refresh the
page and copy few things ,
1. value for
your consumer key
2. value for your
consumer secret
3. value for
your Access token .
4. value for
your Access token secret
finally we need
to create one more configuration file in our flume/bin/conf directory , this is
flume.conf file, command is given below.
sudo gedit /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume.conf
this will create
a blank file, you can check the complete flume.conf file here
just replace access token and consumer keys and secretes with the your keys and
secretes , I will explain flume.conf
here
first line is
name of our flume agent, in our case we named it as twitteragent as
TwitterAgent.sources= Twitter
our sink is HDFS
hence this line TwitterAgent.sinks=HDFS,
next we need to define a channel
which links sources and sink, in our case it is memChannel as in the following line
TwitterAgent.channels= MemChannel
next 4 things
are your consumer key and secrete and your access token key and secrete.
in TwitterAgent.sources.Twitter.keywords you can put any required word as keyword.
furthermore our
sink is HDFS so we need to give in some other things too like file system URL,
path to create files, frequency of file rotation etc. which are given here
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
when you are
done editing flume.conf, save and close it, and change your directory to /apache-flume-1.4.0-bin/bin/
give the bellow
command to start fetching data from twitter
./flume-ng
agent -n TwitterAgent -c conf -f
/usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume.conf
you can check if your data has started coming or not, do you remember we had given twitterAgent.sinks.HDFS.hdfs.path
? well thats the place where you are collecting tweets in our case, i.e. in
your HDFS file system go to /user/flume/tweets directory.
Comments
Post a Comment