we have seen twitter
analysis using Hive at many places, Here I am going to present my way of
analyzing tweet's sentiments using Apache Pig.
what we want to do ?
we want to analyse tweets to check
if they contain positive emotions or negative emotions.
tweets reflects
person's emotions when he or she was posting it like " got this Job
done...Hurrahh.." or "xyz Movie sucks!!!! worst movie I ever
saw....".
what we need ?
we need to fetch tweets from twitter to HDFS so that
we can do our analysis using Hadoop ecosystem (Apache Pig Here).
we will use apache flume to fetch tweets from twitter to HDFS, the flume
version I am using here is apache
flume-1.4.0 . then we will do some text analysis on tweets posted by twitter users to check if they contain positive emotions or negative emotions. we will use apache Pig for this purpose, version I have used is apache pig-0.11.0, we will write UDFS in Java to check for the sentiments in tweets and finally we will categorize tweets with positive or negative emotions and store them separately with their respective users. for the ease of Job we will only analyze tweets posted in English and not in any other language. we will also analyze them with the condition that if they contain positive or negative words, we are not doing full line or any grammar analysis here.
to check the complete project Please click here.
to check the complete project Please click here.
for tweets analysis I have used the word List of emotions presented by "Bing Liu and Minquing Hu".
Please check here at this Blog to know How to fetch Data from twitter and sink it in HDFS.
Please check here at this Blog to know How to fetch Data from twitter and sink it in HDFS.
In order to
check the texts of tweets for positive or negative emotions we need to write
our custom UDFS, the Logic behind our UDF is
that it will make an arraylist of words that represent emotions and then
it will check if the text of tweets contains those words or not. its return
type will be Boolean, it will return true if tweets contain those emotions and
will return false if tweets does not contain those emotions.
as Pig does not support boolean as a
full fledged type, i.e. we can not use filter function in foreach statement so
we need to write our own Filter Function. you can check complete code of this UDF at my
github repository here.
lets get started
with our Java UDF, make a new class Sentiments.java.
in order to create a FilterFunc we need to
import two classes
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
and our class
must extends the base class FilterFunc and
must override a method exec(Tuple input),
this method is public and has a return type as boolean, and it takes the whole
Tuple as input. so the method definition is as following
public Boolean exec(Tuple input) throws IOException {
}
all our java
logic will go inside this exec() method.
I have also
created my own defined constructor for the sentiments
class, which is just taking input for the path of emotions word list and
passing it to exec() method through
a variable.
skeleton of my UDF is like following
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class Sentiments extends FilterFunc {
String
path;
ArrayList<String>
sentimentList= new ArrayList<String>();
public
Boolean exec(Tuple input) throws IOException
{
/* all
our Java logic will go here*/
}
public
Sentiments(String in)
{
path=in;
/*the user given path for word list is "in",
/*the user given path for word list is "in",
which then will pass to our global variable
"path"*/
}
}
I have made an
arraylist from the word List of emotions and checked the text of tweets if they
contains those words or not, if they contain those word of emotions i kept them
for further processing.
same way I
have written another FilterFunc UDF to
check if filtered tweets contain positive emotions, if it returned true i
stored them as tweets with positive emotions and if it returned false i stored
them in tweets with negative emotions.
Pig script for
the purpose is given as following
type pig, to go to "grunt shell"
as we know our
Pig scripts must start with register statement, if we are using a custom UDF,
else it will start with Load statement, here we do have two UDFS so we are
registering them first.
then we need
to define them to give path to our word list which is saved locally on our
machine.
register '/PATH--TO--YOUR--UDFS--JAR/Twitter.jar';
define Sentiments Sentiments('/--Path to sentiments
wordlist--/Sentiments-list/sentiment.txt');
define IfPositive IfPositive('/--Path to sentiments wordlist--/Sentiments-list/Positive.txt');
--WE ARE USING
PIGS INBUILT JSONLOADER TO LOAD TWITTER DATA, AS TWEETS ARE IN JSON FORMAT
tweets = load '/home/cloudera/new tweets/tweet' using
JsonLoader('filter_level:chararray,
retweeted:bytearray, in_reply_to_screen_name:chararray,
possibly_sensitive:bytearray, truncated:chararray,
lang:chararray, in_reply_to_status_id_str:chararray,
id:chararray, in_reply_to_user_id_str:chararray,
timestamp_ms:chararray,
in_reply_to_status_id:chararray, created_at:chararray,
favorite_count:chararray, place:chararray,
coordinates:chararray, text:chararray,
contributors:chararray, geo:chararray, entities:map[],
source:chararray, favorited:chararray, in_reply_to_user_id:chararray,
retweet_count:chararray, id_str:chararray,
user:map[]');
--NOW THAT OUR
TWEETS ARE LOADED WE WILL QUERY IT
--FILTER
TWEETS TO KEEP ONLY TWEETS IN ENGLISH FOR ANALYSIS
english_tweets = filter tweets by lang=='en' and text
is not null and user# 'name' is not null;
--FILTERING TWEETS WHICH
CONTIANS EMOTIONS THROUGH OUR UDF
sentiments_tweets
= foreach english_tweets generate
user# 'name' as user_name:chararray, text as tweets:chararray,
Sentiments(text) as if_sentiments;
--KEEPING ONLY THOSE TWEETS WHICH CONTAINS EMOTIONS
AND DISCARDING THE OTHER TWEETS
if_senti_tweets = filter sentiments_tweets by
if_sentiments==true;
--ARRANGING FILTERED TWEETS WITH EMOTIONS WITH THEIR
USER NAME AND TEXTS OF TWEETS
each_senti = foreach if_senti_tweets generate
user_name, tweets;
--AGAIN USING OUR SECOND UDF TO CHECK IF TWEETS WITH
EMOTIONS CONTAIN POSITIVE EMOTIONS OR NOT
sentiments_sorted = foreach each_senti generate
user_name, IfPositive(tweets);
--SORTING TWEETS WITH POSITIVE EMOTIONS FOR FURTHER
PROCESSING
positive_senti = filter sentiments_sorted by $1==true;
--GROUPING TWEETS WITH POSITIVE EMOTIONS AND COUNTING
THEM TO STORE
grp_pos = group
positive_senti by user_name;
count_pos = foreach grp_pos generate group,
COUNT(positive_senti);
store count_pos into '/YOUR--PATH/output/positive_tweets';
--SORTING TWEETS WITH NEGATIVE EMOTIONS HERE
negative_senti = filter sentiments_sorted by
$1==false;
--GROUPING, COUNTING AND STORING TWEETS WITH NEGATIVE
EMOTIONS
grp_neg = group negative_senti by user_name;
count_neg = foreach grp_neg generate group,
COUNT(negative_senti);
store count_neg into '/YOUR--PATH/output/negative_tweets';
i got stuck in place(while writing script) where i need to match words of 2 files.
ReplyDeletemy exact scenario is i have a RESULT which contains huge amount of words and i want to filter it with another file which have desired number of words.
currently my syntax is running properly and also it reads my files successfully ,but its showing zero in place of number of words written. i mean my output file is empty. i cant figure out my error, can you help me in this ?
Hmm.. can you provide some more details ? BTW have you tried distributed cache ? How you are matching the two files ?
Delete