Skip to main content

Twitter sentiment analysis in Hadoop using apache Pig

we have seen twitter analysis using Hive at many places, Here I am going to present my way of analyzing tweet's sentiments using Apache Pig.

what we want to do ?
           we want to analyse tweets to check if they contain positive emotions or negative emotions.
tweets reflects person's emotions when he or she was posting it like " got this Job done...Hurrahh.." or "xyz Movie sucks!!!! worst movie I ever saw....".

what we need ?
        we need to fetch tweets from twitter to HDFS so that we can do our analysis using Hadoop ecosystem (Apache Pig Here).


How will we do it ?
            we will use apache flume to fetch tweets from twitter to HDFS, the flume version I am using here is apache flume-1.4.0 .  then we will do some text analysis on tweets posted by twitter users to check if they contain positive emotions or negative emotions. we will use apache Pig for this purpose, version I have used is apache pig-0.11.0, we will write  UDFS in Java to check for the sentiments in tweets and finally we will categorize tweets with positive or negative emotions and store them separately with their respective users. for the ease of Job we will only analyze tweets posted in English and not in any other language. we will also analyze them with the condition that if they contain positive or negative words, we are not doing full line or any grammar analysis here.
       to check the complete project Please click here.
  
       for tweets analysis I have used the word List of emotions presented by "Bing Liu and Minquing Hu".


Please check here at this Blog to know How to fetch Data from twitter and sink it in HDFS.


In order to check the texts of tweets for positive or negative emotions we need to write our custom UDFS, the Logic behind our UDF is  that it will make an arraylist of words that represent emotions and then it will check if the text of tweets contains those words or not. its return type will be Boolean, it will return true if tweets contain those emotions and will return false if tweets does not contain those emotions.
           as Pig does not support boolean as a full fledged type, i.e. we can not use filter function in foreach statement so we need to write our own Filter Function.  you can check complete code of this UDF at my github repository here.
lets get started with our Java UDF, make a new class Sentiments.java.
 in order to create a FilterFunc we need to import two classes

import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;

and our class must extends the base class FilterFunc and must override a method exec(Tuple input), this method is public and has a return type as boolean, and it takes the whole Tuple as input. so the method definition is as following

public Boolean exec(Tuple input) throws IOException {

}

all our java logic will go inside this exec() method.
I have also created my own defined constructor for the sentiments class, which is just taking input for the path of emotions word list and passing it to exec() method through a variable.


 skeleton of my UDF is like following


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;

import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;


public class Sentiments extends FilterFunc {

                String path;
                ArrayList<String> sentimentList= new ArrayList<String>();
               
                public Boolean exec(Tuple input) throws IOException
      {

       /* all our Java logic will go here*/

      }

      public Sentiments(String in)
                {
                                path=in; 

                  /*the user given path for word list is "in",                
                      which then will pass to our global variable
                      "path"*/

                }
               
}

                               



I have made an arraylist from the word List of emotions and checked the text of tweets if they contains those words or not, if they contain those word of emotions i kept them for further processing.

same way I have written another FilterFunc UDF to check if filtered tweets contain positive emotions, if it returned true i stored them as tweets with positive emotions and if it returned false i stored them in tweets with negative emotions.

Pig script for the purpose is given as following  

type pig, to go to "grunt shell"

as we know our Pig scripts must start with register statement, if we are using a custom UDF, else it will start with Load statement, here we do have two UDFS so we are registering them first.

then we need to define them to give path to our word list which is saved locally on our machine.



register '/PATH--TO--YOUR--UDFS--JAR/Twitter.jar';


define Sentiments Sentiments('/--Path to sentiments wordlist--/Sentiments-list/sentiment.txt');


define IfPositive IfPositive('/--Path to sentiments wordlist--/Sentiments-list/Positive.txt');


--WE ARE USING PIGS INBUILT JSONLOADER TO LOAD TWITTER DATA, AS TWEETS ARE IN JSON FORMAT

tweets = load '/home/cloudera/new tweets/tweet' using JsonLoader('filter_level:chararray,
retweeted:bytearray, in_reply_to_screen_name:chararray,
possibly_sensitive:bytearray, truncated:chararray, lang:chararray, in_reply_to_status_id_str:chararray,
id:chararray, in_reply_to_user_id_str:chararray,
 timestamp_ms:chararray, in_reply_to_status_id:chararray, created_at:chararray, favorite_count:chararray, place:chararray,
coordinates:chararray, text:chararray,
contributors:chararray, geo:chararray, entities:map[], source:chararray, favorited:chararray, in_reply_to_user_id:chararray,
retweet_count:chararray, id_str:chararray,
 user:map[]');


--NOW THAT OUR TWEETS ARE LOADED WE WILL QUERY IT



--FILTER TWEETS TO KEEP ONLY TWEETS IN ENGLISH FOR ANALYSIS

english_tweets = filter tweets by lang=='en' and text is not null and user# 'name' is not null;




--FILTERING TWEETS WHICH CONTIANS EMOTIONS THROUGH OUR UDF
sentiments_tweets  = foreach english_tweets generate  user# 'name' as user_name:chararray, text as tweets:chararray, Sentiments(text) as if_sentiments;



--KEEPING ONLY THOSE TWEETS WHICH CONTAINS EMOTIONS AND DISCARDING THE OTHER TWEETS

if_senti_tweets = filter sentiments_tweets by if_sentiments==true;



--ARRANGING FILTERED TWEETS WITH EMOTIONS WITH THEIR USER NAME AND TEXTS OF TWEETS

each_senti = foreach if_senti_tweets generate user_name, tweets;




--AGAIN USING OUR SECOND UDF TO CHECK IF TWEETS WITH EMOTIONS CONTAIN POSITIVE EMOTIONS OR NOT

sentiments_sorted = foreach each_senti generate user_name, IfPositive(tweets);



--SORTING TWEETS WITH POSITIVE EMOTIONS FOR FURTHER PROCESSING

positive_senti = filter sentiments_sorted by $1==true;




--GROUPING TWEETS WITH POSITIVE EMOTIONS AND COUNTING THEM TO STORE

 grp_pos = group positive_senti by user_name;

count_pos = foreach grp_pos generate group, COUNT(positive_senti);

store count_pos into '/YOUR--PATH/output/positive_tweets';



--SORTING TWEETS WITH NEGATIVE EMOTIONS HERE

negative_senti = filter sentiments_sorted by $1==false;



--GROUPING, COUNTING AND STORING TWEETS WITH NEGATIVE EMOTIONS

grp_neg = group negative_senti by user_name;

count_neg = foreach grp_neg generate group, COUNT(negative_senti);

store count_neg into '/YOUR--PATH/output/negative_tweets';



Comments

  1. i got stuck in place(while writing script) where i need to match words of 2 files.

    my exact scenario is i have a RESULT which contains huge amount of words and i want to filter it with another file which have desired number of words.

    currently my syntax is running properly and also it reads my files successfully ,but its showing zero in place of number of words written. i mean my output file is empty. i cant figure out my error, can you help me in this ?

    ReplyDelete
    Replies
    1. Hmm.. can you provide some more details ? BTW have you tried distributed cache ? How you are matching the two files ?

      Delete

Post a Comment