大数据技术用于开展电影鉴定

文章添加时间:2014-06-27

出自Stéphane Déprès – 情感分析使用机器学习技术之一的朴素贝叶斯分类法,可用于在给定的域内,对tweets进行正向或负向分类。本文展示了使用该技术分析有关电影的tweets的一个试验。

What is naive Bayes classifier?

A naive Bayes classifier is a simple probabilistic classifier based on Bayes theorem with independence assumptions.
As a reminder, Bayes theorem states that probability of A and B equals probability of A given B multiplied by probability of B: P(A,B) = P(A|B)*P(B)
A naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature in order to calculate the probability of a given class (In our case the classes are the positive and the negative sentiment and the features are the words in the tweets that infer the positive or the negative sentiment). Despite this simplified assumption, naive Bayes classifier works quite well in many complex real-world situations.

How to use naive Bayes classifier for sentiment analysis?

The first thing to do is to extract a training set of tweets and to classify them manually as positive, negative or neutral.

The second step is to identify for each positive or negative classified tweet the words that have inferred the classification. These words will be our features.

Let's call + the positive sentiment, - the negative sentiment and Wi one word selected as feature (that has contributed to the classification inference).


The third step is to calculate the following probability based on our training set: P(Wi|+), P(Wi|-), P(+), P(-). You just have to count!              

tweets analysis


Then for a new tweet not in the training set and containing some Wi, you can calculate the likelihood of positive versus negative as follow:

formula

If this likelihood is much greater than one, then the tweet is analysed as positive. If this likelihood is much lesser than one, then the tweet is analysed as negative.

How has it been implemented?

The implementation is quite trivial:

  • less than 100 lines of Python using Twython as a wrapper of the Twitter API,
  • 1000 tweets analysed manually during the supervised learning phase.

What are the main difficulties:

You should be aware that the result is not 100% accurate. On 1000 classified tweets using both the automatic and a manual classification, only 74% of tweets appeared to be appropriately classified: 7% were positive instead of negative or negative instead of positive and 19% were classified as positive or negative and should have been classified as neutral.

So why is this error rate so important?

  • First of all, tweets contain many spelling mistakes and abbreviations. Therefore some words can't be recognized,
  • Secondly, negative form is only managed statistically,
  • Thirdly, it seems very difficult for an algorithm to appropriatly interprate  a tweet containing humour like in the following example: "The ending of #ManOfSteel is superb. Why? Because the movie ended there"!
  • Fourthly, some retrieved tweets can be related to a wrong topic. What if ManOfSteel is also the name of a famous wrestler?
  • And lastly, the analysis is performed on a set of words without a real 'understanding' of the tweet's meaning. For instance, the tweet "Waited for this movie,  better be good!  #manofsteel" is wrongly interpreted as positive because the algorithm does not 'understand' that the person had not watched the movie when she wrote the tweet.

Anyway,  depending on the purpose of the analysis, this error rate can be considered as acceptable.

And the result for 8 movies is :
 

twit4

To go further:

I used the movie domain for this experiment but note that the same analysis can be performed on other domains: products, politicians, companies…
Other classes than positive and negative can also be selected. Note also that sentiment analysis may be performed on a timeline.

You should only know that a supervised learning phase is required for each domain (because some words may be dedicated to a given domain) and of course for each language.

Other classifiers can also be used, for instance 'Support Vector Machine' and 'Maximum Entropy' (2). Note also that Naive Bayes classifier can be improved by adding semantics as additional features into the training set (3).

I perform the experiment only on a low volume of data (900 tweets per movie, movie by movie). For a higher volume, I could have used MapReduce on Python mincemeat or on Hadoop.

References:

  1. Wikipedia – "Sentiment Analysis" – http://en.wikipedia.org/wiki/Sentiment_analysis
  2. Bo Pang, Lillian Lee, Shivakumar Vaithyanathan – "Thumbs up? Sentiment Classification using Machine Learning Techniques" – http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
  3. Hassan Saif, Yulan He and Harith Alani - "Semantic Sentiment Analysis of Twitter" – http://iswc2012.semanticweb.org/sites/default/files/76490497.pdf

Stéphane Déprès