出自Stéphane Déprès – 情感分析使用机器学习技术之一的朴素贝叶斯分类法，可用于在给定的域内，对tweets进行正向或负向分类。本文展示了使用该技术分析有关电影的tweets的一个试验。
A naive Bayes classifier is a simple probabilistic classifier based on Bayes theorem with independence assumptions.
As a reminder, Bayes theorem states that probability of A and B equals probability of A given B multiplied by probability of B: P(A,B) = P(A|B)*P(B)
A naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature in order to calculate the probability of a given class (In our case the classes are the positive and the negative sentiment and the features are the words in the tweets that infer the positive or the negative sentiment). Despite this simplified assumption, naive Bayes classifier works quite well in many complex real-world situations.
The first thing to do is to extract a training set of tweets and to classify them manually as positive, negative or neutral.
The second step is to identify for each positive or negative classified tweet the words that have inferred the classification. These words will be our features.
Let's call + the positive sentiment, - the negative sentiment and Wi one word selected as feature (that has contributed to the classification inference).
The third step is to calculate the following probability based on our training set: P(Wi|+), P(Wi|-), P(+), P(-). You just have to count!
Then for a new tweet not in the training set and containing some Wi, you can calculate the likelihood of positive versus negative as follow:
If this likelihood is much greater than one, then the tweet is analysed as positive. If this likelihood is much lesser than one, then the tweet is analysed as negative.
The implementation is quite trivial:
You should be aware that the result is not 100% accurate. On 1000 classified tweets using both the automatic and a manual classification, only 74% of tweets appeared to be appropriately classified: 7% were positive instead of negative or negative instead of positive and 19% were classified as positive or negative and should have been classified as neutral.
So why is this error rate so important?
Anyway, depending on the purpose of the analysis, this error rate can be considered as acceptable.
And the result for 8 movies is :
I used the movie domain for this experiment but note that the same analysis can be performed on other domains: products, politicians, companies…
Other classes than positive and negative can also be selected. Note also that sentiment analysis may be performed on a timeline.
You should only know that a supervised learning phase is required for each domain (because some words may be dedicated to a given domain) and of course for each language.
Other classifiers can also be used, for instance 'Support Vector Machine' and 'Maximum Entropy' ^{(2)}. Note also that Naive Bayes classifier can be improved by adding semantics as additional features into the training set ^{(3)}.
I perform the experiment only on a low volume of data (900 tweets per movie, movie by movie). For a higher volume, I could have used MapReduce on Python mincemeat or on Hadoop.
Stéphane Déprès