Sentiment Analysis

Download Report

Transcript Sentiment Analysis

Sentiment Analysis

Some Important Techniques

Discussions: Based on Research Papers

Some Important Techniques

Multinomial Naive Bayes

: It supports the "bag of words" model . For example, an E-mail spam classifier might be based on features that count the number of occurrences of various tokens in an E-mail. One feature might count the number of exclamation points, another might count the number of times the word "money" appears, and another might count the number of times the recipient's name appears .

Process:

For each class ‘C’, P(w/C), the probability of observing word w given this class, is estimated from the training data, simply by computing the relative frequency of each word in the collection of training documents of that class. The classier also requires the prior probability P(C), which is straightforward to estimate. Assuming

n wd

is the number of times word (w) occurs in document ‘d’, the probability of class ‘C’ given a test document is calculated as follows:

P

 

d

P

w

d P

 

w C P n wd

Where, P(d) is a normalization factor. To avoid the zero-frequency problem, it is common to use the Laplace correction for all conditional probabilities involved, which means all counts are initialized to value one instead of zero

Some Important Techniques

Stochastic Gradient Descent

: Stochastic gradient descent (SGD) has experienced a revival since it has been discovered that it provides an efficient means to learn some classifiers even if they are based on non-differentiable loss functions, such as the hinge loss used in support vector machines . In some cases the implementation of vanilla stochastic gradient descent with a fixed learning rate are used to optimize the hinge loss with an L2 penalty that is commonly applied to learn support vector machines. With a linear machine, this is frequently applied for document classification; the loss function is generally optimized as:  2

W

2    1  

YXW

b

   Where, W = weight vector b = the bias  = regularation parameter and class labels ‘Y’ are assumed to be in {+1, -1}.

Some Important Techniques

Hoeffding Tree:

The most well-known tree decision tree learner for data streams is the Hoeffding tree algorithm. It employs a pre-pruning strategy based on the Hoeffding bound to incrementally grow a decision tree. A node is expanded by splitting as soon as there is sufficient statistical evidence, based on the data seen so far, to support the split and this decision is based on the distribution independent Hoeffding bound.

Merits and Limitations of Applied Techniques

Merits of Proposed Approach

: The merits of this approach as, I found after the analysis of applied technique are: (a) Effective and fast for very small documents . The incremental learning can gradually improve the performance of system. (b) Due to use of (1) bag of words based model and (2) least linguistic features, it can be easily extended to other languages .

Possible Limitations of the system

: The possible limitations of the given system can be summarized as: (a) There may be several documents, (1) with neutral statements i.e. neither positive nor negative, (2) document may contain both i.e. +ive and negative sentiment of equal amount.

(b) A document may contain multiple sentiments, so relating a single document with only one sentiment may not always give the correct picture of discussion on twitter . (c) Some time positive words do not show positive sentiments, similarly negative words do not so negative sentiments . It depends on combination of negative or positive words with any other words which provide it, positive or negative sense. The absence of such observation may affect the result.

Discussions: Based on Research Papers

Sentiment Analysis in Multiple Languages:

Feature Selection for Opinion Classification in Web Forums; ACM Transactions on Information Systems,June 2008

Problem analysis:

This System deals with: (1) Applicability of sentiment analysis in Web forums of multiple languages, (2) Use of stylistic features in further sentiment insight and classification power, and (3) effect of feature selection in classification accuracy etc.

Techniques Applied:

To solve the above discussed problem, it exploits both (1) feature selection and (2) Opinion Classification. The detailed design of the system is given below: Figure: Sentiment Classification System Design

Discussions: Based on Research Papers

Merits and Limitations of Applied Techniques

Data used in Sentiment analysis, generally contains unstructured text data from (1) blog posts, (2) user reviews (about any product), (3) chatting record, (4) opinion poll, etc. It may contain several noisy symbols, casual languages and emotion symbols .

Future Research Scope

Data used in Sentiment analysis, generally contains unstructured text data from (1) blog posts, (2) user reviews (about any product), (3) chatting record, (4) opinion poll, etc.

It may contain several noisy symbols, casual languages and emotion symbols.

Discussions: Based on Research Papers

Research Paper:

A study of Information Retrieval weighting schemes for sentiment analysis; ACL-2010

Problem analysis:

This paper examines whether term weighting functions adopted from Information Retrieval (IR) based on the standard tf.idf formula and adapted to the particular setting of sentiment analysis can help classification accuracy . It demonstrates that variants of the original tf.idf weighting scheme provide significant increases in classification performance. The advantages of the approach are that it is intuitive, computationally efficient and doesn’t require additional human annotation or external sources.

Experiments conducted on a number of publicly available data sets improve on the previous state-of-the art.

Techniques Applied

In this paper, the basic focus is the use of (1). Improved Tf-Idf based weighting scheme and (2) SVM based technique for sentiment classification . It applies the idea of localizing the estimation of “Idf” values to documents of one class but employs more sophisticated term weighting functions adapted from the SMART retrieval system and the BM25 probabilistic model .

Discussions: Based on Research Papers

It extends the original SMART annotation scheme by adding Delta   variants of the original “Idf” functions and additionally introduces smoothed Delta variants of the “Idf” and the prob “Idf” factors for completeness and comparative reasons, noted by their accented counterparts . For example, the weight of term “i" in document “D” according to the    

n

weighting scheme where we employ the BM25 “Tf” weighting function and utilize the difference of class-based BM25 “Idf” values would be calculated as:

w i

 

k

1

K

 1  

tf i

tf i

  log

N

1 

df df i

, 1

i

, 2   0 .

5 0 .

5 

k

1

K

 1   

tf i tf i

 log  

N

1

N

2  

k

1

K

 1  

tf i

tf i

 log

N

2 

df df i

, 2

i

, 2   0 .

5 0 .

5  

df i

, 1

df i

, 2   0 .

5 0 .

5    

df i df i

, 2 , 1   0 .

5 0 .

5      Where ‘K’ is defined as:

k

1    1 

b

 

b

  

dl avg

_

dl

    The above variation was made for two reasons: firstly, when the df i ’s are larger than 1 then the smoothing factor influences the final “Idf” value only in a minor way in the revised formulation, since it is added only after the multiplication of the df i with N i (or its variation).

Secondly, when df i = 0, then the smoothing factor correctly adds only a small mass, avoiding a potential division by zero, where otherwise it would add a much greater mass, because it would be multiplied by N i .

Finally it applies the SVM for sentiment classification.

Discussions: Based on Research Papers

Merits and Limitations of Applied Techniques

: The possible limitations of this system may be:

1.

Tf-Idf based system generally depends on the size of document also, It will be interesting to see the results of the above discussed “Improved Tf-Idf” based system on variable length documents (as in improved version there is no information about length of document, i.e. does not deals with document-length related issues).

2.

Only concentrating on sentiment related words may misguide us and

3.

The level of sentiment may differ i.e. it may be (1) strong+Positive, (2) Weak+Positive, (3) Neutral, (4) Weak Negative and (5) Strong Negative . The proposed algorithm deals with only positive and negative sentiments.

Future Research Scope

: The future research work should be: 1.

Development of a more generalizes Tf-Idf based system which can handle the issues related to “ variation of size of document ”. 2.

Instead of depending on the features related to keywords, the use of ‘K’ nearest words for each sentiment words may give more effective results and 3.

Use of multi-class classification can broaden our view towards the sentiments , i.e. it can capture some more sentiment options.

References

• • • • • • • • • • • 1. Turning conversations into insights: A comparison of Social Media Monitoring Tools; A white paper from FreshMinds Research 14th May 2010;FreshMinds 229-231 High Holborn London WC1V 7DA Tel: +44 20 7692 4300 Fax: +44 870 46 01596 www.freshminds.co.uk.

2. Alec Go; Richa Bhayani; Lei Huang; Twitter Sentiment Classification using Distant Supervision; Technical report, Stanford University.

3. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP Proceedings.

4. Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL Proceedings.

5. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. ACL Proceedings.

6. Chenghua Lin, Yulan He;Joint Sentiment/Topic Model for Sentiment Analysis; CIKM’09, November 2–6, 2009, Hong Kong, China.Copyright 2009 ACM 978-1-60558-512-3/09/11.

7. P. Turney, “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews,” Proceedings of the Association for Computational Linguistics (ACL), pp. 417–424, 2002..

8. R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano, “Text mining for product attribute extraction,” SIGKDD Explorations Newsletter, vol. 8, pp. 41–48, 2006.

9. E. Riloff, S. Patwardhan, and J. Wiebe, “Feature subsumption for opinion analysis,” Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2006.

10. Prem Melville, Wojciech Gryc, Richard D. Lawrence; Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification;KDD’09, June 28–July 1, 2009, Paris, France.Copyright 2009 ACM 978-1-60558 495-9/09/06.

11. Neil O’Hare, Michael Davy, Adam Bermingham, Paul Ferguson,Páraic Sheridan, Cathal Gurrin, Alan F.meaton1; Topic-Dependent Sentiment Analysis of Financial Blogs; TSA’09, November 6, 2009, Hong Kong, China.Copyright

2009 ACM 978-1-60558-805-6/09/11.