Transcript Document
A Survey of Opinion Mining Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University Introduction The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites A few problems What is the general opinion on the proposed tax reform? How is popular opinion on the presidential candidates evolving? Which of our customers are unsatisfied? Why? Opinion Mining (OM) a recent discipline at the crossroads of information retrieval and computational linguistics which is concerned not with the subject of a document, but with opinion it expresses Related Areas Data Mining(DM), Information Retrieval (IR), Text Classification (TC), Text Summarization (TS) Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 2 Agenda Introduction Development of Linguistic Resource Conjunction Method PMI Method WordNet Expanding Method Gloss Use Method Sentiment Classification PMI Method Machine Learning Method NLP Combined Method Extracting and Summarizing Opinion Expression Statistical Approach NLP Based Approach Discussion Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 3 Development of Linguistic Resource (1) Linguistic resources can be used to extract opinion and to classify the sentiment of text Appraisal Theory Sentiment related properties are well-defined A framework of linguistic resources which describes how writers and speakers express inter-subjective and ideological position underlying linguistic foundation of OM Tasks Determining the subjectivity of a term Determining term orientation Determining the strength of term attitude Example Objective: vertical, yellow, liquid Subjective – Positive: good < excellent – Negative: bad < terrible Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 4 Development of Linguistic Resource (2) Conjunction Method PMI Method Orientation Subjectivity WordNet Expansion Method Gloss Use Method Orientation Subjectivity SentiWordNet Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 5 Conjunction Method - overview Hatzivassiloglou and McKeown, 1997 Hypothesis Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’ is used with opposite orientation. Process negative 1. All conjunction of adjectives are extracted from the corpus. positive 2. A log-linear regression model combines information from different conjunctions to determine if each two conjoined adjectives are of same or different orientation. seed terms 3. A clustering algorithm separates the adjectives into two subsets of different orientation. It places as many words of same orientation as possible into the same subset. corpus and but 4. The average frequencies in each group are compared and the group with the higher frequency is labeled as positive. Randomly selected adjectives with positive and negative orientation seed terms were used to predict orientation. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 6 Conjunction Method – objective function and constraints Select pmin that minimizes Φ(p) |Ci| : the cardinality of cluster i d(x, y) : the dissimilarity between adjectives x , y dissimilarity between adjectives in same cluster is minimized and dissimilarity between adjectives in different cluster is maximized. Experiments HM term set : 1,336 adjectives – 657 positive, 679 negative terms Methods to improve performance of orientation prediction – But rule : Most conjunctions had same orientation, while some conjunctions linked by ‘but’ had almost opposite orientation – log-linear regression model – morphological relationship adequate-inadequate or thoughtful –thoughtless log-linear model with morphological relationship : 82.5% accuracy Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 7 PMI Method - overview Pointwise Mutual Information (PMI) a measure of association used in information theory and statistics Orientation – Turney and Littman, 2003 – terms with similar orientation tend to co-occur in documents Subjectivity – Baroni and Vegnaduzzo, 2004 – subjective adjectives tend to occur in the near of other subjective adjectives Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 8 PMI Method – predicting semantic orientation Modified PMI was measured using the number of results returned by the AltaVista search engine with NEAR operator t : target term ti : paradigmatic term Predicting semantic orientation of a term SO(t) Experiments With HM term set and three corpora Corpus AV-ENG AV-CA TASA Approx. # of word in corpus 1 *1011 2*109 1*107 Accuracy 87.13% 80.31% 61.83% – With small corpus, accuracy isn’t higher than conjunction method. – With large corpus, accuracy is higher than conjunction method. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 9 WordNet Expansion Method Hu et al., 2004 Hypothesis used synonym and antonym relationship between words adjectives usually share the same orientation as their synonyms and opposite orientation as their antonyms By using a set of seed adjectives, orientation of all adjectives in WordNet can be assigned through a procedure exploring on the cluster graphs. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 10 Gloss Use Method Esuli et al., 2005, 2006 Hypothesis - overview good: that which is pleasing or valuable or useful; agreeable or pleasing beautiful: aesthetically pleasing Orientation – pretty: pleasing by delicacy or grace; not imposing terms with similar orientation have similar glosses Subjectivity – terms with similar orientation have similar glosses – terms without orientation have non-oriented glosses yellow: similar to the color of an egg yolk vertical: at right angles to the plane of the horizon or a base line SentiWordNet All words in the WordNet have three scores – positivity, negativity, and objectivity Term Sense is positioned in reversed triangle Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 11 Gloss Use Method – classification process Process 1. A seed set (Lp, Ln) is provided as input 2. Lexical relations (e.g. synonymy) from a thesaurus, or online dictionary, are used to extend seed set. Once added to the original ones, the new terms yield two new, richer sets Trp and Trn; together they form the training set for the learning phase of Step 4. 3. For each term ti in Trp∪Trn or in the test set, a textual representation of ti is generated by collating all the glosses of ti as found in a machinereadable dictionary. Each such representation is converted into vectorial form by standard text indexing techniques. 4. A binary text classifier is trained on the terms in Trp∪Trn and then applied to the terms in the test set. Experiments Classifier : NB, SVM, PrTFIDF 87.38% Accuracy Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 12 Development of Linguistic Resource - Summary Method Conjunction Method Intuition Accuracy Adjectives in and conjunctions usually have similar orientation, though but is used with opposite orientation terms with similar orientation PMI method tend to co-occur in documents WordNet Expansion Method Gloss Use Method Characteristics 78.08% The First try test data : 1336 adjectives 87.13% No limitation Much time required adjectives usually share the same N/A orientation as their synonyms and opposite orientation as their antonyms Limited to WordNet terms with similar orientation have similar glosses terms without orientation have non-oriented glosses SentiWordNet (All word in WordNet) Accuracy depends on the quality of thesaurus Center for E-Business Technology 87.38% Copyright © 2007 by CEBT IDS Lab. - 13 Sentiment Classification The process of identifying the sentiment – or polarity – of a piece of text or a document. Document-level Sentence-level, phrase-level Feature-level – Define target of the opinion and assign the sentiment of the target Document-level Sentiment Classification Method PMI method Machine Learning Method – Default Classifiers – Enhanced Classifier NLP Combined Method – A Two-Step Classification – Combining Appraisal Theory Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 14 PMI Method Turney et al., 2002 Process Only two-word phrases containing adjectives or adverbs are extracted Semantic orientation of a phrase – SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”) Semantic orientation is an average semantic orientation of the phrases Experiments 410 reviews from Epinions (epinion.com): 170 positive, 240 negative calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours Domain of review Automobiles Accuracy 84.00% Domain of review Movies Accuracy 65.83% - Honda Accord 83.78% - The Matrix 66.67% - Volkswagen Jetta 84.21% - Pearl Harbor 65.00% Banks 80.00% Travel Destination 70.53% - Bank of America 78.33% - Cancun 64.41% - Washington Mutual 81.67% - Puerto Vallarta 80.56% Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 15 ML - Default Classifier Pang and Lee, 2002 A special case of text categorization with sentiment- rather than topic-based categories Document modeling standard bag-of-features framework Experiments Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive Naïve Bayes, Maximum Entropy, Support Vector Machine Features unigrams unigrams unigrams+bigrams bigrams unigrams+POS adjectives top 2633 unigrams unigrams+position # of features Frequency or presence? NB ME SVM 16165 16165 32330 16165 16695 2633 2633 22430 freq. pres. pres. pres. pres. pres. pres. pres. 78.7 81.0 80.6 77.3 81.5 77.0 80.3 81.0 N/A 80.4 80.8 77.4 80.4 77.7 81.0 80.1 72.8 82.9 82.7 77.1 81.9 75.1 81.4 81.6 In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to do the best, although the differences aren’t very large. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 16 ML Pang and Lee, 2004 - Using Only Subjective Sentences improved polarity classification by removing objective sentences A subjectivity detector determines whether each sentence is subjective or not Standard subjectivity classifier Subjectivity classifier using proximity relationship The use of subjectivity extracts can improve the polarity classification at least no loss of accuracy. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 17 NLP Combined Method – A Two-Step Classification Wilson et al., 2005 A Two-Step Contextual Polarity Classification employ machine learning and 28 linguistic features document polarity : the average polarity of phrases Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates their contextual polarity (positive, negative, both, or neutral). 28 Features : were extracted using NLP techniques with a dependency parser 4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features, 1 Document Feature Experiments Data : Multi-perspective Question Answering (MPQA) Opinion Corpus neutral-polar classification (%) Features Word token Word+priorpol 28 features Center for E-Business Technology Accuracy 73.6 74.2 75.9 polarity classification (%). Features Word token Word+priorpol 10 features Copyright © 2007 by CEBT Accuracy 61.7 63.0 65.7 IDS Lab. - 18 NLP Combined Method - Combining Appraisal Theory Whitelaw et al., 2005 applied the appraisal theory to the machine learning methods of Pang and Lee Structure of an appraisal An example “not very happy” Experiments a lexicon of 1329 appraisal entities have been produced semi-automatically from 400 seed terms in around twenty man-hours combining attitude type and orientation : accuracy 90.2%. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 19 Sentiment Classification - Summary Method Characteristics Cons Use phrase PMI Simple Need not priory polarity dictionary Loss of contextual meaning Slow(Time to get PMI) Bag of Words Unigram to bigram or ngram SVM, NB, MaxEnt Simple Need not priory polarity dictionary Loss of contextual meaning Need learning phase PMI Method Machine Learning Method Pros Based on ML Parsing or Syntactic NLP Combined Analysis Method Prior polarity to contextual polarity Center for E-Business Technology Consider contextual Need prior polarity meaning dictionary Easily extendible Syntactic Analysis for various purpose Overhead Copyright © 2007 by CEBT IDS Lab. - 20 Extracting and Summarizing Opinion Expression Goal Extract the opinion expression from large reviews and present it with an effective way Tasks Feature Extraction – Sentiment Assignment – Each feature is usually classified as being either favorable or unfavorable. Visualization – Sentiment classification at the feature-level requires the extraction of features that are the target of opinion words Extracted opinion expression are summarized and visualized. Methods Statistical Approaches – ReviewSeer (2003) – Opinion Observer (2004) – Red Opal (2007) product Summarize Extract Features NLP-Based Approaches – Kanayama System (2004) – WebFountain (2005) – OPINE (2005) product Assign Sentiment reviews Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 21 Opinion Observer - Overview Hu and Liu, 2005 Extract and summarize opinion expression from customer reviews on the Web. Only mines the features of the product on which the customers have expressed their opinions and whether the opinion are positive or negative Overall process 1. Review crawling 2. Feature extraction 3. Sentiment assignment 4. Overall process – Opinion word extraction – Opinion orientation identification Summary generation Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 22 Opinion Observer - Tasks Feature Extraction Product features are extracted from the noun or noun phrase by the association miner CBA Compactness pruning, redundancy pruning Sentiment Assignment Opinion sentence : a sentence contains one or more product features and one or more opinion words Adjectives are the only opinion words Prior polarity of adjectives was identified by WordNet expansion methods with seed terms Infrequent features are extracted by using frequent opinion words Polarity of a sentence is assigned as a dominant orientation Extracted form : (product feature, # of positive sentences, # of negative sentences) Experiments Large collection of reviews of 15 electronic products 86.3% recall, 84.0% precision Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 23 Opinion Observer - Visualization Features of products are compared by the bar graph Number of positive and negative sentences of each feature are normalized Positive portion Negative portion Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 24 Web Fountain - Overview Yi et al., 2005 Extracts target features of the sentiment from the various resources and assigns polarity to the features System Architecture Sentiment Miner Analyzes grammatical sentence structures and phrases by using NLP techniques Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 25 Web Fountain Feature Extraction Candidate features – a part-of relationship with the given topic – an attribute-of relationship with the given topic. – an attribute-of relationship with a known feature of the given topic bBNP (Beginning definite Base Noun Phrase) heuristic is used Select bnp (base noun phrase) that has high likelihood ratio Experiments – – Tasks Precision - digital camera: 97%, music reviews: 100% Sentiment Assignment Parse and traverse with two linguistic resources – Sentiment lexicon: define the sentiment polarity of terms – Sentiment pattern database: contain the sentiment assignment patterns of predicates Experiments – Product review – Recall 56%, Precision 87% Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 26 Web Fountain – Visualization Web interface listing sentiment bearing sentences about a given product Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 27 Extracting and Summarizing Opinion Expression - Summary System Feature Extraction N/A probabilistic model Naïve Bayes Accuracy: 85.3% CBA miner Infrequent feature selection Recall: 86.3% Precision: 84.0% frequent noun and noun phrase Precision:85% WordNet expansion prior polarity of adjectives ReviewSeer (2003) Statistical Opinion Observer (2004) Red Opal (2007) Kanayama’s system (2004) NLPbased WebFountain (2005) OPINE (2005) Center for E-Business Technology Sentiment Assignment use user’s rating Precision:80% sentiment unit modifying the machine translation framework Visualization List feature term and it’s score and show sentences contain the feature term graph ordered product list by score of each feature the confidence of the scoring N/A Recall:43% Precision:89% bBNP heuristics likelihood ratio Precision:97% Web PMI Recall:76% Precision:79% Relaxation Labeling Recall:89% Precision:86% sentiment lexicon sentiment pattern database Recall:56% Precision:87% Copyright © 2007 by CEBT listing sentiment bearing sentences of a product N/A IDS Lab. - 28 Discussion OM is a growing research discipline related to various research areas, such as IR, computational linguistics, TC, TS, and DM. Surveyed three topics and summarized it. For Korean OM? There isn’t any published research into the Korean OM. Language differences may impose some limits on the methods used in the OM subtasks. – Structural differences between English and Korean may mean that the same heuristics cannot be applied to extract features from text – The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior polarity of words for the PMI or conjunction methods. Research into Korean OM must be conducted in conjunction with other related areas. Center for E-Business Technology Copyright © 2007 by CEBT IDS Lab. - 29 Discussion Center for E-Business Technology - Research Map of OM Copyright © 2007 by CEBT IDS Lab. - 30 Thank you Center for E-Business Technology IDS Lab. - 31