Transcript Document
Mining the peanut gallery: Opinion extraction and semantic classification of product reviews Kushal Dave IBM Work done at NEC Laboratories Steve Lawrence David M. Pennock Google Overture 1 Problem • Many reviews spread out across Web – – – – – Product-specific sites Editorial reviews at C|net, magazines, etc. User reviews at C|net, Amazon, Epinions, etc. Reviews in blogs, author sites Reviews in Google Groups (Usenet) • No easy aggregation, but would like to get numerous diverse opinions • Even at one site, synthesis difficult 2 Solution! • Find reviews on the web and mine them… – Filtering (find the reviews) – Classification (positive or negative) – Separation (identify and rate specific attributes) 3 Existing work Classification varies in granularity/purpose: • Objectivity – Does this text express fact or opinion? • Words – What connotations does this word have? • Sentiments – Is this text expressing a certain type of opinion? 4 Objectivity classification • Best features: relative frequencies of parts of speech (Finn 02) • Subjectivity is…subjective (Wiebe 01) 5 Word classification • Polarity vs. intensity • Conjunctions – (Hatzivassiloglou and McKeown, 97) • Colocations – (Turney and Littman 02) • Linguistic colocations – (Lin 98), (Pereira 93) 6 Sentiment classification • Manual lexicons – Fuzzy logic affect sets (Subasic and Huettener 01) – Directionality – block/enable (Hearst 92) – Common Sense and emotions (Liu et al 03) • Recommendations – Stocks in Yahoo (Das and Chen 01) – URLs in Usenet (Terveen 97) – Movie reviews in IMDb (Pang et al 02) 7 Applied tools • AvaQuest’s GoogleMovies – http://24.60.188.10:8080/demos/ GoogleMovies/GoogleMovies.cgi • Clipping services – PlanetFeedback, Webclipping, eWatch, TracerLock, etc. – Commonly use templates or simple searches • Feedback analysis: – NEC’s Survey Analyzer – IBM’s Eclassifier • Targeted aggregation – BlogStreet, onfocus, AllConsuming, etc. – Rotten Tomatoes 8 Our approach 9 Train/test • Existing tagged corpora! • C|net – two cases – even, mixed 10-fold test • 4,480 reviews • 4 categories – messy, by category 7-fold test • 32,970 reviews 10 Other corpora • Preliminary work with Amazon – (Potentially) easier problem • Pang et al IMDb movie reviews corpus – Our simple algorithm insignificantly worse 80.6 vs. 82.9 (unigram SVM) – Different sort of problem 11 Domain obstacles • Typos, user error, inconsistencies, repeated reviews • Skewed distributions – – – 5x as many positive reviews 13,000 reviews of MP3 players, 350 of networking kits fewer than 10 reviews for ½ of products • Variety of language, sparse data – – 16000 12000 8000 4000 0 Network TV Laser Laptop PDA MP3 Camera 1/5 of reviews have fewer than 10 tokens More than 2/3 of terms occur in fewer than 3 documents • Misleading passages (ambivalence/comparison) – Battery life is good, but... Good idea, but... 12 Base Features • Unigrams – great, camera, support, easy, poor, back, excellent • Bigrams – can't beat, simply the, very disappointed, quit working • Trigrams, substrings – highly recommend this, i am very happy, not worth it, i sent it back, it stopped working • Near – greater would, find negative, refund this, automatic poor, company totally 13 Generalization • Replacing product names, (metadata) – the iPod is an excellent ==> the _productname is an excellent • domain-specific words, – excellent [picture|sound] quality ==> excellent _producttypeword quality • rare words – (statistical) overshadowed by ==> _unique by (statistical) 14 Generalization (2) • Finding synsets in WordNet – anxious ==> 2377770 – nervous ==> 2377770 • Stemming – was ==> be – compromised ==> compromis 15 Qualification • Parsing for colocation – – this stupid piece of garbage ==> (piece(N):mod:stupid(A))... this piece is stupid and ugly ==> (piece(N):mod:stupid(A))... • Negation – not good or useful ==> NOTgood NOTor NOTuseful 16 Optimal features • Confidence/precision tradeoff • Try to find best-length substrings – How to find features? • Marginal improvement when traversing suffix tree • Information gain worked best – How to score? • [p(C|w) – p(C’|w)] * df • Dynamic programming or simple during construction during use – Still usually short ( n < 5 ) • Although… “i have had no problems with” 17 Clean up • Thresholding – Reduce space with minimal impact – But not too much! (sparse data) – > 3 docs • Smoothing: allocate weight to unseen features, regularize other values – Laplace, Simple Good-Turing, Witten-Bell tried – Laplace helps Naïve Bayes 18 Scoring • SVMlight, naïve Bayes – Neither does better in both tests – Our scoring is simpler, less reliant on confidence, document length, corpus regularity • Give each word a score ranging –1 to 1 – Our metric: score(fi) = p(fi|C) – p(fi|C’) p(fi|C) + p(fi|C’) – Tried Fisher discriminant, information gain, odds ratio • If sum > 0, it’s a positive review • Other probability models – presence • Bootstrapping barely helps 19 Reweighting • Document frequency – df, idf, normalized df, ridf – Some improvement from logdf and gaussian (arbitrary) • Analogues for term frequency, product frequency, product type frequency • Tried bootstrapping, different ways of assigning scores 20 Test 1 (clean) Best Results Baseline Naïve Bayes + Laplace Unigrams Weighting (log df) Weighting (Gaussian tf) Trigrams Baseline + _productname 85.0 % 87.0 % 85.7 % 85.7 % 88.7 % 88.9 % 21 Test 2 (messy) Best Results Baseline Prob. after threshold Unigrams Presence prob. model + Odds ratio 82.2 % 83.3 % 83.1 % 83.3 % Bigrams Baseline + Odds ratio + SVM 84.6 % 85.4 % 85.8 % Variable Baseline + _productname 85.1 % 85.3 % 22 Extraction Can we use the techniques from classification to mine reviews? 23 Obstacles • Words have different implications in general usage – “main characters” is negatively biased in reviews, but has innocuous uses elsewhere • Much non-review subjective content – previews, sales sites, technical support, passing mentions and comparisons, lists 24 Results • Manual test set + web interface • Substrings better…bigrams too general? • Initial filtering is bad – Make “review” part of the actual search – Threshold sentence length, score – Still get non-opinions and ambiguous opinions • Test corpus, with red herrings removed, 76% accuracy in top confidence tercile 25 26 27 Attribute heuristics • Look for _producttypewords • Or, just look for things starting with “the” – Of course, some thresholding – And stopwords 28 Results • No way of judging accuracy, arbitrary • For example, Amstel Light – Ideally: taste, price, image... – Statistical method: beer, bud, taste, adjunct, other, brew, lager, golden, imported... – Dumb heuristic: the taste, the flavor, the calories, the best… 29 30 31 Future work • Methodology – More precise corpus – More tests, more precise tests • Algorithm – – – – New combinations of techniques Decrease overfitting to improve web search Computational complexity Ambiguity detection? • New pieces – Separate review/non-review classifier – Review context (source, time) – Segment pages 32