Transcript Document

Mining the peanut gallery:
Opinion extraction
and semantic classification
of product reviews
Kushal Dave
IBM
Work done at NEC Laboratories
Steve Lawrence David M. Pennock
Google
Overture
1
Problem
• Many reviews spread out across Web
–
–
–
–
–
Product-specific sites
Editorial reviews at C|net, magazines, etc.
User reviews at C|net, Amazon, Epinions, etc.
Reviews in blogs, author sites
Reviews in Google Groups (Usenet)
• No easy aggregation, but would like to get
numerous diverse opinions
• Even at one site, synthesis difficult
2
Solution!
• Find reviews on the web and mine them…
– Filtering
(find the reviews)
– Classification
(positive or negative)
– Separation
(identify and rate
specific attributes)
3
Existing work
Classification varies in granularity/purpose:
• Objectivity
– Does this text express fact or opinion?
• Words
– What connotations does this word have?
• Sentiments
– Is this text expressing a certain type of opinion?
4
Objectivity classification
• Best features: relative frequencies of parts
of speech (Finn 02)
• Subjectivity is…subjective (Wiebe 01)
5
Word classification
• Polarity vs. intensity
• Conjunctions
– (Hatzivassiloglou and McKeown, 97)
• Colocations
– (Turney and Littman 02)
• Linguistic colocations
– (Lin 98), (Pereira 93)
6
Sentiment classification
• Manual lexicons
– Fuzzy logic affect sets
(Subasic and Huettener 01)
– Directionality – block/enable (Hearst 92)
– Common Sense and emotions (Liu et al 03)
• Recommendations
– Stocks in Yahoo (Das and Chen 01)
– URLs in Usenet (Terveen 97)
– Movie reviews in IMDb (Pang et al 02)
7
Applied tools
• AvaQuest’s
GoogleMovies
– http://24.60.188.10:8080/demos/
GoogleMovies/GoogleMovies.cgi
• Clipping services
– PlanetFeedback,
Webclipping, eWatch,
TracerLock, etc.
– Commonly use templates
or simple searches
• Feedback analysis:
– NEC’s Survey Analyzer
– IBM’s Eclassifier
• Targeted aggregation
– BlogStreet, onfocus,
AllConsuming, etc.
– Rotten Tomatoes
8
Our approach
9
Train/test
• Existing tagged
corpora!
• C|net – two cases
– even, mixed
10-fold test
• 4,480 reviews
• 4 categories
– messy, by category
7-fold test
• 32,970 reviews
10
Other corpora
• Preliminary work with Amazon
– (Potentially) easier problem
• Pang et al IMDb movie reviews corpus
– Our simple algorithm insignificantly worse
80.6 vs. 82.9 (unigram SVM)
– Different sort of problem
11
Domain obstacles
• Typos, user error, inconsistencies,
repeated reviews
• Skewed distributions
–
–
–
5x as many positive reviews
13,000 reviews of MP3 players,
350 of networking kits
fewer than 10 reviews
for ½ of products
• Variety of language, sparse data
–
–
16000
12000
8000
4000
0
Network
TV
Laser
Laptop
PDA
MP3
Camera
1/5 of reviews have fewer than 10
tokens
More than 2/3 of terms occur in fewer
than 3 documents
• Misleading passages
(ambivalence/comparison)
–
Battery life is good, but... Good idea,
but...
12
Base Features
• Unigrams
–
great, camera, support, easy, poor,
back, excellent
• Bigrams
–
can't beat, simply the, very
disappointed, quit working
• Trigrams, substrings
–
highly recommend this, i am very happy,
not worth it, i sent it back, it stopped
working
• Near
–
greater would, find negative, refund
this, automatic poor, company totally
13
Generalization
• Replacing product names, (metadata)
– the iPod is an excellent
==> the _productname is an excellent
• domain-specific words,
–
excellent [picture|sound] quality
==> excellent _producttypeword quality
• rare words
–
(statistical)
overshadowed by
==> _unique by
(statistical)
14
Generalization (2)
• Finding synsets in WordNet
– anxious ==> 2377770
– nervous ==> 2377770
• Stemming
– was ==> be
– compromised ==> compromis
15
Qualification
• Parsing for colocation
–
–
this stupid piece of garbage ==>
(piece(N):mod:stupid(A))...
this piece is stupid and ugly ==>
(piece(N):mod:stupid(A))...
• Negation
– not good or useful ==>
NOTgood NOTor NOTuseful
16
Optimal features
• Confidence/precision tradeoff
• Try to find best-length substrings
– How to find features?
• Marginal improvement when traversing suffix tree
• Information gain worked best
– How to score?
• [p(C|w) – p(C’|w)] * df
• Dynamic programming or simple
during construction
during use
– Still usually short ( n < 5 )
• Although… “i have had no problems with”
17
Clean up
• Thresholding
– Reduce space with minimal impact
– But not too much! (sparse data)
– > 3 docs
• Smoothing: allocate weight to unseen features,
regularize other values
– Laplace, Simple Good-Turing, Witten-Bell tried
– Laplace helps Naïve Bayes
18
Scoring
• SVMlight, naïve Bayes
– Neither does better in both tests
– Our scoring is simpler, less reliant on confidence,
document length, corpus regularity
• Give each word a score ranging –1 to 1
– Our metric:
score(fi) =
p(fi|C) – p(fi|C’)
p(fi|C) + p(fi|C’)
– Tried Fisher discriminant, information gain, odds ratio
• If sum > 0, it’s a positive review
• Other probability models – presence
• Bootstrapping barely helps
19
Reweighting
• Document frequency
– df, idf, normalized df, ridf
– Some improvement from logdf and gaussian
(arbitrary)
• Analogues for term frequency, product
frequency, product type frequency
• Tried bootstrapping, different ways of
assigning scores
20
Test 1 (clean) Best Results
Baseline
Naïve Bayes + Laplace
Unigrams
Weighting (log df)
Weighting (Gaussian tf)
Trigrams
Baseline
+ _productname
85.0 %
87.0 %
85.7 %
85.7 %
88.7 %
88.9 %
21
Test 2 (messy) Best Results
Baseline
Prob. after threshold
Unigrams
Presence prob. model
+ Odds ratio
82.2 %
83.3 %
83.1 %
83.3 %
Bigrams
Baseline
+ Odds ratio
+ SVM
84.6 %
85.4 %
85.8 %
Variable
Baseline
+ _productname
85.1 %
85.3 %
22
Extraction
Can we use the
techniques from
classification to
mine reviews?
23
Obstacles
• Words have different implications in
general usage
– “main characters” is negatively biased in
reviews, but has innocuous uses elsewhere
• Much non-review subjective content
– previews, sales sites, technical support,
passing mentions and comparisons, lists
24
Results
• Manual test set + web interface
• Substrings better…bigrams too general?
• Initial filtering is bad
– Make “review” part of the actual search
– Threshold sentence length, score
– Still get non-opinions and ambiguous opinions
• Test corpus, with red herrings removed,
76% accuracy in top confidence tercile
25
26
27
Attribute heuristics
• Look for _producttypewords
• Or, just look for things starting with “the”
– Of course, some thresholding
– And stopwords
28
Results
• No way of judging accuracy, arbitrary
• For example, Amstel Light
– Ideally:
taste, price, image...
– Statistical method:
beer, bud, taste, adjunct, other, brew, lager,
golden, imported...
– Dumb heuristic:
the taste, the flavor, the calories, the best…
29
30
31
Future work
• Methodology
– More precise corpus
– More tests, more precise tests
• Algorithm
–
–
–
–
New combinations of techniques
Decrease overfitting to improve web search
Computational complexity
Ambiguity detection?
• New pieces
– Separate review/non-review classifier
– Review context (source, time)
– Segment pages
32