Transcript Document
A Survey of Opinion Mining
Dongjoo Lee
Intelligent Database Systems Lab.
Dept. of Computer Science and Engineering
Seoul National University
Introduction
The Web contains a wealth of opinions about products, politics, and
more in newsgroup posts, review sites, and other web sites
A few problems
What is the general opinion on the proposed tax reform?
How is popular opinion on the presidential candidates evolving?
Which of our customers are unsatisfied? Why?
Opinion Mining (OM)
a recent discipline at the crossroads of information retrieval and
computational linguistics which is concerned not with the subject of a
document, but with opinion it expresses
Related Areas
Data Mining(DM), Information Retrieval (IR), Text Classification
(TC), Text Summarization (TS)
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 2
Agenda
Introduction
Development of Linguistic Resource
Conjunction Method
PMI Method
WordNet Expanding Method
Gloss Use Method
Sentiment Classification
PMI Method
Machine Learning Method
NLP Combined Method
Extracting and Summarizing Opinion Expression
Statistical Approach
NLP Based Approach
Discussion
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 3
Development of Linguistic Resource (1)
Linguistic resources can be used to extract opinion and to classify the sentiment
of text
Appraisal Theory
Sentiment related properties are well-defined
A framework of linguistic resources which describes how writers and speakers express
inter-subjective and ideological position
underlying linguistic foundation of OM
Tasks
Determining the subjectivity of a term
Determining term orientation
Determining the strength of term attitude
Example
Objective: vertical, yellow, liquid
Subjective
–
Positive: good < excellent
–
Negative: bad < terrible
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 4
Development of Linguistic Resource (2)
Conjunction Method
PMI Method
Orientation
Subjectivity
WordNet Expansion Method
Gloss Use Method
Orientation
Subjectivity
SentiWordNet
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 5
Conjunction Method - overview
Hatzivassiloglou and McKeown, 1997
Hypothesis
Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’ is used with
opposite orientation.
Process
negative
1. All conjunction of adjectives are extracted from the corpus.
positive
2. A log-linear regression model combines information from
different conjunctions to determine if each two conjoined
adjectives are of same or different orientation.
seed terms
3. A clustering algorithm separates the adjectives into two
subsets of different orientation. It places as many words of
same orientation as possible into the same subset.
corpus
and
but
4. The average frequencies in each group are compared and
the group with the higher frequency is labeled as positive.
Randomly selected adjectives with positive and negative orientation seed terms were
used to predict orientation.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 6
Conjunction Method
– objective function and constraints
Select pmin that minimizes Φ(p)
|Ci| : the cardinality of cluster i
d(x, y) : the dissimilarity between adjectives x , y
dissimilarity between adjectives in same cluster is minimized and dissimilarity
between adjectives in different cluster is maximized.
Experiments
HM term set : 1,336 adjectives
–
657 positive, 679 negative terms
Methods to improve performance of orientation prediction
–
But rule : Most conjunctions had same orientation, while some conjunctions linked by
‘but’ had almost opposite orientation
–
log-linear regression model
–
morphological relationship
adequate-inadequate or thoughtful –thoughtless
log-linear model with morphological relationship : 82.5% accuracy
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 7
PMI Method
- overview
Pointwise Mutual Information (PMI)
a measure of association used in information theory and statistics
Orientation
–
Turney and Littman, 2003
–
terms with similar orientation tend to co-occur in documents
Subjectivity
–
Baroni and Vegnaduzzo, 2004
–
subjective adjectives tend to occur in the near of other subjective
adjectives
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 8
PMI Method
– predicting semantic orientation
Modified PMI was measured using the number of results returned by the
AltaVista search engine with NEAR operator
t : target term
ti : paradigmatic term
Predicting semantic orientation of a term SO(t)
Experiments
With HM term set and three corpora
Corpus
AV-ENG
AV-CA
TASA
Approx. # of word in corpus
1 *1011
2*109
1*107
Accuracy
87.13%
80.31%
61.83%
–
With small corpus, accuracy isn’t higher than conjunction method.
–
With large corpus, accuracy is higher than conjunction method.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 9
WordNet Expansion Method
Hu et al., 2004
Hypothesis
used synonym and antonym relationship between words
adjectives usually share the same orientation as their synonyms and opposite
orientation as their antonyms
By using a set of seed adjectives, orientation of all adjectives in WordNet can be
assigned through a procedure exploring on the cluster graphs.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 10
Gloss Use Method
Esuli et al., 2005, 2006
Hypothesis
- overview
good: that which is pleasing or valuable or useful; agreeable or pleasing
beautiful: aesthetically pleasing
Orientation
–
pretty: pleasing by delicacy or grace; not imposing
terms with similar orientation have similar glosses
Subjectivity
–
terms with similar orientation have similar glosses
–
terms without orientation have non-oriented glosses
yellow: similar to the color of an egg yolk
vertical: at right angles to the plane of the horizon or a base line
SentiWordNet
All words in the WordNet have three scores
–
positivity, negativity, and objectivity
Term Sense is positioned in reversed triangle
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 11
Gloss Use Method –
classification process
Process
1. A seed set (Lp, Ln) is provided as input
2. Lexical relations (e.g. synonymy) from a thesaurus,
or online dictionary, are used to extend seed set.
Once added to the original ones, the new terms
yield two new, richer sets Trp and Trn; together
they form the training set for the learning phase of
Step 4.
3. For each term ti in Trp∪Trn or in the test set, a
textual representation of ti is generated by
collating all the glosses of ti as found in a machinereadable dictionary. Each such representation is
converted into vectorial form by standard text
indexing techniques.
4. A binary text classifier is trained on the terms in
Trp∪Trn and then applied to the terms in the test
set.
Experiments
Classifier : NB, SVM, PrTFIDF
87.38% Accuracy
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 12
Development of Linguistic Resource - Summary
Method
Conjunction
Method
Intuition
Accuracy
Adjectives in and conjunctions
usually have similar orientation,
though but is used with opposite
orientation
terms with similar orientation
PMI method
tend to co-occur in documents
WordNet
Expansion
Method
Gloss Use
Method
Characteristics
78.08%
The First try
test data : 1336
adjectives
87.13%
No limitation
Much time required
adjectives usually share the same N/A
orientation as their synonyms and
opposite orientation as their
antonyms
Limited to WordNet
terms with similar orientation
have similar glosses
terms without orientation have
non-oriented glosses
SentiWordNet (All
word in WordNet)
Accuracy depends on
the quality of thesaurus
Center for E-Business Technology
87.38%
Copyright © 2007 by CEBT
IDS Lab. - 13
Sentiment Classification
The process of identifying the sentiment – or polarity – of a piece of
text or a document.
Document-level
Sentence-level, phrase-level
Feature-level
–
Define target of the opinion and assign the sentiment of the target
Document-level Sentiment Classification Method
PMI method
Machine Learning Method
–
Default Classifiers
–
Enhanced Classifier
NLP Combined Method
–
A Two-Step Classification
–
Combining Appraisal Theory
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 14
PMI Method
Turney et al., 2002
Process
Only two-word phrases containing adjectives or adverbs are extracted
Semantic orientation of a phrase
–
SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”)
Semantic orientation is an average semantic orientation of the phrases
Experiments
410 reviews from Epinions (epinion.com): 170 positive, 240 negative
calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours
Domain of review
Automobiles
Accuracy
84.00%
Domain of review
Movies
Accuracy
65.83%
- Honda Accord
83.78%
- The Matrix
66.67%
- Volkswagen Jetta
84.21%
- Pearl Harbor
65.00%
Banks
80.00%
Travel Destination
70.53%
- Bank of America
78.33%
- Cancun
64.41%
- Washington Mutual
81.67%
- Puerto Vallarta
80.56%
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 15
ML
- Default Classifier
Pang and Lee, 2002
A special case of text categorization with sentiment- rather than topic-based
categories
Document modeling
standard bag-of-features framework
Experiments
Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive
Naïve Bayes, Maximum Entropy, Support Vector Machine
Features
unigrams
unigrams
unigrams+bigrams
bigrams
unigrams+POS
adjectives
top 2633 unigrams
unigrams+position
# of features
Frequency
or presence?
NB
ME
SVM
16165
16165
32330
16165
16695
2633
2633
22430
freq.
pres.
pres.
pres.
pres.
pres.
pres.
pres.
78.7
81.0
80.6
77.3
81.5
77.0
80.3
81.0
N/A
80.4
80.8
77.4
80.4
77.7
81.0
80.1
72.8
82.9
82.7
77.1
81.9
75.1
81.4
81.6
In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to
do the best, although the differences aren’t very large.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 16
ML
Pang and Lee, 2004
- Using Only Subjective Sentences
improved polarity classification by removing
objective sentences
A subjectivity detector determines whether
each sentence is subjective or not
Standard subjectivity classifier
Subjectivity classifier using proximity
relationship
The use of subjectivity extracts can improve
the polarity classification at least no loss of
accuracy.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 17
NLP Combined Method – A Two-Step Classification
Wilson et al., 2005
A Two-Step Contextual Polarity Classification
employ machine learning and 28 linguistic features
document polarity : the average polarity of phrases
Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar
Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates
their contextual polarity (positive, negative, both, or neutral).
28 Features : were extracted using NLP techniques with a dependency parser
4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features,
1 Document Feature
Experiments
Data : Multi-perspective Question Answering (MPQA) Opinion Corpus
neutral-polar classification (%)
Features
Word token
Word+priorpol
28 features
Center for E-Business Technology
Accuracy
73.6
74.2
75.9
polarity classification (%).
Features
Word token
Word+priorpol
10 features
Copyright © 2007 by CEBT
Accuracy
61.7
63.0
65.7
IDS Lab. - 18
NLP Combined Method - Combining Appraisal Theory
Whitelaw et al., 2005
applied the appraisal theory to the machine learning methods of Pang and Lee
Structure of an appraisal
An example “not very happy”
Experiments
a lexicon of 1329 appraisal entities have been produced semi-automatically from 400 seed
terms in around twenty man-hours
combining attitude type and orientation : accuracy 90.2%.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 19
Sentiment Classification - Summary
Method
Characteristics
Cons
Use phrase PMI
Simple
Need not priory
polarity dictionary
Loss of contextual
meaning
Slow(Time to get
PMI)
Bag of Words
Unigram to bigram or ngram
SVM, NB, MaxEnt
Simple
Need not priory
polarity dictionary
Loss of contextual
meaning
Need learning phase
PMI Method
Machine
Learning
Method
Pros
Based on ML
Parsing or Syntactic
NLP Combined
Analysis
Method
Prior polarity to
contextual polarity
Center for E-Business Technology
Consider contextual Need prior polarity
meaning
dictionary
Easily extendible
Syntactic Analysis
for various purpose
Overhead
Copyright © 2007 by CEBT
IDS Lab. - 20
Extracting and Summarizing Opinion Expression
Goal
Extract the opinion expression from large reviews and present it with an effective way
Tasks
Feature Extraction
–
Sentiment Assignment
–
Each feature is usually classified as being either favorable or unfavorable.
Visualization
–
Sentiment classification at the feature-level requires the extraction of features that are the target of
opinion words
Extracted opinion expression are summarized and visualized.
Methods
Statistical Approaches
–
ReviewSeer (2003)
–
Opinion Observer (2004)
–
Red Opal (2007)
product
Summarize
Extract
Features
NLP-Based Approaches
–
Kanayama System (2004)
–
WebFountain (2005)
–
OPINE (2005)
product
Assign
Sentiment
reviews
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 21
Opinion Observer
- Overview
Hu and Liu, 2005
Extract and summarize opinion expression
from customer reviews on the Web.
Only mines the features of the product on
which the customers have expressed their
opinions and whether the opinion are
positive or negative
Overall process
1.
Review crawling
2.
Feature extraction
3.
Sentiment assignment
4.
Overall process
–
Opinion word extraction
–
Opinion orientation identification
Summary generation
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 22
Opinion Observer
- Tasks
Feature Extraction
Product features are extracted from the noun or noun phrase by the association miner
CBA
Compactness pruning, redundancy pruning
Sentiment Assignment
Opinion sentence : a sentence contains one or more product features and one or more
opinion words
Adjectives are the only opinion words
Prior polarity of adjectives was identified by WordNet expansion methods with seed
terms
Infrequent features are extracted by using frequent opinion words
Polarity of a sentence is assigned as a dominant orientation
Extracted form : (product feature, # of positive sentences, # of negative sentences)
Experiments
Large collection of reviews of 15 electronic products
86.3% recall, 84.0% precision
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 23
Opinion Observer
- Visualization
Features of products are compared by the bar graph
Number of positive and negative sentences of each feature are normalized
Positive portion
Negative portion
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 24
Web Fountain
- Overview
Yi et al., 2005
Extracts target features of the sentiment from the various resources and assigns
polarity to the features
System Architecture
Sentiment Miner
Analyzes grammatical sentence structures and phrases by using NLP techniques
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 25
Web Fountain
Feature Extraction
Candidate features
–
a part-of relationship with the given topic
–
an attribute-of relationship with the given topic.
–
an attribute-of relationship with a known feature of the given topic
bBNP (Beginning definite Base Noun Phrase) heuristic is used
Select bnp (base noun phrase) that has high likelihood ratio
Experiments
–
– Tasks
Precision - digital camera: 97%, music reviews: 100%
Sentiment Assignment
Parse and traverse with two linguistic resources
–
Sentiment lexicon: define the sentiment polarity of terms
–
Sentiment pattern database: contain the sentiment assignment patterns of predicates
Experiments
–
Product review
–
Recall 56%, Precision 87%
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 26
Web Fountain
– Visualization
Web interface listing sentiment bearing sentences about a given product
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 27
Extracting and Summarizing Opinion Expression - Summary
System
Feature Extraction
N/A
probabilistic model
Naïve Bayes
Accuracy: 85.3%
CBA miner
Infrequent feature
selection
Recall: 86.3%
Precision: 84.0%
frequent noun and noun
phrase
Precision:85%
WordNet expansion
prior polarity of adjectives
ReviewSeer
(2003)
Statistical
Opinion
Observer
(2004)
Red Opal
(2007)
Kanayama’s
system
(2004)
NLPbased
WebFountain
(2005)
OPINE
(2005)
Center for E-Business Technology
Sentiment Assignment
use user’s rating
Precision:80%
sentiment unit
modifying the machine translation framework
Visualization
List feature term and it’s
score and show
sentences contain the
feature term
graph
ordered product list by
score of each feature
the confidence of the
scoring
N/A
Recall:43%
Precision:89%
bBNP heuristics
likelihood ratio
Precision:97%
Web PMI
Recall:76%
Precision:79%
Relaxation Labeling
Recall:89%
Precision:86%
sentiment lexicon
sentiment pattern database
Recall:56%
Precision:87%
Copyright © 2007 by CEBT
listing sentiment bearing
sentences of a product
N/A
IDS Lab. - 28
Discussion
OM is a growing research discipline related to various research areas, such as IR,
computational linguistics, TC, TS, and DM.
Surveyed three topics and summarized it.
For Korean OM?
There isn’t any published research into the Korean OM.
Language differences may impose some limits on the methods used in the OM
subtasks.
–
Structural differences between English and Korean may mean that the same heuristics cannot
be applied to extract features from text
–
The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior
polarity of words for the PMI or conjunction methods.
Research into Korean OM must be conducted in conjunction with other related areas.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 29
Discussion
Center for E-Business Technology
- Research Map of OM
Copyright © 2007 by CEBT
IDS Lab. - 30
Thank you
Center for E-Business Technology
IDS Lab. - 31