Document

Transcript Document

A Survey of Opinion Mining
Dongjoo Lee
Intelligent Database Systems Lab.
Dept. of Computer Science and Engineering
Seoul National University
Introduction

The Web contains a wealth of opinions about products, politics, and
more in newsgroup posts, review sites, and other web sites

A few problems


What is the general opinion on the proposed tax reform?

How is popular opinion on the presidential candidates evolving?

Which of our customers are unsatisfied? Why?
Opinion Mining (OM)

a recent discipline at the crossroads of information retrieval and
computational linguistics which is concerned not with the subject of a
document, but with opinion it expresses
 Related Areas

Data Mining(DM), Information Retrieval (IR), Text Classification
(TC), Text Summarization (TS)
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 2
Agenda

Introduction

Development of Linguistic Resource




Conjunction Method

PMI Method

WordNet Expanding Method

Gloss Use Method
Sentiment Classification

PMI Method

Machine Learning Method

NLP Combined Method
Extracting and Summarizing Opinion Expression

Statistical Approach

NLP Based Approach
Discussion
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 3
Development of Linguistic Resource (1)

Linguistic resources can be used to extract opinion and to classify the sentiment
of text

Appraisal Theory



Sentiment related properties are well-defined

A framework of linguistic resources which describes how writers and speakers express
inter-subjective and ideological position

underlying linguistic foundation of OM
Tasks

Determining the subjectivity of a term

Determining term orientation

Determining the strength of term attitude
Example

Objective: vertical, yellow, liquid

Subjective
–
Positive: good < excellent
–
Negative: bad < terrible
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 4
Development of Linguistic Resource (2)

Conjunction Method

PMI Method

Orientation

Subjectivity

WordNet Expansion Method

Gloss Use Method

Orientation

Subjectivity

SentiWordNet
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 5
Conjunction Method - overview

Hatzivassiloglou and McKeown, 1997

Hypothesis

Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’ is used with
opposite orientation.
Process

negative
1. All conjunction of adjectives are extracted from the corpus.
positive
2. A log-linear regression model combines information from
different conjunctions to determine if each two conjoined
adjectives are of same or different orientation.
seed terms
3. A clustering algorithm separates the adjectives into two
subsets of different orientation. It places as many words of
same orientation as possible into the same subset.
corpus
and

but
4. The average frequencies in each group are compared and
the group with the higher frequency is labeled as positive.
Randomly selected adjectives with positive and negative orientation seed terms were
used to predict orientation.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 6
Conjunction Method

– objective function and constraints
Select pmin that minimizes Φ(p)
|Ci| : the cardinality of cluster i
d(x, y) : the dissimilarity between adjectives x , y


dissimilarity between adjectives in same cluster is minimized and dissimilarity
between adjectives in different cluster is maximized.
Experiments

HM term set : 1,336 adjectives
–

657 positive, 679 negative terms
Methods to improve performance of orientation prediction
–
But rule : Most conjunctions had same orientation, while some conjunctions linked by
‘but’ had almost opposite orientation
–
log-linear regression model
–
morphological relationship


adequate-inadequate or thoughtful –thoughtless
log-linear model with morphological relationship : 82.5% accuracy
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 7
PMI Method
- overview
 Pointwise Mutual Information (PMI)

a measure of association used in information theory and statistics

Orientation

–
Turney and Littman, 2003
–
terms with similar orientation tend to co-occur in documents
Subjectivity
–
Baroni and Vegnaduzzo, 2004
–
subjective adjectives tend to occur in the near of other subjective
adjectives
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 8
PMI Method

– predicting semantic orientation
Modified PMI was measured using the number of results returned by the
AltaVista search engine with NEAR operator
t : target term
ti : paradigmatic term

Predicting semantic orientation of a term SO(t)

Experiments

With HM term set and three corpora
Corpus
AV-ENG
AV-CA
TASA
Approx. # of word in corpus
1 *1011
2*109
1*107
Accuracy
87.13%
80.31%
61.83%
–
With small corpus, accuracy isn’t higher than conjunction method.
–
With large corpus, accuracy is higher than conjunction method.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 9
WordNet Expansion Method

Hu et al., 2004


Hypothesis


used synonym and antonym relationship between words
adjectives usually share the same orientation as their synonyms and opposite
orientation as their antonyms
By using a set of seed adjectives, orientation of all adjectives in WordNet can be
assigned through a procedure exploring on the cluster graphs.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 10
Gloss Use Method

Esuli et al., 2005, 2006

Hypothesis
- overview
good: that which is pleasing or valuable or useful; agreeable or pleasing
beautiful: aesthetically pleasing

Orientation
–

pretty: pleasing by delicacy or grace; not imposing
terms with similar orientation have similar glosses
Subjectivity
–
terms with similar orientation have similar glosses
–
terms without orientation have non-oriented glosses
yellow: similar to the color of an egg yolk
vertical: at right angles to the plane of the horizon or a base line

SentiWordNet

All words in the WordNet have three scores
–

positivity, negativity, and objectivity
Term Sense is positioned in reversed triangle
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 11
Gloss Use Method –

classification process
Process
1. A seed set (Lp, Ln) is provided as input
2. Lexical relations (e.g. synonymy) from a thesaurus,
or online dictionary, are used to extend seed set.
Once added to the original ones, the new terms
yield two new, richer sets Trp and Trn; together
they form the training set for the learning phase of
Step 4.
3. For each term ti in Trp∪Trn or in the test set, a
textual representation of ti is generated by
collating all the glosses of ti as found in a machinereadable dictionary. Each such representation is
converted into vectorial form by standard text
indexing techniques.
4. A binary text classifier is trained on the terms in
Trp∪Trn and then applied to the terms in the test
set.

Experiments
 Classifier : NB, SVM, PrTFIDF
 87.38% Accuracy
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 12
Development of Linguistic Resource - Summary
Method
Conjunction
Method
Intuition
Accuracy
 Adjectives in and conjunctions
usually have similar orientation,
though but is used with opposite
orientation
 terms with similar orientation
PMI method
tend to co-occur in documents
WordNet
Expansion
Method
Gloss Use
Method
Characteristics
 78.08%
 The First try
 test data : 1336
adjectives
 87.13%
 No limitation
 Much time required
 adjectives usually share the same  N/A
orientation as their synonyms and
opposite orientation as their
antonyms
 Limited to WordNet
 terms with similar orientation
have similar glosses
 terms without orientation have
non-oriented glosses
 SentiWordNet (All
word in WordNet)
 Accuracy depends on
the quality of thesaurus
Center for E-Business Technology
 87.38%
Copyright © 2007 by CEBT
IDS Lab. - 13
Sentiment Classification

The process of identifying the sentiment – or polarity – of a piece of
text or a document.

Document-level

Sentence-level, phrase-level

Feature-level
–

Define target of the opinion and assign the sentiment of the target
Document-level Sentiment Classification Method

PMI method

Machine Learning Method

–
Default Classifiers
–
Enhanced Classifier
NLP Combined Method
–
A Two-Step Classification
–
Combining Appraisal Theory
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 14
PMI Method

Turney et al., 2002

Process

Only two-word phrases containing adjectives or adverbs are extracted

Semantic orientation of a phrase
–


SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”)
Semantic orientation is an average semantic orientation of the phrases
Experiments

410 reviews from Epinions (epinion.com): 170 positive, 240 negative

calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours
Domain of review
Automobiles
Accuracy
84.00%
Domain of review
Movies
Accuracy
65.83%
- Honda Accord
83.78%
- The Matrix
66.67%
- Volkswagen Jetta
84.21%
- Pearl Harbor
65.00%
Banks
80.00%
Travel Destination
70.53%
- Bank of America
78.33%
- Cancun
64.41%
- Washington Mutual
81.67%
- Puerto Vallarta
80.56%
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 15
ML
- Default Classifier

Pang and Lee, 2002

A special case of text categorization with sentiment- rather than topic-based
categories

Document modeling


standard bag-of-features framework
Experiments

Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive

Naïve Bayes, Maximum Entropy, Support Vector Machine
Features
unigrams
unigrams
unigrams+bigrams
bigrams
unigrams+POS
adjectives
top 2633 unigrams
unigrams+position

# of features
Frequency
or presence?
NB
ME
SVM
16165
16165
32330
16165
16695
2633
2633
22430
freq.
pres.
pres.
pres.
pres.
pres.
pres.
pres.
78.7
81.0
80.6
77.3
81.5
77.0
80.3
81.0
N/A
80.4
80.8
77.4
80.4
77.7
81.0
80.1
72.8
82.9
82.7
77.1
81.9
75.1
81.4
81.6
In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to
do the best, although the differences aren’t very large.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 16
ML

Pang and Lee, 2004



- Using Only Subjective Sentences
improved polarity classification by removing
objective sentences
A subjectivity detector determines whether
each sentence is subjective or not

Standard subjectivity classifier

Subjectivity classifier using proximity
relationship
The use of subjectivity extracts can improve
the polarity classification at least no loss of
accuracy.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 17
NLP Combined Method – A Two-Step Classification

Wilson et al., 2005

A Two-Step Contextual Polarity Classification

employ machine learning and 28 linguistic features

document polarity : the average polarity of phrases
Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar
Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates
their contextual polarity (positive, negative, both, or neutral).

28 Features : were extracted using NLP techniques with a dependency parser


4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features,
1 Document Feature
Experiments

Data : Multi-perspective Question Answering (MPQA) Opinion Corpus
neutral-polar classification (%)
Features
Word token
Word+priorpol
28 features
Center for E-Business Technology
Accuracy
73.6
74.2
75.9
polarity classification (%).
Features
Word token
Word+priorpol
10 features
Copyright © 2007 by CEBT
Accuracy
61.7
63.0
65.7
IDS Lab. - 18
NLP Combined Method - Combining Appraisal Theory

Whitelaw et al., 2005

applied the appraisal theory to the machine learning methods of Pang and Lee

Structure of an appraisal

An example “not very happy”

Experiments

a lexicon of 1329 appraisal entities have been produced semi-automatically from 400 seed
terms in around twenty man-hours

combining attitude type and orientation : accuracy 90.2%.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 19
Sentiment Classification - Summary
Method
Characteristics
Cons
 Use phrase PMI
 Simple
 Need not priory
polarity dictionary
 Loss of contextual
meaning
 Slow(Time to get
PMI)
 Bag of Words
 Unigram to bigram or ngram
 SVM, NB, MaxEnt
 Simple
 Need not priory
polarity dictionary
 Loss of contextual
meaning
 Need learning phase
PMI Method
Machine
Learning
Method
Pros
 Based on ML
 Parsing or Syntactic
NLP Combined
Analysis
Method
 Prior polarity to
contextual polarity
Center for E-Business Technology
 Consider contextual  Need prior polarity
meaning
dictionary
 Easily extendible
 Syntactic Analysis
for various purpose
Overhead
Copyright © 2007 by CEBT
IDS Lab. - 20
Extracting and Summarizing Opinion Expression

Goal


Extract the opinion expression from large reviews and present it with an effective way
Tasks

Feature Extraction
–

Sentiment Assignment
–

Each feature is usually classified as being either favorable or unfavorable.
Visualization
–

Sentiment classification at the feature-level requires the extraction of features that are the target of
opinion words
Extracted opinion expression are summarized and visualized.
Methods


Statistical Approaches
–
ReviewSeer (2003)
–
Opinion Observer (2004)
–
Red Opal (2007)
product
Summarize
Extract
Features
NLP-Based Approaches
–
Kanayama System (2004)
–
WebFountain (2005)
–
OPINE (2005)
product
Assign
Sentiment
reviews
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 21
Opinion Observer
- Overview

Hu and Liu, 2005

Extract and summarize opinion expression
from customer reviews on the Web.

Only mines the features of the product on
which the customers have expressed their
opinions and whether the opinion are
positive or negative

Overall process
1.
Review crawling
2.
Feature extraction
3.
Sentiment assignment
4.
Overall process
–
Opinion word extraction
–
Opinion orientation identification
Summary generation
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 22
Opinion Observer



- Tasks
Feature Extraction

Product features are extracted from the noun or noun phrase by the association miner
CBA

Compactness pruning, redundancy pruning
Sentiment Assignment

Opinion sentence : a sentence contains one or more product features and one or more
opinion words

Adjectives are the only opinion words

Prior polarity of adjectives was identified by WordNet expansion methods with seed
terms

Infrequent features are extracted by using frequent opinion words

Polarity of a sentence is assigned as a dominant orientation

Extracted form : (product feature, # of positive sentences, # of negative sentences)
Experiments

Large collection of reviews of 15 electronic products

86.3% recall, 84.0% precision
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 23
Opinion Observer
- Visualization

Features of products are compared by the bar graph

Number of positive and negative sentences of each feature are normalized
Positive portion
Negative portion
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 24
Web Fountain
- Overview

Yi et al., 2005

Extracts target features of the sentiment from the various resources and assigns
polarity to the features

System Architecture

Sentiment Miner

Analyzes grammatical sentence structures and phrases by using NLP techniques
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 25
Web Fountain

Feature Extraction

Candidate features
–
a part-of relationship with the given topic
–
an attribute-of relationship with the given topic.
–
an attribute-of relationship with a known feature of the given topic

bBNP (Beginning definite Base Noun Phrase) heuristic is used

Select bnp (base noun phrase) that has high likelihood ratio

Experiments
–

– Tasks
Precision - digital camera: 97%, music reviews: 100%
Sentiment Assignment


Parse and traverse with two linguistic resources
–
Sentiment lexicon: define the sentiment polarity of terms
–
Sentiment pattern database: contain the sentiment assignment patterns of predicates
Experiments
–
Product review
–
Recall 56%, Precision 87%
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 26
Web Fountain

– Visualization
Web interface listing sentiment bearing sentences about a given product
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 27
Extracting and Summarizing Opinion Expression - Summary
System
Feature Extraction
 N/A
 probabilistic model
 Naïve Bayes
 Accuracy: 85.3%
 CBA miner
 Infrequent feature
selection
 Recall: 86.3%
 Precision: 84.0%
 frequent noun and noun
phrase
 Precision:85%
 WordNet expansion
 prior polarity of adjectives
ReviewSeer
(2003)
Statistical
Opinion
Observer
(2004)
Red Opal
(2007)
Kanayama’s
system
(2004)
NLPbased
WebFountain
(2005)
OPINE
(2005)
Center for E-Business Technology
Sentiment Assignment
 use user’s rating
 Precision:80%
 sentiment unit
 modifying the machine translation framework
Visualization
 List feature term and it’s
score and show
sentences contain the
feature term
 graph
 ordered product list by
score of each feature
 the confidence of the
scoring
 N/A
 Recall:43%
 Precision:89%
 bBNP heuristics
 likelihood ratio
 Precision:97%




 Web PMI
 Recall:76%
 Precision:79%
 Relaxation Labeling
 Recall:89%
 Precision:86%
sentiment lexicon
sentiment pattern database
Recall:56%
Precision:87%
Copyright © 2007 by CEBT
 listing sentiment bearing
sentences of a product
 N/A
IDS Lab. - 28
Discussion

OM is a growing research discipline related to various research areas, such as IR,
computational linguistics, TC, TS, and DM.

Surveyed three topics and summarized it.

For Korean OM?

There isn’t any published research into the Korean OM.

Language differences may impose some limits on the methods used in the OM
subtasks.

–
Structural differences between English and Korean may mean that the same heuristics cannot
be applied to extract features from text
–
The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior
polarity of words for the PMI or conjunction methods.
Research into Korean OM must be conducted in conjunction with other related areas.
Center for E-Business Technology
Copyright © 2007 by CEBT
IDS Lab. - 29
Discussion
Center for E-Business Technology
- Research Map of OM
Copyright © 2007 by CEBT
IDS Lab. - 30
Thank you
Center for E-Business Technology
IDS Lab. - 31

Document

Transcript Document

Directory