Scott Wen-tau Yih Joint work with Geoffrey Zweig & John Platt Microsoft Research.

Download Report

Transcript Scott Wen-tau Yih Joint work with Geoffrey Zweig & John Platt Microsoft Research.

Scott Wen-tau Yih
Joint work with
Geoffrey Zweig & John Platt
Microsoft Research
Text objects (e.g., words, phrases, sentences or
documents) are represented as vectors
High-dimensional sparse term-vectors
Concept vectors from topic models or projection methods
Constructed compositionally from word vectors
[Socher et al. 12]
Relations of the text objects are estimated by
functions in the vector space
vq cos(θ)
Relatedness is measured by some
distance function (e.g., cosine)

vvd
q
Document Level
Information Retrieval [Salton & McGill 83]
Document Clustering [Deerwester et al. 90]
Search Relevance Measurement
[Baeza-Yates & Riberio-Neto ’99]
Cross-lingual document retrieval [Platt et al. 10; Yih et al. 11]
Word Level
Language modeling [Bellegarda 00]
Word similarity and relatedness
[Deerwester et al. 90; Lin 98; Turney 01; Turney & Littman 05; Agirre
et al. 09; Reisinger & Mooney 10; Yih & Qazvinian 12]
Existing VSMs cannot distinguish finer relations
The “antonym” issue of distributional similarity
The co-occurrence or distributional hypotheses
Apply to near-synonyms, hypernyms and other semantically
related words, including antonyms [Mohammad et al. 08]
e.g., “hot” and “cold” occur in similar contexts
LSA does not solve the issue
Might assign a high degree of similarity to opposites as well as
synonyms [Landauer & Laham 98]
Separate antonyms from distributionally similar
word pairs [Lin et al. 03]
Patterns: “from X to Y”, “either X or Y”
WordNet graph [Harabagiu et al. 06]
Synsets connected by is-a links and exactly one
antonymy link
WordNet + affix rules + heuristics [Mohammad et al. 08]
Distinguishing synonyms and antonyms is still
perceived as a difficult open problem…
[Poon & Domingos 09]
Polarity Inducing Latent Semantic Analysis (PILSA)
A vector space model that encodes polarity information
Synonyms cluster together in this space
Antonyms lie at the opposite ends of a unit sphere
burning
hot
freezing
cold
Polarity Inducing Latent Semantic Analysis (PILSA)
A vector space model that encodes polarity information
Synonyms cluster together in this space
Antonyms lie at the opposite ends of a unit sphere
Significantly improved the prediction accuracy on a
benchmark GRE dataset (64% → 80%)
Introduction
Polarity Inducing Latent Semantic Analysis
Basic construction
Extension 1: Improving accuracy
Extension 2: Improving coverage
Experimental evaluation
Task & datasets
Results
Conclusion
Input: A thesaurus (with synonyms & antonyms)
Create a “document”-term matrix
Each group of words (synonyms and antonyms) is
treated as a “document”
Induce polarity by making antonyms have
negative weights
Apply SVD as in regular Latent Semantic Analysis
Acrimony: rancor, conflict, bitterness; goodwill, affection
Affection: goodwill, tenderness, fondness; acrimony, rancor
Document: row-vector
Term: column-vector
acrimony
rancor
goodwill
affection
…
Group 1: “acrimony”
4.73
6.01
5.81
4.86
…
Group 2: “affection”
3.78
5.23
6.21
5.15
…
…
…
…
…
…
…
TFIDF score
Acrimony: rancor, conflict, bitterness; goodwill, affection
Affection: goodwill, tenderness, fondness; acrimony, rancor
Inducing polarity
acrimony
rancor
goodwill
affection
…
Group 1: “acrimony”
4.73
6.01
-5.81
-4.86
…
Group 2: “affection”
-3.78
-5.23
6.21
5.15
…
…
…
…
…
…
…
+ 𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠
Cosine Score:
− 𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
4.73
6.01
5.81
4.86
Group 2: “affection”
3.78
5.23
6.21
5.15
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
1
1
1
1
Group 2: “affection”
1
1
1
1
Cosine similarity = 1
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
1
1
1
1
Group 2: “affection”
1
1
1
1
Cosine similarity = 1
Cannot distinguish antonyms
from synonyms!
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
1
1
1
1
Group 2: “affection”
1
1
1
1
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
1
1
-1
-1
Group 2: “affection”
-1
-1
1
1
Cosine similarity = 1
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
1
1
1
1
Group 2: “affection”
1
1
1
1
acrimony
rancor
goodwill
affection
Group 1: “acrimony”
1
1
-1
-1
Group 2: “affection”
-1
-1
1
1
Cosine similarity = -1
words
𝐒
≈
𝐖
𝑑×𝑛
𝐔
𝐕T
𝑘×𝑘
𝑑×𝑘
Word similarity: cosine of two columns in 𝐒𝐕 T
SVD generalizes and smooths the original data
Uncovers relationships not explicit in the thesaurus
𝑘×𝑛
words
𝐒
≈
𝐖
𝑑×𝑛
𝐔
𝑘×𝑘
𝐕T
𝑘×𝑛
𝑑×𝑘
As 𝐔 T 𝐖 = 𝐒𝐕 T , 𝐔𝑑×𝑘 can be viewed as the projection
matrix that maps the raw 𝑑 × 1 column-vector to the
𝑘-dimensional latent space
Refine the projection matrix by discriminative training
S2Net [Yih et al. 11]: very similar to RankNet [Burges et al. 05] but
focuses on learning concept vectors
𝑓𝑠𝑖𝑚 (𝑣𝑝 , 𝑣𝑞 )
𝒗𝒑
𝒇𝒑 𝑡1
𝑐1
𝒗𝒒
𝑐𝑘
𝐴𝑑×𝑘
𝑡𝑑
𝑣𝑝 = 𝐴𝑇 𝑓𝑝
Training data: Antonym pairs from thesaurus
Initialize model with the PILSA projection matrix
Learning objective: cosine score of antonyms
should be lower than other word pairs
Δ𝑖𝑗 ≡ cos 𝐀T 𝐟𝑝 𝑖 , 𝐀T 𝐟𝑞 𝑗 − cos 𝐀T 𝐟𝑝 𝑖 , 𝐀T 𝐟𝑞 𝑖
Other word pair
Antonyms
𝐿 Δ𝑖𝑗 ; 𝐀 = log(1 + exp(−𝛾Δ𝑖𝑗 ))
20
15
10
5
0
-2
-1
0
1
2
What to do with out-of-thesaurus words?
Some lexical variations

Encarta thesaurus contains “corruptible” and
“corruption”, but not “corruptibility”
Morphological analysis and stemming to find alternatives
of an out-of-thesaurus target word
Rare or offensive words

e.g., “froward” and “moronic”
Embedding out-of-thesaurus words by leveraging a
general corpus
Create a context vector space model using a
collection of documents (e.g., Wikipedia)
Context: words within a window of [-10,10]
Embed target word into the PILSA space by 𝑘-NN
Find nearby in-thesaurus words in the context space
Remove words with inconsistent polarity
Use the centroid of the corresponding PILSA vectors to
represent the target word
Create a context vector space model using a
collection of documents (e.g., Wikipedia)
Context: words within a window of [-10,10]
Embed target word into the PILSA space by 𝑘-NN
hot
sweltering
burning
cold
Context Vector Space
PILSA Space
Introduction
Polarity Inducing Latent Semantic Analysis
Basic construction
Extension 1: Improving accuracy
Extension 2: Improving coverage
Experimental evaluation
Task & datasets
Results
Conclusion
Encarta Thesaurus (for basic PILSA)
47k word categories (i.e., the “documents”)
Vocabulary of 50k words
125,724 pairs of antonyms
Wikipedia (for embedding out-of-thesaurus words)
Sentences from a Nov-2010 snapshot
917M words after preprocessing
Task: GRE closest-opposite questions
Which is the closest opposite of adulterate?
(a) renounce (b) forbid (c) purify (d) criticize (e) correct
Dev / Test: 162 / 950 questions [Mohammad et al. 08]
Dev set is used for tuning the dimensionality of PILSA
Evaluation metric
Accuracy: #correct / #total questions
Questions with unresolved out-of-thesaurus target words
are treated answered incorrectly
0.85
0.8
0.8
0.77
0.74
0.75
0.7
0.64
0.65
0.6
0.56
0.57
0.55
0.5
Lookup
Raw TFIDF
PILSA
PILSA+S2Net
OOV Embedding Mohammad et al.
08
Target word: admirable
No polarity – LSA
Most Similar: commendable, creditable, despicable
Least Similar: uninviting, dessert, seductive
With polarity – PILSA
Most Similar: commendable, creditable, laudable
Least Similar: despicable, shameful, unworthy
Full results on GRE test set are available online
Polarity Inducing LSA
Solves the open problem of antonyms/synonyms by
making a vector space that can distinguish opposites
Vector space designed so that synonyms/antonyms tend
to have positive/negative cosine similarity
Future Work
New methods or representations for other word relations
e.g., Part-Whole, Is-A, Attribute
Applications
e.g., Textual Entailment or Sentence Completion