Transcript Linguistic Regularities in Sparse and Explicit Word
Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy Yoav Goldberg Bar-Ilan University Israel
Papers in ACL 2014* * Sampling error: +/- 100% Other Topics Neural Networks & Word Embeddings
Neural Embeddings β’ Dense vectors β’ Each dimension is a latent feature β’ Common software package: word2vec πΌπ‘πππ¦: (β7.35, 9.42, 0.88, β¦ ) β β 100 β’
βMagicβ
king β man + woman = queen
(analogies)
Representing words as vectors is not new!
Explicit Representations (Distributional) β’ Sparse vectors β’ Each dimension is an explicit context β’ Common association metric: PMI, PPMI πΌπ‘πππ¦: π πππ: 17, πππ π‘π: 5, πΉπππ‘: 2, β¦ β β πππππ β100,000 β’ Does the same βmagicβ work for explicit representations too?
β’ Baroni et al. (2014) showed that embeddings outperform explicit, butβ¦
Questions β’
Are analogies unique to neural embeddings?
Compare neural embeddings with explicit representations β’
Why does vector arithmetic reveal analogies?
Unravel the mystery behind neural embeddings and their βmagicβ
Background
Mikolov et al. (2013a,b,c) β’ Neural embeddings have interesting geometries
Mikolov et al. (2013a,b,c) β’ Neural embeddings have interesting geometries β’ These patterns capture βrelational similaritiesβ β’ Can be used to solve analogies: man is to woman as king is to queen
Mikolov et al. (2013a,b,c) β’ Neural embeddings have interesting geometries β’ These patterns capture βrelational similaritiesβ β’ Can be used to solve analogies: π is to π β as π is to π β β’ Can be recovered by βsimpleβ vector arithmetic: π β π β = π β π β
Mikolov et al. (2013a,b,c) β’ Neural embeddings have interesting geometries β’ These patterns capture βrelational similaritiesβ β’ Can be used to solve analogies: π is to π β as π is to π β β’ With simple vector arithmetic: π β π β = π β π β
Mikolov et al. (2013a,b,c) π β π β = π β π β
Mikolov et al. (2013a,b,c) π β π + π β = π β
Mikolov et al. (2013a,b,c) π king β π man + π β woman = π β queen
Mikolov et al. (2013a,b,c) π Tokyo β π Japan + π β France = π β Paris
Mikolov et al. (2013a,b,c) π best β π good + π β strong = π β strongest
Mikolov et al. (2013a,b,c) π best β π good + π β strong = π β strongest vectors in β π
Are analogies unique to neural embeddings?
Are analogies unique to neural embeddings?
β’ Experiment: compare embeddings to explicit representations
Are analogies unique to neural embeddings?
β’ Experiment: compare embeddings to explicit representations
Are analogies unique to neural embeddings?
β’ Experiment: compare embeddings to explicit representations β’ Learn different representations from the same corpus:
Are analogies unique to neural embeddings?
β’ Experiment: compare embeddings to explicit representations β’ Learn different representations from the same corpus: β’ Evaluate with the same recovery method: arg max π β cos π β , π β π + π β
Analogy Datasets β’ 4 words per analogy: π is to π β as π is to π β β’ Given 3 words: π is to π β as π is to ?
β’ Guess the best suiting β’ π β from the entire vocabulary π Excluding the question words π , π β , π β’ β’
MSR: Google:
~ 8000 syntactic analogies ~ 19,000 syntactic and semantic analogies
Embedding vs Explicit (Round 1)
Embedding vs Explicit (Round 1) 70% 60% 50% 40% 30% 20% 10% Embedding 54% Explicit 29% Embedding 63% Explicit 45% 0% MSR Google Many analogies recovered by explicit, but many more by embedding.
Why does vector arithmetic reveal analogies?
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π β to π β π + π β β’ This is done with cosine similarity: arg max π β βπ cos π β , π β π + π β = arg max π β βπ cos π β , π β cos π β , π + cos π β , π β Problem: one similarity might dominate the rest.
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π β to π β π + π β
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π β to π β π + π β β’ This is done with cosine similarity: arg max π β cos π β , π β π + π β = arg max π β βπ cos π β , π β cos π β , π + cos π β , π β
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π β to π β π + π β β’ This is done with cosine similarity: arg max π β cos π β , π β π + π β = arg max π β cos π β , π β cos π β , π + cos π β , π β
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π β to π β π + π β β’ This is done with cosine similarity: arg max π β cos π β , π β π + π β = arg max π β cos π β , π β cos π β , π + cos π β , π β
vector arithmetic
=
similarity arithmetic
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π β to π β π + π β β’ This is done with cosine similarity: arg max π β cos π β , π β π + π β = arg max π β cos π β , π β cos π β , π + cos π β , π β
vector arithmetic
=
similarity arithmetic
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π₯ to ππππ β πππ + π€ππππ β’ This is done with cosine similarity: arg max π₯ cos π₯, ππππ β πππ + π€ππππ = arg max π₯ cos π₯ , ππππ β cos π₯ , πππ + cos π₯ , π€ππππ
vector arithmetic
=
similarity arithmetic
Why does vector arithmetic reveal analogies?
β’ We wish to find the closest π₯ to ππππ β πππ + π€ππππ β’ This is done with cosine similarity: arg max π₯ cos π₯, ππππ β πππ + π€ππππ = arg max π₯ cos π₯ , ππππ
royal?
β cos π₯ , πππ + cos π₯ , π€ππππ
female?
vector arithmetic
=
similarity arithmetic
What does each similarity term mean?
β’ Observe the joint features with explicit representations!
πππππ β© ππππ uncrowned majesty second β¦ πππππ β© πππππ Elizabeth Katherine impregnate β¦
Can we do better?
Letβs look at some mistakesβ¦
Letβs look at some mistakesβ¦ England β London + Baghdad = ?
Letβs look at some mistakesβ¦ England β London + Baghdad = Iraq
Letβs look at some mistakesβ¦ England β London + Baghdad = Mosul?
The Additive Objective cos πΌπππ , πΈππππππ 0.15
β cos πΌπππ , πΏπππππ 0.13
+ cos πΌπππ , π΅ππβπππ 0.63
= 0.65
0.13
cos πππ π’π , πΈππππππ 0.14
β cos πππ π’π , πΏπππππ 0.75
= 0.74
+ cos πππ π’π , π΅ππβπππ
The Additive Objective cos πΌπππ , πΈππππππ 0.15
β cos πΌπππ , πΏπππππ 0.13
+ cos πΌπππ , π΅ππβπππ 0.63
= 0.65
0.13
cos πππ π’π , πΈππππππ 0.14
β cos πππ π’π , πΏπππππ 0.75
= 0.74
+ cos πππ π’π , π΅ππβπππ
The Additive Objective cos πΌπππ , πΈππππππ 0.15
β cos πΌπππ , πΏπππππ 0.13
+ cos πΌπππ , π΅ππβπππ 0.63
= 0.65
0.13
cos πππ π’π , πΈππππππ 0.14
β cos πππ π’π , πΏπππππ 0.75
= 0.74
+ cos πππ π’π , π΅ππβπππ
The Additive Objective cos πΌπππ , πΈππππππ 0.15
β cos πΌπππ , πΏπππππ 0.13
+ cos πΌπππ , π΅ππβπππ 0.63
= 0.65
0.13
cos πππ π’π , πΈππππππ 0.14
β cos πππ π’π , πΏπππππ 0.75
= 0.74
+ cos πππ π’π , π΅ππβπππ
The Additive Objective cos πΌπππ , πΈππππππ 0.15
β cos πΌπππ , πΏπππππ 0.13
+ cos πΌπππ , π΅ππβπππ 0.63
= 0.65
0.13
cos πππ π’π , πΈππππππ 0.14
β cos πππ π’π , πΏπππππ 0.75
= 0.74
+ cos πππ π’π , π΅ππβπππ
The Additive Objective cos πΌπππ , πΈππππππ 0.15
β cos πΌπππ , πΏπππππ 0.13
+ cos πΌπππ , π΅ππβπππ 0.63
= 0.65
0.13
cos πππ π’π , πΈππππππ 0.14
β cos πππ π’π , πΏπππππ 0.75
= 0.74
+ cos πππ π’π , π΅ππβπππ β’ Problem: one similarity might dominate the rest β’ β’ Much more prevalent in explicit representation Might explain why explicit underperformed
How can we do better?
How can we do better?
β’ Instead of adding similarities, multiply them!
How can we do better?
β’ Instead of adding similarities, multiply them!
arg max π β cos π β , π cos cos π β , π π β , π β
How can we do better?
β’ Instead of adding similarities, multiply them!
arg max π β cos π β , π cos cos π β , π π β , π β
Embedding vs Explicit (Round 2)
Multiplication > Addition 80% 70% 60% 50% 40% 30% 20% 10% 0% Add 54% Mul 59% Add 63% Mul 67% MSR Google Embedding Add 29% MSR Mul 57% Explicit Add 45% Mul 68% Google
Explicit is on-par with Embedding 30% 20% 10% 0% 80% 70% 60% 50% 40% Embedding 59% Explicit 57% Embedding 67% MSR Google Explicit 68%
Explicit is on-par with Embedding β’ Embeddings are not βmagicalβ β’ Embedding-based similarities have a more uniform distribution β’ The additive objective performs better on smoother distributions β’ The multiplicative objective overcomes this issue
Conclusion β’ Are analogies unique to neural embeddings?
No!
They occur in sparse and explicit representations as well.
β’ Why does vector arithmetic reveal analogies?
Because vector arithmetic is equivalent to
similarity arithmetic
.
β’ Can we do better?
Yes!
The multiplicative objective is significantly better.
More Results and Analyses (in the paper) β’ Evaluation on closed-vocabulary analogy questions (SemEval 2012) β’ Experiments with a third objective function (PairDirection) β’ Do different representations reveal the same analogies?
β’ Error analysis β’ A feature-level interpretation of how word similarity reveals analogies
Thanks β for + listening = )