Linguistic Regularities in Sparse and Explicit Word

Download Report

Transcript Linguistic Regularities in Sparse and Explicit Word

Linguistic Regularities in Sparse and Explicit Word Representations

Omer Levy Yoav Goldberg Bar-Ilan University Israel

Papers in ACL 2014* * Sampling error: +/- 100% Other Topics Neural Networks & Word Embeddings

Neural Embeddings β€’ Dense vectors β€’ Each dimension is a latent feature β€’ Common software package: word2vec πΌπ‘‘π‘Žπ‘™π‘¦: (βˆ’7.35, 9.42, 0.88, … ) ∈ ℝ 100 β€’

β€œMagic”

king βˆ’ man + woman = queen

(analogies)

Representing words as vectors is not new!

Explicit Representations (Distributional) β€’ Sparse vectors β€’ Each dimension is an explicit context β€’ Common association metric: PMI, PPMI πΌπ‘‘π‘Žπ‘™π‘¦: π‘…π‘œπ‘šπ‘’: 17, π‘π‘Žπ‘ π‘‘π‘Ž: 5, πΉπ‘–π‘Žπ‘‘: 2, … ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘ β‰ˆ100,000 β€’ Does the same β€œmagic” work for explicit representations too?

β€’ Baroni et al. (2014) showed that embeddings outperform explicit, but…

Questions β€’

Are analogies unique to neural embeddings?

Compare neural embeddings with explicit representations β€’

Why does vector arithmetic reveal analogies?

Unravel the mystery behind neural embeddings and their β€œmagic”

Background

Mikolov et al. (2013a,b,c) β€’ Neural embeddings have interesting geometries

Mikolov et al. (2013a,b,c) β€’ Neural embeddings have interesting geometries β€’ These patterns capture β€œrelational similarities” β€’ Can be used to solve analogies: man is to woman as king is to queen

Mikolov et al. (2013a,b,c) β€’ Neural embeddings have interesting geometries β€’ These patterns capture β€œrelational similarities” β€’ Can be used to solve analogies: π‘Ž is to π‘Ž βˆ— as 𝑏 is to 𝑏 βˆ— β€’ Can be recovered by β€œsimple” vector arithmetic: π‘Ž βˆ’ π‘Ž βˆ— = 𝑏 βˆ’ 𝑏 βˆ—

Mikolov et al. (2013a,b,c) β€’ Neural embeddings have interesting geometries β€’ These patterns capture β€œrelational similarities” β€’ Can be used to solve analogies: π‘Ž is to π‘Ž βˆ— as 𝑏 is to 𝑏 βˆ— β€’ With simple vector arithmetic: π‘Ž βˆ’ π‘Ž βˆ— = 𝑏 βˆ’ 𝑏 βˆ—

Mikolov et al. (2013a,b,c) π‘Ž βˆ’ π‘Ž βˆ— = 𝑏 βˆ’ 𝑏 βˆ—

Mikolov et al. (2013a,b,c) 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— = 𝑏 βˆ—

Mikolov et al. (2013a,b,c) 𝑏 king βˆ’ π‘Ž man + π‘Ž βˆ— woman = 𝑏 βˆ— queen

Mikolov et al. (2013a,b,c) 𝑏 Tokyo βˆ’ π‘Ž Japan + π‘Ž βˆ— France = 𝑏 βˆ— Paris

Mikolov et al. (2013a,b,c) 𝑏 best βˆ’ π‘Ž good + π‘Ž βˆ— strong = 𝑏 βˆ— strongest

Mikolov et al. (2013a,b,c) 𝑏 best βˆ’ π‘Ž good + π‘Ž βˆ— strong = 𝑏 βˆ— strongest vectors in ℝ 𝑛

Are analogies unique to neural embeddings?

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations β€’ Learn different representations from the same corpus:

Are analogies unique to neural embeddings?

β€’ Experiment: compare embeddings to explicit representations β€’ Learn different representations from the same corpus: β€’ Evaluate with the same recovery method: arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Ž βˆ—

Analogy Datasets β€’ 4 words per analogy: π‘Ž is to π‘Ž βˆ— as 𝑏 is to 𝑏 βˆ— β€’ Given 3 words: π‘Ž is to π‘Ž βˆ— as 𝑏 is to ?

β€’ Guess the best suiting β€’ 𝑏 βˆ— from the entire vocabulary 𝑉 Excluding the question words π‘Ž , π‘Ž βˆ— , 𝑏 β€’ β€’

MSR: Google:

~ 8000 syntactic analogies ~ 19,000 syntactic and semantic analogies

Embedding vs Explicit (Round 1)

Embedding vs Explicit (Round 1) 70% 60% 50% 40% 30% 20% 10% Embedding 54% Explicit 29% Embedding 63% Explicit 45% 0% MSR Google Many analogies recovered by explicit, but many more by embedding.

Why does vector arithmetic reveal analogies?

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest 𝑏 βˆ— to 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— β€’ This is done with cosine similarity: arg max 𝑏 βˆ— βˆˆπ‘‰ cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— = arg max 𝑏 βˆ— βˆˆπ‘‰ cos 𝑏 βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Ž βˆ— Problem: one similarity might dominate the rest.

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest 𝑏 βˆ— to 𝑏 βˆ’ π‘Ž + π‘Ž βˆ—

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest 𝑏 βˆ— to 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— β€’ This is done with cosine similarity: arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— = arg max 𝑏 βˆ— βˆˆπ‘‰ cos 𝑏 βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Ž βˆ—

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest 𝑏 βˆ— to 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— β€’ This is done with cosine similarity: arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— = arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Ž βˆ—

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest 𝑏 βˆ— to 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— β€’ This is done with cosine similarity: arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— = arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Ž βˆ—

vector arithmetic

=

similarity arithmetic

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest 𝑏 βˆ— to 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— β€’ This is done with cosine similarity: arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ π‘Ž + π‘Ž βˆ— = arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 βˆ’ cos 𝑏 βˆ— , π‘Ž + cos 𝑏 βˆ— , π‘Ž βˆ—

vector arithmetic

=

similarity arithmetic

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘₯ to π‘˜π‘–π‘›π‘” βˆ’ π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘› β€’ This is done with cosine similarity: arg max π‘₯ cos π‘₯, π‘˜π‘–π‘›π‘” βˆ’ π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘› = arg max π‘₯ cos π‘₯ , π‘˜π‘–π‘›π‘” βˆ’ cos π‘₯ , π‘šπ‘Žπ‘› + cos π‘₯ , π‘€π‘œπ‘šπ‘Žπ‘›

vector arithmetic

=

similarity arithmetic

Why does vector arithmetic reveal analogies?

β€’ We wish to find the closest π‘₯ to π‘˜π‘–π‘›π‘” βˆ’ π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘› β€’ This is done with cosine similarity: arg max π‘₯ cos π‘₯, π‘˜π‘–π‘›π‘” βˆ’ π‘šπ‘Žπ‘› + π‘€π‘œπ‘šπ‘Žπ‘› = arg max π‘₯ cos π‘₯ , π‘˜π‘–π‘›π‘”

royal?

βˆ’ cos π‘₯ , π‘šπ‘Žπ‘› + cos π‘₯ , π‘€π‘œπ‘šπ‘Žπ‘›

female?

vector arithmetic

=

similarity arithmetic

What does each similarity term mean?

β€’ Observe the joint features with explicit representations!

𝒒𝒖𝒆𝒆𝒏 ∩ π’Œπ’Šπ’π’ˆ uncrowned majesty second … 𝒒𝒖𝒆𝒆𝒏 ∩ π’˜π’π’Žπ’‚π’ Elizabeth Katherine impregnate …

Can we do better?

Let’s look at some mistakes…

Let’s look at some mistakes… England βˆ’ London + Baghdad = ?

Let’s look at some mistakes… England βˆ’ London + Baghdad = Iraq

Let’s look at some mistakes… England βˆ’ London + Baghdad = Mosul?

The Additive Objective cos πΌπ‘Ÿπ‘Žπ‘ž , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.15

βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.13

+ cos πΌπ‘Ÿπ‘Žπ‘ž , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ 0.63

= 0.65

0.13

cos π‘€π‘œπ‘ π‘’π‘™ , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.14

βˆ’ cos π‘€π‘œπ‘ π‘’π‘™ , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.75

= 0.74

+ cos π‘€π‘œπ‘ π‘’π‘™ , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective cos πΌπ‘Ÿπ‘Žπ‘ž , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.15

βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.13

+ cos πΌπ‘Ÿπ‘Žπ‘ž , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ 0.63

= 0.65

0.13

cos π‘€π‘œπ‘ π‘’π‘™ , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.14

βˆ’ cos π‘€π‘œπ‘ π‘’π‘™ , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.75

= 0.74

+ cos π‘€π‘œπ‘ π‘’π‘™ , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective cos πΌπ‘Ÿπ‘Žπ‘ž , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.15

βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.13

+ cos πΌπ‘Ÿπ‘Žπ‘ž , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ 0.63

= 0.65

0.13

cos π‘€π‘œπ‘ π‘’π‘™ , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.14

βˆ’ cos π‘€π‘œπ‘ π‘’π‘™ , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.75

= 0.74

+ cos π‘€π‘œπ‘ π‘’π‘™ , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective cos πΌπ‘Ÿπ‘Žπ‘ž , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.15

βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.13

+ cos πΌπ‘Ÿπ‘Žπ‘ž , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ 0.63

= 0.65

0.13

cos π‘€π‘œπ‘ π‘’π‘™ , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.14

βˆ’ cos π‘€π‘œπ‘ π‘’π‘™ , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.75

= 0.74

+ cos π‘€π‘œπ‘ π‘’π‘™ , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective cos πΌπ‘Ÿπ‘Žπ‘ž , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.15

βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.13

+ cos πΌπ‘Ÿπ‘Žπ‘ž , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ 0.63

= 0.65

0.13

cos π‘€π‘œπ‘ π‘’π‘™ , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.14

βˆ’ cos π‘€π‘œπ‘ π‘’π‘™ , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.75

= 0.74

+ cos π‘€π‘œπ‘ π‘’π‘™ , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘

The Additive Objective cos πΌπ‘Ÿπ‘Žπ‘ž , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.15

βˆ’ cos πΌπ‘Ÿπ‘Žπ‘ž , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.13

+ cos πΌπ‘Ÿπ‘Žπ‘ž , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ 0.63

= 0.65

0.13

cos π‘€π‘œπ‘ π‘’π‘™ , πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ 0.14

βˆ’ cos π‘€π‘œπ‘ π‘’π‘™ , πΏπ‘œπ‘›π‘‘π‘œπ‘› 0.75

= 0.74

+ cos π‘€π‘œπ‘ π‘’π‘™ , π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘ β€’ Problem: one similarity might dominate the rest β€’ β€’ Much more prevalent in explicit representation Might explain why explicit underperformed

How can we do better?

How can we do better?

β€’ Instead of adding similarities, multiply them!

How can we do better?

β€’ Instead of adding similarities, multiply them!

arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 cos cos 𝑏 βˆ— , π‘Ž 𝑏 βˆ— , π‘Ž βˆ—

How can we do better?

β€’ Instead of adding similarities, multiply them!

arg max 𝑏 βˆ— cos 𝑏 βˆ— , 𝑏 cos cos 𝑏 βˆ— , π‘Ž 𝑏 βˆ— , π‘Ž βˆ—

Embedding vs Explicit (Round 2)

Multiplication > Addition 80% 70% 60% 50% 40% 30% 20% 10% 0% Add 54% Mul 59% Add 63% Mul 67% MSR Google Embedding Add 29% MSR Mul 57% Explicit Add 45% Mul 68% Google

Explicit is on-par with Embedding 30% 20% 10% 0% 80% 70% 60% 50% 40% Embedding 59% Explicit 57% Embedding 67% MSR Google Explicit 68%

Explicit is on-par with Embedding β€’ Embeddings are not β€œmagical” β€’ Embedding-based similarities have a more uniform distribution β€’ The additive objective performs better on smoother distributions β€’ The multiplicative objective overcomes this issue

Conclusion β€’ Are analogies unique to neural embeddings?

No!

They occur in sparse and explicit representations as well.

β€’ Why does vector arithmetic reveal analogies?

Because vector arithmetic is equivalent to

similarity arithmetic

.

β€’ Can we do better?

Yes!

The multiplicative objective is significantly better.

More Results and Analyses (in the paper) β€’ Evaluation on closed-vocabulary analogy questions (SemEval 2012) β€’ Experiments with a third objective function (PairDirection) β€’ Do different representations reveal the same analogies?

β€’ Error analysis β€’ A feature-level interpretation of how word similarity reveals analogies

Thanks βˆ’ for + listening = )