Semi-sup training? | slideum.com

Semi-sup training?

Download Report

Transcript Semi-sup training?

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies: http://metaoptimize.com/projects/wordreprs/

Sup data

Supervised training

Sup model

2

Semi-sup training?

Sup data

Supervised training

Sup model

3

Semi-sup training?

Sup data

Supervised training

Sup model

4

More feats Semi-sup training?

Sup data

Supervised training

Sup model

5

More feats sup task 1 Sup data More feats sup task 2 Sup data Sup model Sup model

6

Unsup data

Joint semi-sup

Semi-sup model Sup data

7

Unsup data Unsup model unsup pretraining Sup data semi-sup fine-tuning Semi-sup model

8

Unsup data unsup training Unsup model unsup feats

9

Unsup data unsup training Semi-sup model Sup training unsup feats Sup data

10

Unsup data unsup feats unsup training sup task 1 sup task 2 sup task 3

11

What unsupervised features are most useful in NLP?

12

Natural language processing • Words, words, words • Words, words, words • Words, words, words 13

How do we handle words?

• Not very well 14

“

One-hot

” word representation • |V| = |vocabulary|, e.g. 50K for PTB2 Pr dist over labels (3*|V|) x m m word -1, word 0, word +1 3*|V| 15

One-hot word representation • 85% of vocab words occur as only 10% of corpus tokens • Bad estimate of Pr(label|rare word) m |V| x m |V| word 0 16

Approach 17

Approach • Manual feature engineering 18

Approach • Manual feature engineering 19

Approach • Induce word reprs over large corpus, unsupervised • Use word reprs as word features for supervised task 20

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs 21

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs 22

Distributional representations

C W (size of vocab) F e.g. F w,v or F w,v = Pr(v follows word w) = Pr(v occurs in same doc as w)

23

Distributional representations

C d g ( W (size of vocab) F ) = f g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans C >> d

24

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs 25

Class-based word repr • |C| classes, hard clustering word 0 m (|V|+|C|) x m |V|+|C| 26

Class-based word repr • Hard vs. soft clustering • Hierarchical vs. flat clustering 27

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs – Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering • Distributed word reprs 28

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs – Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering • Distributed word reprs 29

Brown clustering • Hard, hierarchical class-based LM • Brown et al. (1992) • Greedy technique for maximizing

bigram

mutual information • Merge words by contextual similarity 30

Brown clustering cluster(chairman) = `0010’ 2 prefix(cluster(chairman)) = `00’ 31 (image from Terry Koo)

Brown clustering • Hard, hierarchical class-based LM • 1000 classes • Use prefixes = 4, 6, 10, 20 32

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs – Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering • Distributed word reprs 33

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs 34

Distributed word repr • k- (low) dimensional, dense representation • “word embedding” matrix E of size |V| x k m k x m word 0 k 35

Sequence labeling w/ embeddings “word embedding” matrix E of size |V| x k m (3*k) x m |V| x k, tied weights word -1, word 0, word +1 36

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs – Collobert + Weston (2008) – HLBL embeddings (Mnih + Hinton, 2007) 37

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs – Collobert + Weston (2008) – HLBL embeddings (Mnih + Hinton, 2007) 38

Collobert + Weston 2008 1 score > μ + score 100 50*5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 39

50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008) 40

Less sparse word reprs?

• Distributional word reprs • Class-based (clustering) word reprs • Distributed word reprs – Collobert + Weston (2008) – HLBL embeddings (Mnih + Hinton, 2007) 41

Log bilinear Language Model (LBL) Linear prediction of w5 w1 w2 w3 w4 w5 } Pr  exp( predict  target )

Z

42

HLBL • HLBL = hierarchical (fast) training of LBL • Mnih + Hinton (2009) 43

Approach • Induce word reprs over large corpus, unsupervised – Brown: 3 days – HLBL: 1 week, 100 epochs – C&W: 4 weeks, 50 epochs • Use word reprs as word features for supervised task 44

Unsupervised corpus • RCV1 newswire • 40M tokens (vocab = all 270K types) 45

Supervised Tasks • Chunking (CoNLL, 2000) – CRF (Sha + Pereira, 2003) • Named entity recognition (NER) – Averaged perceptron (linear classifier) – Based upon Ratinov + Roth (2009) 46

Unsupervised word reprs as features Word = “the” Embedding = [ 0.2, …, 1.6] Brown cluster = 1010001100 (cluster 4-prefix = 1010, cluster 6-prefix = 101000, …) 47

Unsupervised word reprs as features Orig X = {pos 2=“DT”: 1, word-2=“the”: 1, ...} X w/ Brown = {pos 2=“DT”: 1 , word-2=“the”: 1, class-2 pre4=“1010”: 1, class-2 pre6=“101000”: 1} X w/ emb = {pos 2=“DT”: 1 , word-2=“the”: 1, word-2-dim00: 0.2, …, word-2-dim49: 1.6, ...} 48

Embeddings: Normalization E = σ * E / stddev(E) 49

Embeddings: Normalization (Chunking) 50

Embeddings: Normalization (NER) 51

Repr capacity (Chunking) 52

Repr capacity (NER) 53

95.5

Test results (Chunking)

baseline HLBL 95 C&W 94.5

Brown 94 C&W+Brown 93.5

93 Suzuki+Isozaki (08), 15M Suzuki+Isozaki (08), 1B

54

91 90 89 88 87 86 85 84

Test results (NER)

Baseline Baseline+nonloc al Gazeteers C&W HLBL Brown All All+nonlocal

55

72 70 68 66 84 82 80 78 76 74

MUC7 (OOD) results (NER)

Baseline Baseline+nonloc al Gazeteers C&W HLBL Brown All All+nonlocal

56

91 90.5

90 89.5

89 88.5

88

Test results (NER)

Lin+Wu (09), 3.4B

Suzuki+Isozaki (08), 37M Suzuki+Isozaki (08), 1B All+nonlocal, 37M Lin+Wu (09), 700B

57

Test results • Chunking: C&W = Brown • NER: C&W < Brown • Why?

58

Word freq vs word error (Chunking) 59

Word freq vs word error (NER) 60

Summary • Both Brown + word emb can increase acc of near-SOTA system • Combining can improve accuracy further • On rare words, Brown > word emb • Scale parameter σ = 0.1

• Goodies: http://metaoptimize.com/projects/wordreprs/ Word features! Code!

61

Difficulties with word embeddings • No stopping criterion during unsup training • More active features (slower sup training) • Hyperparameters – Learning rate for model – (optional) Learning rate for embeddings – Normalization constant • vs. Brown clusters, few hyperparams 62

HMM approach • Soft, flat class-based repr • Multinomial distribution over hidden states = word representation • 80 hidden states • Huang and Yates (2009) • No results with HMM approach yet 63