Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Transcript Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Feature Selection &
Maximum Entropy
Advanced Statistical Methods in NLP
Ling 572
January 26, 2012
1
Roadmap
 Feature selection and weighting
 Feature weighting
 Chi-square feature selection
 Chi-square feature selection example
 HW #4
 Maximum Entropy
 Introduction: Maximum Entropy Principle
 Maximum Entropy NLP examples
2
Feature Selection Recap
 Problem: Curse of dimensionality
 Data sparseness, computational cost, overfitting
3
Feature Selection Recap
 Problem: Curse of dimensionality
 Data sparseness, computational cost, overfitting
 Solution: Dimensionality reduction
 New feature set r’ s.t. |r’| < |r|
4
Feature Selection Recap
 Problem: Curse of dimensionality
 Data sparseness, computational cost, overfitting
 Solution: Dimensionality reduction
 New feature set r’ s.t. |r’| < |r|
 Approaches:
 Global & local approaches
 Feature extraction:
 New features in r’ transformations of features in r
5
Feature Selection Recap
 Problem: Curse of dimensionality
 Data sparseness, computational cost, overfitting
 Solution: Dimensionality reduction
 New feature set r’ s.t. |r’| < |r|
 Approaches:
 Global & local approaches
 Feature extraction:
 New features in r’ transformations of features in r
 Feature selection:
 Wrapper techniques
6
Feature Selection Recap
 Problem: Curse of dimensionality
 Data sparseness, computational cost, overfitting
 Solution: Dimensionality reduction
 New feature set r’ s.t. |r’| < |r|
 Approaches:
 Global & local approaches
 Feature extraction:
 New features in r’ transformations of features in r
 Feature selection:
 Wrapper techniques
 Feature scoring
7
Feature Weighting
 For text classification, typical weights include:
8
Feature Weighting
 For text classification, typical weights include:
 Binary: weights in {0,1}
9
Feature Weighting
 For text classification, typical weights include:
 Binary: weights in {0,1}
 Term frequency (tf):
 # occurrences of tk in document di
10
Feature Weighting
 For text classification, typical weights include:
 Binary: weights in {0,1}
 Term frequency (tf):
 # occurrences of tk in document di
 Inverse document frequency (idf):
 dfk: # of docs in which tk appears; N: # docs
 idf = log (N/(1+dfk))
11
Feature Weighting
 For text classification, typical weights include:
 Binary: weights in {0,1}
 Term frequency (tf):
 # occurrences of tk in document di
 Inverse document frequency (idf):
 dfk: # of docs in which tk appears; N: # docs
 idf = log (N/(1+dfk))
 tfidf = tf*idf
12
Chi Square
 Tests for presence/absence of relation between
random variables
13
Chi Square
 Tests for presence/absence of relation between
random variables
 Bivariate analysis tests 2 random variables
 Can test strength of relationship
 (Strictly speaking) doesn’t test direction
14
Chi Square
 Tests for presence/absence of relation between
random variables
 Bivariate analysis tests 2 random variables
 Can test strength of relationship
15
Chi Square
 Tests for presence/absence of relation between
random variables
 Bivariate analysis tests 2 random variables
 Can test strength of relationship
 (Strictly speaking) doesn’t test direction
16
Chi Square Example
 Can gender predict shoe choice?
Due to F. Xia
17
Chi Square Example
 Can gender predict shoe choice?
 A: male/female  Features
Due to F. Xia
18
Chi Square Example
 Can gender predict shoe choice?
 A: male/female  Features
 B: shoe choice  Classes: {sandal, sneaker,…}
Due to F. Xia
19
Chi Square Example
 Can gender predict shoe choice?
 A: male/female  Features
 B: shoe choice  Classes: {sandal, sneaker,…}
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Due to F. Xia
20
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Due to F. Xia
21
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
Due to F. Xia
22
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
sandal
sneaker
leather
shoe
boot
other
Total
Male
50
Female
50
Total
Due to F. Xia
19
22
20
25
14
100
23
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
sandal
sneaker
leather
shoe
boot
other
Total
Male
9.5
50
Female
9.5
50
Total
19
Due to F. Xia
22
20
25
14
100
24
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
sandal
sneaker
Male
9.5
11
50
Female
9.5
11
50
Total
19
22
Due to F. Xia
leather
shoe
20
boot
25
other
14
Total
100
25
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
sandal
sneaker
leather
shoe
Male
9.5
11
10
50
Female
9.5
11
10
50
Total
19
22
20
Due to F. Xia
boot
25
other
14
Total
100
26
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
sandal
sneaker
leather
shoe
boot
Male
9.5
11
10
12.5
50
Female
9.5
11
10
12.5
50
Total
19
22
20
25
Due to F. Xia
other
14
Total
100
27
Comparing Distributions
 Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
 Expected distribution (E):
sandal
sneaker
leather
shoe
boot
other
Total
Male
9.5
11
10
12.5
7
50
Female
9.5
11
10
12.5
7
50
Total
19
22
20
25
14
100
Due to F. Xia
28
Computing Chi Square
 Expected value for cell=
 row_total*column_total/table_total
29
Computing Chi Square
 Expected value for cell=
 row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
30
Computing Chi Square
 Expected value for cell=
 row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
 X2=(6-9.5)2/9.5+
31
Computing Chi Square
 Expected value for cell=
 row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
 X2=(6-9.5)2/9.5+(17-11)2/11
32
Computing Chi Square
 Expected value for cell=
 row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
 X2=(6-9.5)2/9.5+(17-11)2/11+..

= 14.026
33
Calculating X2
 Tabulate contigency table of observed values: O
34
Calculating X2
 Tabulate contigency table of observed values: O
 Compute row, column totals
35
Calculating X2
 Tabulate contigency table of observed values: O
 Compute row, column totals
 Compute table of expected values, given row/col
 Assuming no association
36
Calculating X2
 Tabulate contigency table of observed values: O
 Compute row, column totals
 Compute table of expected values, given row/col
 Assuming no association
 Compute X2
37
For 2x2 Table
 O:
!tk
tk
!ci
a
c
ci
b
d
 E:
38
For 2x2 Table
 O:
!ci
a
c
!tk
tk
 E:
!ci
ci
b
d
ci
Total
!tk
tk
total
39
For 2x2 Table
 O:
!ci
a
c
!tk
tk
 E:
!tk
tk
total
ci
b
d
!ci
ci
a+c
b+d
Total
a+b
c+d
N
40
For 2x2 Table
 O:
!ci
a
c
!tk
tk
 E:
!tk
tk
total
ci
b
d
!ci
(a+b)(a+c)/N
ci
a+c
b+d
Total
a+b
c+d
N
41
For 2x2 Table
 O:
!ci
a
c
!tk
tk
 E:
!tk
tk
total
ci
b
d
!ci
(a+b)(a+c)/N
ci
(a+b)(b+d)/N
a+c
b+d
Total
a+b
c+d
N
42
For 2x2 Table
 O:
!tk
tk
 E:
!tk
tk
total
!ci
a
c
!ci
(a+b)(a+c)/N
(c+d)(a+c)/N
a+c
ci
b
d
ci
(a+b)(b+d)/N
b+d
Total
a+b
c+d
N
43
For 2x2 Table
 O:
!ci
a
c
!tk
tk
 E:
!tk
tk
total
!ci
(a+b)(a+c)/N
(c+d)(a+c)/N
a+c
ci
b
d
ci
(a+b)(b+d)/N
(c+d)(b+d)/N
b+d
Total
a+b
c+d
N
2
(O
E
)
(ad
bc)
N
2
ij
ij
c = åij
=
Eij
(a + b)(a + c)(b + d)(c + d) 44
2
X2 Test
 Test whether random variables are independent
45
X2 Test
 Test whether random variables are independent
 Null hypothesis: R.V.s are independent
46
X2 Test
 Test whether random variables are independent
 Null hypothesis: 2 R.V.s are independent
 Compute X2 statistic:
47
X2 Test
 Test whether random variables are independent
 Null hypothesis: 2 R.V.s are independent
 Compute X2 statistic:
 Compute degrees of freedom
48
X2 Test
 Test whether random variables are independent
 Null hypothesis: 2 R.V.s are independent
 Compute X2 statistic:
 Compute degrees of freedom
 df = (# rows -1)(# cols -1)
49
X2 Test
 Test whether random variables are independent
 Null hypothesis: 2 R.V.s are independent
 Compute X2 statistic:
 Compute degrees of freedom
 df = (# rows -1)(# cols -1)
 Shoe example, df = (2-1)(5-1)=4
50
X2 Test
 Test whether random variables are independent
 Null hypothesis: 2 R.V.s are independent
 Compute X2 statistic:
 Compute degrees of freedom
 df = (# rows -1)(# cols -1)
 Shoe example, df = (2-1)(5-1)=4
 Test probability of X2 statistic value
 X2 table
51
X2 Test
 Test whether random variables are independent
 Null hypothesis: 2 R.V.s are independent
 Compute X2 statistic:
 Compute degrees of freedom
 df = (# rows -1)(# cols -1)
 Shoe example, df = (2-1)(5-1)=4
 Test probability of X2 statistic value
 X2 table
 If probability is low – below some significance level
 Can reject null hypothesis
52
Requirements for X2 Test
 Events assumed independent, same distribution
53
Requirements for X2 Test
 Events assumed independent, same distribution
 Outcomes must be mutually exclusive
54
Requirements for X2 Test
 Events assumed independent, same distribution
 Outcomes must be mutually exclusive
 Raw frequencies, not percentages
55
Requirements for X2 Test
 Events assumed independent, same distribution
 Outcomes must be mutually exclusive
 Raw frequencies, not percentages
 Sufficient values per cell: > 5
56
X2 Example
57
X2 Example
 Shared Task Evaluation:
 Topic Detection and Tracking (aka TDT)
58
X2 Example
 Shared Task Evaluation:
 Topic Detection and Tracking (aka TDT)
 Sub-task: Topic Tracking Task
 Given a small number of exemplar documents (1-4)
 Define a topic
 Create a model that allows tracking of the topic
 I.e. find all subsequent documents on this topic
59
X2 Example
 Shared Task Evaluation:
 Topic Detection and Tracking (aka TDT)
 Sub-task: Topic Tracking Task
 Given a small number of exemplar documents (1-4)
 Define a topic
 Create a model that allows tracking of the topic
 I.e. find all subsequent documents on this topic
 Exemplars: 1-4 newswire articles
 300-600 words each
60
Challenges
 Many news articles look alike
 Create a profile (feature representation)
 Highlights terms strongly associated with current topic
 Differentiate from all other topics
61
Challenges
 Many news articles look alike
 Create a profile (feature representation)
 Highlights terms strongly associated with current topic
 Differentiate from all other topics
 Not all documents labeled
 Only a small subset belong to topics of interest
 Differentiate from other topics AND ‘background’
62
Approach
 X2 feature selection:
63
Approach
 X2 feature selection:
 Assume terms have binary representation
64
Approach
 X2 feature selection:
 Assume terms have binary representation
 Positive class term occurrences from exemplar docs
65
Approach
 X2 feature selection:
 Assume terms have binary representation
 Positive class term occurrences from exemplar docs
 Negative class term occurrences from
 other class exemplars, ‘earlier’ uncategorized docs
66
Approach
 X2 feature selection:
 Assume terms have binary representation
 Positive class term occurrences from exemplar docs
 Negative class term occurrences from
 other class exemplars, ‘earlier’ uncategorized docs
 Compute X2 for terms
 Retain terms with highest X2 scores
 Keep top N terms
67
Approach
 X2 feature selection:
 Assume terms have binary representation
 Positive class term occurrences from exemplar docs
 Negative class term occurrences from
 other class exemplars, ‘earlier’ uncategorized docs
 Compute X2 for terms
 Retain terms with highest X2 scores
 Keep top N terms
 Create one feature set per topic to be tracked
68
Tracking Approach
 Build vector space model
69
Tracking Approach
 Build vector space model
 Feature weighting: tf*idf
 with some modifications
70
Tracking Approach
 Build vector space model
 Feature weighting: tf*idf
 with some modifications
 Distance measure: Cosine similarity
71
Tracking Approach
 Build vector space model
 Feature weighting: tf*idf
 with some modifications
 Distance measure: Cosine similarity
 Select documents scoring above threshold
 For each topic
72
Tracking Approach
 Build vector space model
 Feature weighting: tf*idf
 with some modifications
 Distance measure: Cosine similarity
 Select documents scoring above threshold
 For each topic
 Result: Improved retrieval
73
HW #4
 Topic: Feature Selection for kNN
 Build a kNN classifier using:
 Euclidean distance, Cosine Similarity
 Write a program to compute X2 on a data set
 Use X2 at different significance levels to filter
 Compare the effects of different feature filtering
 on kNN classification
74
Maximum Entropy
75
Maximum Entropy
 “MaxEnt”:
 Popular machine learning technique for NLP
 First uses in NLP circa 1996 – Rosenfeld, Berger
 Applied to a wide range of tasks
 Sentence boundary detection (MxTerminator, Ratnaparkhi), POS
tagging (Ratnaparkhi, Berger), topic segmentation (Berger),
Language modeling (Rosenfeld), prosody labeling, etc….
76
Readings & Comments
 Several readings:
 (Berger, 1996), (Ratnaparkhi, 1997)
 (Klein & Manning, 2003): Tutorial
77
Readings & Comments
 Several readings:
 (Berger, 1996), (Ratnaparkhi, 1997)
 (Klein & Manning, 2003): Tutorial
 Note: Some of these are very ‘dense’
 Don’t spend huge amounts of time on every detail
 Take a first pass before class, review after lecture
78
Readings & Comments
 Several readings:
 (Berger, 1996), (Ratnaparkhi, 1997)
 (Klein & Manning, 2003): Tutorial
 Note: Some of these are very ‘dense’
 Don’t spend huge amounts of time on every detail
 Take a first pass before class, review after lecture
 Going forward:
 Techniques more complex
 Goal: Understand basic model, concepts
 Training esp. complex – we’ll discuss, but not implement
79
Notation Note
 Not entirely consistent:
 We’ll use: input = x; output=y; pair = (x,y)
 Consistent with Berger, 1996
 Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)
 Klein/Manning, ‘03: input = d; output=c; pair = (c,d)
80
Joint vs Conditional Models
 Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
81
Joint vs Conditional Models
 Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
 Different types of models:
 Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
82
Joint vs Conditional Models
 Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
 Different types of models:
 Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
 Most models so far: n-gram, Naïve Bayes, HMM, etc
83
Joint vs Conditional Models
 Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
 Different types of models:
 Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
 Most models so far: n-gram, Naïve Bayes, HMM, etc
 Conceptually easy to compute weights: relative frequency
84
Joint vs Conditional Models
 Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
 Different types of models:
 Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
 Most models so far: n-gram, Naïve Bayes, HMM, etc
 Conceptually easy to compute weights: relative frequency
 Conditional (aka discriminative) models estimate P(y|x),
by maximizing P(Y|X, Θ)
 Models going forward: MaxEnt, SVM, CRF, …
85
Joint vs Conditional Models
 Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
 Different types of models:
 Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
 Most models so far: n-gram, Naïve Bayes, HMM, etc
 Conceptually easy to compute weights: relative frequency
 Conditional (aka discriminative) models estimate P(y|x),
by maximizing P(Y|X, Θ)
 Models going forward: MaxEnt, SVM, CRF, …
 Computing weights more complex
86
Naïve Bayes Model
c
f1
f2
f3
 Naïve Bayes Model assumes features f are
independent of each other, given the class C
fk
Naïve Bayes Model
 Makes assumption of conditional independence of
features given class
88
Naïve Bayes Model
 Makes assumption of conditional independence of
features given class
 However, this is generally unrealistic
89
Naïve Bayes Model
 Makes assumption of conditional independence of
features given class
 However, this is generally unrealistic
 P(“cuts”|politics) = pcuts
90
Naïve Bayes Model
 Makes assumption of conditional independence of
features given class
 However, this is generally unrealistic
 P(“cuts”|politics) = pcuts
 What about
 P(“cuts”|politics,”budget”) ?= pcuts
91
Naïve Bayes Model
 Makes assumption of conditional independence of
features given class
 However, this is generally unrealistic
 P(“cuts”|politics) = pcuts
 What about
 P(“cuts”|politics,”budget”) ?= pcuts
 Would like a model that doesn’t assume
92
Model Parameters
 Our model:
 c*= argmaxc P(c)ΠjP(fj|c)
 Types of parameters
 Two:
 P(C): Class priors
 P(fj|c): Class conditional feature probabilities
 Features in total
 |C|+|VC|, if features are words in vocabulary V
Weights in Naïve Bayes
c1
c2
c3
f1
P(f1|c1)
P(f1|c2)
P(f1|c3)
f2
P(f2|c1)
P(f2|c2)
…
…
…
f|V|
P(f|V||,c1)
…
ck
P(f1|ck)
94
Weights in Naïve Bayes and
Maximum Entropy
 Naïve Bayes:
 P(f|y) are probabilities in [0,1] , weights
95
Weights in Naïve Bayes and
Maximum Entropy
 Naïve Bayes:
 P(f|y) are probabilities in [0,1] , weights
 P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
96
Weights in Naïve Bayes and
Maximum Entropy
 Naïve Bayes:
 P(f|y) are probabilities in [0,1] , weights
 P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
97
Weights in Naïve Bayes and
Maximum Entropy
 Naïve Bayes:
 P(f|y) are probabilities in [0,1] , weights
 P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
=
e
åk ln P( fk |y)
ln P( y)+
Z
98
Weights in Naïve Bayes and
Maximum Entropy
 Naïve Bayes:
 P(f|y) are probabilities in [0,1] , weights
 P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
=
e
åk ln P( fk |y)
ln P( y)+
Z
 MaxEnt:
 Weights are real numbers; any magnitude, sign
99
Weights in Naïve Bayes and
Maximum Entropy
 Naïve Bayes:
 P(f|y) are probabilities in [0,1] , weights
 P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
=
e
åk ln P( fk |y)
ln P( y)+
Z
 MaxEnt:
 Weights are real numbers; any magnitude, sign
 P(y|x) =
l j f j ( x,y)
e
åj
Z
100
MaxEnt Overview
 Prediction:
 P(y|x)
e
å j l j f j ( x, y )
Z
101
MaxEnt Overview
 Prediction:
 P(y|x)
e
å j l j f j ( x, y )
Z
 fj (x,y): binary feature function, indicating presence of
feature in instance x of class y
102
MaxEnt Overview
 Prediction:
 P(y|x)
e
å j l j f j ( x, y )
Z
 fj (x,y): binary feature function, indicating presence of
feature in instance x of class y
 λj : feature weights, learned in training
103
MaxEnt Overview
 Prediction:
 P(y|x)
e
å j l j f j ( x, y )
Z
 fj (x,y): binary feature function, indicating presence of
feature in instance x of class y
 λj : feature weights, learned in training
 Prediction: Compute P(y|x), pick highest y
104
Weights in MaxEnt
c1
c2
f1
λ1
λ8
f2
λ2
…
…
f|V|
λ6
c3
…
ck
…
…
105
Maximum Entropy Principle
106
Maximum Entropy Principle
 Intuitively, model all that is known, and assume as little
as possible about what is unknown
107
Maximum Entropy Principle
 Intuitively, model all that is known, and assume as little
as possible about what is unknown
 Maximum entropy = minimum commitment
108
Maximum Entropy Principle
 Intuitively, model all that is known, and assume as little
as possible about what is unknown
 Maximum entropy = minimum commitment
 Related to concepts like Occam’s razor
109
Maximum Entropy Principle
 Intuitively, model all that is known, and assume as little
as possible about what is unknown
 Maximum entropy = minimum commitment
 Related to concepts like Occam’s razor
 Laplace’s “Principle of Insufficient Reason”:
 When one has no information to distinguish between the
probability of two events, the best strategy is to consider
them equally likely
110
Example I: (K&M 2003)
 Consider a coin flip
 H(X) = -åx p(x)log p(x)
111
Example I: (K&M 2003)
 Consider a coin flip
 H(X) = -åx p(x)log p(x)
 What values of P(X=H), P(X=T)
 maximize H(X)?
112
Example I: (K&M 2003)
 Consider a coin flip
 H(X) = -åx p(x)log p(x)
 What values of P(X=H), P(X=T)
 maximize H(X)?
 P(X=H)=P(X=T)=1/2
 If no prior information, best guess is fair coin
113
Example I: (K&M 2003)
 Consider a coin flip
 H(X) = -åx p(x)log p(x)
 What values of P(X=H), P(X=T)
 maximize H(X)?
 P(X=H)=P(X=T)=1/2
 If no prior information, best guess is fair coin
 What if you know P(X=H) =0.3?
114
Example I: (K&M 2003)
 Consider a coin flip
 H(X) = -åx p(x)log p(x)
 What values of P(X=H), P(X=T)
 maximize H(X)?
 P(X=H)=P(X=T)=1/2
 If no prior information, best guess is fair coin
 What if you know P(X=H) =0.3?
 P(X=T)=0.7
115
Example II: MT (Berger, 1996)
 Task: English  French machine translation
 Specifically, translating ‘in’
 Suppose we’ve seen in translated as:
 {dans, en, à, au cours de, pendant}
 Constraint:
116
Example II: MT (Berger, 1996)
 Task: English  French machine translation
 Specifically, translating ‘in’
 Suppose we’ve seen in translated as:
 {dans, en, à, au cours de, pendant}
 Constraint:
 p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1
 If no other constraint, what is maxent model?
117
Example II: MT (Berger, 1996)
 Task: English  French machine translation
 Specifically, translating ‘in’
 Suppose we’ve seen in translated as:
 {dans, en, à, au cours de, pendant}
 Constraint:
 p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1
 If no other constraint, what is maxent model?
 p(dans)=p(en)=p(à)=p(au cours de)=p(pendant)=1/5
118
Example II: MT (Berger, 1996)
 What we find out that translator uses dans or en 30%?
 Constraint
119
Example II: MT (Berger, 1996)
 What we find out that translator uses dans or en 30%?
 Constraint: p(dans)+p(en)=3/10
 Now what is maxent model?
120
Example II: MT (Berger, 1996)
 What we find out that translator uses dans or en 30%?
 Constraint: p(dans)+p(en)=3/10
 Now what is maxent model?
 p(dans)=p(en)=
121
Example II: MT (Berger, 1996)
 What we find out that translator uses dans or en 30%?
 Constraint: p(dans)+p(en)=3/10
 Now what is maxent model?
 p(dans)=p(en)=3/20
 p(à)=p(au cours de)=p(pendant)=
122
Example II: MT (Berger, 1996)
 What we find out that translator uses dans or en 30%?
 Constraint: p(dans)+p(en)=3/10
 Now what is maxent model?
 p(dans)=p(en)=3/20
 p(à)=p(au cours de)=p(pendant)=7/30
 What if we also know translate picks à or dans 50%?
 Add new constraint: p(à)+p(dans)=0.5
 Now what is maxent model??
123
Example II: MT (Berger, 1996)
 What we find out that translator uses dans or en 30%?
 Constraint: p(dans)+p(en)=3/10
 Now what is maxent model?
 p(dans)=p(en)=3/20
 p(à)=p(au cours de)=p(pendant)=7/30
 What if we also know translate picks à or dans 50%?
 Add new constraint: p(à)+p(dans)=0.5
 Now what is maxent model??
 Not intuitively obvious…
124
Example III: POS (K&M, 2003)
125
Example III: POS (K&M, 2003)
126
Example III: POS (K&M, 2003)
127
Example III: POS (K&M, 2003)
128
Example III
 Problem: Too uniform
 What else do we know?
 Nouns more common than verbs
129
Example III
 Problem: Too uniform
 What else do we know?
 Nouns more common than verbs
 So fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36
 Also, proper nouns more frequent than common, so
 E[NNP,NNPS]=24/36
Etc
130

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Transcript Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Directory