Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012
Download
Report
Transcript Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012
Feature Selection &
Maximum Entropy
Advanced Statistical Methods in NLP
Ling 572
January 26, 2012
1
Roadmap
Feature selection and weighting
Feature weighting
Chi-square feature selection
Chi-square feature selection example
HW #4
Maximum Entropy
Introduction: Maximum Entropy Principle
Maximum Entropy NLP examples
2
Feature Selection Recap
Problem: Curse of dimensionality
Data sparseness, computational cost, overfitting
3
Feature Selection Recap
Problem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reduction
New feature set r’ s.t. |r’| < |r|
4
Feature Selection Recap
Problem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reduction
New feature set r’ s.t. |r’| < |r|
Approaches:
Global & local approaches
Feature extraction:
New features in r’ transformations of features in r
5
Feature Selection Recap
Problem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reduction
New feature set r’ s.t. |r’| < |r|
Approaches:
Global & local approaches
Feature extraction:
New features in r’ transformations of features in r
Feature selection:
Wrapper techniques
6
Feature Selection Recap
Problem: Curse of dimensionality
Data sparseness, computational cost, overfitting
Solution: Dimensionality reduction
New feature set r’ s.t. |r’| < |r|
Approaches:
Global & local approaches
Feature extraction:
New features in r’ transformations of features in r
Feature selection:
Wrapper techniques
Feature scoring
7
Feature Weighting
For text classification, typical weights include:
8
Feature Weighting
For text classification, typical weights include:
Binary: weights in {0,1}
9
Feature Weighting
For text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf):
# occurrences of tk in document di
10
Feature Weighting
For text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf):
# occurrences of tk in document di
Inverse document frequency (idf):
dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
11
Feature Weighting
For text classification, typical weights include:
Binary: weights in {0,1}
Term frequency (tf):
# occurrences of tk in document di
Inverse document frequency (idf):
dfk: # of docs in which tk appears; N: # docs
idf = log (N/(1+dfk))
tfidf = tf*idf
12
Chi Square
Tests for presence/absence of relation between
random variables
13
Chi Square
Tests for presence/absence of relation between
random variables
Bivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
14
Chi Square
Tests for presence/absence of relation between
random variables
Bivariate analysis tests 2 random variables
Can test strength of relationship
15
Chi Square
Tests for presence/absence of relation between
random variables
Bivariate analysis tests 2 random variables
Can test strength of relationship
(Strictly speaking) doesn’t test direction
16
Chi Square Example
Can gender predict shoe choice?
Due to F. Xia
17
Chi Square Example
Can gender predict shoe choice?
A: male/female Features
Due to F. Xia
18
Chi Square Example
Can gender predict shoe choice?
A: male/female Features
B: shoe choice Classes: {sandal, sneaker,…}
Due to F. Xia
19
Chi Square Example
Can gender predict shoe choice?
A: male/female Features
B: shoe choice Classes: {sandal, sneaker,…}
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Due to F. Xia
20
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Due to F. Xia
21
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
Due to F. Xia
22
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
sandal
sneaker
leather
shoe
boot
other
Total
Male
50
Female
50
Total
Due to F. Xia
19
22
20
25
14
100
23
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
sandal
sneaker
leather
shoe
boot
other
Total
Male
9.5
50
Female
9.5
50
Total
19
Due to F. Xia
22
20
25
14
100
24
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
sandal
sneaker
Male
9.5
11
50
Female
9.5
11
50
Total
19
22
Due to F. Xia
leather
shoe
20
boot
25
other
14
Total
100
25
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
sandal
sneaker
leather
shoe
Male
9.5
11
10
50
Female
9.5
11
10
50
Total
19
22
20
Due to F. Xia
boot
25
other
14
Total
100
26
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
sandal
sneaker
leather
shoe
boot
Male
9.5
11
10
12.5
50
Female
9.5
11
10
12.5
50
Total
19
22
20
25
Due to F. Xia
other
14
Total
100
27
Comparing Distributions
Observed distribution (O):
sandal
sneaker
leather
shoe
boot
other
Male
6
17
13
9
5
Female
13
5
7
16
9
Expected distribution (E):
sandal
sneaker
leather
shoe
boot
other
Total
Male
9.5
11
10
12.5
7
50
Female
9.5
11
10
12.5
7
50
Total
19
22
20
25
14
100
Due to F. Xia
28
Computing Chi Square
Expected value for cell=
row_total*column_total/table_total
29
Computing Chi Square
Expected value for cell=
row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
30
Computing Chi Square
Expected value for cell=
row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
X2=(6-9.5)2/9.5+
31
Computing Chi Square
Expected value for cell=
row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
X2=(6-9.5)2/9.5+(17-11)2/11
32
Computing Chi Square
Expected value for cell=
row_total*column_total/table_total
c = åij (Oij - Eij ) / Eij
2
2
X2=(6-9.5)2/9.5+(17-11)2/11+..
= 14.026
33
Calculating X2
Tabulate contigency table of observed values: O
34
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
35
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/col
Assuming no association
36
Calculating X2
Tabulate contigency table of observed values: O
Compute row, column totals
Compute table of expected values, given row/col
Assuming no association
Compute X2
37
For 2x2 Table
O:
!tk
tk
!ci
a
c
ci
b
d
E:
38
For 2x2 Table
O:
!ci
a
c
!tk
tk
E:
!ci
ci
b
d
ci
Total
!tk
tk
total
39
For 2x2 Table
O:
!ci
a
c
!tk
tk
E:
!tk
tk
total
ci
b
d
!ci
ci
a+c
b+d
Total
a+b
c+d
N
40
For 2x2 Table
O:
!ci
a
c
!tk
tk
E:
!tk
tk
total
ci
b
d
!ci
(a+b)(a+c)/N
ci
a+c
b+d
Total
a+b
c+d
N
41
For 2x2 Table
O:
!ci
a
c
!tk
tk
E:
!tk
tk
total
ci
b
d
!ci
(a+b)(a+c)/N
ci
(a+b)(b+d)/N
a+c
b+d
Total
a+b
c+d
N
42
For 2x2 Table
O:
!tk
tk
E:
!tk
tk
total
!ci
a
c
!ci
(a+b)(a+c)/N
(c+d)(a+c)/N
a+c
ci
b
d
ci
(a+b)(b+d)/N
b+d
Total
a+b
c+d
N
43
For 2x2 Table
O:
!ci
a
c
!tk
tk
E:
!tk
tk
total
!ci
(a+b)(a+c)/N
(c+d)(a+c)/N
a+c
ci
b
d
ci
(a+b)(b+d)/N
(c+d)(b+d)/N
b+d
Total
a+b
c+d
N
2
(O
E
)
(ad
bc)
N
2
ij
ij
c = åij
=
Eij
(a + b)(a + c)(b + d)(c + d) 44
2
X2 Test
Test whether random variables are independent
45
X2 Test
Test whether random variables are independent
Null hypothesis: R.V.s are independent
46
X2 Test
Test whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
47
X2 Test
Test whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
Compute degrees of freedom
48
X2 Test
Test whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
Compute degrees of freedom
df = (# rows -1)(# cols -1)
49
X2 Test
Test whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
Compute degrees of freedom
df = (# rows -1)(# cols -1)
Shoe example, df = (2-1)(5-1)=4
50
X2 Test
Test whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
Compute degrees of freedom
df = (# rows -1)(# cols -1)
Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value
X2 table
51
X2 Test
Test whether random variables are independent
Null hypothesis: 2 R.V.s are independent
Compute X2 statistic:
Compute degrees of freedom
df = (# rows -1)(# cols -1)
Shoe example, df = (2-1)(5-1)=4
Test probability of X2 statistic value
X2 table
If probability is low – below some significance level
Can reject null hypothesis
52
Requirements for X2 Test
Events assumed independent, same distribution
53
Requirements for X2 Test
Events assumed independent, same distribution
Outcomes must be mutually exclusive
54
Requirements for X2 Test
Events assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
55
Requirements for X2 Test
Events assumed independent, same distribution
Outcomes must be mutually exclusive
Raw frequencies, not percentages
Sufficient values per cell: > 5
56
X2 Example
57
X2 Example
Shared Task Evaluation:
Topic Detection and Tracking (aka TDT)
58
X2 Example
Shared Task Evaluation:
Topic Detection and Tracking (aka TDT)
Sub-task: Topic Tracking Task
Given a small number of exemplar documents (1-4)
Define a topic
Create a model that allows tracking of the topic
I.e. find all subsequent documents on this topic
59
X2 Example
Shared Task Evaluation:
Topic Detection and Tracking (aka TDT)
Sub-task: Topic Tracking Task
Given a small number of exemplar documents (1-4)
Define a topic
Create a model that allows tracking of the topic
I.e. find all subsequent documents on this topic
Exemplars: 1-4 newswire articles
300-600 words each
60
Challenges
Many news articles look alike
Create a profile (feature representation)
Highlights terms strongly associated with current topic
Differentiate from all other topics
61
Challenges
Many news articles look alike
Create a profile (feature representation)
Highlights terms strongly associated with current topic
Differentiate from all other topics
Not all documents labeled
Only a small subset belong to topics of interest
Differentiate from other topics AND ‘background’
62
Approach
X2 feature selection:
63
Approach
X2 feature selection:
Assume terms have binary representation
64
Approach
X2 feature selection:
Assume terms have binary representation
Positive class term occurrences from exemplar docs
65
Approach
X2 feature selection:
Assume terms have binary representation
Positive class term occurrences from exemplar docs
Negative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
66
Approach
X2 feature selection:
Assume terms have binary representation
Positive class term occurrences from exemplar docs
Negative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
Compute X2 for terms
Retain terms with highest X2 scores
Keep top N terms
67
Approach
X2 feature selection:
Assume terms have binary representation
Positive class term occurrences from exemplar docs
Negative class term occurrences from
other class exemplars, ‘earlier’ uncategorized docs
Compute X2 for terms
Retain terms with highest X2 scores
Keep top N terms
Create one feature set per topic to be tracked
68
Tracking Approach
Build vector space model
69
Tracking Approach
Build vector space model
Feature weighting: tf*idf
with some modifications
70
Tracking Approach
Build vector space model
Feature weighting: tf*idf
with some modifications
Distance measure: Cosine similarity
71
Tracking Approach
Build vector space model
Feature weighting: tf*idf
with some modifications
Distance measure: Cosine similarity
Select documents scoring above threshold
For each topic
72
Tracking Approach
Build vector space model
Feature weighting: tf*idf
with some modifications
Distance measure: Cosine similarity
Select documents scoring above threshold
For each topic
Result: Improved retrieval
73
HW #4
Topic: Feature Selection for kNN
Build a kNN classifier using:
Euclidean distance, Cosine Similarity
Write a program to compute X2 on a data set
Use X2 at different significance levels to filter
Compare the effects of different feature filtering
on kNN classification
74
Maximum Entropy
75
Maximum Entropy
“MaxEnt”:
Popular machine learning technique for NLP
First uses in NLP circa 1996 – Rosenfeld, Berger
Applied to a wide range of tasks
Sentence boundary detection (MxTerminator, Ratnaparkhi), POS
tagging (Ratnaparkhi, Berger), topic segmentation (Berger),
Language modeling (Rosenfeld), prosody labeling, etc….
76
Readings & Comments
Several readings:
(Berger, 1996), (Ratnaparkhi, 1997)
(Klein & Manning, 2003): Tutorial
77
Readings & Comments
Several readings:
(Berger, 1996), (Ratnaparkhi, 1997)
(Klein & Manning, 2003): Tutorial
Note: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detail
Take a first pass before class, review after lecture
78
Readings & Comments
Several readings:
(Berger, 1996), (Ratnaparkhi, 1997)
(Klein & Manning, 2003): Tutorial
Note: Some of these are very ‘dense’
Don’t spend huge amounts of time on every detail
Take a first pass before class, review after lecture
Going forward:
Techniques more complex
Goal: Understand basic model, concepts
Training esp. complex – we’ll discuss, but not implement
79
Notation Note
Not entirely consistent:
We’ll use: input = x; output=y; pair = (x,y)
Consistent with Berger, 1996
Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)
Klein/Manning, ‘03: input = d; output=c; pair = (c,d)
80
Joint vs Conditional Models
Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
81
Joint vs Conditional Models
Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
Different types of models:
Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
82
Joint vs Conditional Models
Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
Different types of models:
Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
Most models so far: n-gram, Naïve Bayes, HMM, etc
83
Joint vs Conditional Models
Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
Different types of models:
Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
Most models so far: n-gram, Naïve Bayes, HMM, etc
Conceptually easy to compute weights: relative frequency
84
Joint vs Conditional Models
Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
Different types of models:
Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
Most models so far: n-gram, Naïve Bayes, HMM, etc
Conceptually easy to compute weights: relative frequency
Conditional (aka discriminative) models estimate P(y|x),
by maximizing P(Y|X, Θ)
Models going forward: MaxEnt, SVM, CRF, …
85
Joint vs Conditional Models
Assuming some training data {(x,y)}, need to learn a
model Θ s.t. given a new x, can predict label y.
Different types of models:
Joint models (aka generative models) estimate P(x,y) by
maximizing P(X,Y|Θ)
Most models so far: n-gram, Naïve Bayes, HMM, etc
Conceptually easy to compute weights: relative frequency
Conditional (aka discriminative) models estimate P(y|x),
by maximizing P(Y|X, Θ)
Models going forward: MaxEnt, SVM, CRF, …
Computing weights more complex
86
Naïve Bayes Model
c
f1
f2
f3
Naïve Bayes Model assumes features f are
independent of each other, given the class C
fk
Naïve Bayes Model
Makes assumption of conditional independence of
features given class
88
Naïve Bayes Model
Makes assumption of conditional independence of
features given class
However, this is generally unrealistic
89
Naïve Bayes Model
Makes assumption of conditional independence of
features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
90
Naïve Bayes Model
Makes assumption of conditional independence of
features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about
P(“cuts”|politics,”budget”) ?= pcuts
91
Naïve Bayes Model
Makes assumption of conditional independence of
features given class
However, this is generally unrealistic
P(“cuts”|politics) = pcuts
What about
P(“cuts”|politics,”budget”) ?= pcuts
Would like a model that doesn’t assume
92
Model Parameters
Our model:
c*= argmaxc P(c)ΠjP(fj|c)
Types of parameters
Two:
P(C): Class priors
P(fj|c): Class conditional feature probabilities
Features in total
|C|+|VC|, if features are words in vocabulary V
Weights in Naïve Bayes
c1
c2
c3
f1
P(f1|c1)
P(f1|c2)
P(f1|c3)
f2
P(f2|c1)
P(f2|c2)
…
…
…
f|V|
P(f|V||,c1)
…
ck
P(f1|ck)
94
Weights in Naïve Bayes and
Maximum Entropy
Naïve Bayes:
P(f|y) are probabilities in [0,1] , weights
95
Weights in Naïve Bayes and
Maximum Entropy
Naïve Bayes:
P(f|y) are probabilities in [0,1] , weights
P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
96
Weights in Naïve Bayes and
Maximum Entropy
Naïve Bayes:
P(f|y) are probabilities in [0,1] , weights
P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
97
Weights in Naïve Bayes and
Maximum Entropy
Naïve Bayes:
P(f|y) are probabilities in [0,1] , weights
P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
=
e
åk ln P( fk |y)
ln P( y)+
Z
98
Weights in Naïve Bayes and
Maximum Entropy
Naïve Bayes:
P(f|y) are probabilities in [0,1] , weights
P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
=
e
åk ln P( fk |y)
ln P( y)+
Z
MaxEnt:
Weights are real numbers; any magnitude, sign
99
Weights in Naïve Bayes and
Maximum Entropy
Naïve Bayes:
P(f|y) are probabilities in [0,1] , weights
P(y|x) =
P(y)Õ P( fk | y)
k
P(x)
=
e
ln P( y)
Õ k P( fk |y)
Z
=
e
åk ln P( fk |y)
ln P( y)+
Z
MaxEnt:
Weights are real numbers; any magnitude, sign
P(y|x) =
l j f j ( x,y)
e
åj
Z
100
MaxEnt Overview
Prediction:
P(y|x)
e
å j l j f j ( x, y )
Z
101
MaxEnt Overview
Prediction:
P(y|x)
e
å j l j f j ( x, y )
Z
fj (x,y): binary feature function, indicating presence of
feature in instance x of class y
102
MaxEnt Overview
Prediction:
P(y|x)
e
å j l j f j ( x, y )
Z
fj (x,y): binary feature function, indicating presence of
feature in instance x of class y
λj : feature weights, learned in training
103
MaxEnt Overview
Prediction:
P(y|x)
e
å j l j f j ( x, y )
Z
fj (x,y): binary feature function, indicating presence of
feature in instance x of class y
λj : feature weights, learned in training
Prediction: Compute P(y|x), pick highest y
104
Weights in MaxEnt
c1
c2
f1
λ1
λ8
f2
λ2
…
…
f|V|
λ6
c3
…
ck
…
…
105
Maximum Entropy Principle
106
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little
as possible about what is unknown
107
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little
as possible about what is unknown
Maximum entropy = minimum commitment
108
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little
as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
109
Maximum Entropy Principle
Intuitively, model all that is known, and assume as little
as possible about what is unknown
Maximum entropy = minimum commitment
Related to concepts like Occam’s razor
Laplace’s “Principle of Insufficient Reason”:
When one has no information to distinguish between the
probability of two events, the best strategy is to consider
them equally likely
110
Example I: (K&M 2003)
Consider a coin flip
H(X) = -åx p(x)log p(x)
111
Example I: (K&M 2003)
Consider a coin flip
H(X) = -åx p(x)log p(x)
What values of P(X=H), P(X=T)
maximize H(X)?
112
Example I: (K&M 2003)
Consider a coin flip
H(X) = -åx p(x)log p(x)
What values of P(X=H), P(X=T)
maximize H(X)?
P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
113
Example I: (K&M 2003)
Consider a coin flip
H(X) = -åx p(x)log p(x)
What values of P(X=H), P(X=T)
maximize H(X)?
P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?
114
Example I: (K&M 2003)
Consider a coin flip
H(X) = -åx p(x)log p(x)
What values of P(X=H), P(X=T)
maximize H(X)?
P(X=H)=P(X=T)=1/2
If no prior information, best guess is fair coin
What if you know P(X=H) =0.3?
P(X=T)=0.7
115
Example II: MT (Berger, 1996)
Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:
{dans, en, à, au cours de, pendant}
Constraint:
116
Example II: MT (Berger, 1996)
Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:
{dans, en, à, au cours de, pendant}
Constraint:
p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1
If no other constraint, what is maxent model?
117
Example II: MT (Berger, 1996)
Task: English French machine translation
Specifically, translating ‘in’
Suppose we’ve seen in translated as:
{dans, en, à, au cours de, pendant}
Constraint:
p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1
If no other constraint, what is maxent model?
p(dans)=p(en)=p(à)=p(au cours de)=p(pendant)=1/5
118
Example II: MT (Berger, 1996)
What we find out that translator uses dans or en 30%?
Constraint
119
Example II: MT (Berger, 1996)
What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
120
Example II: MT (Berger, 1996)
What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
p(dans)=p(en)=
121
Example II: MT (Berger, 1996)
What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
p(dans)=p(en)=3/20
p(à)=p(au cours de)=p(pendant)=
122
Example II: MT (Berger, 1996)
What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
p(dans)=p(en)=3/20
p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?
Add new constraint: p(à)+p(dans)=0.5
Now what is maxent model??
123
Example II: MT (Berger, 1996)
What we find out that translator uses dans or en 30%?
Constraint: p(dans)+p(en)=3/10
Now what is maxent model?
p(dans)=p(en)=3/20
p(à)=p(au cours de)=p(pendant)=7/30
What if we also know translate picks à or dans 50%?
Add new constraint: p(à)+p(dans)=0.5
Now what is maxent model??
Not intuitively obvious…
124
Example III: POS (K&M, 2003)
125
Example III: POS (K&M, 2003)
126
Example III: POS (K&M, 2003)
127
Example III: POS (K&M, 2003)
128
Example III
Problem: Too uniform
What else do we know?
Nouns more common than verbs
129
Example III
Problem: Too uniform
What else do we know?
Nouns more common than verbs
So fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36
Also, proper nouns more frequent than common, so
E[NNP,NNPS]=24/36
Etc
130