#### Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Download Report#### Transcript Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1 Roadmap Feature selection and weighting Feature weighting Chi-square feature selection Chi-square feature selection example HW #4 Maximum Entropy Introduction: Maximum Entropy Principle Maximum Entropy NLP examples 2 Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting 3 Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| 4 Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r 5 Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r Feature selection: Wrapper techniques 6 Feature Selection Recap Problem: Curse of dimensionality Data sparseness, computational cost, overfitting Solution: Dimensionality reduction New feature set r’ s.t. |r’| < |r| Approaches: Global & local approaches Feature extraction: New features in r’ transformations of features in r Feature selection: Wrapper techniques Feature scoring 7 Feature Weighting For text classification, typical weights include: 8 Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} 9 Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of tk in document di 10 Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of tk in document di Inverse document frequency (idf): dfk: # of docs in which tk appears; N: # docs idf = log (N/(1+dfk)) 11 Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of tk in document di Inverse document frequency (idf): dfk: # of docs in which tk appears; N: # docs idf = log (N/(1+dfk)) tfidf = tf*idf 12 Chi Square Tests for presence/absence of relation between random variables 13 Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction 14 Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship 15 Chi Square Tests for presence/absence of relation between random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction 16 Chi Square Example Can gender predict shoe choice? Due to F. Xia 17 Chi Square Example Can gender predict shoe choice? A: male/female Features Due to F. Xia 18 Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} Due to F. Xia 19 Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Due to F. Xia 20 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Due to F. Xia 21 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): Due to F. Xia 22 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): sandal sneaker leather shoe boot other Total Male 50 Female 50 Total Due to F. Xia 19 22 20 25 14 100 23 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): sandal sneaker leather shoe boot other Total Male 9.5 50 Female 9.5 50 Total 19 Due to F. Xia 22 20 25 14 100 24 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): sandal sneaker Male 9.5 11 50 Female 9.5 11 50 Total 19 22 Due to F. Xia leather shoe 20 boot 25 other 14 Total 100 25 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): sandal sneaker leather shoe Male 9.5 11 10 50 Female 9.5 11 10 50 Total 19 22 20 Due to F. Xia boot 25 other 14 Total 100 26 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): sandal sneaker leather shoe boot Male 9.5 11 10 12.5 50 Female 9.5 11 10 12.5 50 Total 19 22 20 25 Due to F. Xia other 14 Total 100 27 Comparing Distributions Observed distribution (O): sandal sneaker leather shoe boot other Male 6 17 13 9 5 Female 13 5 7 16 9 Expected distribution (E): sandal sneaker leather shoe boot other Total Male 9.5 11 10 12.5 7 50 Female 9.5 11 10 12.5 7 50 Total 19 22 20 25 14 100 Due to F. Xia 28 Computing Chi Square Expected value for cell= row_total*column_total/table_total 29 Computing Chi Square Expected value for cell= row_total*column_total/table_total c = åij (Oij - Eij ) / Eij 2 2 30 Computing Chi Square Expected value for cell= row_total*column_total/table_total c = åij (Oij - Eij ) / Eij 2 2 X2=(6-9.5)2/9.5+ 31 Computing Chi Square Expected value for cell= row_total*column_total/table_total c = åij (Oij - Eij ) / Eij 2 2 X2=(6-9.5)2/9.5+(17-11)2/11 32 Computing Chi Square Expected value for cell= row_total*column_total/table_total c = åij (Oij - Eij ) / Eij 2 2 X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026 33 Calculating X2 Tabulate contigency table of observed values: O 34 Calculating X2 Tabulate contigency table of observed values: O Compute row, column totals 35 Calculating X2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association 36 Calculating X2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X2 37 For 2x2 Table O: !tk tk !ci a c ci b d E: 38 For 2x2 Table O: !ci a c !tk tk E: !ci ci b d ci Total !tk tk total 39 For 2x2 Table O: !ci a c !tk tk E: !tk tk total ci b d !ci ci a+c b+d Total a+b c+d N 40 For 2x2 Table O: !ci a c !tk tk E: !tk tk total ci b d !ci (a+b)(a+c)/N ci a+c b+d Total a+b c+d N 41 For 2x2 Table O: !ci a c !tk tk E: !tk tk total ci b d !ci (a+b)(a+c)/N ci (a+b)(b+d)/N a+c b+d Total a+b c+d N 42 For 2x2 Table O: !tk tk E: !tk tk total !ci a c !ci (a+b)(a+c)/N (c+d)(a+c)/N a+c ci b d ci (a+b)(b+d)/N b+d Total a+b c+d N 43 For 2x2 Table O: !ci a c !tk tk E: !tk tk total !ci (a+b)(a+c)/N (c+d)(a+c)/N a+c ci b d ci (a+b)(b+d)/N (c+d)(b+d)/N b+d Total a+b c+d N 2 (O E ) (ad bc) N 2 ij ij c = åij = Eij (a + b)(a + c)(b + d)(c + d) 44 2 X2 Test Test whether random variables are independent 45 X2 Test Test whether random variables are independent Null hypothesis: R.V.s are independent 46 X2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X2 statistic: 47 X2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X2 statistic: Compute degrees of freedom 48 X2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) 49 X2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 50 X2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X2 statistic value X2 table 51 X2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X2 statistic value X2 table If probability is low – below some significance level Can reject null hypothesis 52 Requirements for X2 Test Events assumed independent, same distribution 53 Requirements for X2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive 54 Requirements for X2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages 55 Requirements for X2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5 56 X2 Example 57 X2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) 58 X2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) Sub-task: Topic Tracking Task Given a small number of exemplar documents (1-4) Define a topic Create a model that allows tracking of the topic I.e. find all subsequent documents on this topic 59 X2 Example Shared Task Evaluation: Topic Detection and Tracking (aka TDT) Sub-task: Topic Tracking Task Given a small number of exemplar documents (1-4) Define a topic Create a model that allows tracking of the topic I.e. find all subsequent documents on this topic Exemplars: 1-4 newswire articles 300-600 words each 60 Challenges Many news articles look alike Create a profile (feature representation) Highlights terms strongly associated with current topic Differentiate from all other topics 61 Challenges Many news articles look alike Create a profile (feature representation) Highlights terms strongly associated with current topic Differentiate from all other topics Not all documents labeled Only a small subset belong to topics of interest Differentiate from other topics AND ‘background’ 62 Approach X2 feature selection: 63 Approach X2 feature selection: Assume terms have binary representation 64 Approach X2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs 65 Approach X2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs 66 Approach X2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs Compute X2 for terms Retain terms with highest X2 scores Keep top N terms 67 Approach X2 feature selection: Assume terms have binary representation Positive class term occurrences from exemplar docs Negative class term occurrences from other class exemplars, ‘earlier’ uncategorized docs Compute X2 for terms Retain terms with highest X2 scores Keep top N terms Create one feature set per topic to be tracked 68 Tracking Approach Build vector space model 69 Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications 70 Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity 71 Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity Select documents scoring above threshold For each topic 72 Tracking Approach Build vector space model Feature weighting: tf*idf with some modifications Distance measure: Cosine similarity Select documents scoring above threshold For each topic Result: Improved retrieval 73 HW #4 Topic: Feature Selection for kNN Build a kNN classifier using: Euclidean distance, Cosine Similarity Write a program to compute X2 on a data set Use X2 at different significance levels to filter Compare the effects of different feature filtering on kNN classification 74 Maximum Entropy 75 Maximum Entropy “MaxEnt”: Popular machine learning technique for NLP First uses in NLP circa 1996 – Rosenfeld, Berger Applied to a wide range of tasks Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc…. 76 Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial 77 Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture 78 Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture Going forward: Techniques more complex Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement 79 Notation Note Not entirely consistent: We’ll use: input = x; output=y; pair = (x,y) Consistent with Berger, 1996 Ratnaparkhi, 1996: input = h; output=t; pair = (h,t) Klein/Manning, ‘03: input = d; output=c; pair = (c,d) 80 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. 81 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) 82 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc 83 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency 84 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … 85 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … Computing weights more complex 86 Naïve Bayes Model c f1 f2 f3 Naïve Bayes Model assumes features f are independent of each other, given the class C fk Naïve Bayes Model Makes assumption of conditional independence of features given class 88 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic 89 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = pcuts 90 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = pcuts What about P(“cuts”|politics,”budget”) ?= pcuts 91 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = pcuts What about P(“cuts”|politics,”budget”) ?= pcuts Would like a model that doesn’t assume 92 Model Parameters Our model: c*= argmaxc P(c)ΠjP(fj|c) Types of parameters Two: P(C): Class priors P(fj|c): Class conditional feature probabilities Features in total |C|+|VC|, if features are words in vocabulary V Weights in Naïve Bayes c1 c2 c3 f1 P(f1|c1) P(f1|c2) P(f1|c3) f2 P(f2|c1) P(f2|c2) … … … f|V| P(f|V||,c1) … ck P(f1|ck) 94 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1] , weights 95 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1] , weights P(y|x) = P(y)Õ P( fk | y) k P(x) 96 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1] , weights P(y|x) = P(y)Õ P( fk | y) k P(x) = e ln P( y) Õ k P( fk |y) Z 97 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1] , weights P(y|x) = P(y)Õ P( fk | y) k P(x) = e ln P( y) Õ k P( fk |y) Z = e åk ln P( fk |y) ln P( y)+ Z 98 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1] , weights P(y|x) = P(y)Õ P( fk | y) k P(x) = e ln P( y) Õ k P( fk |y) Z = e åk ln P( fk |y) ln P( y)+ Z MaxEnt: Weights are real numbers; any magnitude, sign 99 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1] , weights P(y|x) = P(y)Õ P( fk | y) k P(x) = e ln P( y) Õ k P( fk |y) Z = e åk ln P( fk |y) ln P( y)+ Z MaxEnt: Weights are real numbers; any magnitude, sign P(y|x) = l j f j ( x,y) e åj Z 100 MaxEnt Overview Prediction: P(y|x) e å j l j f j ( x, y ) Z 101 MaxEnt Overview Prediction: P(y|x) e å j l j f j ( x, y ) Z fj (x,y): binary feature function, indicating presence of feature in instance x of class y 102 MaxEnt Overview Prediction: P(y|x) e å j l j f j ( x, y ) Z fj (x,y): binary feature function, indicating presence of feature in instance x of class y λj : feature weights, learned in training 103 MaxEnt Overview Prediction: P(y|x) e å j l j f j ( x, y ) Z fj (x,y): binary feature function, indicating presence of feature in instance x of class y λj : feature weights, learned in training Prediction: Compute P(y|x), pick highest y 104 Weights in MaxEnt c1 c2 f1 λ1 λ8 f2 λ2 … … f|V| λ6 c3 … ck … … 105 Maximum Entropy Principle 106 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown 107 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment 108 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor 109 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor Laplace’s “Principle of Insufficient Reason”: When one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely 110 Example I: (K&M 2003) Consider a coin flip H(X) = -åx p(x)log p(x) 111 Example I: (K&M 2003) Consider a coin flip H(X) = -åx p(x)log p(x) What values of P(X=H), P(X=T) maximize H(X)? 112 Example I: (K&M 2003) Consider a coin flip H(X) = -åx p(x)log p(x) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin 113 Example I: (K&M 2003) Consider a coin flip H(X) = -åx p(x)log p(x) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? 114 Example I: (K&M 2003) Consider a coin flip H(X) = -åx p(x)log p(x) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? P(X=T)=0.7 115 Example II: MT (Berger, 1996) Task: English French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: 116 Example II: MT (Berger, 1996) Task: English French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? 117 Example II: MT (Berger, 1996) Task: English French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? p(dans)=p(en)=p(à)=p(au cours de)=p(pendant)=1/5 118 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint 119 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? 120 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)= 121 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)= 122 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? 123 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? Not intuitively obvious… 124 Example III: POS (K&M, 2003) 125 Example III: POS (K&M, 2003) 126 Example III: POS (K&M, 2003) 127 Example III: POS (K&M, 2003) 128 Example III Problem: Too uniform What else do we know? Nouns more common than verbs 129 Example III Problem: Too uniform What else do we know? Nouns more common than verbs So fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36 Also, proper nouns more frequent than common, so E[NNP,NNPS]=24/36 Etc 130