ללא כותרת שקופית - University of Haifa

Download Report

Transcript ללא כותרת שקופית - University of Haifa

1

Classifying by Multiple Features: Naive Bayes

e

: The example (instance) to be classified – word occurrence, document • •

c

: a class, among all possible classes

C

– word sense, document category

f

: Feature – context word, document word

2

Bayes Decision Rule

• Among alternative classes, select

c

maximal • Minimizes the probability of error • Likelihood ratio view:

P

(

c

|

e

) such that

P(c|e)

is

P

(~

c

|

e

)  select

c

for which the ratio is maximal • Binary classification: classify • Multiple (0 or more) classes per example: consider each class as a binary classification (usually with threshold, due to imprecise estimate) • Easier to estimate

P(e|c) c

if ratio>1, otherwise

~c

(generative model), therefore using Bayes rule

Conditional Probability and Bayes Rule

 Conditiona l Probabilit y : Probabilit Pr[

A B

]  y of event

A

given event

B

has happend Pr[

A

Pr[ 

B

]

B

] Pr[ Pr[

A

B

]  Pr[

A

,

B

]

B A

]  Pr[

A

B

] Pr[

A

] 3  Bayes Rule Pr[

A B

]  Pr[

B

Pr[

A

]

A

] Pr[

B

]

A A

B B

4

Bayes and Independence Assumption

P

(

c

|

e

)

P

(~

c

|

e

) 

P

(

P e

(

e

|~ |

c

)

c

)

P

(

c P

(~ )

c

) Bayes 

f

 

e P

(

f

 

e P

(

f f

| |~

c

)

c

) 

P P

( (~

c

)

c

) independen assumption ce Posterior Prior

Log-Likelihood Ratio

• Computational convenience – avoid underflow

score

(

c

)  log

P

(

c

|

e

)

P

(~

c

|

e

) 

f

 

e

log

P P

(

f

(

f

|

c

) |~

c

)  log

P

(

C

)

P

(~

C

) • Estimate probabilities from training corpus, by (smoothed) count ratios •

P

(

f

log

P

(

f

| |~

c

)

c

) : the “entailment score” of

f

for

c

5 • Working with this ratio indicates explicitly feature “weight” within the score, compared to –

score

(

c

) 

f

 

e

log

P

(

f

|

c

)  log

P

(

C log P(f|c)

itself is large for frequent

f

’s, regardless of discrimination ) • Was applied to many classification tasks

6

Word Sense Disambiguation by Naïve Bayes

• • • • Each ambiguous word

w

is a classification problem

e

: The example (instance) to be classified – An occurrence of the ambiguous word

c

: a class, among all possible classes – A word sense, among all listed senses

C f

: Feature – A context word or phrase, in near or broad context, possibly within a syntactic relationship

Estimating Probabilities

• Assume a sense labeled training corpus • Apply some smoothing to avoid zero counts:

P

(

f

|

c

) 

freq

(

f

,

c

)

freq

(

c

)

P

(

c

) 

freq

(

c

)

freq

(

w

) 7 •

freq(c)

– sum of all context positions • Context features (words) that tend to occur mostly with a specific sense, and not with the others, contribute high values to its accumulative score • Gale, Church and Yarowsky (1992) report 90% accuracy for 6 ambiguous nouns – Combining weaker evidence from broad context, vs. stronger collocations used in decisions based on single feature

8

Examples for Significant Features

• Senses of

drug

(Gale et al. 1992): ‘medication’

prices, prescription, patent, increase, consumer, pharmaceutical

‘illegal substance’

abuse, paraphernalia, illicit, alcohol, cocaine, traffickers

Text Categorization

• A document is labeled by its “major” categories (event type, entities, geography) – typically the major topics discussed in the document • “Controlled vocabulary” – categories taken from a canonical pre-defined list (hierarchy/taxonomy) – to be distinguished from free-text terms – Used for consistent searching and browsing • Document features: terms, weighted by frequency (mostly), position, … • Category: typically represented by triggering rules or feature weights, applied to test documents 9

Types of Text Categorization Tasks

• Topical (multiple classes per document) – Broad: finance, politics, sports – Detailed: investment, sale, merger • Genre: news, contract, movie review, editorial • Authorship (style-based): individual, gender • Attitude: positive, negative • Relevance – Generic: spam, junk mail, pornography – Personal interests • Applications: filtering, routing, search, browsing 10

11

Categorization Approaches (with Some Industrial Perspective)

• Manual labeling by authors/editors – Still the most common approach in practice • Supervised automatic classification – Manual encoding of classification rules (instead of training) – requires special expertise – Supervised machine learning • Training documents are labeled manually (cost!) • Requires (

somewhat

) less expertise – Combined • Users can modify the learned “logic” (rules, weights) • “Bootstrapping” of classification “logic” (no labeling)

12

Features in Text Categorization

• Feature space dimensionality is very high – each word, possibly word-combinations – Vs. many other learning tasks • Only a relatively small number of features is usually relevant for each category • Learning methods need to be robust for very high dimensionality, and avoid overfitting for coincidental feature-category correlations in training • Apparent advantage to methods that combine evidence from many features, vs. methods that consider relatively few (e.g. decision trees/lists)

Text Categorization with Naïve Bayes

• Consider each category independently as a class

c

(for the multiple class setting) – Example

e

– document – Feature

f

– word or term

score

(

c

)  log

P

(

c

|

e

)

P

(~

c

|

e

) 

f

 

e

log

P P

(

f

(

f

|

c

) |~

c

)  log

P

(

C

)

P

(~

C

) – Classify

c

if

score(c)>θ

• Typically a specifically tuned threshold for each class, due to inaccuracy of the probabilistic estimate of

P(e|c)

with given training statistics and independence assumption, • .. but a biased probability estimate for

c

may still correlate well with the classification decision 13

Two Feature Models

• 1 st Model: Multivariate binomial – One binary feature

f

corresponds to each word in the (corpus) lexicon –

f

is true in a document (triggered for the example) if the word appears in it, and false otherwise – Parameter estimation:

P

(

f

|

c

) 

doc_count

(

doc_count f

, (

c

)

c

)

P

(~

f

|

c

)  1 

P

(

f

|

c

) – Notice that in classification each word in the document contributes its “weight” once, regardless of its frequency • But according to the model, words that do not appear in the document take part in the classification too (

~f

) 14

15

2

nd

Model: Multinomial

• One multinomial feature

f

corresponds to each position in the document • Feature value is the identity of the word at the corresponding position; possible values are all words in the lexicon – For brevity, we use

f

to denote both the feature and its value • Parameter estimation:

P

(

f

|

c

) 

position_c ount

(

f

,

c

)

posiotion_ count

(

c

) • Notice that in classification each word contributes its “weight” multiplied by its frequency in the document

Some Observations on the Models

• Multinomial model has an advantage of taking into account word frequency in the document, but: – According to experience in Information Retrieval (IR), multiplying a word “weight” by its frequency yields inflated impact of frequent words in a document (multiple occurrences of a word are dependent). E.g., a word weight is multiplied in IR by the log of its frequency in the document – Considering frequency boosts the misleading effect of word ambiguity: a word correlated with the category might appear frequently in a document but under an alternate sense; the binomial model “emphasizes” accumulating weight from multiple words, and its unlikely that several words that are correlated with the category will occur together under alternate senses.

• Both models do not distinguish well between words that really trigger the category vs. words that refer to other frequently correlated topics 

Relevant language behavior should be analyzed when choosing a model; some aspects are beyond basic model

16

Naïve Bayes Properties

• Very simple and efficient – Training: one pass over the corpus to count feature-class co occurrences – Classification: linear in the number of “active” features in the example • Not the best model but often not much worse than more complex models – Often a useful quick solution; good baseline for advanced models • Works well when classification is triggered by multiple, roughly equally indicative, features • Relatively robust to irrelevant features, which typically cancel each other – But feature selection often helps (needed) – Somewhat sensitive to features that correspond to different but correlated classes (whether such classes are defined or not) 17

18

Feature Selection

• Goal: enable the learning method to focus on the most informative features, either globally or per class, reducing the noise introduced by irrelevant features • Most simple criterion: feature frequency – For some categorization results: 10 times feature space reduction with no accuracy loss; 100 times reduction with small loss – Typical filters: at most 1-3 docs, 1-5 occurrences

Feature Selection (cont.)

• More complicated selection scores based on feature-category co-occurrence frequency – Computed per-category, possibly obtaining a global score by sum/weighted average/max C ~C f a b ~f c d 19 – The same data as in actual classification by Bayes, but used to decide whether to ignore the feature altogether

20

Example Selection Score Functions

• Mutual information for

c i ,t k

: Recall MI (for variables):

I

(

X

;

Y

)  { 

x

} {

y

}

p

(

x

,

y

) log

p

(

x

,

y

)

p

(

x

)

p

(

y

) • Information gain: • Odds ratio: • Galavotti et al. (2000)

Linear Classifiers

• • Linear classifier: • Classify

e

to

c

if

s(f,e) score

(

e

,

c

)

score

(

e

,

c

)   

f

 

e s

(

f

,

e

) 

w

(

f

,

c

) (the variable): “strength” of

f

some function of

f’

s frequency in

e

) in

e

(e.g. •

w(f,c)

(the coefficient): the weight of

f

the vector representing

c

• Two dimensional case: • Compare

s

and

w a

x

b

y

  with unsupervised association 21 in + - - - + + + +

Naive Bayes as a Linear Classifier

• In Naive Bayes - classify

e

for

c

if:

P

(

c

|

e

) log

P

(~

c

|

e

) 

f

 

e

log

P P

(

f

(

f

| |~

c

)

c

)  log

P

(

C

)

P

(~

C

)  0 •

w

(

f

,

c

)  log

P

(

P

(

f f

| |~

c

)

c

)

s

(

f

,

e

)   0 1

f

e

otherwise    log

P

(

P

(

f f

| |~

c

)

c

) 22

Perceptron (Winnow): Non-Parametric Mistake-Driven Learning of

w(f,c)

For a category

c

: foreach

f

initialize

w(f,c)

(uniformly/randomly) do until no_errors or time_limit foreach (

e

in training) compute score(

e,c

) if (score(

e,c

) <= teta && pos(

e

)) #false negative do foreach (

f w(f,c)

in

e

) += alpha (*= alpha) if (score(

e,c

) >= teta && neg(

e

)) #false positive do foreach (

f

in

e

)

w(f,c)

-= alpha (*= beta) Notice: defining score(e,c) is part of “feature engineering” 23

24

Text Categorization with Winnow

• Much work in NLP using extended versions of Winnow by Dan Roth (SNOW) • Suitability of Winnow for NLP: – High dimensionality, sparse data and target vector • certain theoretical advantages over Perceptron (and some vice versa); irrelevant features diminish faster – Robustness for noise – Non-parametric and no independence assumptions; mistake-driven approach sensitive to dependencies – Finds good approximate separator when a perfect linear separator doesn’t exist – Can track changes over time • Categorization: Dagan, Karov, Roth (1997)

Balanced Winnow: Negative Weights

• Maintain a positive weight (w+) and a negative weight (w-) for each feature: w(f,c) = w+(f,c) - w-(f,c) • Modify algorithm: if (score(doc,C) < teta && pos(doc)) do foreach (f in doc) w+(f,C) *= alpha w-(f,C) *= beta if (score(doc,C) > teta && neg(doc)) do foreach (f in doc) w+(f,C) *= alpha w-(f,C) *= beta • Initialization: on average, initial score close to teta 25

Experimental Results

• Major problem with positive Winnow – variation in document length • Negative features – mostly small values for irrelevant features, along with small positive values; sometimes significant negative values for features that indicate negative classification in documents that do include positive features (some “disambiguation” effect) 26

Length Normalization

• Problem : due to example length variation, a “long” example may get a high score when there are many active, low weight, features – For positive Winnow; in Balanced Winnow and Perceptron small negative weights cancel out • Length normalization: • Initialize

w(f,c)

to  (

f

,

e

) 

s

(

f

,

e

)

f

 

e s

(

f

,

e

) • “Indifferent” features -

w(f,c)

remains close to  • “Negative” features -

w(f,c)

smaller than  27

Feature Repetition

• "Burstiness" of word and term occurrences • Repetition of a feature often indicates high relevance for the context - suggests higher

s(f,e)

• For multiple classes - a repeated feature may be indicative for only one class, therefore repetition should not inflate the strength too much • Possible alternatives (common in IR): – – –

s(f,e)

= 1

or

0 (active/not-active)

s(f,e)

=

freq(f,e) s(f,e)

= sub-linear function of

freq(f,e)

(sqrt, log+1) 28

29

Learning a Threshold Range

• Instead of searching for a linear positive and negative examples, search for a separating linear the separating line in the middle of that hyper plane (cf. support-vector machines) • Implementation: – Use teta+ and teta- while training: – Algorithm classifies document as positive if: score > teta+ score < teta – Otherwise (teta- < score < teta+):

line thick hyper-plane

, and then set – Algorithm classifies document as negative if: Always consider as classification error that separates

Incremental Feature Filtering

• The algorithms can tolerate a large number of features • However: each class usually depends on a relatively small number of features (sparseness) • A desired goal: discard non-indicative features – Space and time efficiency – Comprehensibility of class profiles and classifications – May improve results due to noise reduction • Implementation: during training, filter features whose weight remains close to the initialization weight 30

Comparing Results with Other Methods

• Optimal performance: balanced, square-root feature strength, threshold range and feature filtering.

• Results for Reuters 22173 test collection 31

Winnow Categorization – Conclusions

• Need to adapt basic model to additional characteristics of textual language data • Showed augmented Winnow effectiveness and suitability for texts – High dimensionality, irrelevant features, some feature dependency (positive & negative) – No need in feature selection (but may help sometimes) • Today, more complex learning methods such as SVM outperform these reported results in text categorization, but Winnow is still a viable option 32

33

Other Classification Approaches

• Decision trees – Test “strongest” feature first, then according to the test result test the current“strongest” feature – Compare with decision list – split/full data • (K-)Nearest Neighbor – A memory-based approach – For a given test example, find the (K) most “similar” examples in training and classify the new example accordingly (weighted majority)

Decisions by Single vs. Multiple Features

• Local vs. global decisions • Using multiple evidence in parallel is method of choice in many more tasks • May not be optimal for language processing – how should hard vs. soft decisions be made for definite vs. quantitative phenomena in language? • Often – problems are quite local • Conjecture: might be possible to use a more “symbolic” model for clear cases that can covered, and a “softer” model where multiple weak evidence is required – Example problem: text categorization – multiple context (weakly correlated) evidence, with no real triggers.

• Easier to analyze errors in the “symbolic” cases 34