Transcript ללא כותרת שקופית - University of Haifa
1
Classifying by Multiple Features: Naive Bayes
•
e
: The example (instance) to be classified – word occurrence, document • •
c
: a class, among all possible classes
C
– word sense, document category
f
: Feature – context word, document word
2
Bayes Decision Rule
• Among alternative classes, select
c
maximal • Minimizes the probability of error • Likelihood ratio view:
P
(
c
|
e
) such that
P(c|e)
is
P
(~
c
|
e
) select
c
for which the ratio is maximal • Binary classification: classify • Multiple (0 or more) classes per example: consider each class as a binary classification (usually with threshold, due to imprecise estimate) • Easier to estimate
P(e|c) c
if ratio>1, otherwise
~c
(generative model), therefore using Bayes rule
Conditional Probability and Bayes Rule
Conditiona l Probabilit y : Probabilit Pr[
A B
] y of event
A
given event
B
has happend Pr[
A
Pr[
B
]
B
] Pr[ Pr[
A
B
] Pr[
A
,
B
]
B A
] Pr[
A
B
] Pr[
A
] 3 Bayes Rule Pr[
A B
] Pr[
B
Pr[
A
]
A
] Pr[
B
]
A A
B B
4
Bayes and Independence Assumption
P
(
c
|
e
)
P
(~
c
|
e
)
P
(
P e
(
e
|~ |
c
)
c
)
P
(
c P
(~ )
c
) Bayes
f
e P
(
f
e P
(
f f
| |~
c
)
c
)
P P
( (~
c
)
c
) independen assumption ce Posterior Prior
Log-Likelihood Ratio
• Computational convenience – avoid underflow
score
(
c
) log
P
(
c
|
e
)
P
(~
c
|
e
)
f
e
log
P P
(
f
(
f
|
c
) |~
c
) log
P
(
C
)
P
(~
C
) • Estimate probabilities from training corpus, by (smoothed) count ratios •
P
(
f
log
P
(
f
| |~
c
)
c
) : the “entailment score” of
f
for
c
5 • Working with this ratio indicates explicitly feature “weight” within the score, compared to –
score
(
c
)
f
e
log
P
(
f
|
c
) log
P
(
C log P(f|c)
itself is large for frequent
f
’s, regardless of discrimination ) • Was applied to many classification tasks
6
Word Sense Disambiguation by Naïve Bayes
• • • • Each ambiguous word
w
is a classification problem
e
: The example (instance) to be classified – An occurrence of the ambiguous word
c
: a class, among all possible classes – A word sense, among all listed senses
C f
: Feature – A context word or phrase, in near or broad context, possibly within a syntactic relationship
Estimating Probabilities
• Assume a sense labeled training corpus • Apply some smoothing to avoid zero counts:
P
(
f
|
c
)
freq
(
f
,
c
)
freq
(
c
)
P
(
c
)
freq
(
c
)
freq
(
w
) 7 •
freq(c)
– sum of all context positions • Context features (words) that tend to occur mostly with a specific sense, and not with the others, contribute high values to its accumulative score • Gale, Church and Yarowsky (1992) report 90% accuracy for 6 ambiguous nouns – Combining weaker evidence from broad context, vs. stronger collocations used in decisions based on single feature
8
Examples for Significant Features
• Senses of
drug
(Gale et al. 1992): ‘medication’
prices, prescription, patent, increase, consumer, pharmaceutical
‘illegal substance’
abuse, paraphernalia, illicit, alcohol, cocaine, traffickers
Text Categorization
• A document is labeled by its “major” categories (event type, entities, geography) – typically the major topics discussed in the document • “Controlled vocabulary” – categories taken from a canonical pre-defined list (hierarchy/taxonomy) – to be distinguished from free-text terms – Used for consistent searching and browsing • Document features: terms, weighted by frequency (mostly), position, … • Category: typically represented by triggering rules or feature weights, applied to test documents 9
Types of Text Categorization Tasks
• Topical (multiple classes per document) – Broad: finance, politics, sports – Detailed: investment, sale, merger • Genre: news, contract, movie review, editorial • Authorship (style-based): individual, gender • Attitude: positive, negative • Relevance – Generic: spam, junk mail, pornography – Personal interests • Applications: filtering, routing, search, browsing 10
11
Categorization Approaches (with Some Industrial Perspective)
• Manual labeling by authors/editors – Still the most common approach in practice • Supervised automatic classification – Manual encoding of classification rules (instead of training) – requires special expertise – Supervised machine learning • Training documents are labeled manually (cost!) • Requires (
somewhat
) less expertise – Combined • Users can modify the learned “logic” (rules, weights) • “Bootstrapping” of classification “logic” (no labeling)
12
Features in Text Categorization
• Feature space dimensionality is very high – each word, possibly word-combinations – Vs. many other learning tasks • Only a relatively small number of features is usually relevant for each category • Learning methods need to be robust for very high dimensionality, and avoid overfitting for coincidental feature-category correlations in training • Apparent advantage to methods that combine evidence from many features, vs. methods that consider relatively few (e.g. decision trees/lists)
Text Categorization with Naïve Bayes
• Consider each category independently as a class
c
(for the multiple class setting) – Example
e
– document – Feature
f
– word or term
score
(
c
) log
P
(
c
|
e
)
P
(~
c
|
e
)
f
e
log
P P
(
f
(
f
|
c
) |~
c
) log
P
(
C
)
P
(~
C
) – Classify
c
if
score(c)>θ
• Typically a specifically tuned threshold for each class, due to inaccuracy of the probabilistic estimate of
P(e|c)
with given training statistics and independence assumption, • .. but a biased probability estimate for
c
may still correlate well with the classification decision 13
Two Feature Models
• 1 st Model: Multivariate binomial – One binary feature
f
corresponds to each word in the (corpus) lexicon –
f
is true in a document (triggered for the example) if the word appears in it, and false otherwise – Parameter estimation:
P
(
f
|
c
)
doc_count
(
doc_count f
, (
c
)
c
)
P
(~
f
|
c
) 1
P
(
f
|
c
) – Notice that in classification each word in the document contributes its “weight” once, regardless of its frequency • But according to the model, words that do not appear in the document take part in the classification too (
~f
) 14
15
2
nd
Model: Multinomial
• One multinomial feature
f
corresponds to each position in the document • Feature value is the identity of the word at the corresponding position; possible values are all words in the lexicon – For brevity, we use
f
to denote both the feature and its value • Parameter estimation:
P
(
f
|
c
)
position_c ount
(
f
,
c
)
posiotion_ count
(
c
) • Notice that in classification each word contributes its “weight” multiplied by its frequency in the document
Some Observations on the Models
• Multinomial model has an advantage of taking into account word frequency in the document, but: – According to experience in Information Retrieval (IR), multiplying a word “weight” by its frequency yields inflated impact of frequent words in a document (multiple occurrences of a word are dependent). E.g., a word weight is multiplied in IR by the log of its frequency in the document – Considering frequency boosts the misleading effect of word ambiguity: a word correlated with the category might appear frequently in a document but under an alternate sense; the binomial model “emphasizes” accumulating weight from multiple words, and its unlikely that several words that are correlated with the category will occur together under alternate senses.
• Both models do not distinguish well between words that really trigger the category vs. words that refer to other frequently correlated topics
Relevant language behavior should be analyzed when choosing a model; some aspects are beyond basic model
16
Naïve Bayes Properties
• Very simple and efficient – Training: one pass over the corpus to count feature-class co occurrences – Classification: linear in the number of “active” features in the example • Not the best model but often not much worse than more complex models – Often a useful quick solution; good baseline for advanced models • Works well when classification is triggered by multiple, roughly equally indicative, features • Relatively robust to irrelevant features, which typically cancel each other – But feature selection often helps (needed) – Somewhat sensitive to features that correspond to different but correlated classes (whether such classes are defined or not) 17
18
Feature Selection
• Goal: enable the learning method to focus on the most informative features, either globally or per class, reducing the noise introduced by irrelevant features • Most simple criterion: feature frequency – For some categorization results: 10 times feature space reduction with no accuracy loss; 100 times reduction with small loss – Typical filters: at most 1-3 docs, 1-5 occurrences
Feature Selection (cont.)
• More complicated selection scores based on feature-category co-occurrence frequency – Computed per-category, possibly obtaining a global score by sum/weighted average/max C ~C f a b ~f c d 19 – The same data as in actual classification by Bayes, but used to decide whether to ignore the feature altogether
20
Example Selection Score Functions
• Mutual information for
c i ,t k
: Recall MI (for variables):
I
(
X
;
Y
) {
x
} {
y
}
p
(
x
,
y
) log
p
(
x
,
y
)
p
(
x
)
p
(
y
) • Information gain: • Odds ratio: • Galavotti et al. (2000)
Linear Classifiers
• • Linear classifier: • Classify
e
to
c
if
s(f,e) score
(
e
,
c
)
score
(
e
,
c
)
f
e s
(
f
,
e
)
w
(
f
,
c
) (the variable): “strength” of
f
some function of
f’
s frequency in
e
) in
e
(e.g. •
w(f,c)
(the coefficient): the weight of
f
the vector representing
c
• Two dimensional case: • Compare
s
and
w a
x
b
y
with unsupervised association 21 in + - - - + + + +
Naive Bayes as a Linear Classifier
• In Naive Bayes - classify
e
for
c
if:
P
(
c
|
e
) log
P
(~
c
|
e
)
f
e
log
P P
(
f
(
f
| |~
c
)
c
) log
P
(
C
)
P
(~
C
) 0 •
w
(
f
,
c
) log
P
(
P
(
f f
| |~
c
)
c
)
s
(
f
,
e
) 0 1
f
e
otherwise log
P
(
P
(
f f
| |~
c
)
c
) 22
Perceptron (Winnow): Non-Parametric Mistake-Driven Learning of
w(f,c)
For a category
c
: foreach
f
initialize
w(f,c)
(uniformly/randomly) do until no_errors or time_limit foreach (
e
in training) compute score(
e,c
) if (score(
e,c
) <= teta && pos(
e
)) #false negative do foreach (
f w(f,c)
in
e
) += alpha (*= alpha) if (score(
e,c
) >= teta && neg(
e
)) #false positive do foreach (
f
in
e
)
w(f,c)
-= alpha (*= beta) Notice: defining score(e,c) is part of “feature engineering” 23
24
Text Categorization with Winnow
• Much work in NLP using extended versions of Winnow by Dan Roth (SNOW) • Suitability of Winnow for NLP: – High dimensionality, sparse data and target vector • certain theoretical advantages over Perceptron (and some vice versa); irrelevant features diminish faster – Robustness for noise – Non-parametric and no independence assumptions; mistake-driven approach sensitive to dependencies – Finds good approximate separator when a perfect linear separator doesn’t exist – Can track changes over time • Categorization: Dagan, Karov, Roth (1997)
Balanced Winnow: Negative Weights
• Maintain a positive weight (w+) and a negative weight (w-) for each feature: w(f,c) = w+(f,c) - w-(f,c) • Modify algorithm: if (score(doc,C) < teta && pos(doc)) do foreach (f in doc) w+(f,C) *= alpha w-(f,C) *= beta if (score(doc,C) > teta && neg(doc)) do foreach (f in doc) w+(f,C) *= alpha w-(f,C) *= beta • Initialization: on average, initial score close to teta 25
Experimental Results
• Major problem with positive Winnow – variation in document length • Negative features – mostly small values for irrelevant features, along with small positive values; sometimes significant negative values for features that indicate negative classification in documents that do include positive features (some “disambiguation” effect) 26
Length Normalization
• Problem : due to example length variation, a “long” example may get a high score when there are many active, low weight, features – For positive Winnow; in Balanced Winnow and Perceptron small negative weights cancel out • Length normalization: • Initialize
w(f,c)
to (
f
,
e
)
s
(
f
,
e
)
f
e s
(
f
,
e
) • “Indifferent” features -
w(f,c)
remains close to • “Negative” features -
w(f,c)
smaller than 27
Feature Repetition
• "Burstiness" of word and term occurrences • Repetition of a feature often indicates high relevance for the context - suggests higher
s(f,e)
• For multiple classes - a repeated feature may be indicative for only one class, therefore repetition should not inflate the strength too much • Possible alternatives (common in IR): – – –
s(f,e)
= 1
or
0 (active/not-active)
s(f,e)
=
freq(f,e) s(f,e)
= sub-linear function of
freq(f,e)
(sqrt, log+1) 28
29
Learning a Threshold Range
• Instead of searching for a linear positive and negative examples, search for a separating linear the separating line in the middle of that hyper plane (cf. support-vector machines) • Implementation: – Use teta+ and teta- while training: – Algorithm classifies document as positive if: score > teta+ score < teta – Otherwise (teta- < score < teta+):
line thick hyper-plane
, and then set – Algorithm classifies document as negative if: Always consider as classification error that separates
Incremental Feature Filtering
• The algorithms can tolerate a large number of features • However: each class usually depends on a relatively small number of features (sparseness) • A desired goal: discard non-indicative features – Space and time efficiency – Comprehensibility of class profiles and classifications – May improve results due to noise reduction • Implementation: during training, filter features whose weight remains close to the initialization weight 30
Comparing Results with Other Methods
• Optimal performance: balanced, square-root feature strength, threshold range and feature filtering.
• Results for Reuters 22173 test collection 31
Winnow Categorization – Conclusions
• Need to adapt basic model to additional characteristics of textual language data • Showed augmented Winnow effectiveness and suitability for texts – High dimensionality, irrelevant features, some feature dependency (positive & negative) – No need in feature selection (but may help sometimes) • Today, more complex learning methods such as SVM outperform these reported results in text categorization, but Winnow is still a viable option 32
33
Other Classification Approaches
• Decision trees – Test “strongest” feature first, then according to the test result test the current“strongest” feature – Compare with decision list – split/full data • (K-)Nearest Neighbor – A memory-based approach – For a given test example, find the (K) most “similar” examples in training and classify the new example accordingly (weighted majority)
Decisions by Single vs. Multiple Features
• Local vs. global decisions • Using multiple evidence in parallel is method of choice in many more tasks • May not be optimal for language processing – how should hard vs. soft decisions be made for definite vs. quantitative phenomena in language? • Often – problems are quite local • Conjecture: might be possible to use a more “symbolic” model for clear cases that can covered, and a “softer” model where multiple weak evidence is required – Example problem: text categorization – multiple context (weakly correlated) evidence, with no real triggers.
• Easier to analyze errors in the “symbolic” cases 34