Introduction to Machine Learning
Download
Report
Transcript Introduction to Machine Learning
Lecture Slides for
INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN
© The MIT Press, 2004
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml
Outline
Discriminant Function
Learning Association Rules
Naïve Bayes Classifier
Example: Play Tennis
Relevant Issues
Conclusions
2
What is discriminant Function
For classification problem, for each class, define a
function gi ( x), i 1,..., K such that we choose Ci if
gi ( x) max g k ( x)
k
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Discriminant Functions
c hooseCi if gi x maxk gk x
gi x , i 1,, K
R i | x
gi x P C i | x
p x | C P C
i
i
K decision regions R1,...,RK
Ri x | gi x maxk gk x
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K=2 Classes
Dichotomizer (K=2) vs Polychotomizer (K>2)
g(x) = g1(x) – g2(x)
C1 if g x 0
choose
C 2 otherwise
Log odds:
P C1 | x
log
P C2 | x
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Problem: Association Rule
Mining
INPUT
A set of transactions
Objective:
Given a set of transactions D, generate all
association rules that have support and
confidence greater than the user-specified
minimum support and minimum confidence.
Minimize computation time by pruning.
Constraints:
Items should be in lexicographical order
TID
1
2
3
4
5
Transactions
{bread, milk, beer, diapers}
{beer, apples, diapers}
{diapers, milk, beer}
{beer, apples, diapers}
{milk, bread, chocolate}
Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,
Coke},
{Beer, Bread} {Milk},
Real World Applications
NCR (Terradata) does ARM for more than 20 large retail organizations including
Walmart.
Used for pattern discovery in biological DBs.
Association Rules
Association rule: X Y
Support (X Y):
# c ustome rswho bought X and Y
P X ,Y
# c ustome rs
Confidence (X Y):
P X ,Y
P Y | X
P( X )
# c ustome rswho bought X and Y
# c ustome rswho bought X
Apriori algorithm (Agrawal et al., 1996)
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Apriori Algorithm:
Breadth First Search
{}
a
ab
b
a d
c
d
bd
abd
ICDM'06 Panel
8
Apriori Algorithm Examples
Problem Decomposition
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset
that satisfies the minimum support.
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
If the minimum confidence is 50%, then the only
two rules generated from this 2-itemset, that
have confidence greater than 50%, are:
9
The Apriori Algorithm — Example
Min support =50%
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
10
Naive Bayes’ Classifier
Given C, xj are independent:
p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Background
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: multi-layered perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
a) and b) are examples of discriminative
classification
c) is an example of generative classification
b) and c) are both examples of probabilistic
classification
12
Probability Basics
•
Prior, conditional and joint probability
– Prior probability:P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability:X ( X1 , X2 ), P( X) P(X1 ,X2 )
– Relationship:P(X1 ,X2 ) P( X2 |X1 )P( X1 ) P( X1 |X2 )P( X2 )
•
– Independence:P( X |X ) P( X ), P( X |X ) P( X ), P(X ,X ) P( X )P( X )
2
1
2
1
2
1
1
2
1
2
Bayesian Rule
Likelihood Prior
P(X|C)P(C)
P(C |X)
Posterior
P(X)
Evidence
13
Probabilistic Classification
•
Establishing a probabilistic model for classification
– Discriminative model
P(C |X) C c1 , , c L , X (X1 , , Xn )
– Generative model
P( X |C ) C c1 , , c L , X (X1 , , Xn )
•
MAP classification rule
– MAP: Maximum A Posterior
•
– Assign x to c* if P(C c* |X x) P(C c|X x) c c* , c c , , c
1
L
Generative classification with the MAP rule
– Apply Bayesian rule to convert:
P(X|C)P(C)
P(C|X)
P(X|C)P(C)
P(X)
14
Naïve Bayes
•
Bayes classification
P(C |X) P( X |C )P(C ) P( X1 , , Xn |C )P(C )
Difficulty: learning the joint probabilityP( X1 , , Xn |C )
•
Naïve Bayes classification
– Making the assumption that all input attributes are independent
P( X1 , X2 , , Xn |C ) P( X1 |X2 , , Xn ; C )P( X2 , , Xn |C )
P( X1 |C )P( X2 , , Xn |C )
P( X1 |C )P( X2 |C ) P( Xn |C )
– MAP classification rule
[P( x1 |c* ) P( xn |c* )]P(c* ) [P( x1 |c) P( xn |c)]P(c), c c* , c c1 , , cL
15
Naïve Bayes
•
Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
For each target value of ci (ci c1 , , c L )
Pˆ (C ci ) estimate P(C ci ) with examples in S;
For every attribute value a jk of each attribute x j ( j 1, , n; k 1, , N j )
Pˆ ( X j a jk |C ci ) estimate P( X j a jk |C ci ) with examples in S;
Output: conditional probability tables; forx N L elements
j,
j
– Test Phase: Given an unknown instance X ( a , , a ) ,
1
n
Look up tables to assign the label c* to X’ if
[Pˆ (a1 |c* ) Pˆ (an |c* )]Pˆ (c* ) [Pˆ (a1 |c) Pˆ (an |c)]Pˆ (c), c c* , c c1 , , cL
16
Example
•
Example: Play Tennis
17
Example
•
Learning Phase
Outlook Play=Yes Play=No
Sunny
Overcast
Rain
Humidity
2/9
4/9
3/5
0/5
3/9
2/5
Temperatur
e
Play=Yes
Play=No
Hot
2/9
2/5
Mild
4/9
3/9
2/5
1/5
Cool
Play=Ye
s
Play=N
o
High
3/9
4/5
Normal
6/9
1/5
P(Play=Yes) = 9/14
Wind
Strong
Weak
Play=Yes Play=No
3/9
6/9
3/5
2/5
P(Play=No) = 5/14
18
Example
•
Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
19
Example
•
Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
20
Relevant Issues
•
Violation of Independence Assumption
– For many real world tasks,P(X1 , , Xn |C ) P( X1 |C ) P(Xn |C )
– Nevertheless, naïve Bayes works surprisingly well anyway!
•
Zero conditional probability Problem
– If no example contains the attribute valueX a , Pˆ (X a |C c ) 0
j
j
i
jk
jk
– In this circumstance, Pˆ (x |c ) Pˆ (a |c ) Pˆ (x |c ) 0 during test
1
i
jk
i
n
i
– For a remedy, conditional probabilities estimated with
n mp
Pˆ ( X j a jk |C ci ) c
nm
nc : number of training examples for which X j a jk and C ci
n : number of training examples for which C ci
p : prior estimate (usually, p 1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m 1)
21
Relevant Issues
•
Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
( X j ji )2
Pˆ ( X j |C ci )
exp
2
2 ji
2 ji
ji : mean (avearage) of attribute values X j of examples for which C ci
1
ji : standard deviation of attribute values X j of examples for which C ci
– Learning Phase: for X ( X , , X ), C c , , c
1
n
1
L
Output: n L normal distributions andP(C c ) i 1, , L
i
– Test Phase:
for X ( X1 , , Xn )
•
•
Calculate conditional probabilities with all the normal distributions
Apply the MAP rule to make a decision
22
Conclusions
Naïve Bayes based on the independence assumption
Training is very easy and fast; just requiring considering
each attribute in each class separately
Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
A popular generative model
Performance competitive to most of state-of-the-art
classifiers even in presence of violating independence
assumption
Many successful applications, e.g., spam mail filtering
Apart from classification, naïve Bayes can do more…
23