Introduction to Machine Learning

Download Report

Transcript Introduction to Machine Learning

Lecture Slides for
INTRODUCTION TO
Machine Learning
ETHEM ALPAYDIN
© The MIT Press, 2004
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml
Outline

Discriminant Function

Learning Association Rules

Naïve Bayes Classifier

Example: Play Tennis

Relevant Issues

Conclusions
2
What is discriminant Function

For classification problem, for each class, define a
function gi ( x), i  1,..., K such that we choose Ci if
gi ( x)  max g k ( x)
k
3
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Discriminant Functions
c hooseCi if gi x   maxk gk x 
gi x , i  1,, K
 R i | x 

gi x   P C i | x 
p x | C P C 
i
i

K decision regions R1,...,RK
Ri  x | gi x   maxk gk x 
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K=2 Classes


Dichotomizer (K=2) vs Polychotomizer (K>2)
g(x) = g1(x) – g2(x)
C1 if g x   0
choose 
C 2 otherwise

Log odds:
P C1 | x 
log
P C2 | x 
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Problem: Association Rule
Mining
INPUT
A set of transactions
Objective:
Given a set of transactions D, generate all
association rules that have support and
confidence greater than the user-specified
minimum support and minimum confidence.
Minimize computation time by pruning.
Constraints:
Items should be in lexicographical order
TID
1
2
3
4
5
Transactions
{bread, milk, beer, diapers}
{beer, apples, diapers}
{diapers, milk, beer}
{beer, apples, diapers}
{milk, bread, chocolate}
Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,
Coke},
{Beer, Bread}  {Milk},
Real World Applications
NCR (Terradata) does ARM for more than 20 large retail organizations including
Walmart.
Used for pattern discovery in biological DBs.
Association Rules



Association rule: X  Y
Support (X  Y):
# c ustome rswho bought X and Y 
P X ,Y  

# c ustome rs
Confidence (X  Y):
P X ,Y 
P Y | X  
P( X )
# c ustome rswho bought X and Y 

# c ustome rswho bought X 
Apriori algorithm (Agrawal et al., 1996)
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Apriori Algorithm:
Breadth First Search
{}
a
ab
b
a d
c
d
bd
abd
ICDM'06 Panel
8
Apriori Algorithm Examples
Problem Decomposition
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset
that satisfies the minimum support.
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
If the minimum confidence is 50%, then the only
two rules generated from this 2-itemset, that
have confidence greater than 50%, are:
9
The Apriori Algorithm — Example
Min support =50%
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
10
Naive Bayes’ Classifier
Given C, xj are independent:
p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Background

There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: multi-layered perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers

a) and b) are examples of discriminative
classification

c) is an example of generative classification

b) and c) are both examples of probabilistic
classification
12
Probability Basics
•
Prior, conditional and joint probability
– Prior probability:P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability:X  ( X1 , X2 ), P( X)  P(X1 ,X2 )
– Relationship:P(X1 ,X2 )  P( X2 |X1 )P( X1 )  P( X1 |X2 )P( X2 )
•
– Independence:P( X |X )  P( X ), P( X |X )  P( X ), P(X ,X )  P( X )P( X )
2
1
2
1
2
1
1
2
1
2
Bayesian Rule
Likelihood  Prior
P(X|C)P(C)
P(C |X) 
Posterior 
P(X)
Evidence
13
Probabilistic Classification
•
Establishing a probabilistic model for classification
– Discriminative model
P(C |X) C  c1 ,  , c L , X  (X1 ,  , Xn )
– Generative model
P( X |C ) C  c1 ,  , c L , X  (X1 ,  , Xn )
•
MAP classification rule
– MAP: Maximum A Posterior
•
– Assign x to c* if P(C  c* |X  x)  P(C  c|X  x) c  c* , c  c ,  , c
1
L
Generative classification with the MAP rule
– Apply Bayesian rule to convert:
P(X|C)P(C)
P(C|X) 
 P(X|C)P(C)
P(X)
14
Naïve Bayes
•
Bayes classification
P(C |X)  P( X |C )P(C )  P( X1 ,  , Xn |C )P(C )
Difficulty: learning the joint probabilityP( X1 ,  , Xn |C )
•
Naïve Bayes classification
– Making the assumption that all input attributes are independent
P( X1 , X2 ,  , Xn |C )  P( X1 |X2 ,  , Xn ; C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 |C )    P( Xn |C )
– MAP classification rule
[P( x1 |c* )    P( xn |c* )]P(c* )  [P( x1 |c)    P( xn |c)]P(c), c  c* , c  c1 ,  , cL
15
Naïve Bayes
•
Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
For each target value of ci (ci  c1 ,  , c L )
Pˆ (C  ci )  estimate P(C  ci ) with examples in S;
For every attribute value a jk of each attribute x j ( j  1,  , n; k  1,  , N j )
Pˆ ( X j  a jk |C  ci )  estimate P( X j  a jk |C  ci ) with examples in S;
Output: conditional probability tables; forx N  L elements
j,
j
– Test Phase: Given an unknown instance X  ( a ,  , a ) ,
1
n
Look up tables to assign the label c* to X’ if
[Pˆ (a1 |c* )    Pˆ (an |c* )]Pˆ (c* )  [Pˆ (a1 |c)    Pˆ (an |c)]Pˆ (c), c  c* , c  c1 ,  , cL
16
Example
•
Example: Play Tennis
17
Example
•
Learning Phase
Outlook Play=Yes Play=No
Sunny
Overcast
Rain
Humidity
2/9
4/9
3/5
0/5
3/9
2/5
Temperatur
e
Play=Yes
Play=No
Hot
2/9
2/5
Mild
4/9
3/9
2/5
1/5
Cool
Play=Ye
s
Play=N
o
High
3/9
4/5
Normal
6/9
1/5
P(Play=Yes) = 9/14
Wind
Strong
Weak
Play=Yes Play=No
3/9
6/9
3/5
2/5
P(Play=No) = 5/14
18
Example
•
Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
19
Example
•
Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
20
Relevant Issues
•
Violation of Independence Assumption
– For many real world tasks,P(X1 ,  , Xn |C )  P( X1 |C )    P(Xn |C )
– Nevertheless, naïve Bayes works surprisingly well anyway!
•
Zero conditional probability Problem
– If no example contains the attribute valueX  a , Pˆ (X  a |C  c )  0
j
j
i
jk
jk
– In this circumstance, Pˆ (x |c )    Pˆ (a |c )    Pˆ (x |c )  0 during test
1
i
jk
i
n
i
– For a remedy, conditional probabilities estimated with
n  mp
Pˆ ( X j  a jk |C  ci )  c
nm
nc : number of training examples for which X j  a jk and C  ci
n : number of training examples for which C  ci
p : prior estimate (usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
21
Relevant Issues
•
Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
 ( X j   ji )2 

Pˆ ( X j |C  ci ) 
exp 
2

2 ji 
2  ji

 ji : mean (avearage) of attribute values X j of examples for which C  ci
1
 ji : standard deviation of attribute values X j of examples for which C  ci
– Learning Phase: for X  ( X ,  , X ), C  c ,  , c
1
n
1
L
Output: n L normal distributions andP(C  c ) i  1,  , L
i
– Test Phase:
for X  ( X1 ,  , Xn )
•
•
Calculate conditional probabilities with all the normal distributions
Apply the MAP rule to make a decision
22
Conclusions


Naïve Bayes based on the independence assumption

Training is very easy and fast; just requiring considering
each attribute in each class separately

Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
A popular generative model

Performance competitive to most of state-of-the-art
classifiers even in presence of violating independence
assumption

Many successful applications, e.g., spam mail filtering

Apart from classification, naïve Bayes can do more…
23