Click to add title

Download Report

Transcript Click to add title

Classification
Part 1
CSE 439 – Data Mining
Assist. Prof. Dr. Derya BİRANT
Outline
◘ What Is Classification?
◘ Classification Examples
◘ Classification Methods
–
–
–
–
–
–
–
Decision Trees
Bayesian Classification
K-Nearest Neighbor
Neural Network
Genetic Algorithms
Support Vector Machines (SVM)
Fuzzy Set Approaches
What Is Classification?
◘ Classification
– Construction of a model to classify data
– When constructing the model, use the training set and the class labels
– After the construction of the model, use it in classifying new data
Classification (A Two-Step Process)
1. Model construction
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, trees, or mathematical
formulae
2. Model usage (Classifying future or unknown objects)
– Estimate accuracy rate of the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
Classification (A Two-Step Process)
Mining Model
Training Data
Data
Mining Model To Predict
DM
Engine
DM
Engine
Mining Model
Predicted Data
Classification Example
Classification
Algorithms
Training
Data
NAM E RANK
M ike
M ary
Bill
Jim
Dave
Anne
Assistant Prof
Assistant Prof
Professor
Associate Prof
Assistant Prof
Associate Prof
YEARS TENURED
3
7
2
7
6
3
no
yes
yes
yes
no
no
Classifier
(Model)
Process (2): Using the
Model in Prediction
IF rank = ‘professor’
OR years > 6
Classifier
THEN tenured = ‘yes’
Process (1): Model Construction
Testing
Data
NAM E
Tom
M erlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification Example
◘ Given old data about customers and payments, predict new applicant’s
loan eligibility.
– Good Customers
– Bad Customers
Previous customers
Classifier
Rules
Salary > 5 L
Age
Salary
Profession
Location
Customer type
Prof. = Exec
New applicant’s data
Good/
bad
Classification Techniques
1. Decision Trees
5. Genetic Algorithms
2. Bayesian Classification
c  max
cj
p( c j )
n
 p( a | c )
p( d )
i 1
i
j
6. Support Vector Machines (SVM)
3. K-Nearest Neighbor
7. Fuzzy Set Approaches
4. Neural Network
Classification Techniques
Decision Trees
Bayesian Classification
K-Nearest Neighbor
Neural Network
Classification
Genetic Algorithms
Support Vector Machines (SVM)
Fuzzy Set Approaches
…
Decision Trees
◘ Decision Tree is a tree where
– internal nodes are simple decision rules on one or more attributes
– leaf nodes are predicted class labels
◘ Decision trees are used for deciding between several courses of
action
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Attribute
Value
age?
<=30
31..40
student?
No
no
yes
Yes
yes
>40
Classification
credit rating?
Excellent
Fair
yes
Desicion Tree Applications
◘ Decision trees are used extensively in data mining.
◘ Has been applied to:
–
–
–
–
classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment,
...
Salary < 1 M
Job = teacher
Good
Age < 30
Bad
Bad
House Hiring
Good
Decision Trees (Different Representation)
DT Splits Area ( Different representation of decision tree)
Age
<30
>=30
Car Type
Minivan
YES
Sports, Truck
NO
YES
Minivan
YES
Sports,
Truck
NO
YES
short
short
0
30
60 Age
medium
medium
tall
tall
Decision Tree Adv. DisAdv.
Positives (+)
+ Reasonable training time
+ Fast application
+ Easy to interpret
(can be re-represented as if-thenelse rules)
+ Easy to implement
+ Can handle large number of
features
+ Does not require any prior
knowledge of data distribution
Negatives (-)
- Cannot handle complicated
relationship between features
- Simple decision boundaries
- Problems with lots of missing
data
- Output attribute must be categorical
- Limited to one output attribute
Rules Indicated by Decision Trees
◘ Write a rule for each path in the decision tree from the root to a leaf.
Decision Tree Algorithms
◘ ID3
– Quinlan (1981)
– Tries to reduce expected number of comparison
◘ C 4.5
–
–
–
–
Quinlan (1993)
It is an extension of ID3
Just starting to be used in data mining applications
Also used for rule induction
◘ CART
– Breiman, Friedman, Olshen, and Stone (1984)
– Classification and Regression Trees
◘ CHAID
– Kass (1980)
– Oldest decision tree algorithm
– Well established in database marketing industry
◘ QUEST
– Loh and Shih (1997)
Decision Tree Construction
◘ Which attribute is the best classifier?
– Calculate the information gain G(S,A) for each attribute A.
– The basic idea is that we select the attribute with the highest
information gain.
m
Entropy(S)    pi log2 pi
Entropy(S)  p1 log2 p1  p 2 log2 p 2
i 1
Gain(S, A)  Entropy(S )  
iA
| Si |
Entropy(Si)
| S|
Decision Tree Construction
Hava
Sıcaklık
Nem
Rüzgar
Tenis
Güneşli
Sıcak
Yüksek
Hafif
Hayır
Güneşli
Sıcak
Yüksek
Kuvvetli
Hayır
Bulutlu
Sıcak
Yüksek
Hafif
Evet
Yağmurlu
Ilık
Yüksek
Hafif
Evet
Yağmurlu
Serin
Normal
Hafif
Evet
Yağmurlu
Serin
Normal
Kuvvetli
Hayır
Bulutlu
Serin
Normal
Kuvvetli
Evet
Güneşli
Ilık
Yüksek
Hafif
Hayır
Güneşli
Serin
Normal
Hafif
Evet
Yağmurlu
Ilık
Normal
Hafif
Evet
Güneşli
Ilık
Normal
Kuvvetli
Evet
Bulutlu
Ilık
Yüksek
Kuvvetli
Evet
Bulutlu
Sıcak
Normal
Hafif
Evet
Yağmurlu
Ilık
Yüksek
Kuvvetli
Hayır
Which attribute first?
Decision Tree Construction
Hava
Sıcaklık
Nem
Rüzgar
Tenis
Güneşli
Sıcak
Yüksek
Hafif
Hayır
Güneşli
Sıcak
Yüksek
Kuvvetli
Hayır
Bulutlu
Sıcak
Yüksek
Hafif
Evet
Yağmurlu
Ilık
Yüksek
Hafif
Evet
Yağmurlu
Serin
Normal
Hafif
Evet
Yağmurlu
Serin
Normal
Kuvvetli
Hayır
Bulutlu
Serin
Normal
Kuvvetli
Evet
Güneşli
Ilık
Yüksek
Hafif
Hayır
Güneşli
Serin
Normal
Hafif
Evet
Yağmurlu
Ilık
Normal
Hafif
Evet
Güneşli
Ilık
Normal
Kuvvetli
Evet
Bulutlu
Ilık
Yüksek
Kuvvetli
Evet
Bulutlu
Sıcak
Normal
Hafif
Evet
Yağmurlu
Ilık
Yüksek
Kuvvetli
Hayır
Gain(S, Rüzgar)  Entropy(S) 
Entropi(S )  (9 / 14) log2 (9 / 14)  (5 / 14) log2 (5 / 14)  0,940
Gain(S, Hava) = 0,246
Gain(S, Sıcaklık) = 0,029
Gain(S, Nem) = 0,151
Gain(S, Rüzgar) = 0,048
| SHafif |
|S
|
8
6
Entropy(SHafif )  Kuvvetli Entropy(SKuvvetli)  0,940  * 0,811 *1,0
14
14
| S|
| S|
 0,048
Gain(S, Nem)  Entropy(S) 
| SYüksek |
|S
|
7
7
Entropy(SYüksek )  Normal Entropy(SNormal)  0,940  * 0,985  *1,0
14
14
| S|
| S|
 0,151
Decision Tree Construction
◘ Which attribute is next?
Hava
Güneşli
?
Bulutlu Yağmurlu
?
Evet
Gain(SGüneşün, Rüzgar)  0,970 (2 / 5)1,0  (3 / 5)0,918  0,970  0,019
Gain(SGünesli , Nem)  0,970 (3 / 5)0,0  (2 / 5)0,0  0,970
Gain(SGünesli , Sııcaklı)  0,970 (2 / 5)0  (2 / 5)1  (1 / 5)0  0,570
Decision Tree Construction
Hava
Sıcaklık
Nem
Rüzgar
Tenis
R1
Güneşli
Sıcak
Yüksek
Hafif
Hayır
R2
Güneşli
Sıcak
Yüksek
Kuvvetli
Hayır
R3
Bulutlu
Sıcak
Yüksek
Hafif
Evet
R4
Yağmurlu
Ilık
Yüksek
Hafif
Evet
R5
Yağmurlu
Serin
Normal
Hafif
Evet
R6
Yağmurlu
Serin
Normal
Kuvvetli
Hayır
R7
Bulutlu
Serin
Normal
Kuvvetli
Evet
R8
Güneşli
Ilık
Yüksek
Hafif
Hayır
R9
Güneşli
Serin
Normal
Hafif
Evet
R10
Yağmurlu
Ilık
Normal
Hafif
Evet
R11
Güneşli
Ilık
Normal
Kuvvetli
Evet
R12
Bulutlu
Ilık
Yüksek
Kuvvetli
Evet
R13
Bulutlu
Sıcak
Normal
Hafif
Evet
R14
Yağmurlu
Ilık
Yüksek
Kuvvetli
Hayır
Hava
Güneşli
Yağmurlu
Bulutlu
Rüzgar
Nem
Evet
[R3,R7,R12,R13]
Yüksek
Hayır
[R1,R2, R8]
Normal
Evet
[R9,R11]
Hafif
Evet
[R4,R5,R10]
Kuvvetli
Hayır
[R6,R14]
Another Example
At the weekend:
- go shopping,
- watch a movie,
- play tennis or
- just stay in.
What you do depends on three things:
- the weather (windy, rainy or sunny);
- how much money you have (rich or poor)
- whether your parents are visiting.
Another Example
Classification Techniques
Decision Trees
Bayesian Classification
K-Nearest Neighbor
Neural Network
Classification
Genetic Algorithms
Support Vector Machines (SVM)
Fuzzy Set Approaches
…
Classification Techniques
2- Bayesian Classification
◘ A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities.
◘ Foundation: Based on Bayes’ Theorem.
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows
the Bayes theorem
P(H | X)  P(X | H )P(H )
P(X)
Classification Techniques
2- Bayesian Classification
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (Age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
Classification Techniques
2- Bayesian Classification
◘ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
◘ P(C1):
P(C2):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
◘ Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
◘
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium no excellent
yes
high
yes fair
yes
medium no excellent
no
P(X|C1) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|C2) : P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Classification Techniques
Decision Trees
Bayesian Classification
K-Nearest Neighbor
Neural Network
Classification
Genetic Algorithms
Support Vector Machines (SVM)
Fuzzy Set Approaches
…
K-Nearest Neighbor (k-NN)
◘ An object is classified by a majority vote of its neighbors (k closest
members) .
◘ If k = 1, then the object is simply assigned to the class of its nearest
neighbor.
◘ Euclidean Distance measure is used to calculate how close
K-Nearest Neighbor (k-NN)
Classification Evaluation (Testing)
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classification Accuracy
True Positive
False Negative
dogruluk
False Positive
True Negative
Which classification model is better?
hata 
TP  TN
TP  TN  FP  FN
FN  FP
TP  TN  FP  FN
Validation Techniques
◘ Simple Validation
Test set
Training set
◘ Cross Validation
Training set
Test set
◘ n-Fold Cross Validation
Test set
Training set
◘ Bootstrap Method