Categorical data - Carleton College

Download Report

Transcript Categorical data - Carleton College

Three kinds of learning

Supervised learning


Unsupervised learning


Learning some mapping from inputs to
outputs
Given “data”, what kinds of patterns can
you find?
Reinforcement learning

Learn from positive negative reinforcement
Categorical data example
Example from Ross Quinlan, Decision Tree Induction; graphics from Tom Mitchell, Machine Learning
Decision Tree Classification
Which feature to split on?
Try to classify as many as possible with each split
(This is a good split)
Which feature to split on?
These are bad splits – no classifications obtained
Improving a good split
Decision Tree Algorithm
Framework





Use splitting criterion to decide on best
attribute to split
Each child is new decision tree – recurse with
parent feature removed
If all data points in child node are same class,
classify node as that class
If no attributes left, classify by majority rule
If no data points left, no such example seen:
classify as majority class from entire dataset
How do we know which splits
are good?


Want nodes as “pure” as possible
How do we quantify “randomness” of a
node? Want





All elements +: “randomness” = 0
All elements –: “randomness” = 0
Half +, half -: “randomness” = 1
Draw plot
What should “randomness” function
look like?
Typical solution: Entropy


pp = proportion of + examples
pn = proportion of – examples
Entropy   p P lg p P  p N lg p N

A collection with low entropy is good.
ID3 Criterion


Split on feature with most information
gain.
Gain = entropy in original node –
weighted sum of entropy in child nodes
Gain ( split )  Entropy ( parent )


child
size of child
size of parent
Entropy ( child )
How good is this split?

9
log
14

3
7
log
3
2
7

4
log
7
Weighted
4
2
 . 985

7
6
7
average 
7
14
log
9
2
6
2
(. 985 ) 
14

7
7
1
7

5
log
14
log
1
2
5
2
14
 0 . 592
7
(. 592 )  0 . 789
14
Gain  . 940  . 789  . 151
 . 940
How good is this split?

9
log
14

2
5
log
2
2
5

3
log
5
Weighted
3
2
 . 971

5
4
4
average 
5
14
log
4
2
4

0
log
4
(. 971 ) 
0
2
4
14
9
2

14
0
3
5
5
(0) 
14
Gain  . 940  . 694  . 246
log
14

4
5
log
3
2
5
5
2

 . 940
14
2
5
log
2
2
5
(. 971 )  0 . 694
 . 971
The big picture



Start with root
Find attribute to split on with most gain
Recurse
Assessment



How do I know how well my decision
tree works?
Training set: data that you use to build
decision tree
Test set: data that you did not use for
training that you use to assess the
quality of decision tree
Issues on training and test
sets



Do you know the correct classification
for the test set?
If you do, why not include it in the
training set to get a better classifier?
If you don’t, how can you measure the
performance of your classifier?
Cross Validation

Tenfold cross-validation





Ten iterations
Pull a different tenth of the dataset out
each time to act as a test set
Train on the remaining training set
Measure performance on the test set
Leave one out cross-validation

Similar, but leave only one point out each
time, then count correct vs. incorrect
Noise and Overfitting



Can we always obtain a decision tree that is
consistent with the data?
Do we always want a decision tree that is
consistent with the data?
Example: Predict Carleton students who
become CEOs



Features: state/country of origin, GPA letter,
major, age, high school GPA, junior high GPA, ...
What happens with only a few features?
What happens with many features?
Overfitting

Fitting a classifier “too closely” to the
data


finding patterns that aren’t really there
Prevented in decision trees by pruning


When building trees, stop recursion on
irrelevant attributes
Do statistical tests at node to determine if
should continue or not
Examples of decision trees
using Weka
Preventing overfitting by cross
validation

Another technique to prevent overfitting
(is this valid)?

Keep on recursing on decision tree as long
as you continue to get improved accuracy
on the test set
Ensemble Methods


Many “weak” learners, when combined
together, can perform more strongly
than any one by itself
Bagging & Boosting: many different
learners, voting on which classification

Multiple algorithms, or different features,
or both
Bagging / Boosting

Bagging: vote to determine answer


Boosting: weighted vote to determine answer



Run one algorithm on random subsets of data to
obtain multiple classifiers
Each iteration, weight more heavily data that
learner got wrong
What does it mean to “weight more heavily” for knn? For decision trees?
AdaBoost is recent (1997) and has become
popular, fast
Computational Learning
Theory
Chapter 20 up next


Moving on to Chapter 20: statistical
learning methods
Skipping to: will revisit earlier topics
(perhaps) near end of course


20.5: Neural Networks
20.6: Support vector machines