Categorical data - Carleton College
Download
Report
Transcript Categorical data - Carleton College
Three kinds of learning
Supervised learning
Unsupervised learning
Learning some mapping from inputs to
outputs
Given “data”, what kinds of patterns can
you find?
Reinforcement learning
Learn from positive negative reinforcement
Categorical data example
Example from Ross Quinlan, Decision Tree Induction; graphics from Tom Mitchell, Machine Learning
Decision Tree Classification
Which feature to split on?
Try to classify as many as possible with each split
(This is a good split)
Which feature to split on?
These are bad splits – no classifications obtained
Improving a good split
Decision Tree Algorithm
Framework
Use splitting criterion to decide on best
attribute to split
Each child is new decision tree – recurse with
parent feature removed
If all data points in child node are same class,
classify node as that class
If no attributes left, classify by majority rule
If no data points left, no such example seen:
classify as majority class from entire dataset
How do we know which splits
are good?
Want nodes as “pure” as possible
How do we quantify “randomness” of a
node? Want
All elements +: “randomness” = 0
All elements –: “randomness” = 0
Half +, half -: “randomness” = 1
Draw plot
What should “randomness” function
look like?
Typical solution: Entropy
pp = proportion of + examples
pn = proportion of – examples
Entropy p P lg p P p N lg p N
A collection with low entropy is good.
ID3 Criterion
Split on feature with most information
gain.
Gain = entropy in original node –
weighted sum of entropy in child nodes
Gain ( split ) Entropy ( parent )
child
size of child
size of parent
Entropy ( child )
How good is this split?
9
log
14
3
7
log
3
2
7
4
log
7
Weighted
4
2
. 985
7
6
7
average
7
14
log
9
2
6
2
(. 985 )
14
7
7
1
7
5
log
14
log
1
2
5
2
14
0 . 592
7
(. 592 ) 0 . 789
14
Gain . 940 . 789 . 151
. 940
How good is this split?
9
log
14
2
5
log
2
2
5
3
log
5
Weighted
3
2
. 971
5
4
4
average
5
14
log
4
2
4
0
log
4
(. 971 )
0
2
4
14
9
2
14
0
3
5
5
(0)
14
Gain . 940 . 694 . 246
log
14
4
5
log
3
2
5
5
2
. 940
14
2
5
log
2
2
5
(. 971 ) 0 . 694
. 971
The big picture
Start with root
Find attribute to split on with most gain
Recurse
Assessment
How do I know how well my decision
tree works?
Training set: data that you use to build
decision tree
Test set: data that you did not use for
training that you use to assess the
quality of decision tree
Issues on training and test
sets
Do you know the correct classification
for the test set?
If you do, why not include it in the
training set to get a better classifier?
If you don’t, how can you measure the
performance of your classifier?
Cross Validation
Tenfold cross-validation
Ten iterations
Pull a different tenth of the dataset out
each time to act as a test set
Train on the remaining training set
Measure performance on the test set
Leave one out cross-validation
Similar, but leave only one point out each
time, then count correct vs. incorrect
Noise and Overfitting
Can we always obtain a decision tree that is
consistent with the data?
Do we always want a decision tree that is
consistent with the data?
Example: Predict Carleton students who
become CEOs
Features: state/country of origin, GPA letter,
major, age, high school GPA, junior high GPA, ...
What happens with only a few features?
What happens with many features?
Overfitting
Fitting a classifier “too closely” to the
data
finding patterns that aren’t really there
Prevented in decision trees by pruning
When building trees, stop recursion on
irrelevant attributes
Do statistical tests at node to determine if
should continue or not
Examples of decision trees
using Weka
Preventing overfitting by cross
validation
Another technique to prevent overfitting
(is this valid)?
Keep on recursing on decision tree as long
as you continue to get improved accuracy
on the test set
Ensemble Methods
Many “weak” learners, when combined
together, can perform more strongly
than any one by itself
Bagging & Boosting: many different
learners, voting on which classification
Multiple algorithms, or different features,
or both
Bagging / Boosting
Bagging: vote to determine answer
Boosting: weighted vote to determine answer
Run one algorithm on random subsets of data to
obtain multiple classifiers
Each iteration, weight more heavily data that
learner got wrong
What does it mean to “weight more heavily” for knn? For decision trees?
AdaBoost is recent (1997) and has become
popular, fast
Computational Learning
Theory
Chapter 20 up next
Moving on to Chapter 20: statistical
learning methods
Skipping to: will revisit earlier topics
(perhaps) near end of course
20.5: Neural Networks
20.6: Support vector machines