Decision Trees - Ohio State Linguistics

Download Report

Transcript Decision Trees - Ohio State Linguistics

Classifiers and Machine
Learning
Data Intensive Linguistics
Spring 2008
Ling 684.02
Decision Trees





What does a decision tree do?
How do you prepare training data?
How do you use a decision tree?
The traditional example is a tiny data
set about weather.
Here I use Wagon, many other similar
packages exist.
Decision processes







Challenge: Who am I?
Q:Are you alive?
Q: Are you famous?
…
Q: Are you a tennis player?
Q: Are you a golfer?
Q: Are you Tiger Woods
A: Yes
A: Yes
A: No
A: Yes
A: Yes
Decision trees



Played rationally, this game has the
property that each binary question
partitions the space of possible entities.
Thus, the structure of the search can
be seen as a tree.
Decision trees are encodings of a
similar search process. Usually, a wider
range of questions is allowed.
Decision trees


In a problem solving setup, we’re not
dealing with people, but with a finite
number of classes for the predicted
variable.
But the task is essentially the same,
given a set of available questions,
narrow down the possibilities till you
are confident about the class.
How to choose a question?

Look for the question that most
increases your knowledge of the class



We can’t tell ahead of time which answer will
arise.
So take the average over all possible answers,
weighted by how probable each answer seems
to be.
The maths behind this is either information
theory or an approximation to it.
How to be confident


If a simple majority classifier would
achieve acceptable performance on the
data in the current partition.
Obvious generalization (Kohavi): be
confident if some other baseline
classifier would perform well enough.
Input data
Data format



Each row of the table is an instance
Each column of the table is an attribute
(or feature)
You also have to say which attribute is
the predictee or class variable. In this
case we choose Playable.
Attribute types


We also need to understand the types
of the attributes.
For the weather data:



Windy and Playable look boolean
Temperature and Humidity look as if they can
take any numerical value
Cloudy looks as if it can take any of
“sunny”,”overcast”,”rainy”,”sunny”
Wagon description files


Because guessing the range of an
attribute is tricky, Wagon instead
requires you to have a “description file”
Fortunately (especially if you have big
data files), Wagon also provides
make_wagon_desc which makes a
reasonable guess at the desc file.
For the weather data
(
(outlook overcast rainy sunny)
(temperature float)
(humidity
float)
( windy
FALSE TRUE)
( play
no yes)
)
(needed a little help: replacing lists of numbers
with “float”)
Commands for Wagon


wagon –data weather.dat –desc
weather.desc –o weather.tree
This produces unsatisfying results,
because we need to tell it that the data
set is small by setting –stop 2 (or else
it notices that there are < 50 examples
in the top-level tree, and doesn’t build
a tree)
Using the stopping criterion
wagon

–data weather.dat \
–desc weather.desc \
–o weather.tree \
–stop 1
This allows the system to learn the
exact circumstances under which Play
takes on particular values.
Using Wagon to classify
wagon_test -data weather.dat \
-tree weather.tree \
-desc weather.desc \
-predict play
Output data
Over-training


-stop=1 is over-confident, because it
might build a leaf for every quirky
example.
There will be other quirky examples
once we move to new data. Unless we
are very lucky, what is learnt from the
training set will be too detailed.
Over-training 2



The bigger -stop is, the more errors that
the system will commit on the training
data.
Conversely, the smaller -stop is, the
more likely that the tree will learn
irrelevant detail.
The risk of overtraining grows as -stop
shrinks, and as the set of available
attributes increases.
Why over-training hurts




If you have a complex attribute space,
your training data will not cover
everything.
Unless you learn general rules, new
instances will not be correctly classified.
Also, the system's estimates of how well
it is doing will be very optimistic.
This is like doing Linguistics but only on
written, academic English...
Setting -stop automatically


Split training data in
two. Use first half to
train, trying several
different values for -stop
Use second half for
cross-validation:
measure performance of
the various trees learnt.
Test
Train
Tune
Cross-validation
Test


If performance gain generalizes to
cross-validation half, then probably also
Train
to unseen data.
Any problems?
Tune
Data efficiency
Test



Train/tune split is wasteful.
Reduce tuning part to 10% of data.
Train on 90%.
Rotate the 10% through the training
Train
data.
Tune
Cross-validation
Test
Train
Test
Test
Test
Train
Train
Train
Tune
Tune
Tune
Train
Train
Tune
Train
Train
Cross-validation


Because the tuning set was 10%, this is
10-fold cross-validation. 20-fold would
be 5%
In the limit (very expensive or small
training data) we have “leave one out”
cross-validation.
Clustering with decision trees


The standard stopping criterion is purity of the
classes at the leaves of the trees.
Another criterion uses a distance matrix
measuring the dissimilarity of instances. Stop
when the groups of instances at the leaves from
tight clusters.
What are decision trees?



A decision tree is a classifer. Given an input
instance it inspects the features and delivers a
selected class.
But it knows slightly more than this. The set of
instances grouped at the leaves may not be a
pure class. This set defines a probability
distribution over the classes. So a decision tree is
a distribution classifier.
There are many other varieties of classifier.
Nearest neighbour(s)




If you have a distance measure, and you have a
labelled training set, you can assign a class by
finding the class of the nearest labelled instance.
Relying on just one labelled data point could be
risky, so an alternative is to consider the classes
of k neighbours.
You need to find a suitable distance measure.
You might use cross-validation to set an
appropriate value of k
Bellman's curse



Nearest neighbour classifiers make sense if
classes are well-localized in the space defined by
the distance measure.
As you move from lines to planes, volumes and
high-dimensional hyperplanes, the chance that
you will find enough labelled data points “close
enough” decreases.
This is a general problem, not specific to nearestneighbour, and is known as Bellman's curse of
dimensionality
Dimensionality



If we had uniformly spread data, and we wanted
to catch 10% of the data, we would need 10% of
the range of x in a 1-D space, but 31% of the
range of each x and y in a 2-D space and 46% of
the range of x,y,z in a cube. In 10 dimensions
you need to cover ~80% of the ranges.
In high dimensional spaces, most data points are
closer to the boundaries of the space than they
are to any other data point
Text problems are very often high-dimensional
Decision trees in high-D space




Decision trees work by picking on an important
dimension and using it to split the instance space
into slices of lower dimensionality.
They typically don't use all the dimensions of the
input space.
Different branches of the tree may select different
dimensions as relevant.
Once the subspace is pure enough, or well
enough clustered, the DT is finished.
Cues to class variables



If we have many features, any one could be a
useful cue to the class variable. (If the token is a
single upper case letter followed by a ., it might
be part of A. Name)
If cues conflict, we need to decide which ones to
trust. “... S. p. A. In other news”
In general, we may need to take account of
combinations of features (The “[A-Z]\.” feature is
relevant only if we haven't found abbreviations
Compound cues



Unfortunately there are very many potential
compound cues. Training them all separately will
throw us into a very high-D space.
The naive Bayes classifier “deals with” this by
adopting very strong assumptions about the
relation between feature and the underlying class.
Assumption: Each feature is independently
affected by the class, nothing else matters.
The naïve Bayes classifier
C
F1



F2
...
Fn
P(F1,F2...Fn|C) ≈ P(F1|c)P(F2|c)...P(Fn|c)
Classify by finding class with highest score given
features and this (crass) assumption.
Nice property: easy to train, just count number of
times that Fi and class co-occur
Comments on naïve Bayes




Clearly, the independence assumption is false.
All features, relevant or not, get the same chance
to contribute. If there are many irrelevant
features, they may swamp the real effects we are
after.
But it is very simple and efficient, so can be used
in schemes such as boosting that rely on
combinations of many slightly different classifiers.
In that context, even simpler classifiers (majority
classifier, single rule) can be useful
Decision trees and classifiers






Attributes and instances
Learning from instances
Over-training
Cross-validation
Dimensionality
Independence assumptions