Transcript Keith.ppt

Feature Selection on Time-Series
Cab Data
Yingkit (Keith) Chow
Contents
•
•
•
•
•
Introduction
Features Considered
FCBF (Filter-type feature selection)
FCBF-PCA (my variation)
Conclusion
All Features Considered
• Features =
– Each time sample consists of the following features
• Day of Week, Time of Day (1st two features)
• taxis[t, 6:9], taxis[t-1, 6:9],…, taxis[t-5, 6:9]
– [6:9] represents the index to the matrix taxis, which is the cab
entering with meter off, cab enter on, cab exit off, cab exit on
– Not all features here will be relevant to classifying
whether a game is present.
Fast Correlation-Based Filter
Algorithm:
1. Finds features that are relevant
( SU(I, C) > threshold),
•
where SU is symmetric uncertainty and will
be described in the next slide
2. Remove redundant features by comparing
remaining features (after the first step)
•
Remove feature j if SU(i, j) >= SU(j, C)
[1]
Equations
• Information Gain (IG)
– IG(X|Y) = H(X) – H(X|Y)
• Symmetric Uncertainty (SU)
– SU(X,Y) = 2 * IG(X|Y) / [H(X)+H(Y)]
• SU is used instead of IG because it compensates for
features having more values and normalizes data[1]
FCBF
• Classifier (MATLAB
Classify- Linear)
• Number Bins = 96
• Threshold = 0.01
• Accuracy = 91.9%
Choice of Number Bins
• Num Bins = 96 results
shown in previous slide
(red is ground truth of
game and blue is my
classification)
• Num Bins = 20
– Accuracy = 58.6%
– Here the algorithm breaks
down and only chooses
feature 2, the “time of day”.
The blue is periodic here,
where a certain time
segment a day, everyday
will be classed as a game.
FCBF - PCA
• FCBF compares individual features with each
other
• We can use PCA to try and capture a group of
features. (for example, maybe one eigenvector
can capture the shape of the number of cabs
incoming with meters on initially before a game or
the increase in the number of cabs entering with
meters off prior to the end of game)
– Example shown in the next slide
Cab Traffic Behavior
• Before Start of Game
– Cab On Enter and Cab
Off Exit are high
• Towards End of Game
– Cab Off Enter and Cab
On Exit are high
FCBF-PCA
• Classifier (MATLAB
Classify- Linear)
• Number Bins = 20
• Threshold = 0.01
• Accuracy = 92.9%
• Note: the features here are
projections onto the
eigenvectors and not the
original feature dimension
Conclusions
• The choice of number of bins have an
enormous impact on the performance.
(possibly due to 96 discrete values of time
of day variable)
• FCBF-PCA was less susceptible to the
choice of numBins (10, 20, 100 numBins all
resulted in approximately 91% accuracy)
Future Work
• Currently using labels of game or not game.
– I’ll try to make it work for detecting the first
sample of a game and another classifier to
detect the last sample of a game since the midgame generally has an entirely different
characteristic from the beginning and end of
game. However, I might be limited by the
number of samples.
Questions
• I’m not currently in NYC so please send
questions or comments to:
– [email protected]
Citations
1.
2.
“Feature Selection for High Dimensional Data: A Fast CorrelationBased Filter Solution”, by Lei Yu and Huan Liu, ICML (2003)
“Efficient Feature Selection via Analysis of Relevance and
Redundancy”, by Lei Yu and Huan Liu, Journal of Machine
Learning Research 5 (2004)