Transcript Document

A quick tour of the datasets for
VLDB 2008
(does not include datasets already in
the UCR archive)
Formatting Note
I measured the accuracy of 1NN-ED
on the training set (only).
This was to make sure we do not
have any formatting
misunderstandings
You should test the 1NN-ED on the
training set (only), and see if you
get the same answers. Do this first,
otherwise we may waste time.
Number of training objects
80
Number of testing objects
2320
Number of classes
8
Length of time series
1024
Euclidean Distance accuracy
95.05%
Some Name
The dataset came from blah
blah blah blah
Why is difficult?
• Blah blah
• Blah blah
• Blah blah
This is the one nearest neighbor,
Euclidean distance accuracy for
just the training set, measured
using leaving-one-out.
MALLAT TECHNOMETRICS
Why is difficult?
This figure is from [a]. The only change we
made was to flip the data left to right, (and
z-normalization)
Number of training objects
55
Number of testing objects
2345
Number of classes
8
Length of time series
1024
Euclidean Distance accuracy
98.18%
• Many classes
• Some classes are globally similar, and
have only local differences.
• Small training set (In [a], using 1024
instances for training, a decision tree
got 96.87% accuracy. Since this was
too easy, we reduced the size of the
training set significantly).
This dataset is described in Mallat, S. G. (1998), A Wavelet Tour of Signal
Processing, San Diego: Academic Press. However the data we used was donated by
Jeong [a].
The data was obtained by randomly choosing 55 objects for the training set and
choosing the rest for the testing set. Each time series was also reversed.
[a] M. K. Jeong, J. C. Lu, X. Huo, B. Vidakovic, and D. Chen (2006), "Waveletbased Data Reduction Techniques for Process Fault Detection," Technometrics,
48(1), 26-40. http://web.utk.edu/~mjeong/
ItalyPowerDemand (3 years)
Task
Distinguish days from Oct to March
(inclusive) from April to September
Why is difficult?
1
3
5
7
9 11 13 15 17 19 21 23
• Borderline days (late Sep vs early Oct)
• Unusual days (soccer games etc)
• Under sampled data?
• August is radically different to the rest of
the summer months.
From Keogh ICDM06
Number of training objects
67
Number of testing objects
1029
Number of classes
2
Length of time series
24
Euclidean Distance accuracy
95.522
See Keogh ICDM06
Eamonn Keogh, Li Wei, Xiaopeng Xi, Stefano Lonardi, Jin Shieh, Scott Sirowy
(2006). Intelligent Icons: Integrating Lite-Weight Data Mining and Visualization into
GUI Operating Systems. ICDM 2006.
CinC_ECG_torso
Task
Data is taken from ECG data for multiple
torso-surface sites. There are 4 classes
(4 different people)
Why is difficult?
• See gray strip on figure. Depending on
location on the body, the peak can be
positive, neutral or negative. Similar
remarks apply to all features.
• The figure shows aligned data, but the
challenge data is slightly out of alignment.
Number of training objects
40
Number of testing objects
1380
Number of classes
4
Length of time series
1639
Euclidean Distance accuracy
85.00%
Haptics
Task
Data is taken from 5 people entering their
“passgraph” on a touchscreen. We only
consider the X axis.
Why is difficult?
200
180
160
140
120
100
80
4 sample time series
(before normalizing)
60
40
0
200
400
600
800
Number of training objects
155
Number of testing objects
308
Number of classes
5
Length of time series
1092
Euclidean Distance accuracy
51.61%
1000
1200
• Small training set
• I think (but have not checked this) that
the high variability at the beginning and
end of the time series is just noise.
• We are just looking at the X-axis for
simplicity, we should also be looking at Yaxis, pen pressure, pen acceleration…
Novel Shoulder-Surfing Resistant Haptic-based Graphical Password
Behzad Malek, Mauricio Orozco, Abdulmotaleb El Saddik
Symbols
Task
Thirteen people participated in this
experiment. They were asked to copy the
randomly appearing symbol as best they
could. There were 3 possible symbols,
each person contributed about 30
attempts.
Why is difficult?
0
50
100 150 200 250 300 350 400
0
X-axis
50
100 150 200 250 300 350 400
Y-axis
Number of training objects
25
Number of testing objects
995
Number of classes
6
Length of time series
398
Euclidean Distance accuracy
84.0%
• Individuality of the 13 individuals
• Each of the 6 classes looks only at the
X or Y axis, we really should have 3
classes looking at the X and Y axis
• Two of the symbols are very very similar
on the Y-axis
• Small training set
This dataset was created for the contest by Jill Brady, a
grad student at UCR. We gratefully acknowledge her.
MedicalImages
Task
The data are histograms of pixel intensity
of medical images. The classes are
“different human body regions.”
Why is difficult?
0
10
20
30
40
50
60
70
80
90
Number of training objects
381
Number of testing objects
760
Number of classes
10
Length of time series
99
Euclidean Distance accuracy
72.178%
100
• It is not clear that treating the raw data
as time series is the best overall
approach for this problems, but the
original authors due report success with a
“time warping” measure.
• Original time series are of different
lengths, some are very short, making
them all the same length may have
introduced artifacts
This dataset was donated by Joaquim C. Felipe, Agma
J. M. Traina and Caetano Traina Jr.
SonyAIBORobotSurface
Task
The robot has roll/pitch/yaw
accelerometers, here we looked at just Xaxis.
The task is to detect the surface being
walked on.
Why is difficult?
• Noisy data
• Small training set. See figure at left, with
enough data it looks easy.
Red: Cement. Blue Carpet
Number of training objects
20
Number of testing objects
601
Number of classes
2
Length of time series
70
Euclidean Distance accuracy
90.0%
This dataset was donated by Manuela Veloso and
Douglas Vail of Carnegie Mellon University
SonyAIBORobotSurfaceII
Task
The robot has roll/pitch/yaw
accelerometers, here we looked at just Zaxis.
The task is to detect the surface being
walked on.
Why is difficult?
• Noisy data
• Small training set. See figure at left, with
enough data it looks easier.
Red: Cement. Blue Carpet or Field
Number of training objects
27
Number of testing objects
953
Number of classes
2
Length of time series
65
Euclidean Distance accuracy
85.185%
This dataset was donated by Manuela Veloso and
Douglas Vail of Carnegie Mellon University
TwoLeadECG
Task
Time series is taken from MIT-BIH LongTerm ECG Database (ltdb) Record
ltdb/15814, begin at time 420, ending at
1019. The task is to distinguish between
signal 0 and signal 1.
Why is difficult?
• Subtle distinctions
• Small training set
• Beat extractor does not produce perfect
alignment, but after using EM to align the
signal (figure at left) it is clear that certain
parts of the signal are more informative.
Number of training objects
23
Number of testing objects
1139
Number of classes
2
Length of time series
82
Euclidean Distance accuracy
78.261%
StarLightCurves
Task
Time series are star light curves falling
into three classes.
Why is difficult?
• Two of the three classes are quite
similar.
• Large dataset (but the real datasets
have billions of these!)
• Phase was aligned using standard
astronomy tricks. However we tried
circular shift invariant Euclidean distance
(see [a]) our accuracy improved,
suggesting the alignment is not perfect.
Number of training objects
1000
Number of testing objects
8236
Number of classes
3
Length of time series
1024
Euclidean Distance accuracy
86.00%
1 - CEPH
2 - EB
[a] Eamonn Keogh, Li Wei, Xiaopeng Xi, Sang-Hee Lee and Michail
(2006) LB_Keogh Supports Exact Indexing of Shapes under Rotation
3 - RRL Vlachos
Invariance with Arbitrary Representations and Distance Measures. VLDB 2006.
DiatomSizeReduction
Gomphonema augur
Task
“Each successive generation of a clonaly
reproducing diatom is slightly smaller
than its forebears .”[a]
Why is difficult?
Eunotia tenella
(many omitted)
Fragilariforma bicapitata
• Small training set
• Possible errors caused by image
processing step.
• Change in scale of diatoms shows up as
“warping”.
Stauroneis smithii
[b]
Number of training objects
16
Number of testing objects
306
Number of classes
4
Length of time series
345
Euclidean Distance accuracy
93.75%
0
200
400
600
800
1000
1200
[a] http://rbg-web2.rbge.org.uk/DIADIST/index.htm?srseries.htm&main
[b] Xiaopeng Xi, et al (2007). Finding Motifs in Database of Shapes. SDM'07
Motes
Task
Sensor data used in paper [b].
Here the task is to distinguish between
sensor q8calibHumid and sensor
q8calibHumTemp.
The raw data has dropouts, which I left in.
Why is difficult?
25
20
15
10
5
0
0
50
100
150
200
Number of training objects
20
Number of testing objects
1252
Number of classes
2
Length of time series
84
Euclidean Distance accuracy
75.00%
250
300
350
• Small training set.
• Lots of dropouts (however, when noise
is removed, should be very easy).
• Here the dropouts had value zero. But
after z-normalization these values
changed. It would have been easier to do
smart smoothing if the data was not
normalized.
[a] Raw data from Carlos Guestrin (CMU), Classification version by Keogh
[b] Jimeng Sun, Spiros Papadimitriou, Christos Faloutsos: Online Latent
Variable Detection in Sensor Networks. ICDE 2005: 1126-1127
ChlorineConcentration
1
Task
0.8
Sensor data used in paper [b].
Multiple sensors have spatial correlation,
which I arbitrarily divided into 3 sets
0.6
0.4
0.2
0
-0.2
Why is difficult?
0
20
40
60
80
100
120
Number of training objects
487
Number of testing objects
3840
Number of classes
3
Length of time series
166
Euclidean Distance accuracy
63.383%
140
160
180
• The borderline cases are hard to
classify. However with more data it would
be easy. For example, when I randomly
sample k items from the labeled test set,
and do INN ED classification, I get…
1000 -> 76.5% accuracy
2000 -> 89.85% accuracy
3000 -> 96.8% accuracy
[a] Stacia Thompson and Jeanne M. VanBriesen (CMU) Classification version by Keogh
[b] Jimeng Sun, Spiros Papadimitriou, Christos Faloutsos: Online Latent Variable Detection in
Sensor Networks. ICDE 2005: 1126-1127
ECGFiveDays
Task
Wandering baseline
Excerpt of Class 1
Data is from a 67 year old male. The two
classes are simply
1) ECG date: 12/11/1990
2) ECG date: 17/11/1990
Why is difficult?
•
•
Number of training objects
23
Number of testing objects
861
Number of classes
2
Length of time series
136
Euclidean Distance accuracy
82.609%
Wandering baseline was not removed,
this shows up as linear drift.
Beat extractor does not produce
perfect alignment, but after using EM
to align the signal (figure at left) it is
clear that certain parts of the signal
are more informative.
InlineSkate
Task
This data was been collected from
experiments with inline speed skaters
on a treadmill.
Each time series represents an angular
measurement of the ankle during one
movement cycle.
Cycles were of different lengths, we made
them all the same length.
Why is difficult?
•
•
0
200
400
600
800
1000
1200
1400
1600
Number of training objects
100
Number of testing objects
550
Number of classes
7
Length of time series
1882
Euclidean Distance accuracy
30.00%
1800
2000
•
Lots of “warping”
Long time series (for algorithms that
scale poorly in dimensionality).
The “cycle” extraction algorithm might
not be perfect (this was done before
we saw the data)
The data was provided by Fabian
Moerchen and Olaf Hoos.
FacesUCR
Task
This data consists of faces of grad students
transformed into “time series”
Why is difficult?
•
•
•
•
Number of training objects
200
Number of testing objects
2050
Number of classes
14
Length of time series
131
Euclidean Distance accuracy
75.50
Variation of head angle and expression.
Some have glasses/no glasses versions
All grad students look alike (well, some
do).
The transformation algorithm is a little
brittle (we have since found more robust
techniques).
Photographs by Chotirat "Ann" Ratanamahatana, image
conversion by Xiaopeng Xi and Eamonn Keogh
WordsSynonyms
Task
1
0.5
0
0
50
100
150
200
250
300
350
400
450
The time series representation of words is known to be
very competitive with other representations [a].
Here the results might not be competitive because we
are only using one (of four) time series per word, we are
normalizing, and we have small training sets.
[a] Word spotting for historical documents. Toni M. Rath
and R. Manmatha International Journal on Document
Analysis and Recognition. Volume 9, Numbers 2-4 / April,
2007
This dataset consists of word profiles for
George Washington's manuscripts.
This dataset is the “50-words” dataset,
remapped to 25 classes.
The data was flipped left-right so that it
would not be recognized.
Why is difficult?
•
•
•
Number of training objects
267
Number of testing objects
638
Number of classes
25
Length of time series
270
Euclidean Distance accuracy
58.80
There are two ways to be a member of
each class.
In this case, length normalization
clearly does throw away useful info.
Errors from the difficult task of OCR on
old documents
The data was provided by Toni M. Rath
and R. Manmatha.