Transcript Carlos.ppt

Columbia University
Advanced Machine Learning & Perception – Fall 2006
Term Project
Nonlinear Dimensionality Reduction
and K-Nearest Neighbor
Classification
Applied to Global Climate Data
Carlos Henrique Ribeiro Lima
New York – Dec/2006
Outline
1. Goals
2. Motivation and Dataset
3. Methodology
4. Results
1. Low-Dimensional Manifold
2. KNN on Low-Dimensional Manifold
5. Conclusion
1. Goals
1. Use of kernel PCA based on Semidefinite Embedding to
identify the low-dimensional, non-linear, manifold of climate
data sets  identification of main modes of spatial variability;
2. Classification on the feature space  predictions on the
original space (KNN method);
2. Motivation
Dataset of Monthly Sea Surface Temperature (SST)
Huge economical and social impacts of extreme El Nino
events (e.g. 1997)  Need of forecasting models!
2. Dataset
Monthly Sea Surface Temperature (SST) Data
from Jan/1856 to Dec/2005
1. Latitudinal Band: 25oS-25oN
2. Grid with 599 cells;
3. Training data: Jan/1856 to Dec/1975 = 120 years
4. Testing set: Jan/1976 to Dec/2005 = 30 years
x11
x21
5. Input matrix:
X
.
.
.
.
x1m
n = 1440 points
.
m = 599 dimensions
.
xn1
xnm
3. Methodology
1) Semidefinite Embedding (Code from K. Q. Weinberger)
Semipositive definiteness
Inner product centered on
the origin
Isometry - local distances of
the input space are preserved
on the feature space
2) KNN  Euclidian Distance
3) Probabilistic Forecasting  Skill Score (RPS)
4. Results
Low-Dimensional Manifold
4. Results
Labeling on the feature space
4. Results
Forecasts – Testing Set
KNN method and skill score
E.g. March – 1997;
1) Want to predict the
class of nino3 in
Dec/1997  lead time = 9
months.
2) KNN on feature space
(March:1856 to 1975);
3) Take classes and
weights of the k
neighbors;
4) Skill score.
4. Results
Forecasts – Testing Set
KNN method and skill score – El Nino of 1982 and 1997
5. Conclusions
1. Semidefinite Embedding performs well on the SST data
(high dimensional  just 3 dimensions ~90%of exp.
variance);
2. KNN method provides very good classification and
forecasts;
3. Need to check sensibility to change in some
parameters (# local neighbors, #KNN);
4. Plan to extend to other climate datasets;
5. Try other metrics, multivariate data, etc.