Transcript Carlos.ppt
Columbia University
Advanced Machine Learning & Perception – Fall 2006
Term Project
Nonlinear Dimensionality Reduction
and K-Nearest Neighbor
Classification
Applied to Global Climate Data
Carlos Henrique Ribeiro Lima
New York – Dec/2006
Outline
1. Goals
2. Motivation and Dataset
3. Methodology
4. Results
1. Low-Dimensional Manifold
2. KNN on Low-Dimensional Manifold
5. Conclusion
1. Goals
1. Use of kernel PCA based on Semidefinite Embedding to
identify the low-dimensional, non-linear, manifold of climate
data sets identification of main modes of spatial variability;
2. Classification on the feature space predictions on the
original space (KNN method);
2. Motivation
Dataset of Monthly Sea Surface Temperature (SST)
Huge economical and social impacts of extreme El Nino
events (e.g. 1997) Need of forecasting models!
2. Dataset
Monthly Sea Surface Temperature (SST) Data
from Jan/1856 to Dec/2005
1. Latitudinal Band: 25oS-25oN
2. Grid with 599 cells;
3. Training data: Jan/1856 to Dec/1975 = 120 years
4. Testing set: Jan/1976 to Dec/2005 = 30 years
x11
x21
5. Input matrix:
X
.
.
.
.
x1m
n = 1440 points
.
m = 599 dimensions
.
xn1
xnm
3. Methodology
1) Semidefinite Embedding (Code from K. Q. Weinberger)
Semipositive definiteness
Inner product centered on
the origin
Isometry - local distances of
the input space are preserved
on the feature space
2) KNN Euclidian Distance
3) Probabilistic Forecasting Skill Score (RPS)
4. Results
Low-Dimensional Manifold
4. Results
Labeling on the feature space
4. Results
Forecasts – Testing Set
KNN method and skill score
E.g. March – 1997;
1) Want to predict the
class of nino3 in
Dec/1997 lead time = 9
months.
2) KNN on feature space
(March:1856 to 1975);
3) Take classes and
weights of the k
neighbors;
4) Skill score.
4. Results
Forecasts – Testing Set
KNN method and skill score – El Nino of 1982 and 1997
5. Conclusions
1. Semidefinite Embedding performs well on the SST data
(high dimensional just 3 dimensions ~90%of exp.
variance);
2. KNN method provides very good classification and
forecasts;
3. Need to check sensibility to change in some
parameters (# local neighbors, #KNN);
4. Plan to extend to other climate datasets;
5. Try other metrics, multivariate data, etc.