UNIVERSITA’ DI MILANO-BICOCCA CdL IN INFORMATICA
Download
Report
Transcript UNIVERSITA’ DI MILANO-BICOCCA CdL IN INFORMATICA
Università di Milano-Bicocca
Laurea Magistrale in Informatica
Corso di
APPRENDIMENTO E APPROSSIMAZIONE
Lezione 8 - Instance based learning
Prof. Giancarlo Mauri
Instance-based Learning
Key idea:
Just store all training examples <xi, f(xi)>
Classify a query instance by retrieving a set of “similar”
instances
Advantages
Training is very fast - just storing examples
Learn complex target functions
Don’t lose information
Disadvantages
Slow at query time - need for efficient indexing of training
examples
Easily fooled by irrelevant attributes
Instance-based Learning
Main approaches:
k-Nearest Neighbor
Locally weighted regression
Radial basis functions
Case-based reasoning
Lazy and eager learning
K-Nearest Neighbor Learning
Instances: points in Rn
Euclidean distance to measure similarity
Target function f: Rn V
The algorithm:
Given query instance xq
first locate nearest training example xn, then estimate f(xq) =
f(xn) (for k=1)
or
take vote among its k nearest nbrs (k≠1 and f discrete-valued)
take mean of f values of k nearest nbrs (k≠1 and f real-valued)
f (x )
q
k
i 1
f ( xi )
k
K-Nearest Neighbor Learning
When to Consider Nearest Neighbor
Instances map to points in Rn
Less than 20 attributes per instance
Lots of training data
An example
Boolean target function in 2D
x classified + for k=1 and - for k=5 (left diagram)
Right diagram (Voronoi diagram) shows the decision
surface induced by 1-nbr for a given set of training
examples (black dots)
-
- -
-
+
x
+
+
-
+
Voronoi Diagram
query point qf
nearest neighbor qi
3-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
7-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
Nearest Neighbor (continuous)
1-nearest neighbor
Nearest Neighbor (continuous)
3-nearest neighbor
Nearest Neighbor (continuous)
5-nearest neighbor
Behavior in the Limit
Let p(x) be the probability that instance x will be
labeled 1 (positive) versus 0 (negative)
Nearest neighbor
As number of training examples ∞ , approaches Gibbs
algorithm: with probability p(x) predict 1, else 0
K-Nearest neighbor
As number of training examples ∞ and k gets large,
approaches Bayes optimal: if p(x)>.5 then predict 1, else 0
Note: Gibbs has at most twice the expected error of
Bayes optimal
Distance-Weighted k-NN
Might want weight nearer neighbors more heavily…
w f (x )
w
k
f ( xq )
i 1
i
i
k
i 1
i
1
where
with d(xq,xi) distance between xq and xi
d ( xq , xi )2
(if xq= xi, i.e. d(xq,xi) = 0, then f(xq) = f(xi))
wi
Note:
now it makes sense to use all training examples instead of
just k (Shepard’s method, global, slow)
Curse of Dimensionality
Imagine instances described by 20 attributes, but only 2 are
relevant to target function
nearest nbr is easily misled when high-dimensional X
One approach:
Stretch jth axis by weight zj, where z1,…, zn chosen to
minimize prediction error
Use cross-validation to automatically choose weights z1,…, zn
Note setting zj to zero eliminates this dimension altogether
Locally weighted Regression
Note that kNN forms local approximation to f for each query
point xq : why not form an explicit approximation f’(x)
for
region surrounding xq ?
Fit linear function to k nearest neighbors
Fit quadratic…
Produces “piecewise approximation” to f
Several choices of error to minimize:
Squared error over k nearest neighbors
E1(xq) = (∑xNNs of xq(f(x)-f’(x))2)/2
Distance-weighted squared error over all nbrs
E1(xq) = (∑xD(f(x)-f’(x))2K(d(xq,x))/2
Radial Basis Function Networks
Global approximation to target function, in terms of linear
combination of local approximations
Used, e.g., for image classification
A different kind of neural network
Closely related to distance-weighted regression, but “eager”
instead of “lazy”
Radial Basis Function Networks
Where ai(x) are the attributes describing instance x, and
k
f ( x) w0 wu Ku (d ( xu , x))
u 1
One common choice for Ku (d ( xu , x)) is
K u (d ( xu , x)) e
1
2
d
( xu , x )
2
2 u
Training RBF Networks
Q1: what xu to use for each kernel function Ku(d(xu, x))
Scatter uniformly throughout instance space
Or use training instances (reflects instance
distribution)
Q2: how to train weights (assume here Gaussian Ku)
First choose variance (and perhaps mean) for each Ku
- e.g., use EM
Then hold Ku fixed, and train linear output layer
- efficient methods to fit linear function
Radial Basis Function Network
Global approximation to target function in terms
of linear combination of local approximations
Used, e.g. for image classification
Similar to back-propagation neural network but
activation function is Gaussian rather than sigmoid
Closely related to distance-weighted regression
but ”eager” instead of ”lazy”
Radial Basis Function Network
output f(x)
wn
linear parameters
Kernel functions
Kn(d(xn,x))=
exp(-1/2 d(xn,x)2/2)
xi
input layer
f(x)=w0+n=1k wn Kn(d(xn,x))
Training Radial Basis Function
Networks
How to choose the center xn for each Kernel
function Kn?
scatter uniformly across instance space
use distribution of training instances (clustering)
How to train the weights?
Choose mean xn and variance n for each Kn
nonlinear optimization or EM
Hold Kn fixed and use local linear regression to
compute the optimal weights wn
Radial Basis Network Example
K1(d(x1,x))=
exp(-1/2 d(x1,x)2/2)
w1 x+ w0
f^(x) = K1 (w1 x+ w0)
+ K2 (w3 x + w2)
Case-based reasoning
Can apply instance-based learning even when X ≠ n
need different “distance” metric
Case-based reasoning is instance-based learning applied
to instances with symbolic logic descriptions
((user-complaint error53-on-shutdown)
(cpu-model PowerPC)
(operating-system Windows)
(network-connection PCIA)
(memory 48meg)
(installed-applications Excel Netscape VirusScan)
(disk 1gig)
(likely-cause???))
Case-based reasoning in CADET
CADET: 75 stored examples of mechanical devices
each training example: < qualitative function,
mechanical structure>
new query: desired function
target value: mechanical structure for this function
Distance metric: match qualitative function descriptions
Case-based Reasoning in CADET
Case-based Reasoning in CADET
Case-based Reasoning in CADET
Instances represented by rich structural descriptions
Multiple cases retrieved (and combined) to form solution to new
problem
Tight coupling between case retrieval and problem solving
Bottom line:
Simple matching of cases useful for tasks such as answering
help-desk queries
Area of ongoing research
Lazy and Eager learning
Lazy: wait for query before generalizing
k-NEAREST NEIGHBOR, Case-based reasoning
Eager: generalize before seein query
Radial basis function networks, ID3, Backpropagation,
NaiveBayes,…
Does it matter?
Eager learner must create global approximation
Lazy learner can create many local approximations
If they use same H, lazy can represent more complex fns (e.g.,
consider H = linear functions)
Machine Learning 2D5362
Instance Based Learning
Distance Weighted k-NN
Give more weight to neighbors closer to the query
point
f^(xq) = i=1k wi f(xi) / i=1k wi
where wi=K(d(xq,xi))
and d(xq,xi) is the distance between xq and xi
Instead of only k-nearest neighbors use all training
examples (Shepard’s method)
Distance Weighted Average
Weighting the data:
f^(xq) = i f(xi) K(d(xi,xq))/ i K(d(xi,xq))
Relevance of a data point (xi,f(xi)) is measured by
calculating the distance d(xi,xq) between the query
xq and the input vector xi
Weighting the error criterion:
E(xq) =
i (f^(xq)-f(xi))2 K(d(xi,xq))
the best estimate f^(xq) will minimize the cost E(q), therefore
E(q)/f^(xq)=0
Kernel Functions
Distance Weighted NN
K(d(xq,xi)) = 1/ d(xq,xi)2
Distance Weighted NN
K(d(xq,xi)) = 1/(d0+d(xq,xi))2
Distance Weighted NN
K(d(xq,xi)) = exp(-(d(xq,xi)/0)2)
Linear Global Models
The model is linear in the parameters wk, which can be estimated using a
least squares algorithm
f^(xi) = k=1D k xki or F(x) = X
Where xi=(x1,…,xD)i, i=1..N, with D the input dimension and N the number
of data points.
Estimate the wk by minimizing the error criterion
E= i=1N (f^(xi) – yi)2
(XTX) = XT F(X)
= (XT X)-1 XT F(X)
k= m=1D n=1N (l=1D xTkl xlm)-1 xTmn f(xn)
Linear Regression Example
Linear Local Models
Estimate the parameters k such that they locally (near the query point
xq) match the training data either by
weighting the data:
wi=K(d(xi,xq))1/2 and transforming
zi=wi xi
vi=wi yi
or by weighting the error criterion:
E= i=1N (xiT – yi)2 K(d(xi,xq))
still linear in with LSQ solution
= ((WX)T WX)-1 (WX)T WF(X)
Linear Local Model Example
Kernel K(x,xq)
Local linear
model:
f^(x)=b1x+b0
f^(xq)=0.266
query point
Xq=0.35
Linear Local Model Example
Design Issues in Local Regression
Local model order (constant, linear, quadratic)
Distance function d
feature scaling: d(x,q)=(j=1d mj(xj-qj)2)1/2
irrelevant dimensions mj=0
kernel function K
smoothing parameter bandwidth h in K(d(x,q)/h)
h=|m| global bandwidth
h= distance to k-th nearest neighbor point
h=h(q) depending on query point
h=hi depending on stored data points
See paper by Atkeson [1996] ”Locally Weighted Learning”
Local Linear Models
Local Linear Model Tree (LOLIMOT)
• incremental tree construction algorithm
• partitions input space by axis-orthogonal splits
• adds one local linear model per iteration
1. start with an initial model (e.g. single LLM)
2. identify LLM with worst model error Ei
3. check all divisions : split worst LLM hyper-rectangle
in halves along each possible dimension
4. find best (smallest error) out of possible divisions
5. add new validity function and LLM
6. repeat from step 2. until termination criteria is met
LOLIMOT
Initial global linear model
Split along x1 or x2
Pick split that minimizes
model error (residual)
LOLIMOT Example
LOLIMOT Example
Lazy and Eager Learning
Lazy: wait for query before generalizing
k-nearest neighbors, weighted linear regression
Eager: generalize before seeing query
Radial basis function networks, decision trees, backpropagation, LOLIMOT
Eager learner must create global approximation
Lazy learner can create local approximations
If they use the same hypothesis space, lazy can represent more
complex functions (H=linear functions)