Document 7258372

Download Report

Transcript Document 7258372

Introduction to Machine Learning
course 67577 fall 2007
Lecturer: Amnon Shashua
Teaching Assistant: Yevgeny Seldin
School of Computer Science and Engineering
Hebrew University
What is Machine Learning?
•
Inference engine (computer program) that when given sufficient
data (examples) computes a function that matches as close as
possible the process generating the data.
•
Make accurate prediction based on observed data
•
Algorithms to optimize a performance criterion based on
observed data
•
Learning to do better in the future based on what was
experienced in the past
•
Programming by examples: instead of writing a program to solve
a task directly, machine learning seeks methods by which the
computer will come up with its own program based on training
examples.
Why Machine Learning?
•
Data-driven algorithms are able examine large amounts of data.
A human expert on the other hand is likely to be guided by
subjective impressions or by examining a relatively small number
of examples.
•
Humans often have trouble expressing what they know but have
no difficulty in labeling data
•
Machine learning is effective in domains where declarative (rule
based) knowledge is difficult to obtain yet generating training
data is easy
Typical Examples
•
Visual recognition (say, detect faces in an image): the amount of variability in
appearance introduce challenges that are beyond the capacity of direct
programming
•
Spam filtering: data-driven programming can adapt to changing tactics by
spammers
•
Extract topics from documents: categorize news articles whether they are about
politics, sports, science, etc.
•
Natural language understanding: from spoken words to text; categorize the
meaning of spoken sentences
•
Optical character recognition (OCR)
•
Medical diagnosis: from symptoms to diagnosis
•
Credit card transaction fraud detection
•
Wealth prediction
Fundamental Issues
•
Over-fitting: doing well on a training set does not guarantee accuracy on
new examples
•
What is the resource we wish to optimize? For a given accuracy, use the
smallest size training set
•
Examples are drawn from some (fixed) distribution D over X x Y (instance
space x output space). Does the learner actually need to recover D during
the learning process?
•
How does the learning process depend on the complexity of the family of
learning functions (concept class C)? How does one define complexity of
C?
•
When the goal is to learn the joint distribution D then the problem is
computationally unwieldy because the joint distribution table is
exponentially large. What assumptions can be made to simplify the task?
Supervised vs. Un-supervised
Supervised Learning Models:
f : X  Y where X is the instance (data) space and Y is the output space
Y  1,..., k
Y R
Multiclass classification. K=2 is normally of most interest.
Regression.
Predict the price of a used car given brand, year, mileage..
Kinematics of a robot arm; navigate by determining steering
angle from image input..
Un-supervised Learning Models:
Find regularities in the input data assuming there is some structure in the input space
• Density estimation
• Clustering (non-parametric density estimation): divide
customers to groups which have similar attributes..
• Latent class models: extract topics from documents
• Compression: represent the input space with fewer
parameters; projection to lower-dimensional spaces
Notations
X is the instance space: space from which observations are drawn. Examples,
X  {0,1}n , X  R n , X  *
x X
input instance, a single observation. Examples,
x  (0,1,1,1,0,0), x  (0.5,2.3,0,7.2), x " text"
Y is the output space: set of possible outcomes that can be associated with a
measurement. Examples,
Y  {1,1}, Y  R, Y  { ,  ,  }
An example is an instance-label pair (x,y). If |Y|=2 one typically uses {0,1} or {-1,1}.
We say that an example (x,y) is positive if y=1 and otherwise we call it a negative
example
A training set Z consists of m instance-label pairs:
Z  ( x1 , y1 ),..., ( xm , ym )
In some cases we refer to the training set without labels:
S  x1 ,..., xm 
Notations
A concept (hypothesis) class C is a set (not necessarily finite) of functions of the form:
C  h | h : X  Y 
Each h C is called a concept or hypothesis or classifier. Example, if
then C might be: C  h | i : h( x)  xi 
X  {0,1}n , Y  {0,1}
Other examples:
Decision trees: when X  {0,1}n , Y  {true, false} then any boolean function can be
described by a binary tree. Thus, C consists of decision trees ( | C |  )
Conjunction learning: a conjunction is a special case of a Boolean formula. A literal
Is a variable or its negation and a term is a conjunction of literals, i.e. ( x1 x2 x3 )
A target function is a term which consists of a subset of literals.
In this case X  2n and C  3n
Separating hyperplanes: X  R n a concept h(x) is specified by a vector w  R
a scalar b such that:
T
1
w xb 
h( x )  

 1 otherwise 
n
and
The Formal Learning Model
Probably Approximate Correct (PAC)
•
Distribution invariant: Learner does not need to estimate the joint
distribution D over X x Y. Assumptions are that examples arrive
i.i.d. and that D exists and is fixed.
•
The training sample complexity (size of the training set Z)
depends only the desired accuracy and confidence parameters does not depend on D.
•
Not all concept classes D are PAC-learnable. But some
interesting classes are.
PAC Model Definitions
S  {x1 ,..., xm } sampled randomly and independently (i.i.d) according to some
(unknown) Distribution D, i.e., S is distributed according to the product
distribution D ...  D  D m
Realizable case: when a target concept ct ( x)  C is known to lie inside C.
In this case, the training set is Z  xi , ct ( xi )im1
Unrealizable case: when
and D is over XxY
Given a concept function
ct  C
and the training set is Z  xi , y 
m
i i 1
h C
err (h)  probD x : ct ( x)  h( x)

 ind (c ( x)  h( x))D( x)dx
t
xX
1 x  true 
ind ( x)  

0 x  false 
err (h) is the probability that an instance x sampled according to D will be
labeled incorrectly by h(x)
PAC Model Definitions
opt (C )  min err (h)
hC
Note: in realizable case opt (C )  0
because err (ct )  0
  0 given to the learner specifies desired accuracy, i.e.
err (h)  opt (C )  
1  0
given to the learner specifies desired confidence, i.e.
prob[err (h)  opt (C )   ]  1  
The learner is allowed to deviate occasionally from the desired accuracy
but only rarely so..
PAC Model Definitions
c
We will say that an algorithm L learns C if for every t and for every
D over XxY, L generates a concept function h C such that the
probability that err ( h)  opt (C )   is at least 1  
ct
ct
ct
ct
Formal Definition of PAC Learning
A learning algorithm L is a function:

L:
 ( x , y )
i
m 1
i
m
i 1
C
from the set of all training examples to C with the following property:
given any  ,  
that
0,1 there is an integer m0 ( ,  ) such
c
c
if m  m0 then, for any probability distribution D on XxY, if Z is a
m
training set of length m drawn randomly according to D , then
with probability of at least 1   then hypothesis h  L( Z )  C
is such that err ( h)  opt (C )  
 
t
ct
t
ct
We say that C is learnable (or PAC-learnable) if there is a learning
algorithm for C
Formal Definition of PAC Learning
Notes:
m0 ( ,  )
does not depend on D, i.e., PAC model is distribution invariant
The class C determines the sample complexity. For “simple” classes
m0 ( ,  ) would be
small compared to more “complex” classes.
c
c
t
t
ct
ct
Course Syllabus
3 x PAC:
C 
C 
m
m
1

1

2 x Separating Hyperplanes:
ln
ln
1

C


vcd (C )

ln
1

Support Vector Machine, Kernels, Linear
c
Discriminant
Analysis
t
ct
3 x Unsupervised Learning:
Dimensionality Reduction (PCA), Density Estimation,
Non-parametric
Clustering (spectral methods)
c
c
t
t
5 x Statistical Inference:
Maximum Likelihood, Conditional Independence,
Latent Class Models, Expectation-Maximization Algorithm,
Graphical Models