Gradient Descent Optimization

Transcript Gradient Descent Optimization

Classification and Prediction:
Regression Via Gradient Descent
Optimization
Bamshad Mobasher
DePaul University
Linear Regression
 Linear regression: involves a response variable y and a
single predictor variable x 
y = w0 + w1 x
 The weights w0 (y-intercept) and w1 (slope) are regression coefficients
 Method of least squares: estimates the best-fitting straight line
 w0 and w1 are obtained by minimizing the sum of the squared errors (a.k.a.
residuals)
2
2
e
i
i
w1 can be obtained by
setting the partial
derivative of the SSE to 0
and solving for w1,
ultimately resulting in:
  ( yi  yˆ i )
i
  ( yi  ( w0  w1 xi ))2
 ( x  x )( y  y )
w
w

y

w
x
 (x  x)
i
i
i
1
2
0
1
i
i
2
Multiple Linear Regression
 Multiple linear regression: involves more than one predictor variable
 Features represented as x1, x2, …, xd
 Training data is of the form (x1, y1), (x2, y2),…, (xn, yn)
(each xj is a row vector in matrix X, i.e. a row in the data)
 For a specific value of a feature xi in data item xj we use: xij
ˆ  w0  w1x1  w2 x2
 Ex. For 2-D data, the regression function is: y
x1
y
x2
 More generally: yˆ  f ( x1 ,..., xd )  w0 
d
T
w
x

w

w
 i i 0 .x
i 1
3
Least Squares Generalization
 Multiple dimensions
To simplify add a new feature x0 = 1 to feature vector x:
x0
x1
x2
y
1
1
1
1
1
d
d
i 1
i 0
yˆ  f ( x0 , x1 ,...,xd )  w0 x0   wi xi   wi xi  w T .x
Least Squares Generalization
d
d
i 1
i 0
yˆ  f ( x0 , x1 ,..., xd )  f (x)  w0 x0   wi xi   wi xi  w T .x
Calculate the error function (SSE) and determine w:
2
n
d


i
E (w)  y  f (x)    y   wi  xi    ( y   wi  xij ) 2
i 0
j 1
i 0


d
2
 (y  Xw ) T  (y  Xw )
y  vectorof all trainingresponsesy j
X  matrixof all trainingsamplesx j
w  ( X T X) 1 X T y
yˆ test  w  xtest
for testsamplex test
Closed form solution to

E (w )  0
w
Gradient Descent Optimization
 Linear regression can also be solved using Gradient Decent
optimization approach
 GD can be used in a variety of settings to find the minimum value
of functions (including non-linear functions) where a closed form
solution is not available or not easily obtained
 Basic idea:
 Given an objective function J(w) (e.g., sum of squared errors), with w as a vector
of variables w0, w1, …, wd, iteratively minimize J(w) by finding the gradient of
the function surface in the variable-space and adjusting the weights in the
opposite direction
 The gradient is a vector with each element representing the slope of the function
in the direction of one of the variables
 Each element is the partial derivative of function with respect to one of variables
 f (w)
J (w)  J ( w1 , w2 ,, wd )  
 w1
f (w)
f (w) 


w2
wd 
6
Optimization
 An example - quadratic function in 2 variables:
f( x ) = f( x1, x2 ) = x12 + x1x2 + 3x22
 f( x ) is minimum where gradient of f( x ) is zero in all directions
Optimization
 Gradient is a vector
 Each element is the slope of function along direction of one of variables
 Each element is the partial derivative of function with respect to one of
variables
 Example:
f (x)  f ( x1 , x2 )  x1  x1 x2  3x2
2
 f (x)
f (x)  f ( x1 , x2 )  
 x1
2
f (x) 
  2 x1  x2
x2 
x1  6 x2 
Optimization
 Gradient vector points in direction of steepest ascent of function
f ( x1 , x2 )
f ( x1 , x2 ) x2
f ( x1 , x2 ) x1
f ( x1 , x2 )
Optimization
 This two-variable example is still simple enough that we can
find minimum directly
f ( x1 , x2 )  x1  x1 x2  3x2
2
f ( x1 , x2 )  2 x1  x2
2
x1  6 x2 
 Set both elements of gradient to 0
 Gives two linear equations in two variables
 Solve for x1, x2
2 x1  x2  0
x1  0
x1  6 x2  0
x2  0
Optimization
 Finding minimum directly by closed form analytical solution
often difficult or impossible
Quadratic functions in many variables
 system of equations for partial derivatives may be ill-conditioned
 example: linear least squares fit where redundancy among features is
high
Other convex functions
 global minimum exists, but there is no closed form solution
 example: maximum likelihood solution for logistic regression
Nonlinear functions
 partial derivatives are not linear
 example: f( x1, x2 ) = x1( sin( x1x2 ) ) + x22
 example: sum of transfer functions in neural networks
Gradient descent optimization
 Given an objective (e.g., error) function E(w) = E(w0, w1, …, wd)
 Process (follow the gradient downhill):
1.
Pick an initial set of weights (random):
2.
Determine the descent direction:
w = (w0, w1, …, wd)
-E(wt)
3.
Choose a learning rate:

4.
5.
Update your position:
wt+1 = wt -  E(wt)
Repeat from 2) until stopping criterion is satisfied
 Typical stopping criteria
 E( wt+1 ) ~ 0
 some validation metric is optimized
Note: this step involves
simultaneous updating of
each weight wi
Gradient descent optimization
2
 In Least Squares Regression:
d


E (w )   y   wi  xi   (y  w T .x) 2
i 0


 Process (follow the gradient downhill):
1.
Select initial w = (w0, w1, …, wd)
2.
Compute -E(w)
3.
Set 
4.
5.
Update: w := w -  E(w)
Repeat until E( wt+1 ) ~ 0
1 n
w j : w j 
(w T .xi  y i )x ij

2n i 1
for j  0,1,...,d
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
Direction of steepest
descent = direction of
negative gradient
w0
Illustration of Gradient Descent
E(w)
w1
Original point in
weight space
w0
New point in
weight space
Gradient descent optimization
 Problems:
Choosing step size (learning rate)
 too small  convergence is slow and inefficient
 too large  may not converge
Can get stuck on “flat” areas of function
Easily trapped in local minima
Stochastic gradient descent
 Application to training a machine learning model:
1. Choose one sample from training set: xi
2. Calculate objective function for that single sample: (wT .xi  yi )2
3. Calculate gradient from objective function:
4. Update model parameters a single step based on gradient and
learning rate:
wj : wj (wT .xi  yi ) xij for j  0,...,d
5. Repeat from 1) until stopping criterion is satisfied
 Typically entire training set is processed multiple times before
stopping
 Order in which samples are processed can be fixed or random.