Gradient Descent Optimization
Download
Report
Transcript Gradient Descent Optimization
Classification and Prediction:
Regression Via Gradient Descent
Optimization
Bamshad Mobasher
DePaul University
Linear Regression
Linear regression: involves a response variable y and a
single predictor variable x
y = w0 + w1 x
The weights w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
w0 and w1 are obtained by minimizing the sum of the squared errors (a.k.a.
residuals)
2
2
e
i
i
w1 can be obtained by
setting the partial
derivative of the SSE to 0
and solving for w1,
ultimately resulting in:
( yi yˆ i )
i
( yi ( w0 w1 xi ))2
( x x )( y y )
w
w
y
w
x
(x x)
i
i
i
1
2
0
1
i
i
2
Multiple Linear Regression
Multiple linear regression: involves more than one predictor variable
Features represented as x1, x2, …, xd
Training data is of the form (x1, y1), (x2, y2),…, (xn, yn)
(each xj is a row vector in matrix X, i.e. a row in the data)
For a specific value of a feature xi in data item xj we use: xij
ˆ w0 w1x1 w2 x2
Ex. For 2-D data, the regression function is: y
x1
y
x2
More generally: yˆ f ( x1 ,..., xd ) w0
d
T
w
x
w
w
i i 0 .x
i 1
3
Least Squares Generalization
Multiple dimensions
To simplify add a new feature x0 = 1 to feature vector x:
x0
x1
x2
y
1
1
1
1
1
d
d
i 1
i 0
yˆ f ( x0 , x1 ,...,xd ) w0 x0 wi xi wi xi w T .x
Least Squares Generalization
d
d
i 1
i 0
yˆ f ( x0 , x1 ,..., xd ) f (x) w0 x0 wi xi wi xi w T .x
Calculate the error function (SSE) and determine w:
2
n
d
i
E (w) y f (x) y wi xi ( y wi xij ) 2
i 0
j 1
i 0
d
2
(y Xw ) T (y Xw )
y vectorof all trainingresponsesy j
X matrixof all trainingsamplesx j
w ( X T X) 1 X T y
yˆ test w xtest
for testsamplex test
Closed form solution to
E (w ) 0
w
Gradient Descent Optimization
Linear regression can also be solved using Gradient Decent
optimization approach
GD can be used in a variety of settings to find the minimum value
of functions (including non-linear functions) where a closed form
solution is not available or not easily obtained
Basic idea:
Given an objective function J(w) (e.g., sum of squared errors), with w as a vector
of variables w0, w1, …, wd, iteratively minimize J(w) by finding the gradient of
the function surface in the variable-space and adjusting the weights in the
opposite direction
The gradient is a vector with each element representing the slope of the function
in the direction of one of the variables
Each element is the partial derivative of function with respect to one of variables
f (w)
J (w) J ( w1 , w2 ,, wd )
w1
f (w)
f (w)
w2
wd
6
Optimization
An example - quadratic function in 2 variables:
f( x ) = f( x1, x2 ) = x12 + x1x2 + 3x22
f( x ) is minimum where gradient of f( x ) is zero in all directions
Optimization
Gradient is a vector
Each element is the slope of function along direction of one of variables
Each element is the partial derivative of function with respect to one of
variables
Example:
f (x) f ( x1 , x2 ) x1 x1 x2 3x2
2
f (x)
f (x) f ( x1 , x2 )
x1
2
f (x)
2 x1 x2
x2
x1 6 x2
Optimization
Gradient vector points in direction of steepest ascent of function
f ( x1 , x2 )
f ( x1 , x2 ) x2
f ( x1 , x2 ) x1
f ( x1 , x2 )
Optimization
This two-variable example is still simple enough that we can
find minimum directly
f ( x1 , x2 ) x1 x1 x2 3x2
2
f ( x1 , x2 ) 2 x1 x2
2
x1 6 x2
Set both elements of gradient to 0
Gives two linear equations in two variables
Solve for x1, x2
2 x1 x2 0
x1 0
x1 6 x2 0
x2 0
Optimization
Finding minimum directly by closed form analytical solution
often difficult or impossible
Quadratic functions in many variables
system of equations for partial derivatives may be ill-conditioned
example: linear least squares fit where redundancy among features is
high
Other convex functions
global minimum exists, but there is no closed form solution
example: maximum likelihood solution for logistic regression
Nonlinear functions
partial derivatives are not linear
example: f( x1, x2 ) = x1( sin( x1x2 ) ) + x22
example: sum of transfer functions in neural networks
Gradient descent optimization
Given an objective (e.g., error) function E(w) = E(w0, w1, …, wd)
Process (follow the gradient downhill):
1.
Pick an initial set of weights (random):
2.
Determine the descent direction:
w = (w0, w1, …, wd)
-E(wt)
3.
Choose a learning rate:
4.
5.
Update your position:
wt+1 = wt - E(wt)
Repeat from 2) until stopping criterion is satisfied
Typical stopping criteria
E( wt+1 ) ~ 0
some validation metric is optimized
Note: this step involves
simultaneous updating of
each weight wi
Gradient descent optimization
2
In Least Squares Regression:
d
E (w ) y wi xi (y w T .x) 2
i 0
Process (follow the gradient downhill):
1.
Select initial w = (w0, w1, …, wd)
2.
Compute -E(w)
3.
Set
4.
5.
Update: w := w - E(w)
Repeat until E( wt+1 ) ~ 0
1 n
w j : w j
(w T .xi y i )x ij
2n i 1
for j 0,1,...,d
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
w0
Illustration of Gradient Descent
E(w)
w1
Direction of steepest
descent = direction of
negative gradient
w0
Illustration of Gradient Descent
E(w)
w1
Original point in
weight space
w0
New point in
weight space
Gradient descent optimization
Problems:
Choosing step size (learning rate)
too small convergence is slow and inefficient
too large may not converge
Can get stuck on “flat” areas of function
Easily trapped in local minima
Stochastic gradient descent
Application to training a machine learning model:
1. Choose one sample from training set: xi
2. Calculate objective function for that single sample: (wT .xi yi )2
3. Calculate gradient from objective function:
4. Update model parameters a single step based on gradient and
learning rate:
wj : wj (wT .xi yi ) xij for j 0,...,d
5. Repeat from 1) until stopping criterion is satisfied
Typically entire training set is processed multiple times before
stopping
Order in which samples are processed can be fixed or random.