Lecture 4 slides

Transcript Lecture 4 slides

Lecture 4. Linear Models
for Regression
Outline
Linear Regression
Least Square Solution
Subset Least Square
subset selection/forward/backward
Penalized Least Square:
Ridge Regression
LASSO
Elastic Nets (LASSO+Ridge)
Linear Methods for Regression
Input (FEATURES) Vector: (p-dimensional) X = X1, X2,
…, Xp
Real Valued OUTPUT: Y
Joint Distribution of (Y,X )
Function:
Regression Function E(Y |X ) = f(X)
Training Data :
(x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output
relation f.
Linear Model
p
f ( X)   0   X j  j
j 1
f(x): Regression function or a
good approximation
LINEAR in Unknown
Parameters(weights,
coefficients)
 0 , 1 ,..., p
Features
Quantitative inputs
Any arbitrary but known function of measured
attributes
Transformations of quantitative attributes: g(x), e.g., log,
square, square-root etc.
Basis expansions: e.g., a polynomial approximation of f as
a function of X1 (Taylor Series expansion with unknown
coefficients)
X 2  X , X 3  X ,...X k  X ,
2
1
3
1
k
1
Features (Cont.)
Qualitative (categorical) input G
Dummy Codes: For an attribute with k categories,
may use k codes j = 1,2, …, k, as indicators of the
category (level) used.
Together, this collection of inputs represents the
effect of G through  X 
k
j
j
This is a set of level-dependent constants, since only one
of the Xj equals one and others are zero
j 1
Features(cont)
Interactions: 2nd or higher-order Interactions of some
features, e.g.,
X 3  X1 * X 2 , X 4  X1 * X 2 * X 3
Feature vector for the ith case in training set (Example)
xi  (xi1, xi2, ...,xip )T

Generalized Linear Models:
Basis Expansion
Wide variety of flexible models
Model for f is a linear expansion of basis functions
K
f ( x)   k hk ( x)
k 1
Dictionary: Prescribed basis functions
Other Basis Functions
Polynomial basis of degree s (Smooth functions Cs).
Fourie Series (Band-limited functions, a compact subspace of C∞)
Splines
Piecewise polynomials of degree K between the knots, joined with continuity of
degree K-1 at the knots (Sobolev Spaces).
Wavelets (Besov Spaces)
Radial Basis Functions: Symmetric p-dim kernels located at particular
centroids f(|x-y|)
Gaussian Kernel at each centroids
And more …
-- Curse of Dimensionality: p could be equal to or much larger than n.
Method of Least Squares
T


(

,

,...,

)
0
1
p
Find coefficients
that
minimize
Residual Sum of Squares, RSS() =
N
N
 ( y  f ( x ))  ( y
2
i 1
i
i
i 1
i
p
  0   xij  j ) 2
j 1
RSS denotes the empirical risk over the training
set. It doesn’t assure the predictive performance
over all inputs of interest.
Min RSS Criterion
Statistically Reasonable
provided Examples in the
Training Set
Large # of independent random
draws from the inputs
population for which prediction
is desirable.
Given inputs (x1, x2, …, xN), the
outputs (y1, y2, …, yN)
conditionally independent
In principle, predictive
performance over the set of
future input vectors should be
examined.
Gaussian Noise: the Least
Squares method equivalent
to Max Likelihood
Min RSS() over in R(p+1),
a quadratic function of .
Optimal Solution: Take the
derivatives with respect to
elements of ,and set them
equal to zero.
Optimal Solution
The Hession (2nd derivative) of the criterion function is given by XTX.
The optimal solution satisfies the normal equations
XT(Y-X ) = 0 or (XTX)  XT Y
ö  ( X T X )1 X T Y
Yö  X ( X T X )1 X T Y  HY , H  X ( X T X )1 X T
For an unique solution, the matrix XTX must full rank.
Projection
When the matrix XTX is full rank. the estimated response for the
training set:
Yö  X ( X T X )1 X T Y  HY , H  X ( X T X )1 X T
H: Projection (Hat) matrix
HY: Orthogonal Projection of Y on the space spanned by the
columns of X
Note: the projection is linear in Y
Geometrical Insight
Simple Univariate Regression
One Variable with no intercept
LS estimate Y  X   


N
ö 
1
xi yi
N
1
xi2

x, y
x, x
inner product
x, y  1 xi yi  xT y
N
= cosine (angle between
vectors x and y), a measure of
similarity between y and x
Residuals: projection on
normal space
r  y  xö
Definition: “Regress b on a”
Simple regression of response
b and input a, with no
intercept
Estimate
Residual
“b adjusted for a”
“b orthogonalized with respect
to a”
b  öa
ö  a, b / a, a
Multiple Regression
Multiple Regression:p>1
LS estimates different from
simple univariate regression
estimates, unless columns of
input matrix X orthogonal,
If
xi , x j  0, for all i  j
then
öj  x j , y / x j , x j
These estimates are
uncorrelated, and
Var(öp )   2 /  xp , xp 
Orthogonal inputs occur
sometimes in balanced,
designed experiments
(experimental design).
Observational studies, will
almost never have orthogonal
inputs
Must “orthogonalize” them
in order to have similar
interpretation
Use Gram-Schmidt
procedure to obtain an
orthogonal basis for multiple
regression
Multiple Regression Estimates:
Sequence of Simple Regressions
Regression by Successive
Orthogonalization:
Initialize z0  x0  1
For, j=1, 2, …, p, Regress
x j on z0 , z1, , z j 1 to produce
coefficients
ölj  zl , x j / zl , zl
and residual vectors
z j  x j   k 1 ökj zk 1
Regress y on the residual vector z p
for the estimate
j
Instead of using x1 and x2, take x1 and z as
features
 zp , y 
ö
p 
 zp , zp 
Multiple Regression = Gram-Schmidt
Orthogonalization Procedure
The vector zp is the residual of the
multiple regression of xp on all
other inputs x0 , x1, , xp1
Successive z’s in the above algorithm
are orthogonal and form an
orthogonal basis for the columns
space of X.
The least squares projection onto this
subspace is the usual
yö
By re-arranging the order of these
variables, any input can be labeled as
the pth variable.
If x j is highly correlated with other
variables, the residuals are quite
small, and the coefficient has high
variance.
Statistics Properties of LS
Model
Uncorrelated noise: Mean
zero, Variance  2
1 2
Then Var ( )  ( X ' X ) 
Noise estimation
N
1
2
 
( yi  yöi )2

N  p  1 i 1
Model d.f. = p+1 (dimension
of the model space)
To Draw inferences on
parameter estimates, we
need assumptions on noise:
If assume:
then,
ˆ ~ N ( ,(T )1 2 )
( N  p 1)ˆ 2 ~  2  N2  p1
Gauss-Markov Theorem
(The Gauss-Markov Theorem) If we have any linear estimator
that is unbiased for aT β, that is, E(cT y)= aT β,then
T
ˆ
var(a  )  var(c y)
T
It says that, for inputs in row space of X, LS estimate have
Minimum variance among all unbiased estimates.
Bias-Variance Tradeoff
Mean square error of an estimator = variance + bias
Least square estimator achieves the minimal variance among all
unbiased estimators
There are biased estimators to further reduce variance: Stein’s
estimator, Shrinkage/Thresholding (LASSO, etc.)
More complicated a model is, more variance but less bias, need a
trade-off
Hypothesis Test
• Single Parameter test: βj=0, T-statistics
ˆ j
zj 
~ t N  p 1
ˆ v j
where vj is the j-th diagonal element of V = (XTX)-1
ˆ  z1v1/ 2
ˆ , e.g. z1-0.0025=1.96
Confidence interval 
j
• Group parameter:   0,  p1  p0  , F-statistics for
nested models 
( RSS0  RSS1 ) ( p1  p 0 )
F


RSS1 ( N  p  1)
Example
R command: lm(Y ~ x1 + x2 + … +xp)
Rank Deficiency
For an input, not in the
row space of X, the
estimate may change with
X : rank deficient
Normal equations has
infinitely many solutions
the solution used.
Hat matrix H, and the
projection
are

unique.
For an input in the row
space of X, unique LS
estimate.
Yö  X 
How to generalize to
inputs outside the
training set?

Penalized methods (!)
Reasons for Alternatives to LS
Estimates
Prediction accuracy
LS estimates have low bias but
high variance when inputs are
highly correlated
Larger ESPE
Prediction accuracy can
sometimes be improved by
shrinking or setting some
coefficients to zero.
Small bias in estimates may yield
a large decrease in variance
Bias/var tradeoff may provide
better predictive ability
Better interpretation
With a large number of input
variables, like to determine a
smaller subset that exhibit the
strongest effects.
Many tools to achieve these
objectives
Subset selection
Penalized regression constrained optimization
Best Subset Selection Method
Algorithm: leaps & bounds
Find the best subset
corresponding to the smallest
RSS for each size
k {0,1, 2, , p}
For each fixed size k, can also
find a specified number of
subsets close to the best
For each fixed subset, obtain LS
estimates
Feasible for p ~ 40.
Choice of optimal k based on
model selection criteria to be
discussed later
Other Subset Selection Procedures
Larger p, Classical
Forward selection (step-up),
Backward elimination (step down)
Hybrid forward-backward (step-wise) methods
Given a model, these methods only provide local controls for
variable selection or deletion
Which current variable is least effective (candidate for deletion)
Which variable not in the model is most effective (candidate for
inclusion)
Do not attempt to find the best subset of a given size
Not too popular in current practice
Forward Stagewise Selection
(Incremental) Forward stagewise
Standardize the input variables
Note:
Penalized Regression
• Instead of directly minimize the Residual Sum Square,
The penalized regression usually take the form:
where J(f) is the penalization term, usually penalize on
the smoothness or complexity of the function f
λ is chosen by cross-validation.
Model Assessment and Selection
If we are in data-rich situation, split data into three parts:
training, validation, and testing.
Train
See chapter 7.1 for details
Validation
Test
Cross Validation
When sample size not sufficiently large, Cross
Validation is a way to estimate the out of sample
estimation error (or classification rate).
Randomly split
Available Data
Training
Split many times and get
error2, …, errorm ,then average
over all error to get an estimate
Test
error1
Ridge Regression (Tikhonov Regularization)
Ridge regression shrinks coefficients
by imposing a penalty on their size
Min a penalized RSS
öridge  arg min  {RSS    j 1  j2 }
p
Here   0 is complexity
parameter, that controls the
amount of shrinkage
Larger its value, greater the
amount of shrinkage
Coefficients are shrunk towards
zero
Choice of penalty term based
on cross validation
Prostate Cancer Example
Ridge Regression (cont)
Equivalent problem
Min RSSp subject to

2

s
j
j 1
Lagrangian multiplier 
1-1 correspondence between s
and 
With many correlated variables,
LS estimates can become
unstable and exhibit high
variance and high correlations
A widely large positive coeff on
one variable can be cancelled by
a large negative coeff on another
Imposing a size constraint on the
coefficients, this phenomena is
prevented from occurring
Ridge solutions are not invariant
under scaling of inputs
Normally standardize the inputs
before solving the optimization
problem
Since the penalty term does not
include the bias term, estimate
intercept by the mean of
response y.
Ridge Regression (cont)
The Ridge criterion
RSS ( )  ( y  X  )T ( y  X  )   T 
ö
ridge
 ( X X+I) X Y
T
-1
T
Shrinkage:
For orthogonal inputs, ridge:
scaled version of LS estimates
öridge  ö,0    1
Centered input matrix X
T
SVD of X: X  UDV
U and V are orthogonal matrices
Columns of U span column space of
X
Columns of V span row space of X
D: a diagonal matrix of singular
values
Eigen decomposition of
d1  d2 
Ridge is mean or mode of
posterior distribution of 
under a normal prior
 dp  0
X T X  VD2V T
The Eigen vectors v j : principal
components directions of X
(Karhunen-Loeve direction)
Ridge Regression and Principal
Components
First PC direction
:v1
Among all normalized linear combinations
of columns of X,
z1  Xv1  u1d1 the
has largest sample variance
Derived variable,
z1
is first PC of X.
Subsequent PC z j have max variance
subject to being orthogonal to earlier ones.
Last PC has min variance
Effective Degree of Freedmon
Ridge Regression (Summary)
Ridge Regression penalized the complexity of a linear model by
the sum squares of the coefficients
It is equivalent to minimize RRS given the constraints
The matrix (XTX+ I) is always invertable.
The penalization parameter  controls how simple “you” want
the model to be.
Prostate Cancer Example
Ridge Regression (Summary)
Solutions are not sparse in
the coefficient space.
- ’s are not zero
almost all the time.
The computation
complexity is O(p3) when
inversing the matrix XTX+
I.
Prostate Cancer Example
Least Absolute Shrinkage and
Selection Operator (LASSO)
Penalized RSS with L1-norm
penalty, or subject to
p
constraint
 j 1  j  t
Shrinks like Ridge with L2norm penalty, but LASSO
coefficients hit zero, as the
penalty increases.
LASSO as Penalized Regression
• Instead of directly minimize the Residual Sum Square,
The penalized regression usually take the form:
where
p
J( f )  f 1 :  i
i1
p

f (X)  0   X j  j
j 1
LASSO(cont)
The computation is a
quadratic programming
problem.
We can obtain the solution
path, piece-wise linear.
Coefficients are non-linear in
response y (they are linear in
y in Ridge Regression)
Regularization parameter is
chosen by cross validation.
LASSO and Ridge
Contour of RRS in the space of ’s
Generalize to L-q norm as penalty
Minimize RSS subject to constraints on the l-q norm
Equivalent to Min
RSS    j 1  j
p
q
Bridge regression with ridge and LASSO as special cases (q=1,
smallest value for convex region)
For q=0, best subset regression
For 0<q<1, it is not convex!
Contours of constant values of L-q
norms
Why non-convex norms?
LASSO is biased:
1
2
y     
2
ˆ  S (y)  sign(y)max(y  ,0)




ˆ )  E(y), y  0
E(
Nonconvex Penalty is necessary for unbiased estimator

1
2
y    J()
2

J()  (y  ) /  0,

y 
Elastic Net as a compromise between
Ridge and LASSO
(Zou and Hastie 2005)
The Group LASSO
Group LASSO
Group norm l1-l2 (also l1-l∞)
Every group of variable are simultaneously selected or
dropped
Methods using Derived Directions
Principal Components Regression
Partial Least Squares
Principal Components Regression
Principal Components Regression
Z1
(M<p)
Motivation: leading
X2
Z2
eigen-vectors
describe most of
the variability in X
X1
Principal Components Regression
Zi and Zj are orthogonal now.
The dimension is reduced.
High correlation between independent variables are
eliminated.
Noises in X’s are taken off (hopefully).
Computation: PCA + Regression
Partial Least Square
Partial Least Squares (PLS)
Uses inputs as well as
response y to form the
directions Zm
Seeks directions that have
high variance and have high
correlation with response y
Popular in Chemometrics
If original inputs are
orthogonal, finds the LS
after one step. Subsequent
steps have no effect.
 Since the derived inputs use y, the
estimates are non-linear functions of
the response, when the inputs are not
orthogonal
 The coefficients for original variables
tend to shrink as fewer PLS directions
are used
 Choice of M can be made via cross
validation
Partial Least Square Algorithm
PCR vs. PLS
Principal Component Regression choose directions:
Partial Least Square has m-th direction:
variance tends to dominate, whence PLS is close to Ridge
Ridge, PCR and PLS
The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2))
Regression case.
Comparisons on Estimated Prediction
Errors (Prostate Cancer Example)
0.574
(0.156)
0.491
(0.152)
0.540
(0.168)
0.527
(0.122)
0.636
(0.172)
Least Square:
Test error: 0.586
Sd of error:(0.184)
LASSO and Forward Stagewise
Diabetes Data
LASSO and Forward Stagewise
Least Angel Regression (LARS)
Efron, Hastie, Johnstone, and Tibshirani (2003)
Recall: Forward Stagewise Selection
(Incremental) Forward stagewise
Standardize the input variables
Note:
LAR directions and Example
y
Relationship between those three
• Lasso and forward stagewise can be thought of as restricted version of LARS
• For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor,
recompute the best direction and continue. It gives the LASSO path.
• For Stagewise, Start with LARS; select the most correlated direction at each stage, go that
direction with a small step.
• There are other related methods:
•Orthogonal Matching Pursuit
•Linearized Bregman Iteration
Homework Project I
Keyword Pricing (regression)
Homework Project II
Click Prediction (classification): two subproblems
click/impression
click/bidding
Data Directory: /data/ipinyou/
Files:
bid.20130301.txt: Bidding log file, 1.2M rows, 470MB
imp.20130301.txt: Impression log, 0.8M rows, 360MB
clk.20130301.txt: Click log file, 796 rows, 330KB
data.zip: compressed files above (Password: ipinyou2013)
dsp_bidding_data_format.pdf: format file
Region&citys.txt: Region and City code
Questions: [email protected]
Homework Project II
Data Input by R:
bid <read.table("/Users/Liaohairen/DSP/bid.20130301.txt",
sep='\t', comment.char='')
imp <read.table("/Users/Liaohairen/DSP/imp.20130301.txt",
sep='\t', comment.char='’)
R read.table by default uses '#' ascomment character,
that is , it has the comment.char = '#'
parameter, but the user-agent field may have '#'
character. To read correctly, turning off of
interpretation of comments by setting comment.char=''
is needed.
Homework Project III
Heart Operation Effect Prediction (classification)
Note: Large amount missing values