Non/Semiparametric Regression and Clustered/Longitudinal Data Raymond J. Carroll Texas A&M University

Download Report

Transcript Non/Semiparametric Regression and Clustered/Longitudinal Data Raymond J. Carroll Texas A&M University

Non/Semiparametric Regression
and Clustered/Longitudinal Data
Raymond J. Carroll
Texas A&M University
http://stat.tamu.edu/~carroll
[email protected]
Postdoctoral Training Program:
http://stat.tamu.edu/B3NC
Where am I From?
Wichita Falls,
my
hometown
Big
Bend
National
Park
I-35
I-45
College Station,
home of
Texas A&M
Acknowledgments
Raymond Carroll
Naisyin Wang
Oliver Linton
Enno Mammen
Alan Welsh
Xihong Lin
Series of
papers are
on my web
site
Lin, Wang
and Welsh:
Longitudinal
data
(Mammen &
Linton for
pseudoobservation
methods)
Linton and
Mammen:
time series
data
Outline
• Longitudinal models:
• panel data
• Background:
• splines = kernels for independent data
• Correlated data:
• do splines = kernels?
• Semiparametric case:
• partially linear model:
• does it matter what nonparametric method is used?
Panel Data (for simplicity)
• i = 1,…,n clusters/individuals
• j = 1,…,m observations per cluster
Subject Wave 1
Wave 2
…
Wave m
1
X
X
X
2
X
X
X
…
n
X
X
X
X
Panel Data (for simplicity)
• i = 1,…,n clusters/individuals
• j = 1,…,m observations per cluster
• Important points:
• The cluster size m is meant to be fixed
• This is not a multiple time series problem
where the cluster size increases to infinity
• Some comments on the single time series
problem are given near the end of the talk
The Marginal Nonparametric Model
• Y = Response
• X = time-varying covariate
Yij=Θ(X ij )+ε ij
Θ  •  unknown function
cov(ε ij )=Σ
• Question: can we improve efficiency by
accounting for correlation?
The Marginal Nonparametric Model
• Important assumption
• Covariates at other waves are not conditionally
predictive, i.e., they are surrogates
E(Yij|Xij ,X ik for k  j)=Θ(Xij )
• This assumption is required for any GLS fit,
including parametric GLS
Independent Data
• Splines (smoothing, P-splines, etc.) with penalty
parameter = l
n
minimize  Y i  (X i )
i 1
T
Y i  (X i )  l  (t ) dt
• Ridge regression fit
• Some bias, smaller variance
• l  0 is over-parameterized least squares
• l   is a polynomial regression
2
Independent Data
• Kernels (local averages, local linear, etc.), with
kernel density function K and bandwidth h
 Xi  t 
Y
K

i
 h 


i 1
(t) 
n
X t
1
n  K i

h


i 1
n
1
n
• As the bandwidth h  0, only observations with
X near t get any weight in the fit
Independent Data
• Major methods
• Splines
• Kernels
• Smoothing parameters required for both
• Fits: similar in many (most?) datasets
• Expectation: some combination of bandwidths
and kernel functions look like splines
12
Independent Data
• Splines and kernels are linear in the responses
n
Θ(t) = n-1  Gn (t,X i )Yi
i=1
• Silverman showed that there is a kernel function
and a bandwidth so that the weight functions
Gn (t, x) are asymptotically equivalent
• In this sense, splines = kernels
• This talk is about the same result for correlated
data
The weight functions Gn(t=.25,x) in a
specific case for independent data
Kernel
Smoothing Spline
Note the similarity of shape and the locality: only X’s near t=0.25 get
any weight
Working Independence
• Working independence: Ignore all correlations
• Fix up standard errors at the end
• Advantage: the assumption
E(Yij|Xij ,X ik for k  j)=Θ(Xij )
is not required
• Disadvantage: possible severe loss of efficiency
if carried too far
Working Independence
• Working independence:
• Ignore all correlations
• Should posit some reasonable marginal variances
• Weighting important for efficiency
• Weighted versions: Splines and kernels have
obvious analogues
• Standard method: Zeger & Diggle, Hoover,
Rice, Wu & Yang, Lin & Ying, etc.
Working Independence
• Working independence:
• Weighted splines and weighted kernels are linear
in the responses
• The Silverman result still holds
• In this sense, splines = kernels
Accounting for Correlation
• Splines have an obvious analogue for nonindependent data
• Let
be a working covariance matrix
Σw
• Penalized Generalized least squares (GLS)
n
-1

Y


(X
)
Σ


i
 i
w Y i  (X i )  l   (t) dt
T
2
i 1
• GLS ridge regression
• Because splines are based on likelihood ideas,
they generalize quickly to new problems
Accounting for Correlation
• Splines have an obvious analogue for nonindependent data
• Kernels are not so obvious
• Local likelihood kernel ideas are standard in
independent data problems
• Most attempts at kernels for correlated data have
tried to use local likelihood kernel methods
Kernels and Correlation
• Problem: how to define locality for kernels?
• Goal: estimate the function at t
• Let K(t, X i ) be a diagonal matrix of standard
kernel weights
• Standard Kernel method: GLS pretending
inverse covariance matrix is
K
1/2
(t, X i )Σ -w1 K
1/ 2
(t, X i )
• The estimate is inherently local
Kernels and Correlation
Specific case: m=3, n=35
Exchangeable correlation
structure
Red:
The weight functions Gn(t=.25,x) in
a specific case
r = 0.0
Green: r = 0.4
Blue:
r = 0.8
Note the locality of the
kernel method
18
Splines and Correlation
Specific case: m=3, n=35
Exchangeable correlation
structure
Red:
r = 0.0
Green: r = 0.4
Blue:
r = 0.8
Note the lack of locality of
the spline method
The weight functions Gn(t=.25,x) in a
specific case
Splines and Correlation
Specific case: m=3, n=35
Complex correlation
structure
Red:
Nearly singular
Green: r = 0.0
Blue:
r = AR(0.8)
Note the lack of locality of
the spline method
The weight functions Gn(t=.25,x) in
a specific case
Splines and Standard Kernels
• Accounting for correlation:
• Standard kernels remain local
• Splines are not local
• Numerical results can be confirmed theoretically
• Don’t we want our nonparametric regression
estimates to be local?
Results on Kernels and Correlation
• GLS with weights
K
1/2
-1
w
(t, X i )Σ K
1/ 2
(t, X i )
• Optimal working covariance matrix is working
independence!
• Using the correct covariance matrix
• Increases variance
• Increases MSE
• Splines
 Kernels (or at least these kernels)
24
Pseudo-Observation Kernel Methods
• Better kernel methods are possible
• Pseudo-observation: original method
• Construction: specific linear transformation of Y
• Mean = (X)
• Covariance = diagonal matrix
=Σ -1/2
w
  diag()
Y  Y i   (   ) Y i  (X i )
*
i
1
• This adjusts the original responses without affecting
the mean
Pseudo-Observation Kernel Methods
Y  Y i   1 (   )Y i  (X i )
*
i
• Construction: specific linear transformation of Y
• Mean = (X)
• Covariance = diagonal
• Iterative:
• Efficiency: More efficient than working
independence
• Proof of Principle: kernel methods can be
constructed to take advantage of correlation
Efficiency of Splines and PseudoObservation Kernels
Exchng:
Exchangeable
with correlation
0.6
AR:
autoregressive
with correlation
0.6
Near Sing: A
nearly singular
matrix
3.5
3
2.5
2
Spline
P-kernel
1.5
1
0.5
0
Excnhg
AR
Near Sing
Better Kernel Methods: SUR
• Simulations of the original pseudo-observation
method: it is not as efficient as splines
• Suggests room for a better estimate
• Naisyin Wang: her talk will describe an even
better kernel method
• Basis: seemingly unrelated regression ideas
• Generalizable: based on likelihood ideas
SUR Kernel Methods
• It is well known that the GLS spline has an
exact, analytic expression
• We have shown that the Wang SUR kernel
method has an exact, analytic expression
• Both methods are linear in the responses
SUR Kernel Methods
• The two methods differ only in one matrix term
• This turns out to be exactly the same matrix
term considered by Silverman in his work
• Relatively nontrivial calculations show that
Silverman’s result still holds
• Splines = SUR Kernels
29
Nonlocality
• The lack of locality of GLS splines and SUR
kernels is surprising
• Suppose we want to estimate the function at t
• If any observation has an X near t, then all
observations in the cluster contribute to the fit,
not just those with covariates near t
• Splines, pseudo-kernels and SUR kernels all
borrow strength
Nonlocality
• Wang’s SUR kernels = BLUP-like pseudo
kernels with a clever linear transformation. Let

Σ -1= σ jk

jk
• SUR kernels are working independence kernels
weights =σ
jj
σ jk
pseudo-obs Yij =Yij+ k  j jj Yik  ( X ik )
σ
*
Locality of Kernels
σ jk
pseudo-obs Yij =Yij+ k  j jj Yik  ( X ik )
σ
*
• Original pseudo-observation method: pseudo
observations uncorrelated
• SUR kernels: pseudo-observations are
correlated
• SUR kernels are not local
• SUR kernels are local in (the same!) pseudoobservations
Locality of Splines
• Splines = SUR kernels (Silverman-type result)
• GLS spline:
• Iterative
• standard independent spline smoothing
• SUR pseudo-observations at each iteration
• GLS splines are not local
• GLS splines are local in (the same!) pseudoobservations
Time Series Problems
• Time series problems: many of the same issues
arise
• Original pseudo-observation method
• Two stages
• Linear transformation
• Mean (X)
• Independent errors
• Single standard kernel applied
• Potential for great gains in efficiency (even
infinite for AR problems with large correlation)
Time Series: AR(1) Illustration, First
Pseudo Observation Method
• AR(1), correlation r:
ε t -ρε t-1=ut (whitenoise)
Yt0=Yt -ρ Yt-1 -Θ(X t-1 )
• Regress Yt0 on Xt
Time Series Problems
• More efficient methods can be constructed
• Series of regression problems: for all j,
• Pseudo observations Yij
• Mean
(X t  j )
• White noise errors
• Regress for each j: fits are asymptotically independent
• Then weighted average
• Time series version of SUR-kernels for
longitudinal data?
Time Series: AR(1) Illustration, New
Pseudo Observation Method
• AR(1), correlation r:
ε t -ρε t-1=ut
Yt0=Yt -ρ Yt-1 -Θ(X t-1 )
Yt1=Yt-1 -ρ-1 Yt -Θ(X t )
• Regress Yt0 on Xt and Yt1 on Xt-1
• Weights: 1 and r2
Time Series Problems
• AR(1) errors with correlation r
• Efficiency of original pseudo-observation method
to working independence:
1
:   as r  1
2
1r
• Efficiency of new (SUR?) pseudo-observation
method to original method:
1  r2 :  2 as r  1
36
The Semiparametric Model
• Y = Response
• X,Z = time-varying covariates
Yij=Zijβ+Θ(X ij )+ε ij
cov(ε ij )=Σ
• Question: can we improve efficiency for b by
accounting for correlation?
Profile Methods
Yij=Zijβ+Θ(X ij )+ε ij
cov(ε ij )=Σ
• Given b, solve for , say Θ(X ij ,β)
• Basic idea: Regress
*
ij
Y =Yij -Zijβ on X ij
•
•
•
•
Working independence
Standard kernels
Pseudo –observations kernels
SUR kernels
Profile Methods
Yij=Zijβ+Θ(X ij )+ε ij
cov(ε ij )=Σ
• Given b, solve for , say Θ(X ij ,β)
• Then fit GLS or W.I. to the model with mean
Zijβ+Θ(X ij ,β)
• Question: does it matter what kernel method is used?
• Question: How bad is using W.I. everywhere?
• Question: are there efficient choices?
The Semiparametric Model: Special
Case
Yij=Zijβ+Θ(Xi )+εij
• If X does not vary with time, simple
semiparametric efficient method available
• The basic point is that Yij -Zijβ has common
mean Θ(X i ) and covariance matrix Σ
• If Θ() were a polynomial, GLS likelihood
methods would be natural
The Semiparametric Model: Special
Case
Yij=Zijβ+Θ(Xi )+εij
• Method: Replace polynomial GLS likelihood with
GLS local likelihood with weights
X t
K i

 h 
• Then do GLS on the derived variable
Zijβ+Θ(X ij ,β)
• Semiparametric efficient
Profile Method: General Case
• Given b, solve for , say Θ(X ij ,β)
• Then fit GLS or W.I. to the model with mean
Zijβ+Θ(X ij ,β)
• In this general case, how you estimate  matters
• Working independence
• Standard kernel
• Pseudo-observation kernel
• SUR kernel
Profile Methods
• In this general case, how you estimate  matters
• Working independence
• Standard kernel
• Pseudo-observation kernel
• SUR kernel
• We have published the asymptotically efficient
score, but not how to implement it
Profile Methods
• Naisyin Wang’s talk will describe
• These phenomena
• Search for an efficient estimator
• Loss of efficiency for using working
independence to estimate 
• Examples where ignoring the correlation can
change conclusions
Conclusions (1/3): Nonparametric
Regression
• In nonparametric regression
• Kernels = splines for working independence
(W.I.)
• Weighting is important for W.I.
• Working independence is inefficient
• Standard kernels  splines for correlated
data
Conclusions (2/3): Nonparametric
Regression
• In nonparametric regression
• Pseudo-observation methods improve upon
working independence
• SUR kernels = splines for correlated data
• Splines and SUR kernels are not local
• Splines and SUR kernels are local in pseudoobservations
Conclusions (3/3): Semiparametric
Regression
• In semiparametric regression
• Profile methods are a general class
• Fully efficient parameter estimates are easily
constructed if X is not time-varying
• When X is time-varying, method of estimating
affects properties of parameter estimates
• Ignoring correlations can change conclusions
(see N. Wang talk)
Conclusions: Splines versus Kernels
• One has to be struck by the fact that all the grief
in this problem has come from trying to define
kernel methods
• At the end of the day, they are no more efficient
than splines, and harder and more subtle to
define
• Showing equivalence as we have done suggests
the good properties of splines