Non/Semiparametric Regression and Clustered/Longitudinal Data Raymond J. Carroll Texas A&M University
Download ReportTranscript Non/Semiparametric Regression and Clustered/Longitudinal Data Raymond J. Carroll Texas A&M University
Non/Semiparametric Regression and Clustered/Longitudinal Data Raymond J. Carroll Texas A&M University http://stat.tamu.edu/~carroll [email protected] Postdoctoral Training Program: http://stat.tamu.edu/B3NC Where am I From? Wichita Falls, my hometown Big Bend National Park I-35 I-45 College Station, home of Texas A&M Acknowledgments Raymond Carroll Naisyin Wang Oliver Linton Enno Mammen Alan Welsh Xihong Lin Series of papers are on my web site Lin, Wang and Welsh: Longitudinal data (Mammen & Linton for pseudoobservation methods) Linton and Mammen: time series data Outline • Longitudinal models: • panel data • Background: • splines = kernels for independent data • Correlated data: • do splines = kernels? • Semiparametric case: • partially linear model: • does it matter what nonparametric method is used? Panel Data (for simplicity) • i = 1,…,n clusters/individuals • j = 1,…,m observations per cluster Subject Wave 1 Wave 2 … Wave m 1 X X X 2 X X X … n X X X X Panel Data (for simplicity) • i = 1,…,n clusters/individuals • j = 1,…,m observations per cluster • Important points: • The cluster size m is meant to be fixed • This is not a multiple time series problem where the cluster size increases to infinity • Some comments on the single time series problem are given near the end of the talk The Marginal Nonparametric Model • Y = Response • X = time-varying covariate Yij=Θ(X ij )+ε ij Θ • unknown function cov(ε ij )=Σ • Question: can we improve efficiency by accounting for correlation? The Marginal Nonparametric Model • Important assumption • Covariates at other waves are not conditionally predictive, i.e., they are surrogates E(Yij|Xij ,X ik for k j)=Θ(Xij ) • This assumption is required for any GLS fit, including parametric GLS Independent Data • Splines (smoothing, P-splines, etc.) with penalty parameter = l n minimize Y i (X i ) i 1 T Y i (X i ) l (t ) dt • Ridge regression fit • Some bias, smaller variance • l 0 is over-parameterized least squares • l is a polynomial regression 2 Independent Data • Kernels (local averages, local linear, etc.), with kernel density function K and bandwidth h Xi t Y K i h i 1 (t) n X t 1 n K i h i 1 n 1 n • As the bandwidth h 0, only observations with X near t get any weight in the fit Independent Data • Major methods • Splines • Kernels • Smoothing parameters required for both • Fits: similar in many (most?) datasets • Expectation: some combination of bandwidths and kernel functions look like splines 12 Independent Data • Splines and kernels are linear in the responses n Θ(t) = n-1 Gn (t,X i )Yi i=1 • Silverman showed that there is a kernel function and a bandwidth so that the weight functions Gn (t, x) are asymptotically equivalent • In this sense, splines = kernels • This talk is about the same result for correlated data The weight functions Gn(t=.25,x) in a specific case for independent data Kernel Smoothing Spline Note the similarity of shape and the locality: only X’s near t=0.25 get any weight Working Independence • Working independence: Ignore all correlations • Fix up standard errors at the end • Advantage: the assumption E(Yij|Xij ,X ik for k j)=Θ(Xij ) is not required • Disadvantage: possible severe loss of efficiency if carried too far Working Independence • Working independence: • Ignore all correlations • Should posit some reasonable marginal variances • Weighting important for efficiency • Weighted versions: Splines and kernels have obvious analogues • Standard method: Zeger & Diggle, Hoover, Rice, Wu & Yang, Lin & Ying, etc. Working Independence • Working independence: • Weighted splines and weighted kernels are linear in the responses • The Silverman result still holds • In this sense, splines = kernels Accounting for Correlation • Splines have an obvious analogue for nonindependent data • Let be a working covariance matrix Σw • Penalized Generalized least squares (GLS) n -1 Y (X ) Σ i i w Y i (X i ) l (t) dt T 2 i 1 • GLS ridge regression • Because splines are based on likelihood ideas, they generalize quickly to new problems Accounting for Correlation • Splines have an obvious analogue for nonindependent data • Kernels are not so obvious • Local likelihood kernel ideas are standard in independent data problems • Most attempts at kernels for correlated data have tried to use local likelihood kernel methods Kernels and Correlation • Problem: how to define locality for kernels? • Goal: estimate the function at t • Let K(t, X i ) be a diagonal matrix of standard kernel weights • Standard Kernel method: GLS pretending inverse covariance matrix is K 1/2 (t, X i )Σ -w1 K 1/ 2 (t, X i ) • The estimate is inherently local Kernels and Correlation Specific case: m=3, n=35 Exchangeable correlation structure Red: The weight functions Gn(t=.25,x) in a specific case r = 0.0 Green: r = 0.4 Blue: r = 0.8 Note the locality of the kernel method 18 Splines and Correlation Specific case: m=3, n=35 Exchangeable correlation structure Red: r = 0.0 Green: r = 0.4 Blue: r = 0.8 Note the lack of locality of the spline method The weight functions Gn(t=.25,x) in a specific case Splines and Correlation Specific case: m=3, n=35 Complex correlation structure Red: Nearly singular Green: r = 0.0 Blue: r = AR(0.8) Note the lack of locality of the spline method The weight functions Gn(t=.25,x) in a specific case Splines and Standard Kernels • Accounting for correlation: • Standard kernels remain local • Splines are not local • Numerical results can be confirmed theoretically • Don’t we want our nonparametric regression estimates to be local? Results on Kernels and Correlation • GLS with weights K 1/2 -1 w (t, X i )Σ K 1/ 2 (t, X i ) • Optimal working covariance matrix is working independence! • Using the correct covariance matrix • Increases variance • Increases MSE • Splines Kernels (or at least these kernels) 24 Pseudo-Observation Kernel Methods • Better kernel methods are possible • Pseudo-observation: original method • Construction: specific linear transformation of Y • Mean = (X) • Covariance = diagonal matrix =Σ -1/2 w diag() Y Y i ( ) Y i (X i ) * i 1 • This adjusts the original responses without affecting the mean Pseudo-Observation Kernel Methods Y Y i 1 ( )Y i (X i ) * i • Construction: specific linear transformation of Y • Mean = (X) • Covariance = diagonal • Iterative: • Efficiency: More efficient than working independence • Proof of Principle: kernel methods can be constructed to take advantage of correlation Efficiency of Splines and PseudoObservation Kernels Exchng: Exchangeable with correlation 0.6 AR: autoregressive with correlation 0.6 Near Sing: A nearly singular matrix 3.5 3 2.5 2 Spline P-kernel 1.5 1 0.5 0 Excnhg AR Near Sing Better Kernel Methods: SUR • Simulations of the original pseudo-observation method: it is not as efficient as splines • Suggests room for a better estimate • Naisyin Wang: her talk will describe an even better kernel method • Basis: seemingly unrelated regression ideas • Generalizable: based on likelihood ideas SUR Kernel Methods • It is well known that the GLS spline has an exact, analytic expression • We have shown that the Wang SUR kernel method has an exact, analytic expression • Both methods are linear in the responses SUR Kernel Methods • The two methods differ only in one matrix term • This turns out to be exactly the same matrix term considered by Silverman in his work • Relatively nontrivial calculations show that Silverman’s result still holds • Splines = SUR Kernels 29 Nonlocality • The lack of locality of GLS splines and SUR kernels is surprising • Suppose we want to estimate the function at t • If any observation has an X near t, then all observations in the cluster contribute to the fit, not just those with covariates near t • Splines, pseudo-kernels and SUR kernels all borrow strength Nonlocality • Wang’s SUR kernels = BLUP-like pseudo kernels with a clever linear transformation. Let Σ -1= σ jk jk • SUR kernels are working independence kernels weights =σ jj σ jk pseudo-obs Yij =Yij+ k j jj Yik ( X ik ) σ * Locality of Kernels σ jk pseudo-obs Yij =Yij+ k j jj Yik ( X ik ) σ * • Original pseudo-observation method: pseudo observations uncorrelated • SUR kernels: pseudo-observations are correlated • SUR kernels are not local • SUR kernels are local in (the same!) pseudoobservations Locality of Splines • Splines = SUR kernels (Silverman-type result) • GLS spline: • Iterative • standard independent spline smoothing • SUR pseudo-observations at each iteration • GLS splines are not local • GLS splines are local in (the same!) pseudoobservations Time Series Problems • Time series problems: many of the same issues arise • Original pseudo-observation method • Two stages • Linear transformation • Mean (X) • Independent errors • Single standard kernel applied • Potential for great gains in efficiency (even infinite for AR problems with large correlation) Time Series: AR(1) Illustration, First Pseudo Observation Method • AR(1), correlation r: ε t -ρε t-1=ut (whitenoise) Yt0=Yt -ρ Yt-1 -Θ(X t-1 ) • Regress Yt0 on Xt Time Series Problems • More efficient methods can be constructed • Series of regression problems: for all j, • Pseudo observations Yij • Mean (X t j ) • White noise errors • Regress for each j: fits are asymptotically independent • Then weighted average • Time series version of SUR-kernels for longitudinal data? Time Series: AR(1) Illustration, New Pseudo Observation Method • AR(1), correlation r: ε t -ρε t-1=ut Yt0=Yt -ρ Yt-1 -Θ(X t-1 ) Yt1=Yt-1 -ρ-1 Yt -Θ(X t ) • Regress Yt0 on Xt and Yt1 on Xt-1 • Weights: 1 and r2 Time Series Problems • AR(1) errors with correlation r • Efficiency of original pseudo-observation method to working independence: 1 : as r 1 2 1r • Efficiency of new (SUR?) pseudo-observation method to original method: 1 r2 : 2 as r 1 36 The Semiparametric Model • Y = Response • X,Z = time-varying covariates Yij=Zijβ+Θ(X ij )+ε ij cov(ε ij )=Σ • Question: can we improve efficiency for b by accounting for correlation? Profile Methods Yij=Zijβ+Θ(X ij )+ε ij cov(ε ij )=Σ • Given b, solve for , say Θ(X ij ,β) • Basic idea: Regress * ij Y =Yij -Zijβ on X ij • • • • Working independence Standard kernels Pseudo –observations kernels SUR kernels Profile Methods Yij=Zijβ+Θ(X ij )+ε ij cov(ε ij )=Σ • Given b, solve for , say Θ(X ij ,β) • Then fit GLS or W.I. to the model with mean Zijβ+Θ(X ij ,β) • Question: does it matter what kernel method is used? • Question: How bad is using W.I. everywhere? • Question: are there efficient choices? The Semiparametric Model: Special Case Yij=Zijβ+Θ(Xi )+εij • If X does not vary with time, simple semiparametric efficient method available • The basic point is that Yij -Zijβ has common mean Θ(X i ) and covariance matrix Σ • If Θ() were a polynomial, GLS likelihood methods would be natural The Semiparametric Model: Special Case Yij=Zijβ+Θ(Xi )+εij • Method: Replace polynomial GLS likelihood with GLS local likelihood with weights X t K i h • Then do GLS on the derived variable Zijβ+Θ(X ij ,β) • Semiparametric efficient Profile Method: General Case • Given b, solve for , say Θ(X ij ,β) • Then fit GLS or W.I. to the model with mean Zijβ+Θ(X ij ,β) • In this general case, how you estimate matters • Working independence • Standard kernel • Pseudo-observation kernel • SUR kernel Profile Methods • In this general case, how you estimate matters • Working independence • Standard kernel • Pseudo-observation kernel • SUR kernel • We have published the asymptotically efficient score, but not how to implement it Profile Methods • Naisyin Wang’s talk will describe • These phenomena • Search for an efficient estimator • Loss of efficiency for using working independence to estimate • Examples where ignoring the correlation can change conclusions Conclusions (1/3): Nonparametric Regression • In nonparametric regression • Kernels = splines for working independence (W.I.) • Weighting is important for W.I. • Working independence is inefficient • Standard kernels splines for correlated data Conclusions (2/3): Nonparametric Regression • In nonparametric regression • Pseudo-observation methods improve upon working independence • SUR kernels = splines for correlated data • Splines and SUR kernels are not local • Splines and SUR kernels are local in pseudoobservations Conclusions (3/3): Semiparametric Regression • In semiparametric regression • Profile methods are a general class • Fully efficient parameter estimates are easily constructed if X is not time-varying • When X is time-varying, method of estimating affects properties of parameter estimates • Ignoring correlations can change conclusions (see N. Wang talk) Conclusions: Splines versus Kernels • One has to be struck by the fact that all the grief in this problem has come from trying to define kernel methods • At the end of the day, they are no more efficient than splines, and harder and more subtle to define • Showing equivalence as we have done suggests the good properties of splines