Linear Regression and Correlation

Transcript Linear Regression and Correlation

Linear Regression and Correlation • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model:

  0   1

   ~

( 0 ,  ) •  1 > 0  Positive Association •  1 < 0  Negative Association •  1 = 0  No Association

• •   Least Squares Estimation of  0 ,  1  0  1   Mean response when by 1 unit (slope)

=0 (

-intercept) Change in mean response when

increases  0 ,  1  0 +  1

are unknown parameters (like m )  Mean response when explanatory variable takes on the value

• Goal: Choose values (estimates) that minimize the sum of squared errors (

SSE

) of observed values to the straight-line: ^



0 

x SSE

 

n i

 1

y i

2  

n i

 1  

y i

  ^ 0   ^ 1

x i

  2

Example - Pharmacodynamics of LSD • Response (

) - Math score (mean among 5 volunteers) • Predictor (

) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60

Score (y)

78.93

58.20

67.47

37.47

45.65

32.92

29.97

LSD Conc (x)

1.17

2.97

3.26

4.69

5.83

6.00

6.41

50 40 30 20 1 2 LSD_CONC 3 4 5 6 7 Source: Wagner, et al (1968)

Least Squares Computations

S xx S xy S yy

       

x x



  

x x y

  2  2





0    

  

x x

 

y x

  2



x y

 

S xy S xx s

2  

 ^

y n

 2 2 

SSE n

 2

Example - Pharmacodynamics of LSD

Score (y)

78.93

58.20

67.47

37.47

45.65

32.92

29.97

350.61

LSD Conc (x)

1.17

2.97

3.26

4.69

5.83

6.00

6.41

30.33

x-xbar

-3.163

-1.363

-1.073

0.357

1.497

1.667

2.077

-0.001

y-ybar

28.843

8.113

17.383

-12.617

-4.437

-17.167

-20.117

0.001

Sxx

10.004569

1.857769

1.151329

0.127449

2.241009

2.778889

4.313929

22.474943

Sxy

-91.230409

-11.058019

-18.651959

-4.504269

-6.642189

-28.617389

Syy

831.918649

65.820769

302.168689

159.188689

19.686969

294.705889

-41.783009

404.693689

-202.487243 2078.183343

(Column totals given in bottom row of table)

y b

1   350 .

61  7  202 .

4872 22 .

4749 50 .

087   9 .

01 ^

 89 .

10  9 .

x s

 30 .

33  7

0   50 .

72 4 .

333

  ^ 1

 50 .

09  (  9 .

01 )( 4 .

33 )  89 .

o e

t S t

f f

SPSS Output and Plot of Equation

i Math Score vs LSD Concentration (SPSS)

80.00

 Linear Regression 70.00

 60.00

 50.00

40.00

30.00

1.00

2.00

  

score = 89.12 + -9.01 * lsd_conc R-Square = 0.88

5.00

6.00



lsd_conc i e

n t

d C o S n D p s _ e l t C n a d n O e t ) N n C t V a r i a b l e : S C O R E

Inference Concerning the Slope (  1 ) • Parameter: Slope in the population model (  1 ) • Estimator: Least squares estimate:

1 • Estimated standard error: S

E b

1 

S xx

• Methods of making inference regarding population: – Hypothesis tests (2-sided or 1-sided) – Confidence Intervals

Hypothesis Test for  1 • 2-Sided Test – –

0 :  1

A :  1 = 0  0 • 1-sided Test – – –

0 :  1

A + :

A :  1  1 = 0 > 0 or < 0

t obs R

: |

t obs

 | 

b t

SE b

1  / 2 ,

 2

 value : 2

(

 |

t obs



 :

t obs



t b

SE b

1  ,

 2



val

 :

(



t obs

)

 :

t obs

 

 ,

 2



val

 :

(



t obs

)

(1  )100% Confidence Interval for  1

1 

 / 2

SE b

1 

1 

 / 2

s S xx

• Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2-sided hypothesis test

Example - Pharmacodynamics of LSD

 7 SE

1 

1   9 .

 7 .

12 22 .

475  1 .

50 50 .

72  7 .

S xx

 22 .

475 • Testing

0 :  1 = 0 vs

A :  1  0

t obs

  9 .

01  1 .

50  6 .

t obs

| 

025 , 5  2 .

571 • 95% Confidence Interval for  1 :  9 .

01  2 .

571 ( 1 .

50 )   9 .

01  3 .

86  (  12 .

87 ,  5 .

15 )

Confidence Interval for Mean When

* • Mean Response at a specific level

* is

(

*)  m

  0   1

* • Estimated Mean response and standard error (replacing unknown  0 and  1 with estimates): ^ m



0 

* SE ^ m 

 

* 

 2

S xx

• Confidence Interval for Mean Response: ^ m



 / 2 ,

 2 SE ^ m

Prediction Interval of Future Response @

* • Response at a specific level

* is

y x

*  m

    0   1

*   • Estimated response and standard error (replacing unknown  0 and  1 with estimates): ^



0 

* SE ^



1  1

 

* 

 2

S xx

• Prediction Interval for Future Response: ^



 / 2 ,

 2 SE ^

Correlation Coefficient • Measures the strength of the linear association between two variables • Takes on the same sign as the slope estimate from the linear regression • Not effected by linear transformations of

• Does not distinguish between dependent and independent variable (e.g. height and weight) • Population Parameter r • Pearson’s Correlation Coefficient:



S xy S xx S yy

 1 

 1

Correlation Coefficient • Values close to 1 in absolute value  strong linear association, positive or negative from sign • Values close to 0 imply little or no association • If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the

• Test of

0 : r and

= 0 is equivalent to test of

0 : values  1 =0 • Coefficient of Determination (

2 ) - Proportion of variation in

“explained” by the regression on

2  (

) 2 

S yy



SSE S yy

0 

2  1

Example - Pharmacodynamics of LSD

S r xx

  22 .

475

S xy

  202 .

487  202 .

487   0 .

94 ( 22 .

475 )( 2078 .

183 )

S yy r

2   2078 2078 .

183  253 .

89  0 .

88  (  0 .

94 ) 2 2078 .

183 .

183

SSE

 253 .

S yy SSE

80.00

 70.00

 60.00

 50.00

40.00

30.00

1.00

 2.00

3.00

4.00

lsd_conc

5.00

  6.00

Mean 80.00

 70.00

 60.00

 50.00

 40.00

30.00

1.00

  2.00

score = 89.12 + -9.01 * lsd_conc

4.00

5.00

6.00



lsd_conc

Linear Regression

Example - SPSS Output Pearson’s and Spearman’s Measures

C o

L .

9 0 9 0

3 0 3 0

r r

0 0

l a

* *

i i

i o

g g .

n s

C r ( 2 r ( e o 2 l a n t a t a t i C i i o l l e e n o d r d r i r ) r ) s e e l l s a a i t t g i i o o n i n n f i c

i i

n s

C ( 2 g .

r g r .

r ( 2 e la t t t i a i a i i i o le d le C d ) n i ) s o e e s i f f g f f i i n c c i f i i i e e c n n a t t n t a t a t n h e t 0 a t .

0 1 t h le e v e 0 .

l ( 2 0 1 t a l e i le d v ) .

e l ( 2 t a i l e d ) .

Analysis of Variance in Regression • Goal: Partition the total variation in

variation “explained” by

into and random variation (

y i



)  (

y i

 ^

y i

)  ( ^

y i



)  (

y i



) 2   (

y i

 ^

y i

) 2   ( ^

y i



) 2 • These three sums of squares and degrees of freedom are: •

Total

(

SST

)

DFT

-1 •

Error

(

SSE

)

DFE

-2 •

Model

(

SSM

)

DFM

= 1

Source of Variation Model Error Total Analysis of Variance in Regression Sum of Squares

SSM SSE SST

Degrees of Freedom 1

-2

-1 Mean Square

MSM

SSM

MSE

SSE

-2)

F F

MSM

MSE

• Analysis of Variance -

-test •

0 :  1 = 0

A :  1  0

T P

 : :

F obs F obs

value 

MSM MSE

: 

P F

 , 1 ,

 2 (



F obs

)

Example - Pharmacodynamics of LSD • Total Sum of squares:

SST

  (

y i



) 2  2078 .

183

DFT

 7  1  6 • Error Sum of squares:

SSE

  (

y i

 ^

y i

) 2  253 .

890

DFE

 7  2  5 • Model Sum of Squares:

SSM

  ( ^

y i



) 2  2078 .

183  253 .

890  1824 .

293

DFM

 1

Source of Variation Model Error Total Example - Pharmacodynamics of LSD Sum of Squares 1824.293

253.890

2078.183

Degrees of Freedom 1 5 6 Mean Square 1824.293

50.778

35.93

•Analysis of Variance -

-test •

0 :  1 = 0

A :  1  0

F obs



MSR

 35 .

MSE R

 :

F val obs

: 

(

05 , 1 , 5

 35 .

 6 93 ) .

2 2 0 M 7 8 Example - SPSS Output

A N

S u .

1 8 m

V A

o f a e e o g s t r i a e d l s u s a i l o n e d p i e c t n o d r s e : n t ( C V o a n r i s a t b a l n e : t ) , S L C S O D R _ C E O N C

Multiple Regression • • Numeric Response variable (

)

Numeric predictor variables • Model:

=  0 +  1

1 +  +  p

p +  • Partial Regression Coefficients:  i mean response) of increasing the

th  effect (on the predictor variable by 1 unit,

holding all other predictors constant

Example - Effect of Birth weight on Body Size in Early Adolescence • Response: Height at Early adolescence (

=250 cases) • Predictors (

=6 explanatory variables) • Adolescent Age (

1 , in years -- 11-14) • Tanner stage (

2 , units not given) • Gender (

3 =1 if male, 0 if female) • Gestational age (

4 , in weeks at birth) • Birth length (

5 , units not given) • Birthweight Group (

6 =1,...,6 <1500

(1), 1500 1999

(2), 2000-2499

(3), 2500-2999

(4), 3000 3499

(5), >3500

(6)) Source: Falkner, et al (2004)

Least Squares Estimation • Population Model for mean response:

(

)   0   1

1    

p x p

• Least Squares Fitted (predicted) equation, minimizing

SSE

^ 

0 

1   

b p x p SSE

     

^    2 • All statistical software packages/spreadsheets can compute least squares estimates and their standard errors

Analysis of Variance • Direct extension to ANOVA based on simple linear regression • Only adjustments are to degrees of freedom: –

DFM

p DFE

-1 Source of Variation Model Error Total Sum of Squares

SSM SSE SST

Degrees of Freedom

p n

-1

-1 Mean Square

MSM

SSM

p MSE

SSE

-1)

F F

MSM

MSE R

2 

SST



SSE SST



SSM SST

Testing for the Overall Model -

-test • • • Tests whether

any

of the explanatory variables are associated with the response

0 :  1 =  =  p =0 (None of the

A : Not all  i = 0 associated with

)

F obs



MSM



MSE R P

 .

F val obs

: 

(

 ,



 1



F obs

) ( 1 

) 2 / /(

n p



 1 )

Example - Effect of Birth weight on Body Size in Early Adolescence • Authors did not print ANOVA, but did provide following: • • •

0 :

=250

2 =0.26

 1 =  =  6 =0

A : Not all  i = 0

F obs



MSR MSE

 ( 1 

) 2 / /(

n p



 1 )   ( 1  0 .

0 .

26 / 26 ) 6 /( 250  6

 :

F val obs

: 

(

 , 6 , 243

  14 .

2 )  1 ) 2 .

13  .

0433 .

0030  14 .

Testing Individual Partial Coefficients -

-tests • Wish to determine whether the response is associated with a single explanatory variable, after controlling for the others •

0 :  i = 0

A :  i  0 (2-sided alternative)

t obs R

: |

t obs



b i

| SE

b i



 / 2 ,



 1



val

: 2

(

 |

t obs

Example - Effect of Birth weight on Body Size in Early Adolescence

Variable b SE b t=b/SE b P-val (z) Adolescent Age Tanner Stage Male

2.86

3.41

0.08

-0.11

0.99

0.89

1.26

0.21

2.89

3.83

0.06

-0.52

.0038

<.001

.9522

.6030

Gestational Age Birth Length Birth Wt Grp

0.44

-0.78

0.19

0.64

2.32

-1.22

.0204

.2224

Controlling for all other predictors, adolescent age, Tanner stage, and Birth length are associated with adolescent height measurement

Testing for the Overall Model -

-test • • • Tests whether

any

of the explanatory variables are associated with the response

0 :  1 =  =  p =0 (None of

A : Not all  i = 0 associated with

)

T P

 .

F obs val

: 

(

MSR MSE F





obs

) ( 1 

) 2 / /(

n p



 1 ) The

-value is based on the

-distribution with

numerator and (

-1) denominator degrees of freedom

Comparing Regression Models

• Conflicting Goals: Explaining variation in

while keeping model as simple as possible (parsimony) • We can test whether a subset of

predictors (including possibly cross-product terms) can be dropped from a model that contains the remaining

predictors.

0 :  g+1 =…=  p =0 – Complete Model: Contains all

predictors – Reduced Model: Eliminates the predictors from

0 – Fit both models, obtaining the Error sum of squares for each (or

2 from each)

Comparing Regression Models

•

0 :  g+1 =…=  p = 0 (After removing the effects of

1 ,…,

g , none of other predictors are associated with

) •

H a

0 is false Test Statistic :

F obs

 (

SSE r



SSE c SSE c

 )

/( 

 1 ]

)



(



F obs

)

-value based on

-distribution with

p-g

and

n-p-

1 d.f.

Models with Dummy Variables • Some models have both numeric and categorical explanatory variables (Recall

gender

in example) • If a categorical variable has

levels, need to create

-1 dummy variables that take on the values 1 if the level of interest is present, 0 otherwise.

• The baseline level of the categorical variable for which all

-1 dummy variables are set to 0 • The regression coefficient corresponding to a dummy variable is the difference between the mean for that level and the mean for baseline group, controlling for all numeric predictors

Example - Deep Cervical Infections • Subjects - Patients with deep neck infections • Response (

) - Length of Stay in hospital • Predictors: (One numeric, 11 Dichotomous) – Age (

1 ) – Gender (

2 =1 if female, 0 if male) – Fever (

3 =1 if Body Temp > 38C, 0 if not) – Neck swelling (

4 =1 if Present, 0 if absent) – Neck Pain (

5 =1 if Present, 0 if absent) – Trismus (

6 =1 if Present, 0 if absent) – Underlying Disease (

7 =1 if Present, 0 if absent) – Respiration Difficulty (

8 =1 if Present, 0 if absent) – Complication (

9 =1 if Present, 0 if absent) – WBC > 15000

mm 3

10 =1 if Present, 0 if absent) – CRP > 100 m g/ml (

11 =1 if Present, 0 if absent) Source: Wang, et al (2003)

Example - Weather and Spinal Patients • Subjects - Visitors to National Spinal Network in 23 cities Completing SF-36 Form • Response - Physical Function subscale (1 of 10 reported) • Predictors: Source: Glaser, et al (2004) – Patient’s age (

1 ) – Gender (

2 =1 if female, 0 if male) – High temperature on day of visit (

3 ) – Low temperature on day of visit (

4 ) – Dew point (

5 ) – Wet bulb (

6 ) – Total precipitation (

7 ) – Barometric Pressure (

7 ) – Length of sunlight (

8 ) – Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon, wan gibbous, last Qtr, wan crescent, presumably had 8-1=7 dummy variables)

Analysis of Covariance • Combination of 1-Way ANOVA and Linear Regression • Goal: Comparing numeric responses among

groups, adjusting for numeric concomitant variable(s), referred to as

Covariate(s)

• Clinical trial applications: Response is Post-Trt score, covariate is Pre-Trt score • Epidemiological applications: Outcomes compared across exposure conditions, adjusted for other risk factors (age, smoking status, sex,...)

Nonlinear Regression • Theory often leads to nonlinear relations between variables. Examples: – 1-compartment PK model with 1st-order absorption and elimination – Sigmoid-E max S-shaped PD model

Example - P24 Antigens and AZT • Goal: Model time course of P24 antigen levels after oral administration of zidovudine • Model fit individually in 40 HIV + patients:

(

) 

0 ( 1 

)



k out t



where: •

E(t)

is the antigen level at time

•

0 is the initial level •

is the coefficient of reduction of P24 antigen •

k out

is the rate constant of decrease of P24 antigen Source: Sasomsin, et al (2002)

Example - P24 Antigens and AZT • Among the 40 individuals who the model was fit, the means and standard deviations of the PK “parameters” are given below:

Parameter

E 0 A k out

Mean

472.1

0.28

0.27

Std Dev

408.8

0.21

0.16

• Fitted Model for the “mean subject”

(

)  472 .

1 ( 1  0 .

28 )

 0 .

 ( 472 .

1 )( 0 .

28 )

Example - P24 Antigens and AZT

Example - MK639 in HIV + Patients • Response:

log

10 (RNA change) • Predictor:

= MK639

AUC

0-6h • Model: Sigmoid-

E max



 2  0 

 2  1  2   • where: •  0 is the maximum effect (limit as

 ) •  1 is the

level producing 50% of maximum effect •  2 is a parameter effecting the shape of the function Source: Stein, et al (1996)

Example - MK639 in HIV + Patients • Data on

= 5 subjects in a Phase 1 trial:

Subject

1 2 3 4 5

log RNA change (Y)

0.000

0.167

1.524

3.205

3.518

MK639 AUC 0-6h (x)

10576.9

13942.3

18235.3

19607.8

22317.1

• Model fit using SPSS (estimates slightly different from notes, which used SAS)

^ 

3 .

52

35 .

60 35 .

60 

18374 .

39

Example - MK639 in HIV + Patients

Data Sources • Wagner, J.G., G.K. Aghajanian, and O.H. Bing (1968). “Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects,”

Clinical Pharmacology and Therapeutics

, 9:635-638.