Linear Regression and Correlation

Download Report

Transcript Linear Regression and Correlation

Linear Regression and Correlation • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model:

Y

  0   1

x

   ~

N

( 0 ,  ) •  1 > 0  Positive Association •  1 < 0  Negative Association •  1 = 0  No Association

• •   Least Squares Estimation of  0 ,  1  0  1   Mean response when by 1 unit (slope)

x

=0 (

y

-intercept) Change in mean response when

x

increases  0 ,  1  0 +  1

x

are unknown parameters (like m )  Mean response when explanatory variable takes on the value

x

• Goal: Choose values (estimates) that minimize the sum of squared errors (

SSE

) of observed values to the straight-line: ^

y

b

0 

b

1

x SSE

 

n i

 1

y i

^

y i

2  

n i

 1  

y i

  ^ 0   ^ 1

x i

  2

Example - Pharmacodynamics of LSD • Response (

y

) - Math score (mean among 5 volunteers) • Predictor (

x

) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60

Score (y)

78.93

58.20

67.47

37.47

45.65

32.92

29.97

LSD Conc (x)

1.17

2.97

3.26

4.69

5.83

6.00

6.41

50 40 30 20 1 2 LSD_CONC 3 4 5 6 7 Source: Wagner, et al (1968)

Least Squares Computations

S xx S xy S yy

       

x x

y

  

x x y

  2  2

y

y

b

1

b

0    

x

  

x x

 

y x

  2

y

b

1

x y

 

S xy S xx s

2  

y

 ^

y n

 2 2 

SSE n

 2

Example - Pharmacodynamics of LSD

Score (y)

78.93

58.20

67.47

37.47

45.65

32.92

29.97

350.61

LSD Conc (x)

1.17

2.97

3.26

4.69

5.83

6.00

6.41

30.33

x-xbar

-3.163

-1.363

-1.073

0.357

1.497

1.667

2.077

-0.001

y-ybar

28.843

8.113

17.383

-12.617

-4.437

-17.167

-20.117

0.001

Sxx

10.004569

1.857769

1.151329

0.127449

2.241009

2.778889

4.313929

22.474943

Sxy

-91.230409

-11.058019

-18.651959

-4.504269

-6.642189

-28.617389

Syy

831.918649

65.820769

302.168689

159.188689

19.686969

294.705889

-41.783009

404.693689

-202.487243 2078.183343

(Column totals given in bottom row of table)

y b

1   350 .

61  7  202 .

4872 22 .

4749 50 .

087   9 .

01 ^

y

 89 .

10  9 .

01

x s

2

x

 30 .

33  7

b

0   50 .

72 4 .

333

y

  ^ 1

x

 50 .

09  (  9 .

01 )( 4 .

33 )  89 .

10

C

n

o e

t S t

f f

.

SPSS Output and Plot of Equation

i Math Score vs LSD Concentration (SPSS)

80.00

 Linear Regression 70.00

 60.00

 50.00

40.00

30.00

1.00

2.00

  

score = 89.12 + -9.01 * lsd_conc R-Square = 0.88

5.00

6.00

lsd_conc i e

r

n t

z

s

d C o S n D p s _ e l t C n a d n O e t ) N n C t V a r i a b l e : S C O R E

Inference Concerning the Slope (  1 ) • Parameter: Slope in the population model (  1 ) • Estimator: Least squares estimate:

b

1 • Estimated standard error: S

E b

1 

s

/

S xx

• Methods of making inference regarding population: – Hypothesis tests (2-sided or 1-sided) – Confidence Intervals

Hypothesis Test for  1 • 2-Sided Test – –

H

0 :  1

H

A :  1 = 0  0 • 1-sided Test – – –

H

0 :  1

H

A + :

H

A :  1  1 = 0 > 0 or < 0

T

.

S

.

:

t obs R

.

R

.

: |

t obs

 | 

b t

1

SE b

1  / 2 ,

n

 2

P

 value : 2

P

(

t

 |

t obs

|)

T

.

S

.

:

t obs

R

.

R

.

 :

t obs

t b

1

SE b

1  ,

n

 2

P

val

 :

P

(

t

t obs

)

R

.

R

.

 :

t obs

 

t

 ,

n

 2

P

val

 :

P

(

t

t obs

)

(1  )100% Confidence Interval for  1

b

1 

t

 / 2

SE b

1 

b

1 

t

 / 2

s S xx

• Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2-sided hypothesis test

Example - Pharmacodynamics of LSD

n

 7 SE

b

1 

b

1   9 .

01

s

 7 .

12 22 .

475  1 .

50 50 .

72  7 .

12

S xx

 22 .

475 • Testing

H

0 :  1 = 0 vs

H

A :  1  0

T

.

S

.

:

t obs

  9 .

01  1 .

50  6 .

01

R

.

R

.

:|

t obs

| 

t

.

025 , 5  2 .

571 • 95% Confidence Interval for  1 :  9 .

01  2 .

571 ( 1 .

50 )   9 .

01  3 .

86  (  12 .

87 ,  5 .

15 )

Confidence Interval for Mean When

x

=

x

* • Mean Response at a specific level

x

* is

E

(

y

|

x

*)  m

y

  0   1

x

* • Estimated Mean response and standard error (replacing unknown  0 and  1 with estimates): ^ m

y

b

0 

b

1

x

* SE ^ m 

s

1

n

 

x

* 

x

 2

S xx

• Confidence Interval for Mean Response: ^ m

y

t

 / 2 ,

n

 2 SE ^ m

Prediction Interval of Future Response @

x

=

x

* • Response at a specific level

x

* is

y x

*  m

y

    0   1

x

*   • Estimated response and standard error (replacing unknown  0 and  1 with estimates): ^

y

b

0 

b

1

x

* SE ^

y

s

1  1

n

 

x

* 

x

 2

S xx

• Prediction Interval for Future Response: ^

y

t

 / 2 ,

n

 2 SE ^

y

Correlation Coefficient • Measures the strength of the linear association between two variables • Takes on the same sign as the slope estimate from the linear regression • Not effected by linear transformations of

y

or

x

• Does not distinguish between dependent and independent variable (e.g. height and weight) • Population Parameter r • Pearson’s Correlation Coefficient:

r

S xy S xx S yy

 1 

r

 1

Correlation Coefficient • Values close to 1 in absolute value  strong linear association, positive or negative from sign • Values close to 0 imply little or no association • If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the

x

• Test of

H

0 : r and

y

= 0 is equivalent to test of

H

0 : values  1 =0 • Coefficient of Determination (

r

2 ) - Proportion of variation in

y

“explained” by the regression on

x

:

r

2  (

r

) 2 

S yy

SSE S yy

0 

r

2  1

Example - Pharmacodynamics of LSD

S r xx

  22 .

475

S xy

  202 .

487  202 .

487   0 .

94 ( 22 .

475 )( 2078 .

183 )

S yy r

2   2078 2078 .

183  253 .

89  0 .

88  (  0 .

94 ) 2 2078 .

183 .

183

SSE

 253 .

89

S yy SSE

80.00

 70.00

 60.00

 50.00

40.00

30.00

1.00

 2.00

3.00

4.00

lsd_conc

5.00

  6.00

Mean 80.00

 70.00

 60.00

 50.00

 40.00

30.00

1.00

  2.00

score = 89.12 + -9.01 * lsd_conc

4.00

5.00

6.00

lsd_conc

Linear Regression

Example - SPSS Output Pearson’s and Spearman’s Measures

C o

L .

.

C

L

r

.

.

r

9 0 9 0

o

.

.

e

3 0 3 0

r r

0 0

e

0 0

l a

*

la

* *

t

i i

i o

g g .

.

r

n s

C r ( 2 r ( e o 2 l a n t a t a t i C i i o l l e e n o d r d r i r ) r ) s e e l l s a a i t t g i i o o n i n n f i c

t

i i

io

r

n s

C ( 2 g .

r g r .

r ( 2 e la t t t i a i a i i i o le d le C d ) n i ) s o e e s i f f g f f i i n c c i f i i i e e c n n a t t n t a t a t n h e t 0 a t .

0 1 t h le e v e 0 .

l ( 2 0 1 t a l e i le d v ) .

e l ( 2 t a i l e d ) .

Analysis of Variance in Regression • Goal: Partition the total variation in

y

variation “explained” by

x

into and random variation (

y i

y

)  (

y i

 ^

y i

)  ( ^

y i

y

)  (

y i

y

) 2   (

y i

 ^

y i

) 2   ( ^

y i

y

) 2 • These three sums of squares and degrees of freedom are: •

Total

(

SST

)

DFT

=

n

-1 •

Error

(

SSE

)

DFE

=

n

-2 •

Model

(

SSM

)

DFM

= 1

Source of Variation Model Error Total Analysis of Variance in Regression Sum of Squares

SSM SSE SST

Degrees of Freedom 1

n

-2

n

-1 Mean Square

MSM

=

SSM

/1

MSE

=

SSE

/(

n

-2)

F F

=

MSM

/

MSE

• Analysis of Variance -

F

-test •

H

0 :  1 = 0

H

A :  1  0

T P

.

S

.

R

.

R

.

 : :

F obs F obs

value 

MSM MSE

: 

P F

 , 1 ,

n

 2 (

F

F obs

)

Example - Pharmacodynamics of LSD • Total Sum of squares:

SST

  (

y i

y

) 2  2078 .

183

DFT

 7  1  6 • Error Sum of squares:

SSE

  (

y i

 ^

y i

) 2  253 .

890

DFE

 7  2  5 • Model Sum of Squares:

SSM

  ( ^

y i

y

) 2  2078 .

183  253 .

890  1824 .

293

DFM

 1

Source of Variation Model Error Total Example - Pharmacodynamics of LSD Sum of Squares 1824.293

253.890

2078.183

Degrees of Freedom 1 5 6 Mean Square 1824.293

50.778

F

35.93

•Analysis of Variance -

F

-test •

H

0 :  1 = 0

H

A :  1  0

T

.

S

.

:

F obs

MSR

 35 .

93

MSE R

.

R

.

P

 :

F val obs

: 

P

(

F

.

05 , 1 , 5

F

 35 .

 6 93 ) .

61

2 2 0 M 7 8 Example - SPSS Output

A N

S u .

1 8 m

V A

o f a e e o g s t r i a e d l s u s a i l o n e d p i e c t n o d r s e : n t ( C V o a n r i s a t b a l n e : t ) , S L C S O D R _ C E O N C

Multiple Regression • • Numeric Response variable (

Y

)

p

Numeric predictor variables • Model:

Y

=  0 +  1

x

1 +  +  p

x

p +  • Partial Regression Coefficients:  i mean response) of increasing the

i

th  effect (on the predictor variable by 1 unit,

holding all other predictors constant

Example - Effect of Birth weight on Body Size in Early Adolescence • Response: Height at Early adolescence (

n

=250 cases) • Predictors (

p

=6 explanatory variables) • Adolescent Age (

x

1 , in years -- 11-14) • Tanner stage (

x

2 , units not given) • Gender (

x

3 =1 if male, 0 if female) • Gestational age (

x

4 , in weeks at birth) • Birth length (

x

5 , units not given) • Birthweight Group (

x

6 =1,...,6 <1500

g

(1), 1500 1999

g

(2), 2000-2499

g

(3), 2500-2999

g

(4), 3000 3499

g

(5), >3500

g

(6)) Source: Falkner, et al (2004)

Least Squares Estimation • Population Model for mean response:

E

(

Y

)   0   1

x

1    

p x p

• Least Squares Fitted (predicted) equation, minimizing

SSE

:

Y

^ 

b

0 

b

1

x

1   

b p x p SSE

     

Y

^    2 • All statistical software packages/spreadsheets can compute least squares estimates and their standard errors

Analysis of Variance • Direct extension to ANOVA based on simple linear regression • Only adjustments are to degrees of freedom: –

DFM

=

p DFE

=

n

-

p

-1 Source of Variation Model Error Total Sum of Squares

SSM SSE SST

Degrees of Freedom

p n

-

p

-1

n

-1 Mean Square

MSM

=

SSM

/

p MSE

=

SSE

/(

n

-

p

-1)

F F

=

MSM

/

MSE R

2 

SST

SSE SST

SSM SST

Testing for the Overall Model -

F

-test • • • Tests whether

any

of the explanatory variables are associated with the response

H

0 :  1 =  =  p =0 (None of the

x

s

H

A : Not all  i = 0 associated with

y

)

T

.

S

.

:

F obs

MSM

MSE R P

.

R

 .

:

F val obs

: 

P

(

F

 ,

p

,

n

p

 1

F

F obs

) ( 1 

R

2

R

) 2 / /(

n p

p

 1 )

Example - Effect of Birth weight on Body Size in Early Adolescence • Authors did not print ANOVA, but did provide following: • • •

H

0 :

n

=250

p

=6

R

2 =0.26

 1 =  =  6 =0

H

A : Not all  i = 0

T

.

S

.

:

F obs

MSR MSE

 ( 1 

R

2

R

) 2 / /(

n p

p

 1 )   ( 1  0 .

0 .

26 / 26 ) 6 /( 250  6

R

.

R

.

P

 :

F val obs

: 

P

(

F

 , 6 , 243

F

  14 .

2 )  1 ) 2 .

13  .

0433 .

0030  14 .

2

Testing Individual Partial Coefficients -

t

-tests • Wish to determine whether the response is associated with a single explanatory variable, after controlling for the others •

H

0 :  i = 0

H

A :  i  0 (2-sided alternative)

T

.

S

.

:

t obs R

.

R

.

: |

t obs

b i

| SE

b i

t

 / 2 ,

n

p

 1

P

val

: 2

P

(

t

 |

t obs

|)

Example - Effect of Birth weight on Body Size in Early Adolescence

Variable b SE b t=b/SE b P-val (z) Adolescent Age Tanner Stage Male

2.86

3.41

0.08

-0.11

0.99

0.89

1.26

0.21

2.89

3.83

0.06

-0.52

.0038

<.001

.9522

.6030

Gestational Age Birth Length Birth Wt Grp

0.44

-0.78

0.19

0.64

2.32

-1.22

.0204

.2224

Controlling for all other predictors, adolescent age, Tanner stage, and Birth length are associated with adolescent height measurement

Testing for the Overall Model -

F

-test • • • Tests whether

any

of the explanatory variables are associated with the response

H

0 :  1 =  =  p =0 (None of

X

s

H

A : Not all  i = 0 associated with

Y

)

T P

.

S

 .

:

F obs val

: 

P

(

MSR MSE F

F

obs

) ( 1 

R

2

R

) 2 / /(

n p

p

 1 ) The

P

-value is based on the

F

-distribution with

p

numerator and (

n

-

p

-1) denominator degrees of freedom

Comparing Regression Models

• Conflicting Goals: Explaining variation in

Y

while keeping model as simple as possible (parsimony) • We can test whether a subset of

p

-

g

predictors (including possibly cross-product terms) can be dropped from a model that contains the remaining

g

predictors.

H

0 :  g+1 =…=  p =0 – Complete Model: Contains all

k

predictors – Reduced Model: Eliminates the predictors from

H

0 – Fit both models, obtaining the Error sum of squares for each (or

R

2 from each)

Comparing Regression Models

H

0 :  g+1 =…=  p = 0 (After removing the effects of

X

1 ,…,

X

g , none of other predictors are associated with

Y

) •

H a

:

H

0 is false Test Statistic :

F obs

 (

SSE r

SSE c SSE c

/[

n

 )

p

/( 

p

 1 ]

g

)

P

P

(

F

F obs

)

P

-value based on

F

-distribution with

p-g

and

n-p-

1 d.f.

Models with Dummy Variables • Some models have both numeric and categorical explanatory variables (Recall

gender

in example) • If a categorical variable has

k

levels, need to create

k

-1 dummy variables that take on the values 1 if the level of interest is present, 0 otherwise.

• The baseline level of the categorical variable for which all

k

-1 dummy variables are set to 0 • The regression coefficient corresponding to a dummy variable is the difference between the mean for that level and the mean for baseline group, controlling for all numeric predictors

Example - Deep Cervical Infections • Subjects - Patients with deep neck infections • Response (

Y

) - Length of Stay in hospital • Predictors: (One numeric, 11 Dichotomous) – Age (

x

1 ) – Gender (

x

2 =1 if female, 0 if male) – Fever (

x

3 =1 if Body Temp > 38C, 0 if not) – Neck swelling (

x

4 =1 if Present, 0 if absent) – Neck Pain (

x

5 =1 if Present, 0 if absent) – Trismus (

x

6 =1 if Present, 0 if absent) – Underlying Disease (

x

7 =1 if Present, 0 if absent) – Respiration Difficulty (

x

8 =1 if Present, 0 if absent) – Complication (

x

9 =1 if Present, 0 if absent) – WBC > 15000

/

mm 3

(x

10 =1 if Present, 0 if absent) – CRP > 100 m g/ml (

x

11 =1 if Present, 0 if absent) Source: Wang, et al (2003)

Example - Weather and Spinal Patients • Subjects - Visitors to National Spinal Network in 23 cities Completing SF-36 Form • Response - Physical Function subscale (1 of 10 reported) • Predictors: Source: Glaser, et al (2004) – Patient’s age (

x

1 ) – Gender (

x

2 =1 if female, 0 if male) – High temperature on day of visit (

x

3 ) – Low temperature on day of visit (

x

4 ) – Dew point (

x

5 ) – Wet bulb (

x

6 ) – Total precipitation (

x

7 ) – Barometric Pressure (

x

7 ) – Length of sunlight (

x

8 ) – Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon, wan gibbous, last Qtr, wan crescent, presumably had 8-1=7 dummy variables)

Analysis of Covariance • Combination of 1-Way ANOVA and Linear Regression • Goal: Comparing numeric responses among

k

groups, adjusting for numeric concomitant variable(s), referred to as

Covariate(s)

• Clinical trial applications: Response is Post-Trt score, covariate is Pre-Trt score • Epidemiological applications: Outcomes compared across exposure conditions, adjusted for other risk factors (age, smoking status, sex,...)

Nonlinear Regression • Theory often leads to nonlinear relations between variables. Examples: – 1-compartment PK model with 1st-order absorption and elimination – Sigmoid-E max S-shaped PD model

Example - P24 Antigens and AZT • Goal: Model time course of P24 antigen levels after oral administration of zidovudine • Model fit individually in 40 HIV + patients:

E

(

t

) 

E

0 ( 1 

A

)

e

k out t

E

0

A

where: •

E(t)

is the antigen level at time

t

E

0 is the initial level •

A

is the coefficient of reduction of P24 antigen •

k out

is the rate constant of decrease of P24 antigen Source: Sasomsin, et al (2002)

Example - P24 Antigens and AZT • Among the 40 individuals who the model was fit, the means and standard deviations of the PK “parameters” are given below:

Parameter

E 0 A k out

Mean

472.1

0.28

0.27

Std Dev

408.8

0.21

0.16

• Fitted Model for the “mean subject”

E

(

t

)  472 .

1 ( 1  0 .

28 )

e

 0 .

27

t

 ( 472 .

1 )( 0 .

28 )

Example - P24 Antigens and AZT

Example - MK639 in HIV + Patients • Response:

Y

=

log

10 (RNA change) • Predictor:

x

= MK639

AUC

0-6h • Model: Sigmoid-

E max

:

Y

x

 2  0 

x

 2  1  2   • where: •  0 is the maximum effect (limit as

x

 ) •  1 is the

x

level producing 50% of maximum effect •  2 is a parameter effecting the shape of the function Source: Stein, et al (1996)

Example - MK639 in HIV + Patients • Data on

n

= 5 subjects in a Phase 1 trial:

Subject

1 2 3 4 5

log RNA change (Y)

0.000

0.167

1.524

3.205

3.518

MK639 AUC 0-6h (x)

10576.9

13942.3

18235.3

19607.8

22317.1

• Model fit using SPSS (estimates slightly different from notes, which used SAS)

Y

^ 

x

3 .

52

x

35 .

60 35 .

60 

18374 .

39

Example - MK639 in HIV + Patients

Data Sources • Wagner, J.G., G.K. Aghajanian, and O.H. Bing (1968). “Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects,”

Clinical Pharmacology and Therapeutics

, 9:635-638.