Transcript Linear Regression and Correlation
Linear Regression and Correlation • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model:
Y
0 1
x
~
N
( 0 , ) • 1 > 0 Positive Association • 1 < 0 Negative Association • 1 = 0 No Association
• • Least Squares Estimation of 0 , 1 0 1 Mean response when by 1 unit (slope)
x
=0 (
y
-intercept) Change in mean response when
x
increases 0 , 1 0 + 1
x
are unknown parameters (like m ) Mean response when explanatory variable takes on the value
x
• Goal: Choose values (estimates) that minimize the sum of squared errors (
SSE
) of observed values to the straight-line: ^
y
b
0
b
1
x SSE
n i
1
y i
^
y i
2
n i
1
y i
^ 0 ^ 1
x i
2
Example - Pharmacodynamics of LSD • Response (
y
) - Math score (mean among 5 volunteers) • Predictor (
x
) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
50 40 30 20 1 2 LSD_CONC 3 4 5 6 7 Source: Wagner, et al (1968)
Least Squares Computations
S xx S xy S yy
x x
y
x x y
2 2
y
y
b
1
b
0
x
x x
y x
2
y
b
1
x y
S xy S xx s
2
y
^
y n
2 2
SSE n
2
Example - Pharmacodynamics of LSD
Score (y)
78.93
58.20
67.47
37.47
45.65
32.92
29.97
350.61
LSD Conc (x)
1.17
2.97
3.26
4.69
5.83
6.00
6.41
30.33
x-xbar
-3.163
-1.363
-1.073
0.357
1.497
1.667
2.077
-0.001
y-ybar
28.843
8.113
17.383
-12.617
-4.437
-17.167
-20.117
0.001
Sxx
10.004569
1.857769
1.151329
0.127449
2.241009
2.778889
4.313929
22.474943
Sxy
-91.230409
-11.058019
-18.651959
-4.504269
-6.642189
-28.617389
Syy
831.918649
65.820769
302.168689
159.188689
19.686969
294.705889
-41.783009
404.693689
-202.487243 2078.183343
(Column totals given in bottom row of table)
y b
1 350 .
61 7 202 .
4872 22 .
4749 50 .
087 9 .
01 ^
y
89 .
10 9 .
01
x s
2
x
30 .
33 7
b
0 50 .
72 4 .
333
y
^ 1
x
50 .
09 ( 9 .
01 )( 4 .
33 ) 89 .
10
C
n
o e
t S t
f f
.
SPSS Output and Plot of Equation
i Math Score vs LSD Concentration (SPSS)
80.00
Linear Regression 70.00
60.00
50.00
40.00
30.00
1.00
2.00
score = 89.12 + -9.01 * lsd_conc R-Square = 0.88
5.00
6.00
lsd_conc i e
r
n t
z
s
d C o S n D p s _ e l t C n a d n O e t ) N n C t V a r i a b l e : S C O R E
Inference Concerning the Slope ( 1 ) • Parameter: Slope in the population model ( 1 ) • Estimator: Least squares estimate:
b
1 • Estimated standard error: S
E b
1
s
/
S xx
• Methods of making inference regarding population: – Hypothesis tests (2-sided or 1-sided) – Confidence Intervals
Hypothesis Test for 1 • 2-Sided Test – –
H
0 : 1
H
A : 1 = 0 0 • 1-sided Test – – –
H
0 : 1
H
A + :
H
A : 1 1 = 0 > 0 or < 0
T
.
S
.
:
t obs R
.
R
.
: |
t obs
|
b t
1
SE b
1 / 2 ,
n
2
P
value : 2
P
(
t
|
t obs
|)
T
.
S
.
:
t obs
R
.
R
.
:
t obs
t b
1
SE b
1 ,
n
2
P
val
:
P
(
t
t obs
)
R
.
R
.
:
t obs
t
,
n
2
P
val
:
P
(
t
t obs
)
(1 )100% Confidence Interval for 1
b
1
t
/ 2
SE b
1
b
1
t
/ 2
s S xx
• Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2-sided hypothesis test
Example - Pharmacodynamics of LSD
n
7 SE
b
1
b
1 9 .
01
s
7 .
12 22 .
475 1 .
50 50 .
72 7 .
12
S xx
22 .
475 • Testing
H
0 : 1 = 0 vs
H
A : 1 0
T
.
S
.
:
t obs
9 .
01 1 .
50 6 .
01
R
.
R
.
:|
t obs
|
t
.
025 , 5 2 .
571 • 95% Confidence Interval for 1 : 9 .
01 2 .
571 ( 1 .
50 ) 9 .
01 3 .
86 ( 12 .
87 , 5 .
15 )
Confidence Interval for Mean When
x
=
x
* • Mean Response at a specific level
x
* is
E
(
y
|
x
*) m
y
0 1
x
* • Estimated Mean response and standard error (replacing unknown 0 and 1 with estimates): ^ m
y
b
0
b
1
x
* SE ^ m
s
1
n
x
*
x
2
S xx
• Confidence Interval for Mean Response: ^ m
y
t
/ 2 ,
n
2 SE ^ m
Prediction Interval of Future Response @
x
=
x
* • Response at a specific level
x
* is
y x
* m
y
0 1
x
* • Estimated response and standard error (replacing unknown 0 and 1 with estimates): ^
y
b
0
b
1
x
* SE ^
y
s
1 1
n
x
*
x
2
S xx
• Prediction Interval for Future Response: ^
y
t
/ 2 ,
n
2 SE ^
y
Correlation Coefficient • Measures the strength of the linear association between two variables • Takes on the same sign as the slope estimate from the linear regression • Not effected by linear transformations of
y
or
x
• Does not distinguish between dependent and independent variable (e.g. height and weight) • Population Parameter r • Pearson’s Correlation Coefficient:
r
S xy S xx S yy
1
r
1
Correlation Coefficient • Values close to 1 in absolute value strong linear association, positive or negative from sign • Values close to 0 imply little or no association • If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the
x
• Test of
H
0 : r and
y
= 0 is equivalent to test of
H
0 : values 1 =0 • Coefficient of Determination (
r
2 ) - Proportion of variation in
y
“explained” by the regression on
x
:
r
2 (
r
) 2
S yy
SSE S yy
0
r
2 1
Example - Pharmacodynamics of LSD
S r xx
22 .
475
S xy
202 .
487 202 .
487 0 .
94 ( 22 .
475 )( 2078 .
183 )
S yy r
2 2078 2078 .
183 253 .
89 0 .
88 ( 0 .
94 ) 2 2078 .
183 .
183
SSE
253 .
89
S yy SSE
80.00
70.00
60.00
50.00
40.00
30.00
1.00
2.00
3.00
4.00
lsd_conc
5.00
6.00
Mean 80.00
70.00
60.00
50.00
40.00
30.00
1.00
2.00
score = 89.12 + -9.01 * lsd_conc
4.00
5.00
6.00
lsd_conc
Linear Regression
Example - SPSS Output Pearson’s and Spearman’s Measures
C o
L .
.
C
L
r
.
.
r
9 0 9 0
o
.
.
e
3 0 3 0
r r
0 0
e
0 0
l a
*
la
* *
t
i i
i o
g g .
.
r
n s
C r ( 2 r ( e o 2 l a n t a t a t i C i i o l l e e n o d r d r i r ) r ) s e e l l s a a i t t g i i o o n i n n f i c
t
i i
io
r
n s
C ( 2 g .
r g r .
r ( 2 e la t t t i a i a i i i o le d le C d ) n i ) s o e e s i f f g f f i i n c c i f i i i e e c n n a t t n t a t a t n h e t 0 a t .
0 1 t h le e v e 0 .
l ( 2 0 1 t a l e i le d v ) .
e l ( 2 t a i l e d ) .
Analysis of Variance in Regression • Goal: Partition the total variation in
y
variation “explained” by
x
into and random variation (
y i
y
) (
y i
^
y i
) ( ^
y i
y
) (
y i
y
) 2 (
y i
^
y i
) 2 ( ^
y i
y
) 2 • These three sums of squares and degrees of freedom are: •
Total
(
SST
)
DFT
=
n
-1 •
Error
(
SSE
)
DFE
=
n
-2 •
Model
(
SSM
)
DFM
= 1
Source of Variation Model Error Total Analysis of Variance in Regression Sum of Squares
SSM SSE SST
Degrees of Freedom 1
n
-2
n
-1 Mean Square
MSM
=
SSM
/1
MSE
=
SSE
/(
n
-2)
F F
=
MSM
/
MSE
• Analysis of Variance -
F
-test •
H
0 : 1 = 0
H
A : 1 0
T P
.
S
.
R
.
R
.
: :
F obs F obs
value
MSM MSE
:
P F
, 1 ,
n
2 (
F
F obs
)
Example - Pharmacodynamics of LSD • Total Sum of squares:
SST
(
y i
y
) 2 2078 .
183
DFT
7 1 6 • Error Sum of squares:
SSE
(
y i
^
y i
) 2 253 .
890
DFE
7 2 5 • Model Sum of Squares:
SSM
( ^
y i
y
) 2 2078 .
183 253 .
890 1824 .
293
DFM
1
Source of Variation Model Error Total Example - Pharmacodynamics of LSD Sum of Squares 1824.293
253.890
2078.183
Degrees of Freedom 1 5 6 Mean Square 1824.293
50.778
F
35.93
•Analysis of Variance -
F
-test •
H
0 : 1 = 0
H
A : 1 0
T
.
S
.
:
F obs
MSR
35 .
93
MSE R
.
R
.
P
:
F val obs
:
P
(
F
.
05 , 1 , 5
F
35 .
6 93 ) .
61
2 2 0 M 7 8 Example - SPSS Output
A N
S u .
1 8 m
V A
o f a e e o g s t r i a e d l s u s a i l o n e d p i e c t n o d r s e : n t ( C V o a n r i s a t b a l n e : t ) , S L C S O D R _ C E O N C
Multiple Regression • • Numeric Response variable (
Y
)
p
Numeric predictor variables • Model:
Y
= 0 + 1
x
1 + + p
x
p + • Partial Regression Coefficients: i mean response) of increasing the
i
th effect (on the predictor variable by 1 unit,
holding all other predictors constant
Example - Effect of Birth weight on Body Size in Early Adolescence • Response: Height at Early adolescence (
n
=250 cases) • Predictors (
p
=6 explanatory variables) • Adolescent Age (
x
1 , in years -- 11-14) • Tanner stage (
x
2 , units not given) • Gender (
x
3 =1 if male, 0 if female) • Gestational age (
x
4 , in weeks at birth) • Birth length (
x
5 , units not given) • Birthweight Group (
x
6 =1,...,6 <1500
g
(1), 1500 1999
g
(2), 2000-2499
g
(3), 2500-2999
g
(4), 3000 3499
g
(5), >3500
g
(6)) Source: Falkner, et al (2004)
Least Squares Estimation • Population Model for mean response:
E
(
Y
) 0 1
x
1
p x p
• Least Squares Fitted (predicted) equation, minimizing
SSE
:
Y
^
b
0
b
1
x
1
b p x p SSE
Y
^ 2 • All statistical software packages/spreadsheets can compute least squares estimates and their standard errors
Analysis of Variance • Direct extension to ANOVA based on simple linear regression • Only adjustments are to degrees of freedom: –
DFM
=
p DFE
=
n
-
p
-1 Source of Variation Model Error Total Sum of Squares
SSM SSE SST
Degrees of Freedom
p n
-
p
-1
n
-1 Mean Square
MSM
=
SSM
/
p MSE
=
SSE
/(
n
-
p
-1)
F F
=
MSM
/
MSE R
2
SST
SSE SST
SSM SST
Testing for the Overall Model -
F
-test • • • Tests whether
any
of the explanatory variables are associated with the response
H
0 : 1 = = p =0 (None of the
x
s
H
A : Not all i = 0 associated with
y
)
T
.
S
.
:
F obs
MSM
MSE R P
.
R
.
:
F val obs
:
P
(
F
,
p
,
n
p
1
F
F obs
) ( 1
R
2
R
) 2 / /(
n p
p
1 )
Example - Effect of Birth weight on Body Size in Early Adolescence • Authors did not print ANOVA, but did provide following: • • •
H
0 :
n
=250
p
=6
R
2 =0.26
1 = = 6 =0
H
A : Not all i = 0
T
.
S
.
:
F obs
MSR MSE
( 1
R
2
R
) 2 / /(
n p
p
1 ) ( 1 0 .
0 .
26 / 26 ) 6 /( 250 6
R
.
R
.
P
:
F val obs
:
P
(
F
, 6 , 243
F
14 .
2 ) 1 ) 2 .
13 .
0433 .
0030 14 .
2
Testing Individual Partial Coefficients -
t
-tests • Wish to determine whether the response is associated with a single explanatory variable, after controlling for the others •
H
0 : i = 0
H
A : i 0 (2-sided alternative)
T
.
S
.
:
t obs R
.
R
.
: |
t obs
b i
| SE
b i
t
/ 2 ,
n
p
1
P
val
: 2
P
(
t
|
t obs
|)
Example - Effect of Birth weight on Body Size in Early Adolescence
Variable b SE b t=b/SE b P-val (z) Adolescent Age Tanner Stage Male
2.86
3.41
0.08
-0.11
0.99
0.89
1.26
0.21
2.89
3.83
0.06
-0.52
.0038
<.001
.9522
.6030
Gestational Age Birth Length Birth Wt Grp
0.44
-0.78
0.19
0.64
2.32
-1.22
.0204
.2224
Controlling for all other predictors, adolescent age, Tanner stage, and Birth length are associated with adolescent height measurement
Testing for the Overall Model -
F
-test • • • Tests whether
any
of the explanatory variables are associated with the response
H
0 : 1 = = p =0 (None of
X
s
H
A : Not all i = 0 associated with
Y
)
T P
.
S
.
:
F obs val
:
P
(
MSR MSE F
F
obs
) ( 1
R
2
R
) 2 / /(
n p
p
1 ) The
P
-value is based on the
F
-distribution with
p
numerator and (
n
-
p
-1) denominator degrees of freedom
Comparing Regression Models
• Conflicting Goals: Explaining variation in
Y
while keeping model as simple as possible (parsimony) • We can test whether a subset of
p
-
g
predictors (including possibly cross-product terms) can be dropped from a model that contains the remaining
g
predictors.
H
0 : g+1 =…= p =0 – Complete Model: Contains all
k
predictors – Reduced Model: Eliminates the predictors from
H
0 – Fit both models, obtaining the Error sum of squares for each (or
R
2 from each)
Comparing Regression Models
•
H
0 : g+1 =…= p = 0 (After removing the effects of
X
1 ,…,
X
g , none of other predictors are associated with
Y
) •
H a
:
H
0 is false Test Statistic :
F obs
(
SSE r
SSE c SSE c
/[
n
)
p
/(
p
1 ]
g
)
P
P
(
F
F obs
)
P
-value based on
F
-distribution with
p-g
and
n-p-
1 d.f.
Models with Dummy Variables • Some models have both numeric and categorical explanatory variables (Recall
gender
in example) • If a categorical variable has
k
levels, need to create
k
-1 dummy variables that take on the values 1 if the level of interest is present, 0 otherwise.
• The baseline level of the categorical variable for which all
k
-1 dummy variables are set to 0 • The regression coefficient corresponding to a dummy variable is the difference between the mean for that level and the mean for baseline group, controlling for all numeric predictors
Example - Deep Cervical Infections • Subjects - Patients with deep neck infections • Response (
Y
) - Length of Stay in hospital • Predictors: (One numeric, 11 Dichotomous) – Age (
x
1 ) – Gender (
x
2 =1 if female, 0 if male) – Fever (
x
3 =1 if Body Temp > 38C, 0 if not) – Neck swelling (
x
4 =1 if Present, 0 if absent) – Neck Pain (
x
5 =1 if Present, 0 if absent) – Trismus (
x
6 =1 if Present, 0 if absent) – Underlying Disease (
x
7 =1 if Present, 0 if absent) – Respiration Difficulty (
x
8 =1 if Present, 0 if absent) – Complication (
x
9 =1 if Present, 0 if absent) – WBC > 15000
/
mm 3
(x
10 =1 if Present, 0 if absent) – CRP > 100 m g/ml (
x
11 =1 if Present, 0 if absent) Source: Wang, et al (2003)
Example - Weather and Spinal Patients • Subjects - Visitors to National Spinal Network in 23 cities Completing SF-36 Form • Response - Physical Function subscale (1 of 10 reported) • Predictors: Source: Glaser, et al (2004) – Patient’s age (
x
1 ) – Gender (
x
2 =1 if female, 0 if male) – High temperature on day of visit (
x
3 ) – Low temperature on day of visit (
x
4 ) – Dew point (
x
5 ) – Wet bulb (
x
6 ) – Total precipitation (
x
7 ) – Barometric Pressure (
x
7 ) – Length of sunlight (
x
8 ) – Moon Phase (new, wax crescent, 1st Qtr, wax gibbous, full moon, wan gibbous, last Qtr, wan crescent, presumably had 8-1=7 dummy variables)
Analysis of Covariance • Combination of 1-Way ANOVA and Linear Regression • Goal: Comparing numeric responses among
k
groups, adjusting for numeric concomitant variable(s), referred to as
Covariate(s)
• Clinical trial applications: Response is Post-Trt score, covariate is Pre-Trt score • Epidemiological applications: Outcomes compared across exposure conditions, adjusted for other risk factors (age, smoking status, sex,...)
Nonlinear Regression • Theory often leads to nonlinear relations between variables. Examples: – 1-compartment PK model with 1st-order absorption and elimination – Sigmoid-E max S-shaped PD model
Example - P24 Antigens and AZT • Goal: Model time course of P24 antigen levels after oral administration of zidovudine • Model fit individually in 40 HIV + patients:
E
(
t
)
E
0 ( 1
A
)
e
k out t
E
0
A
where: •
E(t)
is the antigen level at time
t
•
E
0 is the initial level •
A
is the coefficient of reduction of P24 antigen •
k out
is the rate constant of decrease of P24 antigen Source: Sasomsin, et al (2002)
Example - P24 Antigens and AZT • Among the 40 individuals who the model was fit, the means and standard deviations of the PK “parameters” are given below:
Parameter
E 0 A k out
Mean
472.1
0.28
0.27
Std Dev
408.8
0.21
0.16
• Fitted Model for the “mean subject”
E
(
t
) 472 .
1 ( 1 0 .
28 )
e
0 .
27
t
( 472 .
1 )( 0 .
28 )
Example - P24 Antigens and AZT
Example - MK639 in HIV + Patients • Response:
Y
=
log
10 (RNA change) • Predictor:
x
= MK639
AUC
0-6h • Model: Sigmoid-
E max
:
Y
x
2 0
x
2 1 2 • where: • 0 is the maximum effect (limit as
x
) • 1 is the
x
level producing 50% of maximum effect • 2 is a parameter effecting the shape of the function Source: Stein, et al (1996)
Example - MK639 in HIV + Patients • Data on
n
= 5 subjects in a Phase 1 trial:
Subject
1 2 3 4 5
log RNA change (Y)
0.000
0.167
1.524
3.205
3.518
MK639 AUC 0-6h (x)
10576.9
13942.3
18235.3
19607.8
22317.1
• Model fit using SPSS (estimates slightly different from notes, which used SAS)
Y
^
x
3 .
52
x
35 .
60 35 .
60
18374 .
39
Example - MK639 in HIV + Patients
Data Sources • Wagner, J.G., G.K. Aghajanian, and O.H. Bing (1968). “Correlation of Performance Test Scores with Tissue Concentration of Lysergic Acid Diethylamide in Human Subjects,”
Clinical Pharmacology and Therapeutics
, 9:635-638.