Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Download Report

Transcript Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Simple Linear Regression:
An Introduction
Dr. Tuan V. Nguyen
Garvan Institute of Medical Research
Sydney
Give a man three weapons – correlation,
regression and a pen – and he will use all three
(Anon, 1978)
An example
Age and cholesterol
levels in 18 individuals
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Age
46
20
52
30
57
25
28
36
22
43
57
33
22
63
40
48
28
49
Chol (mg/ml)
3.5
1.9
4.0
2.6
4.5
3.0
2.9
3.8
2.1
3.8
4.1
3.0
2.5
4.6
3.2
4.2
2.3
4.0
Read data into R
id <- seq(1:18)
age <- c(46, 20, 52, 30,
43, 57, 33, 22,
chol <- c(3.5, 1.9, 4.0,
3.8, 4.1, 3.0,
plot(chol ~ age, pch=16)
57, 25, 28, 36, 22,
63, 40, 48, 28, 49)
2.6, 4.5, 3.0, 2.9, 3.8, 2.1,
2.5, 4.6, 3.2, 4.2, 2.3, 4.0)
20
30
40
age
50
60
2.0
2.5
3.0
chol
3.5
4.0
4.5
Questions of interest
• Association between age and cholesterol levels
• Strength of association
• Prediction of cholesterol for a given age
Correlation and Regression analysis
Variance and covariance: algebra
• Let x and y be two random variables from a sample of n
obervations.
• Measure of variability of x and y: variance
var x   
n
 xi  x 
i 1
2
n 1
n
var y   
i 1
• Measure of covariation between x and y ?
• Algebraically:
var(x + y) = var(x) + var(y)
var(x + y) = var(x) + var(y) + 2cov(x,y)
Where:
1 n
cov x, y  
n 1

i 1
 xi  x  yi  y 
 yi  y 2
n 1
Variance and covariance: geometry
• The independence or dependence between x and y can be
represented geometrically:
h
y
y
h
H
x
h2 = x2 + y2
x
h2 = x2 + y2 – 2xycos(H)
Meaning of variance and covariance
• Variance is always positive
• If covariance = 0, x and y are independent.
• Covariance is sum of cross-products: can be positive or
negative.
• Negative covariance = deviations in the two distributions in are
opposite directions, e.g. genetic covariation.
• Positive covariance = deviations in the two distributions in are
in the same direction.
• Covariance = a measure of strength of association.
Covariance and correlation
• Covariance is unit-depenent.
• Coefficient of correlation (r) between x and y is a standardized
covariance.
• r is defined by:
cov x, y 
cov x, y 
r

var x   var y  SDx  SDy
-15
Positive and negative correlation
r = -0.9
-25
15
-30
20
y
y
25
-20
30
r = 0.9
8
10
12
14
x
16
8
10
12
14
x
16
Test of hypothesis of correlation
• Hypothesis: Ho: r = 0 versus Ho: r not equal to 0.
• Standard error of r is:
• The t-statistic:
tr
1 r2
SEr  
n2
n2
1 r2
• This statistic has a t distribution with n – 2 degrees of freedom.
• Fisher’s z-transformation:
1 1  r 
z  ln 

2 1 r 
• Standard error of z:
SE  z  
1
n3
• Then 95% CI of z can be constructed as:
z
1
n3
An illustration of correlation analysis
ID
Age Cholesterol
(x) (y; mg/100ml)
1
46
3.5
2
20
1.9
3
52
4.0
4
30
2.6
5
57
4.5
6
25
3.0
7
28
2.9
8
36
3.8
9
22
2.1
10
43
3.8
11
57
4.1
12
33
3.0
13
22
2.5
14
63
4.6
15
40
3.2
16
48
4.2
17
28
2.3
18
49
4.0
Mean 38.83
3.33
SD
13.60
0.84
Cov(x, y) = 10.68
cov x, y 
10.68
r
SDx  SDy

13.60  0.84
 0.94
1  1  0.94 
z  ln 
  0.56
2  1  0.94 
SE  z  
1
1

 0.26
n3
15
t-statistic = 0.56 / 0.26 = 2.17
Critical t-value with 17 df and alpha = 5% is
2.11
Conclusion: There is a significant association
between age and cholesterol.
Simple linear regression analysis
• Only two variables are of interest: one response variable
and one predictor variable
• No adjustment is needed for confounding or covariate
• Assessment:
– Quantify the relationship between two variables
• Prediction
– Make prediction and validate a test
• Control
– Adjusting for confounding effect (in the case of multiple variables)
Relationship between age and cholesterol
Linear regression: model
• Y : random variable representing a response
• X : random variable representing a predictor variable
(predictor, risk factor)
– Both Y and X can be a categorical variable (e.g., yes / no) or a
continuous variable (e.g., age).
– If Y is categorical, the model is a logistic regression model; if Y is
continuous, a simple linear regression model.
• Model
Y = a + bX + e
a : intercept
b : slope / gradient
e : random error (variation between subjects in y even if x is constant, e.g.,
variation in cholesterol for patients of the same age.)
Linear regression: assumptions
• The relationship is linear in terms of the parameter;
• X is measured without error;
• The values of Y are independently from each other (e.g., Y1 is not
correlated with Y2) ;
• The random error term (e) is normally distributed with mean 0 and
constant variance.
Expected value and variance
• If the assumptions are tenable, then:
• The expected value of Y is: E(Y | x) = a + bx
• The variance of Y is: var(Y) = var(e) = s2
Estimation of model parameters
Given two points A(x1, y1) and B(x2, y2) in a two-dimensional
space, we can derive an equation connecting the points.
Gradient:
y
B(x2,y2)
m
dy y2  y1

dx x2  x1
Equation: y = mx + a
dy
A(x1,y1)
a
0
dx
What happen if we have
more than 2 points?
x
Estimation of a and b
• For a series of pairs: (x1, y1), (x2, y2), (x3, y3), …, (xn, yn)
• Let a and b be sample estimates for parameters a and b,
• We have a sample equation: Y* = a + bx
• Aim: finding the values of a and b so that (Y – Y*) is
minimal.
• Let SSE = sum of (Yi – a – bxi)2.
• Values of a and b that minimise SSE are called least
square estimates.
Criteria of estimation
yˆi  a  bxi
d i  yi  yˆ i
yi
Chol
Age
The goal of least square estimator (LSE) is to find a and b such that the sum of d2 is
minimal.
Estimation of a and b
• After some calculus operations, the results can be shown
to be:
a  y  bx
S xy
b
S xx
Where:
S xx    xi  x 
n
2
i 1
S xy    xi  x  yi  y 
n
i 1
• When the regression assumptions are valid, the estimators of a and b
have the following properties:
– Unbiased
– Uniformly minimal variance (eg efficient)
Goodness-of-fit
• Now, we have the equation Y = a + bX + e
• Question: how well the regression equation describe the
actual data?
• Answer: coefficient of determination (R2): the amount
of variation in Y is explained by the variation in X.
Partitioning of variations: concept
• SST = sum of squared difference between yi and the mean of y.
• SSR = sum of squared difference between the predicted value of y and
the mean of y.
• SSE = sum of squared difference between the observed and predicted
value of y.
SST = SSR + SSE
The the coefficient of determination is:
R2 = SSR / SST
Partitioning of variations: geometry
SSE
SST
Chol (Y)
SSR
mean
Age (X)
Partitioning of variations: algebra
• Some statistics:
• Total variation:
• Attributed to the model:
• Residual sum of square:
• SST = SSR + SSE
• SSR = SST – SSE
SST    yi  y 
n
i 1
n
2
SSR    yˆ i  y 
i 1
n
2
SSE    yi  yˆ i 
i 1
2
Analysis of variance
• SS increases in proportion to sample size (n)
• Mean squares (MS): normalise for degrees of freedom (df)
– MSR = SSR / p (where p = number of degrees of freedom)
– MSE = SSE / (n – p – 1)
– MST = SST / (n – 1)
• Analysis of variance (ANOVA) table:
Source
d.f.
Sum of
squares
(SS)
Mean
squares
(MS)
F-test
Regression
Residual
Total
p
N–p –1
n–1
SSR
SSE
SST
MSR
MSE
MSR/MSE
Hypothesis tests in regression analysis
• Now, we have
Sample data:
Population:
Y = a + bX + e
Y = a + bX + e
• Ho: b = 0. There is no linear association between the outcome
and predictor variable.
• In layman language: “what is the chance, given the sample data
that we observed, of observing a sample of data that is less
consistent with the null hypothesis of no association?”
Inference about slope (parameter b)
• Recall that e is assumed to be normally distributed with
mean 0 and variance = s2.
• Estimate of s2 is MSE (or s2)
• It can be shown that
– The expected value of b is b, i.e. E(b) = b,
– The standard error of b is: SEb  s / S xx
• Then the test whether b = 0 is: t = b / SE(b) which
follows a t-distribution with n-1 degrees of freedom.
Confidence interval around predicted valued
• Observed value is Yi.
• Predicted value is Yˆi  a  bxi
• The standard error of the predicted value is:
1 x  x 
SE Yˆi   s 1   i
n
S xx
2
• Interval estimation for Yi values
Yˆi  SE Yˆi  t n  p 1,1a / 2 
Checking assumptions
•
•
•
•
Assumption of constant variance
Assumption of normality
Correctness of functional form
Model stability
• All can be conducted with graphical analysis. The
residuals from the model or a function of the residuals
play an important role in all of the model diagnostic
procedures.
Checking assumptions
• Assumption of constant variance
– Plot the studentized residuals versus their predicted values. Examine
whether the variability between residuals remains relatively constant
across the range of fitted values.
• Assumption of normality
– Plot the residuals versus their expected values under normality (Normal
probability plot). If the residuals are normally distributed, it should fall
along a 45o line.
• Correct functional form?
– Plot the residuals versus fitted values. Examine whether the residual
plot for evidence of a non-linear trend in the value of the residual
across the range of fitted values.
• Model stability
– Check whether one or more observations are influential. Use Cook’s
distance.
Checking assumptions (Cont)
• Cook’s distance (D) is a measure of the magnitude by
which the fitted values of the regression model change if
the ith observation is removed from the data set.
• Leverage is a measure of how extreme the value of xi is
relative to the remaining value of x.
• The Studentized residual provides a measure of how
extreme the value of yi is relative to the remaining value
of y.
Remedial measures
• Non-constant variance
– Transform the response variable (y) to a new scale (e.g. logarithm) is
often helpful.
– If no transformation can achieve the non-constant variance problem,
use a more robust estimator such as iterative weighted least squares.
• Non-normality
– Non-normality and non-constant variance go hand-in-hand.
• Outliers
– Check for accuracy
– Use robust estimator
Regression analysis using R
id <- seq(1:18)
age <- c(46, 20, 52, 30,
43, 57, 33, 22,
chol <- c(3.5, 1.9, 4.0,
3.8, 4.1, 3.0,
57, 25, 28, 36, 22,
63, 40, 48, 28, 49)
2.6, 4.5, 3.0, 2.9, 3.8, 2.1,
2.5, 4.6, 3.2, 4.2, 2.3, 4.0)
#Fit linear regression model
reg <- lm(chol ~ age)
summary(reg)
ANOVA result
> anova(reg)
Analysis of Variance Table
Response: chol
Df Sum Sq Mean Sq F value
Pr(>F)
age
1 10.4944 10.4944 114.57 1.058e-08 ***
Residuals 16 1.4656 0.0916
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results of R analysis
> summary(reg)
Call:
lm(formula = chol ~ age)
Residuals:
Min
1Q
Median
-0.40729 -0.24133 -0.04522
3Q
0.17939
Max
0.63040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.089218
0.221466
4.918 0.000154 ***
age
0.057788
0.005399 10.704 1.06e-08 ***
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3027 on 16 degrees of freedom
Multiple R-Squared: 0.8775,
Adjusted R-squared: 0.8698
F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
Diagnostics: influential data
par(mfrow=c(2,2))
plot(reg)
2
1
6
-1
6
8
0
Standardized residuals
0.0 0.2 0.4 0.6
Normal Q-Q
8
-0.4
Residuals
Residuals vs Fitted
17
17
3.0
3.5
4.0
4.5
-2
-1
0
1
Fitted values
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
8
2
1
0.5
2
1
6
0
0.5
1.0
17
-1
Standardized residuals
6
8
2
Cook's distance
0.0
Standardized residuals
1.5
2.5
2.5
3.0
3.5
Fitted values
4.0
4.5
0.00
0.05
0.10
0.5
0.15
Leverage
0.20
0.25
A non-linear illustration: BMI and sexual attractiveness
– Study on 44 university students
– Measure body mass index (BMI)
– Sexual attractiveness (SA) score
id <- seq(1:44)
bmi <- c(11.00, 12.00,
14.00, 14.00,
16.50, 17.00,
20.00, 20.00,
24.00, 24.50,
28.00, 29.00,
36.00, 36.00)
sa <- c(2.0, 2.8, 1.8,
3.2, 3.7, 5.5,
6.5, 4.9, 5.0,
3.5, 4.0, 3.7,
2.1, 2.0, 1.8,
12.50,
14.80,
17.00,
20.00,
25.00,
31.00,
1.8,
5.2,
5.3,
3.6,
1.7)
14.00,
15.00,
18.00,
20.50,
25.00,
32.00,
2.0,
5.1,
5.0,
3.4,
14.00,
15.00,
18.00,
22.00,
26.00,
33.00,
2.8,
5.7,
4.2,
3.3,
3.2,
5.6,
4.1,
2.9,
14.00,
15.50,
19.00,
23.00,
26.00,
34.00,
3.1,
4.8,
4.7,
2.1,
14.00,
16.00,
19.00,
23.00,
26.50,
35.50,
4.0,
5.4,
3.5,
2.0,
1.5,
6.3,
3.7,
2.1,
Linear regression analysis of BMI and SA
reg <- lm (sa ~ bmi)
summary(reg)
Residuals:
Min
1Q
-2.54204 -0.97584
Median
0.05082
3Q
1.16160
Max
2.70856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.92512
0.64489
7.637 1.81e-09 ***
bmi
-0.05967
0.02862 -2.084
0.0432 *
--Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.354 on 42 degrees of freedom
Multiple R-Squared: 0.09376,
Adjusted R-squared: 0.07218
F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323
BMI and SA: analysis of residuals
plot(reg)
Normal Q-Q
2
3
Residuals vs Fitted
0
-2
-3
10
3.0
3.5
4.0
10
-2
-1
Fitted values
2
1
0
-2
Fitted values
2
-1
Standardized residuals
0.8
0.4
0.0
3.5
1
Residuals vs Leverage
10
1.2
21
20
3.0
0
Theoretical Quantiles
Scale-Location
Standardized residuals
21
1
20
-1
1
0
-1
-2
Residuals
2
Standardized residuals
21
20
4.0
3
10
Cook's distance
0.00
0.02
0.04
1
0.06
Leverage
0.08
0.10
0.12
BMI and SA: a simple plot
4
3
2
sa
5
6
par(mfrow=c(1,1))
reg <- lm(sa ~ bmi)
plot(sa ~ bmi, pch=16)
abline(reg)
10
15
20
25
bmi
30
35
Re-analysis of sexual attractiveness data
# Fit 3 regression models
linear <- lm(sa ~ bmi)
quad <- lm(sa ~ poly(bmi, 2))
cubic <- lm(sa ~ poly(bmi, 3))
# Make new BMI axis
bmi.new <- 10:40
# Get predicted values
quad.pred <- predict(quad,data.frame(bmi=bmi.new))
cubic.pred <- predict(cubic,data.frame(bmi=bmi.new))
# Plot predicted values
abline(reg)
lines(bmi.new, quad.pred, col="blue",lwd=3)
lines(bmi.new, cubic.pred, col="red",lwd=3)
6
5
sa
4
3
2
10
15
20
25
bmi
30
35
Some comments:
Interpretation of correlation
• Correlation lies between –1 and +1. A very small correlation
does not mean that no linear association between the two
variables. The relationship may be non-linear.
• For curlinearity, a rank correlation is better than the Pearson’s
correlation.
• A small correlation (eg 0.1) may be statistically significant, but
clinically unimportant.
• R2 is another measure of strength of association. An r = 0.7
may sound impressive, but R2 is 0.49!
• Correlation does not mean causation.
Some comments:
Interpretation of correlation
• Be careful with multiple correlations. For p variables, there are
p(p – 1)/2 possible pairs of correlation, and false positive is a
problem.
• Correlation can not be inferred directly from association.
– r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that
r(age, fat) is near zero.
– In fact, r(age, fat) = 0.79.
Some comments:
Interpretation of regression
• The fitted line (regression) is only an estimated of the
relation between these variables in the population.
• Uncertainty associated with estimated parameters.
• Regression line should not be used to make prediction of x
values outside the range of values in the observed data.
• A statistical model is an approximation; the “true” relation
may be nonlinear, but a linear is a reasonable
approximation.
Some comments:
Reporting results
• Results should be reported in sufficient details: nature of
response variable, predictor variable; any transformation; checking
assumptions, etc.
• Regression coefficients (a, b), their associated standard
errors, and R2 are useful summary.
Some final comments
• Equations are the cornerstone on which the
edifice of science rests.
• Equations are like poems, or even an onion.
• So, be careful with your building of equations!