No Slide Title

Download Report

Transcript No Slide Title

Experimental design and analysis
Linear regression analysis
Copyright, Gerry Quinn & Mick Keough,
1998 Please do not copy or distribute this
file without the authors’ permission
Peake & Quinn (1993)
Sample of 20 clumps of mussels from rocky
shore at Phillip Island in September 1989:
• What is the relationship between total number
of animals and clump area?
• How much of the variation in number of
animals can be explained by clump area?
• Predict number of animals from clump area
for new clumps.
Scatterplot
2000
1500
1000
500
0
0
100
200
300
Clump area (cm2)
Linear regression
• Two continuous variables:
– Dependent (response) variable (DV, y)
– Independent (predictor) variable (IV, x)
– Each recorded for n observations (replicates)
• Usually implies DV is “dependent” on IV,
ie. IV influences DV
• Does not demonstrate that IV “causes” DV
Linear regression
• Determines the functional relationship
between DV and IV
• Calculates how much of the variation in
the DV can be explained by its linear
relationship with the IV
• Predicts new values of the DV from new
values of the IV
– provides precision of those estimates
Peake & Quinn (1993)
• Response (dependent) variable:
– no. animals per clump
• Predictor (independent) variable:
– clump area (cm2)
• Biological rationale:
– clump area could determine no. animals
– no. animals cannot determine clump area?
Regression line
2000
Line of “best fit”
Linear
regression line
1500
1000
Slope
500
0
Intercept
0
100
200
Clump area
300
Regression model
• Mathematical equation for population of
x,y observations
• Explains the DV (y) in terms of IV (x)
• DV(y) = intercept (constant) + slope * IV
(x)
• Intercept and slope are parameters we
can estimate
Population regression model
yi = b0 + b1x1 + ei
Number of animals = b0 + b1*clump area + ei
b0 is population intercept
b1 is population slope (change in Y per unit change
in X)
ei is variation in Y not explained by regression model
Sample regression equation
yi = b0 + b1x1
b0 is sample intercept, estimates
b0
b1 is sample slope or regression
coefficient, estimates b1
Y
y intercept
b0
X
regression coefficient (slope b1) - change in y per
change in x
Least squares regression
y i  y i  residual
Y
Observed y yi
least squares
regression line
}
Predicted y y i
y
x
xi
X
Worked example
Clump
No. (y)
Area (x)
1
2
3
4
5
216
689
84
404
151
105.55
52.12
42.35
27.28
14.56
20
64
697.38
Means
338.95
54.49
Assumptions for linear regression
1. y normally distributed at each value of
x:
– Boxplot of y should be symmetrical - watch
out for outliers and skewness
– Transformations often help
Boxplot for number of individuals
Number of individuals
Suggests a skewed distribution for number of
individuals
Assumptions (cont.)
2. Variance (spread) of y should be
constant for each value of x
(homogeneity of variance):
– Spread of residuals should be even when
plotted against x or predicted y’s
– Transformations often help
– Transformations which improve normality
of y will also usually make variance of y
more constant.
Residuals
• Difference between observed value and
value predicted or fitted by the model
• Residual for each observation:
– difference between observed y and value
of y predicted by linear regression
equation:
( y i  y i )
Studentised residuals
• residual / SE residuals
• follow a t-distribution
• studentised residuals can be compared
between different regressions
Observations with large residual (or
studentised residual) are outliers from
fitted model.
Plot residuals against predicted y
• Even spread of y
around line
y
Residual
x
• No pattern in residuals
• Indicates assumptions
OK
+ve
0
-ve
Predicted y
• Uneven spread of y
around line
y
Residual
x
+ve
0
-ve
Predicted y
• Increasing spread of
residuals, ie. wedge-shape
• Unequal variance in y
• Skewed distribution of y
• Transformation of y helps
Assumptions (cont.)
3.Values of y are independent of each
other:
– watch out for data which are a time series
on same experimental or sampling units
– should be considered at design stage
Assumptions (cont.)
4.True population relationship between y
and x is linear:
– scatterplot of y against x should be
examined
– watch out for asymptotic or exponential
patterns
– transformations of y or y and x often help
Regression diagnostics
• Check fit of model
• Warn about influential observations and
outliers
Influential observations
• Observations with large influence on
estimated slope
• Measured by Cook’s D statistic observations with D near or greater than
1 should be checked
Outliers
• Observations further from the fitted
equation (model) than rest of
observations
• Large residual
• Can be different from single variable
sample outliers in boxplots
Y
1
2
3
X
• Observation 1 is x and y outlier but not
influential
• Observation 2 has large residual - outlier
• Observation 3 is very influential (large Cook’s
D) - also outlier
Checking model fit and assumptions of
mussel clump data
Calculate predicted y-values:
Clump
No.
1
2
3
4
etc.
216
689
84
404
etc.
Pred no. 511.76
331.07
298.04
247.07
etc.
Resid
-295.76
357.93
-214.04
156.93
etc.
Area
105.45
52.12
42.35
27.28
etc.
• Not good fit to
straight line?
• Clumps 8 (*) and 15
(*) had large
influence (high
Cook’s D).
• Two values may
determine
conclusions of
analysis
No. individuals
Influential values
1200
1000
800
600
400
200
0
*
*
0
100
200
Area (cm2)
300
Residual plot
600
Residual
• Increasing spread
with increasing
clump area
• Suggests unequal
variance in y
• Non-normal y and
possibly x?
400
200
0
-200
-400
0
500
1000
Predicted no. individuals
Log transformed y and x
3.5
3
2.5
2
1.5
1
0.5
0
Log no. individuals
• Better fit to linear
relationship (line)
• No outliers or
influential values
• Clumps 8 and 15 no
longer have large
Cook’s D.
0
1
2
Log area (cm2)
3
Log transformed y and x
1.5
Residual
• Even spread of
residuals
• Suggests
assumptions are OK
1
0.5
0
-0.5
-1
1
2
3
Predicted no. individuals
Calculations for Peake & Quinn data
Slope of least squares line:
 [( x i  x )( y i  y )]
b1 
2
 ( xi  x )
= 0.63 estimates b1
Calculations for Peake & Quinn data
Intercept of least squares line:
b0  y  b1 x
b0
= 2.37 - (0.63*1.51)
= 1.42 estimates b0
4
3
Slope = 0.63
2
1
Intercept = 1.42
0
0
1
2
Log clump area
3
Standard error of b1
Calculate SE of b1:

n 1
(x  x)
2
ˆ
( y i  yi )
SEb1 
2
i
= 0.15
Testing Ho: b1 = 0
• t = b1 / SEb1 = 0.63 / 0.15 = 4.32
• Probability of t-value > +4.32 if HO is
true <0.001
• P < 0.05 so reject HO: b1 = 0
Analysis of variance in y
2
(
y

y
)
 i
Total variation (Sum of Squares) in y
2

(
y

y
)
 i
Variation in y explained
by regression
(SSRegression)
2

(
y

y
)
 i
i
Variation in y
unexplained by
regression (SSResidual)
Unexplained or residual variation
yi  yi small
yi  yi
big
Explained variation
y
yi  y
y
big
yi  y
small
Analysis of variance
Source of
variation
SS
2

(
y

y
)

Regression
i
Residual
2

(
y

y
)
 i
i
df
Variance
(= mean square)
1
SSRegression / 1
n-2
SSResidual / n-2
Null hypothesis
Null hypothesis: b1 = 0
F-ratio statistic = MSRegression / MSResidual
If Ho is true, F-ratio follows F distribution
Source
SS
df
MS
Explained by regression
1.71
1
1.71
Residual
1.66
18
0.09
Total
3.37
19
F-ratio = 1.71 / 0.09 = 18.64
Probability of F-ratio > 18.64 if HO is true = <0.001
Reject HO: b1 = 0
Explained variation
• Proportion of variation in y explained by
linear relationship with x
• Termed R2, coefficient of determination:
SS Regression
SS Total
Explained variation
• If R2 close to 1:
– x explains large proportion of variation in y
• If R2 close to 0:
– x explains little of variation in y
• Measures fit of linear model to a set of
data.
• R2 is simply square of correlation
coefficient (r) between x and y.
Peake & Quinn (1993)
• R2 = 1.714 / 3.369 = 0.51
• 51% of variation in log no. individuals
explained by log clump area.
Peake & Quinn (1993)
Parameter Estimate SE
t
P
Intercept
1.42 0.23 6.18 <0.001
Slope
0.63 0.15 4.32 <0.001
• Reject Ho that population slope equals
zero
• Also reject Ho of zero intercept
– less important
Anscombe (1973) data set
12
12
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
0
15
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
15
0
5
5
10
10
15
15
20
R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
12
12
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
0
15
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
15
0
5
5
10
10
15
15
20
Examples of linear regression
analyses
Copyright, Gerry Quinn & Mick Keough,
1998 Please do not copy or distribute this
file without the authors’ permission
Kobayashi (1993)
• Aust. J. Ecol. 18:231-234
• Indivudual cladocerans (Daphnia) fed
on Anabaena
• Filtering rate (DV)
• Body length (IV)
– log filtrate = -0.634 + 2.09*log length
– R2 = 0.86, P < 0.001, n = 35
Pickering (1993)
• Aust. J. Ecol. 18:336-334
• Individual Ranunculus (“buttercup”) plants
in Kosciusko N.P.
• No. of floral units per plant (DV)
• No. of leaves (IV)
– log(flowers+1) = -0.877 + 0.649*log(leaves+1)
– R2 = 0.615, P < 0.001
– 95%CI for b1= 0.531 to 0.767
Correlation vs regression
Use correlation:
• to simply test whether there is a linear
relationship between y1 and y2
Correlation vs regression
Use regression:
• when DV and IV can be clearly
identified
• to quantify linear relationship of y on x
• to determine variation in y explained by
equation
• to predict new values of y
Correlation vs regression
• Correlation between y1 and y2 is the
same as correlation between y2 and y1.
• Slope of regression of y (DV) on x (IV)
is not the same as slope of regression
of x (DV) on y (IV).
Correlation vs regression
• The following tests of produce same Pvalues.
– HO:r = 0
– HO:b1 for y on x = 0
– HO:b1 for x on y = 0
• But estimates (r, b1 for y on x and b1 for
x on y) different
– different hypotheses
Model comparisons
Copyright, Gerry Quinn & Mick Keough,
1998 Please do not copy or distribute this
file without the authors’ permission
ANOVA for regression
Total variation in y
SSTotal
=
Variation explained by regression with x
SSRegression
+
Residual variation
SSResidual
Full model
yi = b0 + b1x1 + ei
• Unexplained variation in y from full
model = SSResidual
Reduced model (HO true)
• Reduced model (HO: b1 = 0 true):
yi = b0 + ei
• Unexplained variation in y from reduced
model = SSTotal
Model comparison
• Difference in unexplained variation
between full and reduced models:
SSTotal - SSResidual
= SSRegression
• Variation explained by including b1 in
model
Spare stuff
Copyright, Gerry Quinn & Mick Keough,
1998 Please do not copy or distribute this
file without the authors’ permission
Expected mean squares
MS Regression
MS Residual
 2  b 12 ( xi  x ) 2
2
If Ho is true:
MSResidual and MSRegression estimate 2 (variance in
y in the population) - their ratio (F-ratio) should be
near 1.
If Ho is false:
MSRegression estimates > 2, MSResidual estimates 2 their ratio (F-ratio) should be greater than 1.
Analysis of variance for mussel clump data
Calculate predicted y-values:
Clump
1
2
Log no.
2.334
2.838
Pred log no.
2.689
2.497
Resid
-0.335
0.341
Log area
2.023
1.717
3
1.924
2.440
-0.516
1.627
4
2.606
2.321
0.286
1.436
etc.
etc.
etc.
etc.
etc.
SSTotal   ( y i  y ) 2
=(2.334-2.370)2+(2.838-2.370)2+(1.924-2.370)2+(2.606-2.370)2 etc.
= 3.369
SSResidual   ( y i  y i ) 2
=(-0.335)2+(0.341)2+(-0.516)2+ (0.286)2 etc. = 1.655
SSRegression   ( y i  y ) 2
=(2.689-2.370)2+(2.497- 2.370)2+(2.440- 2.370)2+(2.421- 2.370)2
etc.