GLMs in R - Department of Zoology, UBC

Download Report

Transcript GLMs in R - Department of Zoology, UBC

Workshop in R & GLMs: #3
Diane Srivastava
University of British Columbia
[email protected]
Housekeeping
ls() asks what variables are in the global
environment
rm(list=ls()) gets rid of EVERY variable
q() quit, get a prompt to save workspace
or not
hard~dens
31
500 1000
2
3
32
1
36
0
-2 -1
0 200
36
Standardized residuals
600
32
-400
Residuals
Normal Q-Q
4
Residuals vs Fitted
2000
31
-2
Fitted values
2
4
1
3
32
2
0.5
1
36
-2 -1 0
36
Standardized residuals
1.5
1.0
0.5
0.0
Standardized residuals
31
Fitted values
1
Residuals vs Leverage
32
2000
0
Theoretical Quantiles
Scale-Location
500 1000
-1
Cook's distance
31
0.00
0.04
Leverage
0.08
hard^0.45~dens
log(hard)~dens
Normal Q-Q
7.0
7.5
1
0
-1
8.0
-2
-1
0
1
2
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
1
0
-1
1.0
0.5
Cook's distance
-3
0.0
2 35
-2
22
13
2
Fitted values
3
6.5
22
3
Standardized residuals
1.5
13
-2
Standardized residuals
0.0
3
6.5
Standardized residuals
22
13
-0.2
-0.4
Residuals
0.2
2
Residuals vs Fitted
7.0
7.5
Fitted values
8.0
0.00
0.04
Leverage
3
0.08
0.5
Janka exercise
Conclusion:
The best y transformation to optimize the
model fit (highest log likelihood)…
..is not the best y transformation for
normal residuals
This workshop
• Linear, general linear, and generalized linear
models.
• Understand how GLMs work [Excel
simulation]
• Definitions: e.g. deviance, link functions
• Poisson GLMs[R exercise]
• Binomial distribution and logistic regression
• Fit GLMs in R! [Exercise]
In the beginning there were…
Linear models: a normally-distributed y fit
to a continuous x
But wait…couldn’t we
just code a categorical
variable to be
continuous?
Y
1.2
1.3
1.1
0.9
x
0
0
1
1
Then there were…
General Linear Models: a normallydistributed y fit to a continuous OR
categorical x
But wait…why do we
force our data to be
normal when often
it isn’t?
Generalized linear models
No more need
for tedious
transformations!
All variances
are unequal,
but some are
more unequal
than others…
Distribution
solution !
Because most
things in life
aren’t normal
!
What linear models do:
Y
Log Y
X
1. Transform y
2. Fit line to transformed y
3. Back transform to linear y
X
What GLMs do:
Log
fitted
values
Y
X
X
1. Start with an arbitrary fitted line
2. Back-transform line into linear space
3. Calculate residuals
4. Improve fitted line to maximize likelihood
Many
iterations
Maximum likelihood
• Means that an iterative process is used to
find the model equation that has the highest
probability (likelihood) of explaining the y
values given the x values.
•Equation for likelihood depends on the
error distribution chosen
• Least squares – by contrast – minimizes
variation from the model.
• If the data are normally distributed,
maximum likelihood gives the same answer
as least squares.
GLM simulation exercise
•
Simulates fitting a model with normal
errors and a log link to data.
• Your task:
(1) understand how the spreadsheet works
(2) find through an iterative process the best
slope
Generalized linear models
In least squares, we fit:
y=mx + b + error
In GLM, the model is fit more indirectly:
y=g(mx + b + error)
where g is a function, the inverse of which is
called the “link function”:
link fn(expected y) = mx + b + error
LMs
vs
GLMs
• Uses least squares
• Uses maximum likelihood
• Assumes normality
• Specify one of several
distributions
• Based on Sum of
Squares
• Based on deviance
• Fits model to
transformed y
• Fits model to
untransformed y by means
of a link function
All that really matters…
• By using a log link function, we do not need to
calculate log(0).
• Be careful! A log link model predicts log y not y!
• Error distribution need not be normal : Poisson,
binomial, gamma, Gaussian (=normal)
Exercise
1. Open up the file : Rlecture.csv
diane<-read.table(file.choose(),sep=“,",header=TRUE)
2. Look at dataframe. Make treat a factor (“treat”)
3. Fit this model:
my.first.glm<-glm(growth~size*treat, family=poisson
(link=log), data=diane); summary(my.first.glm)
4. Model dignostics
par(mfrow=c(2,2)); plot(my.first.glm)
Overdispersion
Underdispersed
Overdispersed
Random
Overdispersion
Is your residual deviance = residual df (approx.)?
If residual dev>>residual df, overdispersed.
If residual dev<<residual df, underdispersed.
Solution:
second.glm<-glm(growth~size*treat, family =
quasipoisson (link=log), data=diane);
summary(second.glm)
Options
family
binomial
default link
logit
other links
probit, cloglog
gaussian
identity
Gamma
--
identity,inverse,
log
poisson
log
identity, sqrt
Rlecture.csv
80
70
Growth
60
50
40
30
20
10
0
0
1
2
3
4
5
6
7
Size
1.2
100
90
1
70
Survival
Parasitism (%)
80
60
50
0.8
0.6
0.4
40
30
0.2
20
0
10
0
0
0
2
4
Size
6
8
1
2
3
4
Size
5
6
7
Binomial errors
• Variance gets constrained near limits;
binomial accounts for this
• Type 1: Classic example: series of trials
resulting in success (value=1) or failure
(value=0).
• Type 2: Also continuous but bounded (e.g. %
mortality bounded between 0% and 100%).
Logistic regression
• Least squares:
arcsine
transformations
1.2
1
0.8
y
• GLMs: use logit (or
probit) link with
binomial errors
0.6
0.4
0.2
0
-40
-20
0
20
x
40
60
80
Logit
p = proportion of successes
If p = eax+b / (1+ eax+b) calculate:
loge(p/1-p)
Logits continued
Output from logistic regression with
logit link: predicted loge (p/1-p) =
a+bx
To obtain any expected values of p,
need to input a and b in original
equation:
p = eax+b / (1+ eax+b)
Binomial GLMs
Type 1 binomial
• Simply set family = binomial (link=logit)
Type 2 binomial
• First create a vector of % not parasitized.
• Then “cbind” into a matrix (% parasitized, % not
parasitized)
• Then run your binomial glm (link = logit) with the
matrix as your y.
Homework
1. Fit the binomial glm survival = size*treat
2. Fit the bionomial glm parasitism =
size*treat
3. Predict what size has 50% parasitism in
treatment “0”