Multidisciplinary COllaboration: Why and How?

Download Report

Transcript Multidisciplinary COllaboration: Why and How?

Logistic Regression
EPP 245
Statistical Analysis of
Laboratory Data
1
Generalized Linear Models
• The type of predictive model one uses
depends on a number of issues; one is the
type of response.
• Measured values such as quantity of a
protein, age, weight usually can be
handled in an ordinary linear regression
model
• Patient survival, which may be censored,
calls for a different method (next quarter)
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
2
• If the response is binary, then can we use
logistic regression models
• If the response is a count, we can use
Poisson regression
• Other forms of response can generate
other types of generalized linear models
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
3
Generalized Linear Models
• We need a linear predictor of the same form as in linear
regression βx
• In theory, such a linear predictor can generate any type
of number as a prediction, positive, negative, or zero
• We choose a suitable distribution for the type of data we
are predicting (normal for any number, gamma for
positive numbers, binomial for binary responses,
Poisson for counts)
• We create a link function which maps the mean of the
distribution onto the set of all possible linear prediction
results, which is the whole real line (-∞, ∞).
• The inverse of the link function takes the linear predictor
to the actual prediction
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
4
• Ordinary linear regression has identity link
(no transformation by the link function) and
uses the normal distribution
• If one is predicting an inherently positive
quantity, one may want to use the log link
since ex is always positive.
• An alternative to using a generalized linear
model with an log link, is to transform the
data using the log or maybe glog. This is a
device that works well with measurement
data but may not be usable in other cases
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
5
Possible Means
0
∞
Link
= Log
-∞
0
∞
Predictors
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
6
Possible Means
0
∞
Inverse
Link
= ex
-∞
0
∞
Predictors
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
7
Logistic Regression
• Suppose we are trying to predict a binary
variable (patient has ovarian cancer or not,
patient is responding to therapy or not)
• We can describe this by a 0/1 variable in
which the value 1 is used for one response
(patient has ovarian cancer) and 0 for the
other (patient does not have ovarian
cancer
• We can then try to predict this response
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
8
• For a given patient, a prediction can be
thought of as a kind of probability that the
patient does have ovarian cancer. As
such, the prediction should be between 0
and 1. Thus ordinary linear regression is
not suitable
• The logit transform takes a number which
can be anything, positive or negative, and
produces a number between 0 and 1.
Thus the logit link is useful for binary data
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
9
Possible Means
0
1
Link
= Logit
-∞
0
∞
Predictors
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
10
Possible Means
0
-∞
0
Inverse
Link
= inverse
logit
1
∞
Predictors
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
11
 p 
logit  p   log 
 if p  0 then logit( p)   if p  1 then logit( p )  
 1 p 
x
e
1
1
logit 1  x  
if
x


then
logit
(
x
)

0
if
x


then
logit
( x)  1
x
1 e
 ex



 ex 
ex





x
x
x 
x
1

e
1

e
1

e
log 

log

log

log
e
x







x
x
x
 1 e 
 1 e  e 
 1 



 1  ex 
x 
x
 1 e 
 1 e



November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
12
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
13
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
14
Analyzing Tabular Data with
Logistic Regression
• Response is hypertensive y/n
• Predictors are smoking (y/n), obesity (y/n),
snoring (y/n) [coded as 0/1 for Stata, R
does not care]
• How well can these 3 factors
explain/predict the presence of
hypertension?
• Which are important?
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
15
input smoking obesity snoring hyp fw
0
0
0
1
5
1
0
0
1
2
0
1
0
1
1
1
1
0
1
0
0
0
1
1
35
1
0
1
1
13
0
1
1
1
15
1
1
1
1
8
0
0
0
0
55
1
0
0
0
15
0
1
0
0
7
1
1
0
0
2
0
0
1
0
152
1
0
1
0
72
0
1
1
0
36
1
1
1
0
15
end
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
16
. do hypertension-in
. input smoking
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
smoking
0
0
1
0
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
0
0
1
1
1
end
obesity snoring hyp
obesity
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
snoring
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
fw
hyp
fw
5
2
1
0
35
13
15
8
55
15
7
2
152
72
36
15
.
end of do-file
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
17
. logistic hyp smoking obesity snoring [fweight=fw]
Logistic regression
Log likelihood =
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
-199.4582
=
=
=
=
433
12.51
0.0058
0.0304
-----------------------------------------------------------------------------hyp | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smoking |
.9344708
.2598989
-0.24
0.807
.5417838
1.611779
obesity |
2.00433
.5714045
2.44
0.015
1.146316
3.504564
snoring |
2.391544
.950815
2.19
0.028
1.097143
5.213072
-----------------------------------------------------------------------------. logit
Logistic regression
Log likelihood =
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
-199.4582
=
=
=
=
433
12.51
0.0058
0.0304
-----------------------------------------------------------------------------hyp |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------smoking | -.0677749
.2781242
-0.24
0.807
-.6128882
.4773385
obesity |
.6953096
.2850851
2.44
0.015
.136553
1.254066
snoring |
.8719393
.3975737
2.19
0.028
.0927093
1.651169
_cons | -2.377661
.3801845
-6.25
0.000
-3.122809
-1.632513
-----------------------------------------------------------------------------November 2, 2006
EPP 245 Statistical Analysis of
18
Laboratory Data
Juul's IGF data
Description:
The 'juul' data frame has 1339 rows and 6 columns. It contains a
reference sample of the distribution of insulin-like growth factor
(IGF-I), one observation per subject in various ages with the bulk
of the data collected in connection with school physical
examinations.
Variables:
age a numeric vector (years).
menarche a numeric vector. Has menarche occurred (code 1: no, 2:
yes)?
sex a numeric vector (1: boy, 2: girl).
igf1 a numeric vector. Insulin-like growth factor ($mu$g/l).
tanner a numeric vector. Codes 1-5: Stages of puberty a.m. Tanner.
testvol a numeric vector. Testicular volume (ml).
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
19
. clear
. insheet using "C:\TD\CLASS\K30-2006\juul.csv"
. summarize age menarch sex igf1 tanner testvol
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
1334
15.09535
11.25288
.17
83
menarche |
704
1.475852
.4997716
1
2
sex |
1334
1.534483
.4989966
1
2
igf1 |
1018
340.168
171.0356
25
915
tanner |
1099
2.639672
1.76314
1
5
-------------+-------------------------------------------------------testvol |
480
7.895833
8.212571
1
30
. keep if age > 8
(237 observations deleted)
. keep if age < 20
(153 observations deleted)
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
20
. summarize age menarch sex igf1 tanner testvol
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
949
13.36911
3.238077
8.01
19.87
menarche |
519
1.506744
.5004369
1
2
sex |
949
1.553214
.4974224
1
2
igf1 |
737
397.4627
161.1272
71
915
tanner |
863
3.01854
1.740637
1
5
-------------+-------------------------------------------------------testvol |
401
9.164589
8.331137
1
30
. keep if menarche < 100
(430 observations deleted)
. summarize age menarch sex igf1 tanner testvol
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------age |
519
13.43778
3.227661
8.03
19.75
menarche |
519
1.506744
.5004369
1
2
sex |
519
2
0
2
2
igf1 |
411
414.0803
160.9518
95
914
tanner |
436
3.307339
1.730601
1
5
-------------+-------------------------------------------------------testvol |
0
.
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
21
. generate men1 = menarche -1
. logistic men1 age
Logistic regression
Log likelihood = -100.33214
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
519
518.73
0.0000
0.7211
-----------------------------------------------------------------------------men1 | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
4.559845
.7038612
9.83
0.000
3.369442
6.170811
-----------------------------------------------------------------------------. logistic men1 age tanner
Logistic regression
Log likelihood = -55.587542
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
436
493.02
0.0000
0.8160
-----------------------------------------------------------------------------men1 | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
2.266544
.4811877
3.85
0.000
1.495044
3.436168
tanner |
5.616052
1.760354
5.51
0.000
3.038236
10.38104
-----------------------------------------------------------------------------.
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
22
.
.
.
.
.
generate
generate
generate
generate
generate
tan1
tan2
tan3
tan4
tan5
=
=
=
=
=
tanner
tanner
tanner
tanner
tanner
== 1
== 2
== 3
==4
== 5
. logistic men1 age tan2 tan3 tan4 tan5
Logistic regression
Log likelihood = -75.327218
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
519
568.74
0.0000
0.7906
-----------------------------------------------------------------------------men1 | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
3.944062
.7162327
7.56
0.000
2.762915
5.630151
tan2 |
.0444044
.0486937
-2.84
0.005
.0051761
.3809341
tan3 |
.1369598
.095596
-2.85
0.004
.0348712
.5379227
tan4 |
.6969611
.3898228
-0.65
0.519
.2328715
2.085935
tan5 |
9.169558
7.638664
2.66
0.008
1.791671
46.9287
-----------------------------------------------------------------------------.
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
23
Class prediction from expression
arrays
• One common use of omics data is to try to
develop predictions for classes of patients,
such as
– cancer/normal
– type of tumor
– grading or staging of tumors
– many other disease/healthy or diagnosis of
disease type
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
24
Two-class prediction
•
•
•
•
•
Linear regression
Logistic regression
Linear or quadratic discriminant analysis
Partial least squares
Fuzzy neural nets estimated by genetic
algorithms and other buzzwords
• Many such methods require fewer
variables than cases, so dimension
reduction is needed
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
25
Dimension Reduction
• Suppose we have 20,000 variables and
wish to predict whether a patient has
ovarian cancer or not and suppose we
have 50 cases and 50 controls
• We can only use a number of predictors
much smaller than 50
• How do we do this?
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
26
• Two distinct ways are selection of genes
and selection of “supergenes” as linear
combinations
• We can choose the genes with the most
significant t-tests or other individual gene
criteria
• We can use forward stepwise logistic
regression, which adds the most
significant gene, then the most significant
addition, and so on, or other ways of
picking the best subset of genes
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
27
Supergenes are linear combinations of
genes. If g1, g2, g3, …, gp are the
expression measurements for the p genes
in an array, and a1, a2, a3, …, ap are a set
of coefficients then g1 a1+ g2 a2+ g3 a3+ …+
gp ap is a supergene. Methods for
construction of supergenes include PCA
and PLS
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
28
Choosing Subsets of Supergenes
• Suppose we have 50 cases and 50
controls and an array of 20,000 gene
expression values for each of the 100
observations
• In general, any arbitrary set of 100 genes
will be able to predict perfectly in the data
if a logistic regression is fit to the 100
genes
• Most of these will predict poorly in future
samples
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
29
• This is a mathematical fact
• A statistical fact is that even if there is no
association at all between any gene and
the disease, often a few genes will
produce apparently excellent results, that
will not generalize at all
• We must somehow account for this, and
cross validation is the usual way
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
30
Consequences of many variables
• If there is no effect of any variable on the
classification, it is still the case that the
number of cases correctly classified
increases in the sample that was used to
derive the classifier as the number of
variables increases
• But the statistical significance is usually
not there
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
31
• If the variables used are selected from
many, the apparent statistical significance
and the apparent success in classification
is greatly inflated, causing end-stage
delusionary behavior in the investigator
• This problem can be improved using cross
validation or other resampling methods
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
32
Overfitting
• When we fit a statistical model to data, we
adjust the parameters so that the fit is as
good as possible and the errors are as
small as possible
• Once we have done so, the model may fit
well, but we don’t have an unbiased
estimate of how well it fits if we use the
same data to assess as to fit
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
33
Training and Test Data
• One way to approach this problem is to fit the
model on one dataset (say half the data) and
assess the fit on another
• This avoids bias but is inefficient, since we can
only use perhaps half the data for fitting
• We can get more by doing this twice in which
each half serves as the training set once and the
test set once
• This is two-fold cross validation
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
34
• It may be more efficient to use 5- 10-, or
20-fold cross validation depending on the
size of the data set
• Leave-out-one cross validation is also
popular, especially with small data sets
• With 10-fold CV, one can divide the set
into 10 parts, pick random subsets of size
1/10, or repeatedly divide the data
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
35
Stepwise Logistic Regression
• Another way to select variables is
stepwise
• This can be better than individual variable
selection, which may choose many highly
correlated predictors that are redundent
• A generic function stepwise can be used
for many kinds of predictor functions in
stata
November 2, 2006
EPP 245 Statistical Analysis of
Laboratory Data
36