ogistic回归模型

Download Report

Transcript ogistic回归模型

Logistic Regression
Appiled Linear Statistical Models,由Neter等
著
Categorical Data Analysis,由Agresti著
Logistic 回归
当响应变量是定性变量时的非线性模型
 两种可能的结果,成功或失败,患病的或没
有患病的,出席的或缺席的
 实例:CAD(心血管疾病)是年龄,体重,性别,
吸烟历史,血压的函数
 吸烟者或不吸烟者是家庭历史,同年龄组行
为,收入,年龄的函数
 今年购买一辆汽车是收入,当前汽车的使用
年限,年龄的函数

二元结果的响应函数
当响应是二元时的特殊问题

对响应函数的约束:
非标准化的误差项:
当
当
 非恒量的误差方差:

Logistic 响应函数
Logistic 响应函数的例子

图中横坐标为:年龄;纵坐标为:CAD的
概率
Logistic 响应函数的性质
似然函数
多元Logistic回归的似然性
似然方程的解

不封闭的形式解,使用Newton-Raphson算法,
迭代地重加权最小二乘法(IRLS)
Logistic 回归系数的解释
kyphosis {rpart}(驼背)81 rows and 4
columns
Kyphosis: a factor with levels absent present
indicating if a kyphosis (a type of deformation)
was present after the operation.
 Age: in months
 Number: the number of vertebrae involved
 Start: the number of the first (topmost)
vertebra operated on.

some(kyphosis)
Kyphosis Age Number Start
12 absent 148
3 16
18 absent 175
5 13
32 absent 125
2 11
40 present 91
5 12
50 absent 177
2 14
51 absent 68
5 10
52 absent 9
2 17
70 absent 15
5 16
79 absent 120
2 13
81 absent 36
4 13


summary(kyphosis)
Kyphosis
Age
Number
Start
absent :64 Min. : 1.00
Min. : 2.000 Min. : 1.00
present:17 1st Qu.: 26.00 1st Qu.: 3.000 1st Qu.: 9.00
Median : 87.00 Median : 4.000 Median :13.00
Mean : 83.65 Mean : 4.049 Mean :11.49
3rd Qu.:130.00 3rd Qu.: 5.000 3rd Qu.:16.00
Max. :206.00 Max. :10.000 Max. :18.00
plot(kyphosis)
预测因子vs.驼背的箱图

图中横坐标为:是否驼背;纵坐标分别为:年龄,数值,起始
boxplot(Age~Kyphosis,data=kyphosis)
广义拉格朗日乘子拟合
summary(glm(Kyphosis~Age+Number+Start,family=binomial,data=kyphosis))
Deviance Residuals:
Min
1Q Median
3Q
Max
-2.3124 -0.5484 -0.3632 -0.1659 2.1613
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.036934 1.449575 -1.405 0.15996
Age
0.010930 0.006446 1.696 0.08996 .
Number
0.410601 0.224861 1.826 0.06785 .
Start
-0.206510 0.067699 -3.050 0.00229 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 61.380 on 77 degrees of freedom
AIC: 69.38
Number of Fisher Scoring iterations: 5
残差
模型偏差
拟合模型的偏差是拟合模型的对数似然与
饱和模型的对数似然的比值。
 饱和模型的对数似然=0

协方差矩阵
x<-model.matrix(kyph.glm)
fi=fitted(kyph.glm)
xvx<-t(x)%*%diag(fi*(1-fi))%*%x
xvx
(Intercept)
Age Number
Start
(Intercept) 9.62034 907.8886 43.67401 86.49843
Age
907.88858 114049.8138 3904.31285 9013.14288
Number
43.67401 3904.3128 219.95349 378.82840
Start
86.49843 9013.1429 378.82840 1024.07295
xvxi<-solve(xvx)
xvxi
(Intercept)
Age
Number
Start
(Intercept) 2.101403767 -4.332171e-03 -0.2764671477 -0.0370950478
Age
-0.004332171 4.155738e-05 0.0003368973 -0.0001244667
Number
-0.276467148 3.368973e-04 0.0505664451 0.0016809971
Start
-0.037095048 -1.244667e-04 0.0016809971 0.0045833546
sqrt(diag(xvxi))
(Intercept)
Age
Number
Start
1.449621939 0.006446501 0.224869840 0.067700477
因向模型中增加项而产生的偏
差变化
anova(kyph.glm)
Analysis of Deviance Table
Model: binomial, link: logit
Response: Kyphosis
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL
80 83.234
Age 1 1.302
79 81.932
Number 1 10.306
78 71.627
Start 1 10.247
77 61.380
带有附加的年龄^2的驼背模型
kyph.glm2<glm(Kyphosis~poly(Age,2)+Number+Start,fam
ily=binomial,data=kyphosis)
 summary(kyph.glm2)

偏差分析
anova(kyph.glm2)
Analysis of Deviance Table
Model: binomial, link: logit
Response: Kyphosis
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL
80 83.234
poly(Age, 2) 2 10.4959
78 72.739
Number
1 8.8760
77 63.863
Start
1 9.4348
76 54.428
驼背数据,16个对象,带有拟合
和残差
kyphosis$fi<-fi
y<-as.numeric(kyphosis$Kyphosis)
y<-as.numeric(kyphosis$Kyphosis)-1
kyphosis$rr<-y-fi
kyphosis$rp<-(y-fi)/sqrt(fi*(1-fi))
kyphosis$rd<-sqrt(-2*log(abs(1-y-fi)))
响应残差vs.拟合的图

图中横坐标为:y拟合值;纵坐标分别为:
拟合值 plot(rr~fi,kyphosis)
偏差残差vs.序号的图

图中横坐标为:序号;纵坐标分别为:残差plot(resid(kyph.glm))
 yy<-sign(y-fi)*(-2*(y*log(fi)+(1-y)*log(1-fi)))^(1/2)
偏差残差vs.拟合值的图