Linear regression analysis 線性迴歸分析

Download Report

Transcript Linear regression analysis 線性迴歸分析

Linear regression analysis
線性迴歸分析
Wen, shuhui
[email protected]
2010.12
2010.12
1
Example

停經婦女之骨質密度BMD偏低,可能導致易
骨折


older, heavier
高脂飲食者會有較高之LDL cholesterol,可能
增加心血管疾病風險

2010.12
They might be smokers and overweight.
2
Multi-predictor analysis
Y=α+β1X1+β2X2+...+βkXk+error

Potentially complex relationship in observational
study



A continuous outcome (Y, e.g. BMD, LDL) is related to a
risk factor (X1 e.g.停經, 高脂飲食)
But the risk factor of interest might be related to other
factors (X2, e.g. age, BMI,smoke ) which also predict the
outcome.
Similarly, for experiments (e.g. clinical trials)



2010.12
If randomization is implemented, confounding might not an
issue.
For Multi-center trials, need to adjusted for clinical center.
When baseline differences are apparent between case and
control group.
3
以文獻(Åkesson et al. 2006)為例

探討鎘暴露對骨頭的影響



骨骼傷害(Y):因為鈣及磷酸的流失,以及因為腎損
壞而抑制維他命D羊巠化反應,造成骨質疏鬆及軟
化。
評估鎘的暴露量(X)和身體含量時,血中的鎘含量可
表示最近的暴露量,尿中的鎘可表示身體的含量
採用multiple linear regression

2010.12
可能還有其他影響因素(X2, X3,…,Xk)
4
Statistical analyses

Data from two independent groups of subjects
were compared by the Mann-Whitney U-test. We
used Spearman rank correlation (rs) or Kendall’s
tau to assess univariate associations (p ≤ 0.1). In
multiple linear regression models, each bonerelated variable was evaluated in relation to
cadmium, potential confounders (factors
associated with both cadmium and bone) and
effect modifiers (factors associated with bone).
We explored possible interactions in the model.
2010.12
5
Statistical analyses
Because the season of sampling correlated with blood
and urinary cadmium, BMD, PTH, U-DPD, and
urinary calcium, it was included in the models.
Residual and goodness-of-fit analyses indicated no
deviation from a linear pattern in the regression
models. The final regression model included, apart
from cadmium, only statistically significant variables
(p ≤ 0.05). All tests were two sided, and statistical
evaluation was performed using SPSS (version 12.01;
SPSS Inc., Chicago, IL, USA).
2010.12
6
2010.12
7
2010.12
8
2010.12
9
Outline





Correlation
Multiple linear regression
Predictor selection
Interaction
Other extended cases
2010.12
10
Example: FEV data

一秒最大呼氣量(FEV)

FEV 與抽菸的關係?

2010.12
Other related factors, e.g. age, gender
11
FEV data
2010.12
12
Analysis steps





Step1: Present the descriptive the clinical
features for FEV and other influencing factors.
Step2: Explore the correlation between FEV
and X1…Xk.
Step3: Build up the multiple linear regression
model and check for model adequacy.
Step4: Model revision or selection.
Step5: Interpretation the result (model).
2010.12
13
Step2: Explore the correlations

SPSS: Analyze  Correlation  Pairwise
相關係數處有三個選項
1.相關係數: For continuous Xs.
2.Kendall’s tau: For ordinal Xs
3.Spearman: For nominal Xs.
2010.12
14
Correlation matrix (recall Tab2 in paper
2006)

可將output的圖(選圖後滑鼠於圖上點兩下)直
接編輯成下表
或是將p-value放在
左下角矩陣的位置
2010.12
15
Add the scatter plot

Graph Scatter plotMatrix plot
2010.12
16
Matrix scatter plot



2010.12
散佈圖與相
關係數矩陣
搭配著看
相關係數看
出正相關且
達顯著
散佈圖可看
出是否為線
性相關
17
For nominal variables, Spearman rs is more
suitable.

Look at the correlation of FEV and gender(or
smoke)
2010.12
18
Spearman rs


FEV 與性別(0=female,1=male)有關,男性其
FEV較大
FEV 與抽菸(0=No,1=Yes)有關,抽菸者其
FEV較大
2010.12
19
FEV vs. smoke



2010.12
抽菸者FEV
值大?
可能的原因
是抽菸者多
為男性或者
年齡較大(體
型較大)
Confounder?
20
Summary for bivariate correlation




For continuous outcome (Y)
If factors (Xs) are continuous, we show the
Pearson correlation coefficient.
If factors (Xs) are categorical, we list the
Spearman correlation coefficient.
Also, provide the plots as possible.
2010.12
21
Summary of correlation analysis

FEV 與抽菸的關係?




Others related factors, e.g. age, gender
FEV 與 身高、年齡都呈正相關,且有統計上
顯著相關(p<0.05)
FEV與性別有關(p<0.05),男性其FEV值越大
FEV與抽菸有關(p<0.05),抽菸者其FEV值越
大,但此現象可能是有confounder造成,例如
性別、年齡、身高尚未考慮
2010.12
22
Step3: Build up the multiple linear
regression model
Now, we want to build the model as
FEV=α+β1age+β2sex+β3Hgt+β4smoke

2010.12
23
Multiple linear regression
2010.12
24
Check for model adequacy.


點進”圖形”後選擇常態機率圖(為檢驗資料
是否符合常態性假設)
畫殘差圖(Y axis:殘差值, X axis:FEV值)

為判斷同質性假設
若此兩假設不符
則後續檢定迴歸
係數之結果可能
會不對(not valid)

2010.12
25
Results-1:Pearson Correlation matrix

除了看出FEV與因子(Xs)間相關以外,Xs彼此
也有些達統計相關e.g. age vs. Hgt
2010.12
26
Results-2: Adjusted R-square

FEV的變異可被模式中所有因子共同解釋的變
異比例為 0.774。換句話說,還有 22.6% 為誤
差,可能還有其他影響FEV因素未被考慮。
2010.12
27
Result-3:Collinearity diagnosis(共線性)


Collinearity: 意指Xs彼此高相關而影響β值估
計,如此則須再 revise the model.
檢查指標為VIF. 若VIF>10則表示該變項與其
他變數高相關,可考慮拿掉
2010.12
28
Result-4-1: Normality



2010.12
圖中直線若接近45
度直線則表示常態
性假設成立
通常sample size若
夠大可不用太擔心
常態性不成立
如果常態性不成立,
一般會將Y轉換成
log(Y) 重新做
regression
29
Result-4-2: Homogeneous



2010.12
正常圖形應該看來
是雜亂無pattern
右圖看來有點扇形
(Fan shape)可能是
違反同質性
另外Y-axis標準化
殘差值落在(-3,3)之
外的就是異常值
30
Outliers

下表即為outliers. 一般也可以拿掉後重做
regression. (Do you know how to do it?)
2010.12
31
Influential point
High-leverage point could be x-outlier. Influential
point, i.e. one or more β-hat would change by a
large amount.
Criterion
Leverage, h
2010.12
Bound
>2/n
Studentized residual, r
>3
DFFIT
>2
Cook's distance
>1
32
Influential point (2)
2010.12
Reference: Page 122 from Vittinghoff et al. 2005
33
Outliers or influential points?

有outliers. 無影響點(max cook’s distance<1)
2010.12
34
Step4: Model revision or selection.

根據初步分析結果




FEV可被 age, gender, smoke. Height解釋變異之
比例達77.4%
常態性符合,同質性雖不甚符合,但 n 夠大
無共線性問題,無影響點,有 5 個異常值
Model revision

2010.12
試著將 outliers 去掉後再做一次
35
先儲存標準化殘差,再利用selection功
能將outlier去掉
執行完
regression
後請到
資料
選擇觀察值
2010.12
36
Delete outliers and do regression again

條件為 abs(ZRE_1) <=3
2010.12
37
Interpretation of regression analysis


重新做regression後的結果即可仿照 page 2333步驟 檢視統計結果
N=649 (原本有 654筆)
2010.12
38
Adjusted R-square (new)

R-square is 78.7%. A little larger than
previous one.
2010.12
39
Normality, Collinearity, Homogeneous

Normality 符合


Collinearity


VIF 皆小於 10, 無共線性
Homogeneous


常態機率圖 接近45度直線
殘差圖與之前一樣
Outliers

2010.12
雖有但很輕微(很接近3)故不再排除
40
Interpretation of regression analysis


Regression model
FEV=-4.521+0.057Age+0.131Sex0.067Smoke+0.042Hgt
2010.12
41
1. 拿掉outlier 後regression model影響不大
2. 與FEV顯著相關之變項仍是 Age, Sex, Height
有異常點
2010.12
42
整理成 paper 之表格 (供參考)
Table: Multiple linear regression analysis between FEV and factors.
95% CI
Factors
coefficient
lower bound
Age(yr)
0.057
0.039
0.075
<0.001*
Sex
0.131
0.069
0.194
<0.001*
Smoke
-0.067
-0.177
0.044
0.236
Height(cm)
0.042
0.038
0.046
<0.001*
upper bound
p-value
Sex:0=female, 1=male. Smoke: 0=no, 1=yes. *: statistical significance
2010.12
43
Solutions if Normality failed
對 Y 做轉換(特別在小樣本時) e.g, log(Y)

Model is log(Y)=α+βX

Interpretation of β

X每增加一單位,則Y會增加 _____ %.
缺點:資料經轉換後,較不易解釋
How to do it?




2010.12
先利用 compute 得到轉換後的Y
再利用剛剛學到的steps 2-4進行分析
44
Solutions if Homogeneous failed
1. 亦可做轉換(尤其小樣本時) e.g. log(Y), 1/Y
2. 利用加權最小平方法(請洽 statisticians)
2010.12
45
Solutions if Collinearity exists

Model selection


利用模式選取的方式,放入較顯著的變項,以避免
Xs之間之高相關
Forward, Backward, Stepwise regression

2010.12
Stepwise 較常使用
46
Stepwise regression
2010.12
47
Results
2010.12
48
Selected model
Model is FEV=-4.449+0.041Hgt+0.061Age+0.161Sex
(here is for all data, plz use data without outliers)
2010.12
49
Interaction

若Z與X對Y的交互作用存在,則Z的值不同時,
X與Y的關係會改變


統計角度,可畫出 Y 的 mean plot for each X*Z
group
模式中要加入interaction effect, 作法是

2010.12
加入X與Z的交乘項X*Z,檢定X*Z的迴歸係數是否
為0,若顯著則X與Z之 interaction 存在
50
Sex vs. Smoke?
2010.12
51
Check for mean FEV
此處尚未考慮Age, Height的影響喔,
若加入confounder後關係會再改變!
(Multiple regression)
由敘述性統計值看來
男生的FEV值與女生的FEV值之差異會因抽菸狀態不同而不同
可能有交互作用存在(from statistical viewpoint)
4
4
Nonsmoker
smoker
Female
3.5
Mean FEV
Mean FEV
3.5
3
Male
3
2.5
2.5
2
Female
2010.12
Male
2
nonsmoker
smoker
52
Add interaction effects


檢驗抽菸與性別之交互作用
1. 先新增加乘項(name it as “interaction”)
2010.12
53
Build up the model

將 interaction 選入自變數清單
2010.12
54
Results (here is for all data, plz use data
without outliers)

Regression model
抽菸與性別之交互作用存在,此時
的smoke 主效應亦存在
2010.12
55
Which one is the final model?

Add the interaction. (here is for all data)
Mean FEV=-4.422+0.066age+0.135Sex-0.183Smoke+0.041Hgt+0.234Interaction
2010.12
56
Interpretation
Mean FEV= -4.422+0.066age+0.135Sex-0.183Smoke
+0.041Hgt+0.234Interaction
Sex
Smoke
Interaction
Estimated FEV
(adjusted for age, height)
female(0)
No(0)
0
baseline
female
Yes(1)
0
-0.183
male(1)
No
0
0.135
male
Yes
1
0.186
2010.12
女性者抽菸其FEV值會較未抽菸者低0.183(l),
男性者抽菸其FEV值會較未抽菸者高0.051(l)。
可能原因是?
57
會是身高影響?
2010.12
58
Further issues

What if Y is not continuous?


If Y is binary, say disease vs. healthy. Suggest
use the logistic regression (next class by Prof.
Hsieh).
What if Y are repeated measure, say pre/post
Y?


2010.12
Might use post-Y as response variable, and
adjusted for pre-Y and Xs. (For 2 time points)
For several time points, suggest use “repeatedmeasure” ANOVA. (請洽statisticians)
59
References
1.
2.
3.
4.
2010.12
M. Pagano, K. Gauvereau. Principles of
Biostatistics(2nd Ed). Australia ; Pacific Grove, CA :
Duxbury, 2000. (歐亞書局代理)
Rosner B. (2006) Fundamentals of Biostatistics
(6th ed). Belmont, CA : Thomson-Brooks/Cole (歐
亞代理)
Vittinghoff E., Glidden D.V., Shiboski S.C.,
McCulloch C.E. Regression Methods in
Biostatistics. Spreinger 2005.
史麗珠 (2005),進階應用生物統計學。學富文化,
台北。
60