Transcript Document
STATISTICS
Regression & Correlation
1
STATISTICS
Outline
X, Y & Regression Models
Simple linear regression (SLR)
The logic of SLR: SST=SSR+SSE
SLR: ANOVA table & R-square
SLR、ANOVA、2-s t test的比較
Multiple Linear Regression
Pearson’s correlation coefficient (r)
R2, r, b之間的關係
Z, t, F, 2 之間的關係
2
STATISTICS
X and Y
X:
Y:
Predictor variables;
Predictors;
Covariates;
Explanatory variables;
Independent variables.
Outcome;
Response;
Dependent variables
3
STATISTICS
Univariate analysis: 1X1Y
X
Y
Comparisons
Methods
Num._normal
Num._normal
Num._non-normal
Num._non-normal
Num._normal
Num._normal
Num._non-normal
Categorical
Categorical
Categorical_Binary
Categorical_Binary Categorical_Binary
Categorical_Binary Categorical_Binary
2 indep. means
>= 2 indep. means
2 indep. medians
>= 2 indep. medians
Two-sample t test*
One-way ANOVA*
Wilcoxon rank sum
Kruskal-Wallis
Regression*
Paired t
Wilcoxon signed rank
Pearson's Chi-sq
McNemar Chi-sq
Pearson's Chi-sq
2-Z
說明:有*的分析方法需要有以下假設:
名詞縮寫
Binary
Categorical
Binary
Categorical
num._normal
normality
Independence..
2 related means
2 related medians
X related to Y
2 related prop.
2 indep. Prop.
2 indep. Prop.
Cat.: categorical; Num.: numerical
4
STATISTICS
Multivariate analysis: Xs1Y
Xs
Y
Methods
Categorical
Cat.
Log-linear
Cat.+Num.
Cat.(binary)
Logistic regression
Cat.+Num.
Cat.(>=3)
Logistic regression
Dicriminant analysis*
說明:有*的分析方法需要有以下假設:
Multivariate normality
Independence..
名詞縮寫
Cluster analysis
Cat.: categorical; Num.: numerical
Propensity scores
CART: classification and
CART
Cat.
Num.
ANOVA*
MANOVA*
Num.
Num.
Multiple regression*
Cat.+Num.
Num.(censored)
Cox Propotional hazard model
Confounding factors
Num.
ANCOVA*
regression tree
ANOVA: analysis of variance
ANCOVA: analysis of covariance
MANOVA: multivariate analysis of
variance
GEE: generalized estimating
equations
MANOVA*
GEE*
Confounding factors
Num.
Cat.
Mantel-Haenszel
Factor analysis
5
STATISTICS
Regression Models
Mathematical models to describe the
relationship between Y and X
The use of regression model
Adjustment
Prediction
Finding important factors for Y
6
STATISTICS
Regression Models
Definition:
Mathematical models to describe the relationship
between Y and X
Purpose: The use of regression model:
Find important factors for Y and/or
Prediction
7
STATISTICS
Simple linear regression (SLR)
Model:
Y 0 1 X
~ N (0, 2 )
E (Y ) 0 1 X
Y 0 1 X
8
STATISTICS
SLR Example
年齡跟膽固醇間是否有直線關係
ID
AGE
CHOL
1
34
141.4
2
39
180.5
3
44
178.4
4
46
212
5
48
203.2
6
51
224.1
7
53
186
8
60
350
9
61
286.3
10
65
287.6
11
66
330.3
12
67
311.3
9
STATISTICS
SLR: parameter estimation
The least square method
N
min (Yi 0 1 X i ) 2
i 1
Point estimate:
ˆ0 : est imat edint ercept
ˆ1 : est imat edslope
10
STATISTICS
The logic of SLR: SST=SSR+SSE
Yˆ ˆ0 ˆ1 X
amount at Xi unexplained by regression
Y1
Yˆ1
Total amount unexplained at Xi
Y1 Yˆ1
Y1 Y
Yˆ1 Y
Y
amount at Xi explained by regression
Yˆ2
2
2
2
ˆ
ˆ
ˆ
ˆ
(Y Y ) (Y Y Y Y ) (Y Y ) (Y Y )
2
Y2
SST =
X1
SSE
+ SSR
11
STATISTICS
SLR: parameter estimation
The least square method
min SSE:
S (Y Yˆ ) 2 i2 (Yi 0 1 X i ) 2
Point estimate
分別對截距與斜率做偏微分,可求出截距與斜率
截距
S
2 (Yi 0 1 X i ) 0
0
b0 Y b1 X
斜率
S
2 X i (Yi 0 1 X i ) 0
1
b1
( X X )(Y Y )
(X X )
i
i
2
i
12
STATISTICS
SLR example: Regression line
CHOL vs Age
350.0
CHOL
287.5
225.0
162.5
100.0
30.0
Estimated Model: CHOL=
(-57.5964988786446) + ( 5.65024919013205) * (Age)
40.0
50.0
Age
60.0
70.0
13
STATISTICS
SLR: ANOVA table & R-square
Source
DF
SS
MSS
Intercept
1
696538.3
696538.3
Slope
1
42705.43
42705.43
Error
10
9395.352
939.5352
Adj. Total
11
52100.78
4736.435
Total
12
748639.1
F
45.4538
p
0.0001
Power(5%)
1.0000
R2=0.82, p=0.0001
14
STATISTICS
SLR: qualitative covariate
Example:
X=treatment, 1 or 0
Y=SBP
Hypothesis
H0: β1 = 0
H1: β1≠0
與平均值檢定的比較:
H0: μ1 = μ0
H1: μ1≠μ0
Note: β1 = μ1 - μ0
15
STATISTICS
SLR、ANOVA、2-s t test的比較
2-s t →ANOVA
2-s t →SLR
H0: μ1 = μ0 → H0: β1 = 0
Dummy variable: K組需要K-1個
ID
Y
X
ID
Y
X
1
140
A
1
140
0
2
135
B
2
135
1
-
-
ANOVA →SLR H0: μ1 = μ2 = μ3 → H0: β1 = β2 = 0
ID
Y
X
ID
Y
X1
X2
1
140
A
1
140
0
0
2
135
B
2
135
0
1
3
130
C
3
130
1
0
-
16
STATISTICS
Multiple Linear Regression
Model
Y 0 0 X 1 ... p X p
E (Y ) Y 0 0 X 1 ... p X p
Yˆ ˆ0 ˆ0 X 1 ...ˆ p X p
Example: Is Age a predictor for SBP adjusting for Sex?
Yˆ ˆ0 ˆ1 AGE ˆ2 SEX
17
STATISTICS
MLR: example
male
Yˆ ˆ0* ˆ1 AGE
SBP
ˆ ˆ AGE
ˆ
Y
female
0
1
ˆ0* ˆ0
Age
18
STATISTICS
Pearson’s correlation coefficient (r)
Relationship btw X and Y
r
( X X )(Y Y )
( X X ) (Y Y )
i
i
2
i
2
i
Properties of Pearson’s r
Range:
Unitless 1 r 1
Good for normally distributed X and Y
相關係數 r:可視為是多維空間中,兩個向量的cos 值
Spearman’s correlation coefficient
Pearson’s r for ranked X and Y
Good for non- normally distributed X and Y
19
STATISTICS
Spearman’s Rho: rank correlation
Relationship btw X and Y
rs
(R
X
R X )(RY R Y )
(RX R X )
2
( RY RY )
t
2
rS n 2
1 rS2
Spearman’s correlation coefficient
Pearson’s r for ranked X and Y
Good for non- normally distributed X and Y
20
STATISTICS
Assumptions in Regression
Linear
Independent
Normal distribution
Equal Variance
說明:For all the values of x,
εare independent,
normally distributed,
have the same SD σ = σ (ε)
mean μ = 0
y=
α
+
βx
Weight
Height
Yi = α0 + β1Xi + εi
α and β are the unknown parameters
ε = random error fluctuations
21
STATISTICS
R2, r, b之間的關係
r and b
r SSR/ (Y Y ) 1 SSE r
2
r
2
( X X )(Y Y )
( X X ) (Y Y )
i
i
2
i
i
2
b1
2
( x x)
( y y)
2
SDX
b r b
SDY
2
2
( X X )(Y Y )
(X X )
i
i
2
i
r2: Coefficient of Determination:
The proportion of the variability among the observed values of
Y that is explained by the linear regression of Y on X.
Y的變異量可以被X迴歸後所解釋的百分比
22
STATISTICS
r, b之間的關係: 正負同號
r大b小
r小b大
23
STATISTICS
迴歸線的幾個標準差1:
名 稱
(1).估計標準誤
SE of estimate
(2).迴歸線標準誤
(3).預測標準誤
SE of RL(Ŷ的抽樣分佈標準差) SE of prediction
楊志良
迴歸線的標準差
迴歸線標準誤
估計標準誤
**該名詞易混淆
意義
任一觀察值Y與回歸直
線間的垂直距離的分布變
異
以迴歸線代替平均值算
出來的標準差
以重複抽樣的多個相同的X值
來計算Y 的標準誤,亦即Ŷ值
的第二個層次的常態分布的標
準差,
估計單一E(y)的CI用
以一個X預測Y的標
準誤,亦即某個X值上,
Y值的第一個層次的常
態分布的標準差
24
STATISTICS
迴歸線的幾個標準差2:
The Standard Error of the Estimate
S V (Y ) (Y Yˆ ) /(n 2) (Y Y ) b ( X X ) /(n 2)
2
Y.X
2
2
2
2
1
2
(Y Y ) (1 r 2 ) /(n 2)
2
SE of RL
S Y2ˆ V (Yˆ ) V (b0 b1 x) V [Y b1 ( X X )] V (Y ) V [b1 ( X X )] 2( X X )COV (Y , b1 )
2
n
2 (X X )2
(X X )
2
.... from : Note(a)
SE of prediction
SˆY2 V (Y Yˆ ) V (Y ) V (Yˆ ) 2COV (Y , Yˆ )
1
( X X )2
[1
]....from : above2
n ( X X )2
2
25
STATISTICS
迴歸線的幾個標準差3:
Note (a): b1的變異數
(X X )
( X X )(Y Y ) ] V [ ( X X )Y ]
V (b ) V [
( X X ) ( X X )
(X X )
(X X )
2
1
2
V (b1 )
2
2
V (Y )
2
2
(X X )
2
Note (b): b0的變異數
2
V (b0 ) V (Y b1 Y ) V (Y ) V X (b1 ) 2 X COV (Y ,b1 )
2
n
2
2
X
( X X )2
.... from : Note(a)
2
1
X
2(
)
n ( X X )2
26
STATISTICS
例題:
10位30-39歲男子於最初所做的血膽固醇量(X),與相隔10年後所做的量
(Y)兩次的比較如下(資料來源:彭游生物統計學,89年,P374) ,請問:
迴歸係數是多少?截距是多少?
相關係數r是多少
相關係數是否有統計上的意義?已知F0.05 (1,8) =5.32
有多少10年後膽固醇值的變異是由10年前膽固醇值的變異所引起的?
樣本的迴歸係數是否具統計意義?
某個男性目前的膽固醇為350,請預測10年後的膽固醇和其95%CI
某群男性的平均膽固醇為350,則其10年後的膽固醇和其95%CI為多少?
部分解答:
27
STATISTICS
例題:部分解答(續)
28
STATISTICS
Logistic Regression
主題:Y為類別變項的預測
Predicting Nominal or categorical outcome
有無生病;有無死亡
Odds Ratio ( 勝算比; 危險對比值 )
研究設計:
橫斷法:Cross sectional study
世代追蹤法:Cohort study (Follow-up study)
個案對照法:Case-control study
臨床實驗法:Clinical trial
29
STATISTICS
Odds ratio
X
Y
暴露組(+)
非暴露組(-)
總和
有病(+)
沒病(-)
A
B
C
D
A+C
B+D
總和
A+B
C+D
A+B+C+D
Odds是機率的另一種表示方法 odds p( x)
1 p ( x)
Odds就是賠率
危險對比值(Odds ratio)
暴露組發病率: p1 = A / (A+B)
對照組發病率: p0 = C / (C+D)
OR
p
p1
A /( A B) C /(C D) AD
0
1 p1 1 p0 B /( A B) D /(C D) BC
世界杯足球賽巴西隊的賭盤為1賠1,中國隊則為1賠100
巴西與中國的勝算比為何?
30
STATISTICS
流行病學的研究設計:
橫斷法:Cross sectional study
世代追蹤法:Cohort study (Follow-up study)
個案對照法:Case-control study
臨床實驗法:Clinical trial
31
STATISTICS
流行病學的偏差(bias)
選擇性偏差: selection bias
資訊性偏差: information bias
錯誤歸類: misclassification
干擾因子: confounding
32
STATISTICS
橫斷法
研究目的:
盛行率調查
衛生行政需求
研究關鍵:
研究對象要有代表性:隨機抽樣
研究限制:
沒有時序性,無法確定因果關係
33
STATISTICS
個案對照法
E
E
研究目的:
因果分析
個案組與對照組的暴露率比較
D
D
研究關鍵:
對照組的挑選
對照組要能代表個案組所來自的母群
體的暴露經驗
研究限制:
時序性
回憶偏差(recall bias)
34
STATISTICS
世代研究法(追蹤研究法)
E
E
研究目的:
因果分析
暴露組與非暴露組的
疾病發生率比較
研究關鍵:
D
追蹤
研究限制:
失去追蹤
35
STATISTICS
干擾因子Confounding factors
干擾因子的定義:
本身單獨與疾病有相關;本身是危險因子
Obesity
干擾因子與危險因子有相關
干擾不能是中介變項:
X1X2Y
Cholesterol
MI
36
STATISTICS
臨床實驗法
研究目的:評估介入(intervention)效果
介入:藥物治療,衛生教育
研究關鍵:
隨機分派(randomization):控制干擾因子
安慰劑效應(placebo effect)
研究限制:
倫理道德問題
37
STATISTICS
各種Study Designs之間的關係
Case-control study
Matched case-control study
Cohort study
E
E
Matched cohort study
Randomization clinical trial
Complete matched cohort study
Causality and correlation
Y=a+b1X1+b2X2+b3X3+b4X4+b5X5…
covariate, confounder
38
STATISTICS
Logistic regression:
Simple linear regression: E(Y | x) 0 1 x ~ (, )
Logistic regression: Y為二分類別變項
如何使Y從(0,1)到(- ∞, ∞)?
Logistic transformation
p( x) E (Y | x)
exp( 0 1 x)
p ( x)
0 1 x ~ (, )
~ (0,1) g ( x) ln
1 exp( 0 1 x)
1 p ( x)
39
STATISTICS
Logistic regression係數與OR
OR:exp(beta)
若該X變項是三組以上的類別變項,表示與參考組比較的OR
若該X變項是連續變項,表示每增加一單位的X,會增加多少OR
若model有多個X變項,解讀相同,但要加上「其他X變項保持不
變下」的條件
舉例:
X代表性別,男性x=1,女性x=0;Y代表自殺的有無
g (1) 0 1 ; g (0) 0 g (1) g (0) 1 ln(OR)
40
STATISTICS
課本例子:LR
men with unintentional injury
Soderstrom, 1997 Table 10-5,p247
結論:
週末的晚上到急診室的白人,有較高的機率血中酒精濃度
過高(BAC>50mg/Dl);
年紀則沒有統計差異。
41
STATISTICS
Z, t, F, 2 之間的關係
Z2 , chi-square
母群體平均值已知:
定義:
n
2
2
Z
i ( n)
2
(
x
)
i
2
i 1
2
結論: Z 2 2 ( x )
1
(1)
2 /n
n
或
2
2
Z
i ( n)
i 1
2
(
x
)
i
2 / n
42
STATISTICS
Z, t, F, 2 之間的關係
F ,chi-square
母群體平均值未知:
2
2
s
(
x
x
)
(
n
1
)
(
n
1
)
s
2
定義: Zi2 2( n1) i 2
2
n 1
i 1
2
s 2 ( n1)
s12
s12
結論: Fdf 1,df 2 2 df 2 , Fdf 1, 2 Fdf 1, 2
s2
n 1
n
2
2
43
STATISTICS
Z, t, F, 2 之間的關係
F ,(df 1,df 2)
F ,(1,df 2) t
F ,(df 1, )
2
1 / 2,(df 2 )
2
df 1
F ,(1,) z12 / 2
44