Transcript Document

STATISTICS
Regression & Correlation
1
STATISTICS
Outline
 X, Y & Regression Models
 Simple linear regression (SLR)
 The logic of SLR: SST=SSR+SSE
 SLR: ANOVA table & R-square
 SLR、ANOVA、2-s t test的比較
 Multiple Linear Regression
 Pearson’s correlation coefficient (r)
 R2, r, b之間的關係
 Z, t, F, 2 之間的關係
2
STATISTICS
X and Y
X:
Y:
Predictor variables;
Predictors;
Covariates;
Explanatory variables;
Independent variables.
Outcome;
Response;
Dependent variables
3
STATISTICS
Univariate analysis: 1X1Y
X
Y
Comparisons
Methods
Num._normal
Num._normal
Num._non-normal
Num._non-normal
Num._normal
Num._normal
Num._non-normal
Categorical
Categorical
Categorical_Binary
Categorical_Binary Categorical_Binary
Categorical_Binary Categorical_Binary
2 indep. means
>= 2 indep. means
2 indep. medians
>= 2 indep. medians
Two-sample t test*
One-way ANOVA*
Wilcoxon rank sum
Kruskal-Wallis
Regression*
Paired t
Wilcoxon signed rank
Pearson's Chi-sq
McNemar Chi-sq
Pearson's Chi-sq
2-Z
 說明:有*的分析方法需要有以下假設:
 名詞縮寫
Binary
Categorical
Binary
Categorical
num._normal


normality
Independence..
2 related means
2 related medians
X related to Y
2 related prop.
2 indep. Prop.
2 indep. Prop.

Cat.: categorical; Num.: numerical
4
STATISTICS
Multivariate analysis: Xs1Y
Xs
Y
Methods
Categorical
Cat.
Log-linear
Cat.+Num.
Cat.(binary)
Logistic regression
Cat.+Num.
Cat.(>=3)
Logistic regression
Dicriminant analysis*
 說明:有*的分析方法需要有以下假設:
 Multivariate normality
 Independence..
 名詞縮寫
Cluster analysis
 Cat.: categorical; Num.: numerical
Propensity scores
 CART: classification and
CART
Cat.
Num.
ANOVA*
MANOVA*
Num.
Num.
Multiple regression*
Cat.+Num.
Num.(censored)
Cox Propotional hazard model
Confounding factors
Num.
ANCOVA*




regression tree
ANOVA: analysis of variance
ANCOVA: analysis of covariance
MANOVA: multivariate analysis of
variance
GEE: generalized estimating
equations
MANOVA*
GEE*
Confounding factors
Num.
Cat.
Mantel-Haenszel
Factor analysis
5
STATISTICS
Regression Models
Mathematical models to describe the
relationship between Y and X
The use of regression model
 Adjustment
 Prediction
 Finding important factors for Y
6
STATISTICS
Regression Models
Definition:
 Mathematical models to describe the relationship
between Y and X
Purpose: The use of regression model:
 Find important factors for Y and/or
 Prediction
7
STATISTICS
Simple linear regression (SLR)
Model:
Y   0  1 X  
 ~ N (0,  2 )
E (Y )   0  1 X
  Y   0  1 X
8
STATISTICS
SLR Example
年齡跟膽固醇間是否有直線關係
ID
AGE
CHOL
1
34
141.4
2
39
180.5
3
44
178.4
4
46
212
5
48
203.2
6
51
224.1
7
53
186
8
60
350
9
61
286.3
10
65
287.6
11
66
330.3
12
67
311.3
9
STATISTICS
SLR: parameter estimation
The least square method
N
min  (Yi   0  1 X i ) 2
i 1
Point estimate:
ˆ0 : est imat edint ercept
ˆ1 : est imat edslope
10
STATISTICS
The logic of SLR: SST=SSR+SSE
Yˆ  ˆ0  ˆ1 X
amount at Xi unexplained by regression
Y1
Yˆ1
Total amount unexplained at Xi
Y1  Yˆ1
Y1  Y
Yˆ1  Y
Y
amount at Xi explained by regression
Yˆ2
2
2
2
ˆ
ˆ
ˆ
ˆ
 (Y  Y )   (Y  Y  Y  Y )   (Y  Y )   (Y  Y )
2
Y2
SST =
X1
SSE
+ SSR
11
STATISTICS
SLR: parameter estimation
The least square method
 min SSE:
S   (Y  Yˆ ) 2    i2   (Yi   0  1 X i ) 2
Point estimate
 分別對截距與斜率做偏微分,可求出截距與斜率
截距
S
 2 (Yi   0  1 X i )  0
 0
b0  Y  b1 X
斜率
S
 2 X i (Yi   0  1 X i )  0
1
b1
( X  X )(Y  Y )


(X  X )
i
i
2
i
12
STATISTICS
SLR example: Regression line
CHOL vs Age
350.0
CHOL
287.5
225.0
162.5
100.0
30.0
Estimated Model: CHOL=
(-57.5964988786446) + ( 5.65024919013205) * (Age)
40.0
50.0
Age
60.0
70.0
13
STATISTICS
SLR: ANOVA table & R-square
Source
DF
SS
MSS
Intercept
1
696538.3
696538.3
Slope
1
42705.43
42705.43
Error
10
9395.352
939.5352
Adj. Total
11
52100.78
4736.435
Total
12
748639.1
F
45.4538
p
0.0001
Power(5%)
1.0000
R2=0.82, p=0.0001
14
STATISTICS
SLR: qualitative covariate
 Example:
 X=treatment, 1 or 0
 Y=SBP
 Hypothesis
 H0: β1 = 0
 H1: β1≠0
 與平均值檢定的比較:
 H0: μ1 = μ0
 H1: μ1≠μ0
 Note: β1 = μ1 - μ0
15
STATISTICS
SLR、ANOVA、2-s t test的比較
 2-s t →ANOVA
 2-s t →SLR
H0: μ1 = μ0 → H0: β1 = 0
 Dummy variable: K組需要K-1個
ID
Y
X
ID
Y
X
1
140
A
1
140
0
2
135
B
2
135
1
-
-
 ANOVA →SLR H0: μ1 = μ2 = μ3 → H0: β1 = β2 = 0
ID
Y
X
ID
Y
X1
X2
1
140
A
1
140
0
0
2
135
B
2
135
0
1
3
130
C
3
130
1
0
-
16
STATISTICS
Multiple Linear Regression
 Model
Y   0   0 X 1  ... p X p  
E (Y )  Y   0   0 X 1  ... p X p
Yˆ  ˆ0  ˆ0 X 1  ...ˆ p X p
 Example: Is Age a predictor for SBP adjusting for Sex?
Yˆ  ˆ0  ˆ1 AGE  ˆ2 SEX
17
STATISTICS
MLR: example
male
Yˆ  ˆ0*  ˆ1 AGE
SBP
ˆ  ˆ AGE
ˆ
Y


female
0
1
ˆ0*  ˆ0
Age
18
STATISTICS
Pearson’s correlation coefficient (r)
 Relationship btw X and Y
r
 ( X  X )(Y  Y )
 ( X  X )  (Y  Y )
i
i
2
i
2
i
 Properties of Pearson’s r
 Range:
 Unitless  1  r  1
 Good for normally distributed X and Y
 相關係數 r:可視為是多維空間中,兩個向量的cos 值
 Spearman’s correlation coefficient
 Pearson’s r for ranked X and Y
 Good for non- normally distributed X and Y
19
STATISTICS
Spearman’s Rho: rank correlation
 Relationship btw X and Y
rs 
 (R
X
 R X )(RY  R Y )
 (RX  R X )
2
 ( RY  RY )
t
2
rS n  2
1  rS2
 Spearman’s correlation coefficient
 Pearson’s r for ranked X and Y
 Good for non- normally distributed X and Y
20
STATISTICS
Assumptions in Regression
Linear
Independent
Normal distribution
Equal Variance
說明:For all the values of x,




εare independent,
normally distributed,
have the same SD σ = σ (ε)
mean μ = 0
y=
α
+
βx
Weight





Height
Yi = α0 + β1Xi + εi
α and β are the unknown parameters
ε = random error fluctuations
21
STATISTICS
R2, r, b之間的關係
 r and b
r  SSR/  (Y  Y )  1  SSE  r
2
r
2
 ( X  X )(Y  Y )
 ( X  X )  (Y  Y )
i
i
2
i
i
2
b1
2
( x  x)


 ( y  y)
2
SDX
b  r  b
SDY
2
2
( X  X )(Y  Y )


(X  X )
i
i
2
i
 r2: Coefficient of Determination:
 The proportion of the variability among the observed values of
Y that is explained by the linear regression of Y on X.
 Y的變異量可以被X迴歸後所解釋的百分比
22
STATISTICS
r, b之間的關係: 正負同號
 r大b小
 r小b大
23
STATISTICS
迴歸線的幾個標準差1:
名 稱
(1).估計標準誤
SE of estimate
(2).迴歸線標準誤
(3).預測標準誤
SE of RL(Ŷ的抽樣分佈標準差) SE of prediction
楊志良
迴歸線的標準差
迴歸線標準誤
估計標準誤
**該名詞易混淆
意義
任一觀察值Y與回歸直
線間的垂直距離的分布變
異
以迴歸線代替平均值算
出來的標準差
以重複抽樣的多個相同的X值
來計算Y 的標準誤,亦即Ŷ值
的第二個層次的常態分布的標
準差,
估計單一E(y)的CI用
以一個X預測Y的標
準誤,亦即某個X值上,
Y值的第一個層次的常
態分布的標準差
24
STATISTICS
迴歸線的幾個標準差2:
The Standard Error of the Estimate
S  V (Y )     (Y  Yˆ ) /(n  2)   (Y  Y )  b  ( X  X ) /(n  2)
2
Y.X
2
2
2
2
1
2
  (Y  Y )  (1  r 2 ) /(n  2)
2
SE of RL
S Y2ˆ  V (Yˆ )  V (b0  b1 x)  V [Y  b1 ( X  X )]  V (Y )  V [b1 ( X  X )]  2( X  X )COV (Y , b1 )

2
n

 2 (X  X )2
(X  X )
2
.... from : Note(a)
SE of prediction
SˆY2  V (Y  Yˆ )  V (Y )  V (Yˆ )  2COV (Y , Yˆ )
1
( X  X )2
  [1  
]....from : above2
n  ( X  X )2
2
25
STATISTICS
迴歸線的幾個標準差3:
Note (a): b1的變異數
(X  X )
 ( X  X )(Y  Y ) ]  V [  ( X  X )Y ]  
V (b )  V [

  ( X  X ) ( X  X )
(X  X )
(X  X )
2
1
2
V (b1 ) 
2
2

  V (Y )
2

2
(X  X )
2
Note (b): b0的變異數
2
V (b0 )  V (Y  b1 Y )  V (Y )  V X (b1 )  2 X COV (Y ,b1 )

2
n
2
2
X 
 ( X  X )2
.... from : Note(a)
2
1
X
 2( 
)
n  ( X  X )2
26
STATISTICS
例題:
 10位30-39歲男子於最初所做的血膽固醇量(X),與相隔10年後所做的量
(Y)兩次的比較如下(資料來源:彭游生物統計學,89年,P374) ,請問:







迴歸係數是多少?截距是多少?
相關係數r是多少
相關係數是否有統計上的意義?已知F0.05 (1,8) =5.32
有多少10年後膽固醇值的變異是由10年前膽固醇值的變異所引起的?
樣本的迴歸係數是否具統計意義?
某個男性目前的膽固醇為350,請預測10年後的膽固醇和其95%CI
某群男性的平均膽固醇為350,則其10年後的膽固醇和其95%CI為多少?
 部分解答:
27
STATISTICS
例題:部分解答(續)
28
STATISTICS
Logistic Regression
主題:Y為類別變項的預測
 Predicting Nominal or categorical outcome
 有無生病;有無死亡
 Odds Ratio ( 勝算比; 危險對比值 )
研究設計:
 橫斷法:Cross sectional study
 世代追蹤法:Cohort study (Follow-up study)
 個案對照法:Case-control study
 臨床實驗法:Clinical trial
29
STATISTICS
Odds ratio
X
Y
暴露組(+)
非暴露組(-)
總和
有病(+)
沒病(-)
A
B
C
D
A+C
B+D
總和
A+B
C+D
A+B+C+D
 Odds是機率的另一種表示方法 odds  p( x) 
 1  p ( x) 


 Odds就是賠率
 危險對比值(Odds ratio)
 暴露組發病率: p1 = A / (A+B)
 對照組發病率: p0 = C / (C+D)
OR 
p
p1
A /( A  B) C /(C  D) AD
 0 


1  p1 1  p0 B /( A  B) D /(C  D) BC
 世界杯足球賽巴西隊的賭盤為1賠1,中國隊則為1賠100
 巴西與中國的勝算比為何?
30
STATISTICS
流行病學的研究設計:
橫斷法:Cross sectional study
世代追蹤法:Cohort study (Follow-up study)
個案對照法:Case-control study
臨床實驗法:Clinical trial
31
STATISTICS
流行病學的偏差(bias)
選擇性偏差: selection bias
資訊性偏差: information bias
 錯誤歸類: misclassification
干擾因子: confounding
32
STATISTICS
橫斷法
研究目的:
 盛行率調查
 衛生行政需求
研究關鍵:
 研究對象要有代表性:隨機抽樣
研究限制:
 沒有時序性,無法確定因果關係
33
STATISTICS
個案對照法
E
E
 研究目的:
 因果分析
 個案組與對照組的暴露率比較
D
D
 研究關鍵:
 對照組的挑選
 對照組要能代表個案組所來自的母群
體的暴露經驗
 研究限制:
 時序性
 回憶偏差(recall bias)
34
STATISTICS
世代研究法(追蹤研究法)
E
E
研究目的:
 因果分析
 暴露組與非暴露組的
疾病發生率比較
研究關鍵:
D
 追蹤
研究限制:
 失去追蹤
35
STATISTICS
干擾因子Confounding factors
干擾因子的定義:
 本身單獨與疾病有相關;本身是危險因子
Obesity
 干擾因子與危險因子有相關
 干擾不能是中介變項:
X1X2Y
Cholesterol
MI
36
STATISTICS
臨床實驗法
研究目的:評估介入(intervention)效果
 介入:藥物治療,衛生教育
研究關鍵:
 隨機分派(randomization):控制干擾因子
 安慰劑效應(placebo effect)
研究限制:
 倫理道德問題
37
STATISTICS
各種Study Designs之間的關係
Case-control study
 Matched case-control study
Cohort study
E
E
 Matched cohort study
Randomization clinical trial
 Complete matched cohort study
Causality and correlation
 Y=a+b1X1+b2X2+b3X3+b4X4+b5X5…
covariate, confounder
38
STATISTICS
Logistic regression:
Simple linear regression: E(Y | x)   0  1 x ~ (, )
Logistic regression: Y為二分類別變項
 如何使Y從(0,1)到(- ∞, ∞)?
 Logistic transformation
p( x)  E (Y | x) 
exp( 0  1 x)
 p ( x) 
   0  1 x ~ (, )
~ (0,1)  g ( x)  ln
1  exp( 0  1 x)
 1  p ( x) 
39
STATISTICS
Logistic regression係數與OR
 OR:exp(beta)
 若該X變項是三組以上的類別變項,表示與參考組比較的OR
 若該X變項是連續變項,表示每增加一單位的X,會增加多少OR
 若model有多個X變項,解讀相同,但要加上「其他X變項保持不
變下」的條件
 舉例:
 X代表性別,男性x=1,女性x=0;Y代表自殺的有無
g (1)   0  1 ; g (0)   0  g (1)  g (0)  1  ln(OR)
40
STATISTICS
課本例子:LR
men with unintentional injury
 Soderstrom, 1997 Table 10-5,p247
結論:
 週末的晚上到急診室的白人,有較高的機率血中酒精濃度
過高(BAC>50mg/Dl);
 年紀則沒有統計差異。
41
STATISTICS
Z, t, F, 2 之間的關係
 Z2 , chi-square
 母群體平均值已知:
定義:
n
2
2
Z


 i ( n) 
2
(
x


)
 i
2

i 1
2
結論: Z 2   2  ( x   )
1
(1)
2 /n
n
或
2
2
Z


 i ( n) 
i 1
2
(
x


)
 i
2 / n
42
STATISTICS
Z, t, F, 2 之間的關係
F ,chi-square
 母群體平均值未知:
2
2

s
(
x

x
)
(
n

1
)
(
n

1
)
s
 2
定義:  Zi2   2( n1)   i 2

2
n 1 


i 1
2
s 2  ( n1)
s12
s12
結論: Fdf 1,df 2  2  df 2  , Fdf 1,  2 Fdf 1,  2 
s2


n 1
n
2
2
43
STATISTICS
Z, t, F, 2 之間的關係
F ,(df 1,df 2)
F ,(1,df 2)  t
F ,(df 1, ) 
2
1 / 2,(df 2 )
2
df 1
F ,(1,)  z12 / 2
44