No Slide Title

Download Report

Transcript No Slide Title

Lecture 2- Alternate Correlation
Procedures
EPSY 640
Texas A&M University
CORRELATION MEASURES FOR VARIOUS
SCALES OF MEASUREMENT
Y
X
Nominal Level
Dichotomous
Nominal Level
(Polychotomous)
Ordinal
Level
Interval/Ratio
Nominal Level
(dichotomous)
Nominal Level
(polychotomous)
Phi coefficient,
Yule's Q,
Goodman;s 
Lambda
,
tetrachoric
Pearson
Association,
Tschuprow’s T,
Pearson
Association,
Tschiuprow’s T,
Cramer’s C
Ordinal Level
rank-biserial
reduce to
dichotomous or
Kruskal-Wallis
based statistic
Spearman,
Kendall’s tau
Interval or ratio
Level
point-biserial,
biserial, Nagelkirke
R-square (logistic)
R-square
R-squared
Pearson r
Dichotomous-Dichotomous
Case- PHI COEFFICIENT
MINORITY STATUS
GENDER
a
c
b
d
the phi coefficient can be written as
rphi = (ad – bc) / [(a+c)(b+d)(a+b)(c+d)]1/2
Dichotomous-Dichotomous Case- PHI
COEFFICIENT
Political affiliation
Row total
Gender
Column Total
7 2
2 10
9
12
9
12
21
rphi = (7x10 – 2x2) / (7+2)(2+10)(7+2)(2+10)]1/2
= (70 - 4) / [9x12x9x12]1/2
= 66/108 = .611
Pearson r = .157/(.5071 x .5071) = .611
CHI-SQUARE:
I J
2 =   n.j (nij/n – ni./n)2 / ni./n
i=1 j=1
= 9[7/21 – 9/21]2 /( 9/21) + 9[2/21 – 9/21] 2/( 9/21) + 12[2/21 –
12/21]2 /(12/21) + 12[10/21 – 12/21]2 / (12/21)
=
4/21 + 49/21
+ 100/21 + 4/21
= 157/21
=
7.476
PEARSON ASSOCIATION:
P = {2 / 2 + n}1/2
= {7.476/22.476}1/2
= .577
TSCHUPROW'S T:
T = {2/n[(r-c)(c-1)]1/2}1/2
= {7.476/21 x [1 x 1]1/2}1/2
= .597
SPSS Crosstabs procedure
Select “Analyze/ Descriptive Statistics/
Crosstabs”
• Select “Row” and “Column” variables for
the two nominal variables
• Select under “Statistics” the options you
want such as “Chi Square” and various
“Nominal” association measures
Case Processing Summary
Valid
race * sex
N
5299
Percent
96.6%
Cases
Missing
N
Percent
185
3.4%
Total
N
5484
Percent
100.0%
race * sex Crosstabulation
Count
sex
race
Total
1
2 hispanic
3 african american
4
5 white
6
1 male
43
416
211
31
1970
37
2708
2 female
42
411
162
40
1897
39
2591
Total
85
827
373
71
3867
76
5299
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value
6.470a
6.489
5
5
Asymp. Sig.
(2-sided)
.263
.262
1
.727
df
.122
5299
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 34.72.
Symmetric Measures
Nominal by
Nominal
Phi
Cramer's V
Contingency Coefficient
N of Valid Cases
Value
.035
.035
.035
5299
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null
hypothesis.
Approx. Sig.
.263
.263
.263
0,1
1,1
1
0,0
1,0
0
0
1
TETRACHORIC ASSUMPTIONS- underlying normality of
observed dichotomies
n11 = 70 n12 = 20 n21 = 20 n22 = 100
70
20
20
100
ux = height of normal curve for proportion = 90/210 =
U(.4290)
The z-score for .429 = -.18
The ux for = U(.4290) (requires stat table or SPSS procedure
Pdf.Norm)
= .3637
uy = height of normal curve for proportion = 90/210 =
U(.4290)
= .3637
rtet = [ (70 x 100) – (20 x 20) ] / [.3637 x .3637 x 2102 ]
= 6600/ (.132 x 2102)
= 1.13 not a good estimate!
Table 3.4: Computation of tetrachoric correlation
y
N
O
R
M
u
y
u
y
u
y
u
0.00
0.3989
0.02
0.3989
0.52
0.3485
1.02
0.2371
1.52
0.1257
0.04
0.3986
0.54
0.3448
1.04
0.2323
1.54
0.1219
0.06
0.3982
0.56
0.3410
1.06
0.2275
1.56
0.1182
0.08
0.3977
0.58
0.3372
1.08
0.2227
1.58
0.1145
0.10
0.3970
0.60
0.3332
1.10
0.2179
1.60
0.1109
A
0.12
0.3961
0.62
0.3292
1.12
0.2131
1.62
0.1074
L
0.14
0.3951
0.64
0.3251
1.14
0.2083
1.64
0.1040
0.16
0.3939
0.66
0.3209
1.16
0.2036
1.66
0.1006
D
0.18
0.3925
0.68
0.3166
1.18
0.1989
1.68
0.0973
E
0.20
0.3910
0.70
0.3123
1.20
0.1942
1.70
0.0940
0.22
0.3894
0.72
0.3079
1.22
0.1895
1.72
0.0909
0.24
0.3876
0.74
0.3034
1.24
0.1849
1.74
0.0878
0.26
0.3857
0.76
0.2989
1.26
0.1804
1.76
0.0848
0.28
0.3836
0.78
0.2943
1.28
0.1758
1.78
0.0818
T
0.30
0.3814
0.80
0.2897
1.30
0.1714
1.80
0.0790
Y
0.32
0.3790
0.82
0.2850
1.32
0.1669
1.82
0.0761
0.34
0.3765
0.84
0.2803
1.34
0.1626
1.84
0.0734
H
0.36
0.3739
0.86
0.2756
1.36
0.1582
1.86
0.0707
E
0.38
0.3712
0.88
0.2709
1.38
0.1539
1.88
0.0681
0.40
0.3683
0.90
0.2661
1.40
0.1497
1.90
0.0656
0.42
0.3653
0.92
0.2613
1.42
0.1456
1.92
0.0632
0.44
0.3621
0.94
0.2565
1.44
0.1415
1.94
0.0608
0.46
0.3589
0.96
0.2516
1.46
0.1374
1.96
0.0584
0.48
0.3555
0.98
0.2468
1.48
0.1334
1.98
0.0562
0.50
0.3521
1.00
0.2420
1.50
0.1295
2.00
0.0540
N
S
I
I
G
H
T
y
u
y
u
y
u
y
u
2.02
0.0519
2.52
0.0167
3.02
0.0042
3.52
0.0008
N
2.04
0.0498
2.54
0.0158
3.04
0.0039
3.54
0.0008
O
2.06
0.0478
2.56
0.0151
3.06
0.0037
3.56
0.0007
R
2.08
0.0459
2.58
0.0143
3.08
0.0035
3.58
0.0007
M
2.10
0.0440
2.60
0.0136
3.10
0.0033
3.60
0.0006
A
2.12
0.0422
2.62
0.0129
3.12
0.0031
3.62
0.0006
2.14
0.0404
2.64
0.0122
3.14
0.0029
3.64
0.0005
2.16
0.0387
2.66
0.0116
3.16
0.0027
3.66
0.0005
2.18
0.0371
2.68
0.0110
3.18
0.0025
3.68
0.0005
L
D
2.20
0.0355
2.70
0.0104
3.20
0.0024
3.70
0.0004
E
2.22
0.0339
2.72
0.0099
3.22
0.0022
3.72
0.0004
N
2.24
0.0325
2.74
0.0093
3.24
0.0021
3.74
0.0004
S
2.26
0.0310
2.76
0.0088
3.26
0.0020
3.76
0.0003
I
2.28
0.0297
2.78
0.0084
3.28
0.0018
3.78
0.0003
T
2.30
0.0283
2.80
0.0079
3.30
0.0017
3.80
0.0003
2.32
0.0270
2.82
0.0075
3.32
0.0016
3.82
0.0003
2.34
0.0258
2.84
0.0071
3.34
0.0015
3.84
0.0003
2.36
0.0246
2.86
0.0067
3.36
0.0014
3.86
0.0002
2.38
0.0235
2.88
0.0063
3.38
0.0013
3.88
0.0002
2.40
0.0224
2.90
0.0060
3.40
0.0012
3.90
0.0002
I
2.42
0.0213
2.92
0.0056
3.42
0.0012
3.92
0.0002
G
2.44
0.0203
2.94
0.0053
3.44
0.0011
3.94
0.0002
H
2.46
0.0194
2.96
0.0050
3.46
0.0010
3.96
0.0002
T
2.48
0.0184
2.98
0.0047
3.48
0.0009
3.98
0.0001
2.50
0.0175
3.00
0.0044
3.50
0.0009
Y
H
E
POINT-BISERIAL
CORRELATION
Y = score on interval measure (eg. Test score)
x = 0 or 1 (grouping; eg. gender)
rpb =
Y1. – Y0.
_________ .
sy
____________
 n1n0 / n(n-1)
Descriptive Statistics
Mean
GENDER
.4667
SD
.51
N Covariance
15 1.846
READING COMPREHENSION
0 (boys)
68.26
18.39 8
1 (girls)
Total
rpb
75.19
71.49
11.11 7
15.32 15
_________________
= 75.19 – 68.26 .  [ 8 x 7 ] / [15 x 14 ]
15.32
= .233
Table 3.5: Calculation of point-biserial correlation coefficient for
First Grade reading comprehension of boys and girls
POINT BISERIAL CORRELATION
Y
X
X
X
X
m = mean
X
m
X
X
X
X
X
X
X
m
X
X
X
X
F
M
X
Dichotomous (Normal)-Interval
Case
biserial correlation
Y1. – Y0.
rbis =
_________ . [n1n0 / uxn2] ,
sy
where u = height of normal curve for
proportion n1/(n0 + n1)
rbis =
Y1. – Y0.
_________ . [n1n0 / uxn2] ,
sy
= 75.19 – 68.26 . [ 8 x 7 ] / [.3675 x 152 ]
15.32
= .306
BISERIAL CORRELATION
Y
X
X
X
X
X
m
X
X
X
X
X
X
X
m
X
X
X
X
F
M
X
RANK-RANK DATA
1. DATA ARE INTERVAL OR RATIO
Transformed to ranks because of odd
distribution
2. DATA ARE ORDINAL, NO INTERVAL
INFORMATION AVAILABLE
USE SPEARMAN CORRELATION (Pearson
formula used on the ranks - no ties
assumed)
Rank distribution of real estate price per square foot in Manhattan
21 26 19.5 17 12 22 19.5
13 27
Battery
15
24.5 9 24.5 8 6 4 7 3 14
Rank Price per foot
11 23
16 4 10 2 1 5 18
Central Park
Location
The relative position of the ranks above is only approximate, due to typeface limitations. All are ordered
correctly.
This results in the following ranks:
rank:
location: 17 20 23 18 7 1 8 21 23 14 24 6 11 5 3 12 9 19 4 16 10 22 2 15 27 26 13
price:
8 4 1 6 12 21 22 2 5 9 3 17 24.5 27 19.5 11 19.5 10 13 16 15 7 26 24.5 18 14 23
Computation of rank correlation coefficient:
rrank = sxy/sxsy
= -.647
rSpearman = -.640 (from SPSS, Version 13)
Table 3.7: Computation of rank correlation for Real Estate location in Manhattan with Price Per Square
Foot
Least squares estimation
The best estimate will be one in which
the sum of squared differences
between each score and the estimate
will be the smallest among all possible
linear unbiased estimates (BLUES, or
best linear unbiased estimate).
Least squares estimation
• errors or disturbances. They represent in this case
the part of the y score not predictable from x:
ei = yi – b1xi -b0x.
The sum of squares for errors follows:
n
SSe =  e2i
i=1
.
y
e
e
e
e
e
e
SSe = e2i
e
e
x
Matrix representation of least
squares estimation.
• We can represent the regression model in
matrix form:
•
y = X + e
Matrix representation of least squares
estimation

•
y
•
•
•
y1
y2
1 x2
•
y3
1
x3
•
y4
1
x4
•
.
1
.
•
.
1
.
.
•
.
1
.
.
=
X
1
=
+
e
x1
e1
0
1
e2
e3
+
e4
.
Matrix representation of least squares
estimation
• y = Xb + e
• The least squares criterion is satisfied by the
following matrix equation:
•
b = (X΄X)-1X΄y .
• The term X΄ is called the transpose of the X
matrix. It is the matrix turned on its side. When
X΄X is multiplied together, the result is a 2 x 2
matrix
•
n
xi
•
xi x2i Note: all information here:
sample size, mean (sum of scores), variance
(squared scores)
SUMS OF SQUARES
computational equivalents
• SSe = (n – 2 )s2e
• SSreg =  ( b1 xi – y. )2
• SSy = SSreg + SSe
SUMS OF SQUARES-Venn
Diagram
SSy
SSreg
SSe
SSx
Fig. 8.3: Venn diagram for linear regression
with one predictor and one outcome measure
SUMS OF SQUARES- ANOVA Table
SOURCE
df
Sum of Squares
Mean Square
F
x
1
SSreg
SSreg / 1
SSreg/ 1
SSe /(n-2)
e
n-2
SSe
SSe / (n-2)
n-1
SSy
SSy / (n-1)
Total
• Table 8.1: Regression table for Sums of Squares
•
Rupley and Willson (1997) studied the relationship between word recognition
and reading comprehension for 200 six- and seven-year olds using a national
sample of students that mirrored the U.S. census of 1980. The mean for Word
Recognition was 100, SD=15, and the mean for Reading Comprehension was
23.16, SD=14.74. The regression analysis is reported in the table below:
Dep. Var.: Reading Comprehension
SOURCE
df Sum of Squares Mean Square
•
•
Word recognition
1
34316.55
34316.55
44.97
•
error
198
8903.23
•
total
199
43219.78
F
Prob.
763.17 .001
se =6.71
R2
.794
Two variable linear regression:
Which direction?
• Regression equations:
•
y = xb1x+ xb0
•
x = yb1y + yb0
• Regression coefficients:
•
xb1 = rxy sy / sx
•
yb1 = rxy sx / sy
Two variable linear regression
y = b1x + b0
If the correlation coefficient is calculated, then b1 can be
calculated from the equation above:
b1 = rxy sy / sx
The intercept, b0, follows by placing the means for x and y
into the equation above and solving:
b0 = y. – [ rxysy/sx ] x.
Two variable linear regression.
yb1
= rxy sx / sy
x
y
xb1
y
Fig. 8.1: Slopes for two regression
representations of Pearson correlation
= rxy sy / sx
x
Three variable linear
regression
y = b1x1 + b2x2 + b0
Two predictors: all variables may be correlated with each
other
Exact equations exist to compute b1 , b2 but not for more than
two predictors, matrix form normal equations must be used
Three variable linear
regression
• Path model representation:unstandardized
x1
b1
y
12
x2
b2
e
Three variable linear
regression
• Path model representation:standardized
x1
1
y
r12
x2
2
e
SUMS OF SQUARES-Venn
Diagram
SSx1
SSy
SSreg
SSe
SSx2
Fig. 8.3: Venn diagram for linear regression
with two predictors and one outcome
measure