Week 1 Review of basic concepts in Statistics

Download Report

Transcript Week 1 Review of basic concepts in Statistics

Week 3
Association and correlation
handout & additional course notes available at
http://homepages.gold.ac.uk/aphome
Trevor Thompson
15-10-2007
1
Overview
1) What are tests of association and which
test do I use?
2) Associations within categorical data


- descriptives (frequency tables)
- the chi-square test
3) Associations within continuous data


- descriptives (scatterplots)
- Spearmans and Pearsons ‘r’
- Howell (2002) Chap 6 & 9. ‘Statistical Methods for Psychology’
2
What is association/correlation?


To examine whether there is a relationship between
variables
Variables are either associated or independent
(which is null hypothesis?)

Causation vs. association

depends on the experimental design not the test used
3
Which test to use?

Test selection depends on data:
Categorical data – Chi-square
Ordinal (ranked) data - Spearmans rho
Interval/ratio data - Pearsons r


Other less commonly used tests exist (tetrachoric,
kendall’s tau, phi etc) – see Howell
Logistic regression covered in later lecture
4
Which test to use - examples

Is there an association between height and weight?


Is there an association between 50 cities ranked for
‘livability’ 10 years ago and these cities ranked for
‘livability’ today?


Pearson’s r
Spearman’s rho
Is there an association between gender (male /
female) and yogurt preference (light / dark)?

Chi-square test
5
Chi-square test


Pearson’s chi-square test for categorical data
-descriptives
-assumptions
-chi-square significance test
Research question: Is gender associated with
preference for a specifically coloured yogurt?
6
Chi-square test

Data entry


each row should represent
responses of one participant
Compute contingency (frequency) table

n-way table denotes number of variables
gender & yogurt is 2-way table

Tables also described in terms of how many levels of
each variable. So 3*2 table would represent one variable
with 3 levels & one variable with 2 levels
gender & yogurt preference is 2*2 table
7
Chi-square test

Descriptives

Contingency tables:
Gender * Yoghurt Crosstabulation
Probable association
Count
Yoghurt
Light
Gender
Female
Male
Dark
29
1
30
Total
Total
1
29
30
30
30
60
Gender * Yoghurt Crosstabulation
Probable independence
(no association)
Count
Yoghurt
Light
Gender
Female
Male
Dark
15
15
30
Total
Total
15
15
30
30
30
60
Gender * Yoghurt Crosstabulation
Possible association?
Count
Yoghurt
Light
Gender
Total
Female
Male
Dark
20
10
30
Total
10
20
30
30
30
60
8
Chi-square test

Assumptions
1. Observations must be independent
2. Observations must be mutually exclusive

responses should only fall into cell. E.g. prefer either dark
or light yogurt – not both
3. Inclusion of non-occurrences


include all responses (e.g. both ‘yes’ and ‘no’ ) otherwise can be misleading
4. Cell size

Expected cell size>5
9
Chi-square test

Significance testing

Are two variables significantly associated?
Run Pearson’s chi-square
Gender * Yoghurt Crosstabulation
Count
Yoghurt
Light
Gender
Female
Male
Dark
20
10
30
Total
Total
10
20
30
30
30
60
Chi-Square Tests
Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
a
Value
6.667b
5.400
6.796
df
1
1
1
Mantel-Haenszel Common Odds Ratio Estimate
Asymp. Sig.
(2-sided)
.010
.020
.009
Exact Sig.
(2-sided)
.019
6.556
1
.010
60
Exact Sig.
(1-sided)
.010
Estimate
ln(Estimate)
Std. Error of ln(Estimate)
Asymp. Sig. (2-sided)
Asymp. 95% Confidence
Interval
Common Odds
Ratio
ln(Common
Odds Ratio)
4.000
1.386
.548
.011
1.367
11.703
.313
2.460
Lower Bound
Upper Bound
Lower Bound
Upper Bound
The Mantel-Haenszel common odds ratio estimate is asymptotically normally
distributed under the common odds ratio of 1.000 assumption. So is the natural log of
the estimate.
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 15.
00.
10
Chi-square test
Chi-Square Tests
Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
a
Value
6.667b
5.400
6.796
df
1
1
1
Asymp. Sig.
(2-sided)
.010
.020
.009
Exact Sig.
(2-sided)
.019
6.556
1
Exact Sig.
(1-sided)
.010
.010
60
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 15.
00.
Pearsons 2 statistic
 Gender & yogurt preference significantly associated
(2=6.67, p<.05)
Is this in the expected direction?


Our hypothesis was 2-tailed. If 1-tailed (e.g. females will prefer
light yogurts) then check contingency table for direction
Can halve p-value if 1-tailed – but only if variables have 2 levels
11
Chi-square test
Degrees of freedom
 df = (R-1) * (C-1)
Gender * Yoghurt Crosstabulation
Count
Yoghurt
Light
Gender
where r=rows, c=columns




a
Dark
20
10
30
Total
Yates’ Continuity correction
Only applicable to 2 * 2 tables
(O-E)2 in formula to {|0-E| -0.5}2
Not really needed
Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Female
Male
Total
10
20
30
30
30
60
Chi-Square Tests
Value
6.667b
5.400
6.796
df
1
1
1
Asymp. Sig.
(2-sided)
.010
.020
.009
Exact Sig.
(2-sided)
.019
6.556
1
Exact Sig.
(1-sided)
.010
.010
60
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 15.
00.
12
Chi-square test

Likelihood ratio
An alternative test for associations of categorical data

For large samples, likelihood ratio=Pearson chi-square



For small samples, chi-square test may be more
accurate
Likelihood ratio is useful when for multi-dimensional
associations – covered in Logistic regression lecture
13
Chi-square test
Odds-ratio (OR) estimate
How large is our significant association?

Odds of:
females choosing light relative to dark? 2/1
& males choosing light relative to dark? 1/2

Odds ratio=
a/b
c/d
-or equivalently, OR=(ad)/(bc)
Y
X

1
2
1
a
b
2
c
d
Odds ratio: What is likelihood of choosing a light yogurt for
females relative to males? 4/1
14
Chi-square test – underlying logic

Pearson 2=


∑ (O-E)2
E
O=observed frequency
E=expected frequency
2 statistic represents deviation of actual observed data differs from that
expected by chance
Calculating 2
Step 1 -Calculate expected frequencies
Yoghurt
Light
Gender
Gender
Total
Total
Female
Female
Male
Dark
E=15
E=15
E=15
E=15
30
30
Total
30
30
30
60
Prob of choosing light yogurt? ½ (30/60)
Prob of being female? ½
Prob of being female & prefer light yogurt? ¼ [Joint prob = p1 x p2]
So if N=60, expected freq for each cell =15 (60 x ¼)
15
Chi-square test – underlying logic

Step 2. Observed frequencies
Female
Light
Dark
Total
20
10
30
E=15
Male
10
E=15
Total


30
E=15
20
30
E=15
30
60
Bigger deviations between observed and chance-expected cell
sizes, the greater the likelihood of a significant association
2= ∑ (O-E)2
E
= (20-15)2 + (10-15)2 + (10-15)2 + (20-15)2
15
15
15
15
=6.67, same as in SPSS output
16
Chi-square test – underlying logic



Corresponding probability value of 2=6.67 is p=.01 (meaning a
value of 6.67 occurs 1/100 by chance)
Above chi-square distribution shows values of chi-square
statistic that would be obtained by chance in repeated sampling
Distribution of 2 changes according to df
17
Correlation and regression



Detailed coverage of correlation/regression in
lectures 8 & 9
When X & Y are continuous variables, we use
Pearson’s correlation-coefficient ‘r’ (or equivalent
Spearman’s rho for ranked data)
Correlation vs. regression
i. correlation used to index strength of association
regression used in prediction
ii. (historically) If X is fixed then regression, if X is random then
correlation
18
Correlation and regression

Descriptives
Dopamine Binding Index
Scatterplot
5
4
3
2
1
Rsq = 0.7703
0
2
4
6
8
10
12
14
16
Sensation Seeking Score


Correlation (r) related to degree to which the points cluster
around line (0 to 1 or -1)
Regression line is “line of best fit”
19
Correlation and regression

Significance testing
Pearsons product-moment correlation
Correlations

r=0; no correlation
r=+1 or -1; max correlation
senseek
dopabind
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
senseek
1
dopabind
.880**
.000
40
30
.880**
1
.000
40
40
**. Correlation is significant at the 0.01 level (2-tailed).




Null hyp is population r=0 , with r normally distributed
To evaluate significance of ‘r’ convert to ‘t’
t = r
* √ (N – 2)
(1 – r 2)
Assumptions of normality and homogeneity of variance
apply – covered in detail in lecture 6
20
Summary

Selection of appropriate test depends on data

Chi-square test - explanation of output

Chi-square test - underlying logic

Correlation and regression
21