Lecture 9 Week 14

Download Report

Transcript Lecture 9 Week 14

Research Methods & Design
in Psychology
Lecture 4
Correlation
Lecturer: James Neill
Readings
• Howell (Fundamentals)
– Ch9 (Correlation)
• Howell (Methods)
– Ch6 (Categorical Data and Chi-Square)
– Ch 9 (Correlation and Regression)
Overview
•
•
•
•
•
Correlational analyses
Types of answers
Types of correlation
Interpretation
Assumptions / Limitations
The World is Made of
Covariation
Covariations are the Building
Block of Complex Models
1.00000
0E+00
6.07344
0E-01
6.78905
0E-01
2.83671
0E-01
2.56418
0E-01
2.92699
0E-01
3.69098
0E-01
3.69121
0E-01
3.87294
0E-01
1.60843
0E-01
6.07344
0E-01
1.00000
0E+00
6.88344
0E-01
2.23050
0E-01
2.55222
0E-01
2.84722
0E-01
3.68448
0E-01
3.78470
0E-01
4.33714
0E-01
1.70301
0E-01
6.78905
0E-01
6.88344
0E-01
1.00000
0E+00
2.86396
0E-01
2.94275
0E-01
3.76233
0E-01
3.57735
0E-01
4.13585
0E-01
4.59494
0E-01
2.11005
0E-01
2.83671
0E-01
2.23050
0E-01
2.86396
0E-01
1.00000
0E+00
7.81746
0E-01
6.44679
0E-01
2.10182
0E-01
2.71559
0E-01
2.62021
0E-01
1.71870
0E-01
2.56418
0E-01
2.55222
0E-01
2.94275
0E-01
7.81746
0E-01
1.00000
0E+00
6.32233
0E-01
2.26194
0E-01
3.13878
0E-01
2.79477
0E-01
1.98278
0E-01
2.92699
0E-01
2.84722
0E-01
3.76233
0E-01
6.44679
0E-01
6.32233
0E-01
1.00000
0E+00
2.43968
0E-01
3.08430
0E-01
3.53543
0E-01
2.18881
0E-01
3.69098
0E-01
3.68448
0E-01
3.57735
0E-01
2.10182
0E-01
2.26194
0E-01
2.43968
0E-01
1.00000
0E+00
5.61840
0E-01
5.15174
0E-01
2.27267
0E-01
3.69121
0E-01
3.78470
0E-01
4.13585
0E-01
2.71559
0E-01
3.13878
0E-01
3.08430
0E-01
5.61840
0E-01
1.00000
0E+00
6.65721
0E-01
2.40682
0E-01
3.87294
0E-01
4.33714
0E-01
4.59494
0E-01
2.62021
0E-01
2.79477
0E-01
3.53543
0E-01
5.15174
0E-01
6.65721
0E-01
1.00000
0E+00
2.66195
0E-01
1.60843
0E-01
1.70301
0E-01
2.11005
0E-01
1.71870
0E-01
1.98278
0E-01
2.18881
0E-01
2.27267
0E-01
2.40682
0E-01
2.66195
0E-01
1.00000
0E+00
Correlational Research Questions
Are two variables related?
Interesting questions tend to:
• test a novel relationship, e.g.
– “is time spent studying for exams associated with
increased incidence of brain cancer?”
• avoid simply showing an expected
relationship, e.g.
– “is time spent studying studying for exams associated
with higher exam marks?”
Correlational analyses 1
Correlational analyses are used to
examine the extent to which two
variables have a simple linear
relationship.
Correlations provide the building blocks
for:
–
–
–
–
Factor analysis
Reliability
Regression
Etc.
Correlational analyses 2
Linear relationship between 2 variables:
– direction and
– strength
– ranges from -1 to +1
• Sign indicates direction
• Size indicates strength
Correlational analyses 3
Measures the extent to which:
• differences in one variable can be
predicted from differences in the other
variable
• one variable varies with another
variable
A correlation is also an effect size.
Types of Answers
• No relationship (independence)
• Linear relationship:
– As one variable increases, so does the other
(+ve)
– As one variable increases, the other decreases
(-ve)
• Non-linear
• Restricted range
• Heterogeneous samples
Types of Correlation
•
•
•
•
Phi / Cramer’s V
Spearman’s rank / Kendall’s Tau b
Point bi-serial rpb
Product-moment or Pearson’s r
Nominal
Ordinal
Int/Ratio
Nominal
Scatterplot,
Clustered bar- Clustered bar- bar chart or
graph
graph
error-bar
Chi-squared
Chi-squared
chart
Phi (φ) or
Phi (φ) or
Point bi-serial
Cramer's V
Cramer's V
correlation
(rpb)
Ordinal
Scatterplot or
clustered bar
chart
Spearman's
Rho or
Kendall's Tau
Int/Ratio
Recode
Scatterplot
Point bi-serial
or
Spearman/Ke
ndall
Scatterplot
Productmoment
correlation (r)
15
10
5
0
0
5
10
15
Tufte: Graphics reveal data.
10
15
tufte$X1
20
0
5
10
15
tufte$X3
20
0
5
10
15
tufte$X2
20
0
5
10
15
tufte$X4
20
10
5
0
0
5
10
15
5
15
0
Nominal by Nominal
Contingency Tables 1
• Bivariate frequency tables
• Marginal totals
• Can include %
Clustered Bar Graph
• Bar graph of frequencies or
percentages with the category axis
clustered by coloured bars to indicate
the two variable’s categories
Chi-square 1
Example
Phi (f) & Cramer’s V
Phi (f)
• Two dichotomous variables (2x2, 2x3, 3x2)
• E.g., Gender & Pass/Fail
Cramer’s V
• Two dichotomous variables (3x3 or
greater)
• E.g., Favourite Season x Favourite
Sense
Example
Ordinal by Ordinal
Spearman’s rho (rs)
• For ranked (or recoded to ordinal) data
• Uses product-moment correlation, but
interpretations must be adjusted to
consider the underlying ranked scales
– e.g. Olympic Placing vs. World Ranking
Kendall’s Tau-b
• Kendall’s tau-b
– for ordinal/ranked data
– takes joint ranks into account
– Ranges -1 to +1, but only for square tables
Dichotomous by Interval/Ratio
Point-biserial correlation
• Point-biserial correlation (rpb)
– one dichotomous & one continuous
variable
– calculate as for Pearson’s r, but
interpretations must be adjusted to
consider the underlying ranked scales
– e.g., gender and self-esteem
Example
Product-moment correlation (r)
• For two interval and/or ratio variables
•
r
=
covxy
sxsy
Interval/Ratio by Interval/Ratio
Scatterplots
•
•
•
•
Plot each pair of observations (X, Y)
x = predictor variable (independent)
y = criterion variable (dependent)
Check for:
– outliers
– linearity
• ‘Line of best fit’
y = a + bx
The correlation between 2 variables is a
measure of the degree to which:
•
Pairs of numbers (points) cluster together
around a best-fitting straight line
Scatterplot showing relationship
between age & cholesterol
Strong positive (.81)
Figure 9.1
Infant Mortaility and Number of Physicians
10
8
Infant Mortality
6
4
2
0
-2
-4
-6
10
12
14
16
Physicians per 100,000 Population
18
20
Weak positive (.14)
Figure 9.2
Life Expectancy and Health Care Costs
74
73
72
71
70
69
68
67
66
200
400
600
800
1000
1200
Health Care Expenditures
1400
1600
Moderately strong negative (-.76)
Figure 9.3
Cancer Rate and Solar Radiation
34
Breast Cancer Rate
32
30
28
26
24
22
20
200
300
400
Solar Radiation
500
600
Correlation Estimation
Indicate level (high, med., or low) and
sign of the correlation for:
a) number of guns in community and
number firearm deaths
b) robberies and incidence of drug abuse
c) protected sex and incidence of AIDS
d) community education level and crime rate
e) solar flares and suicide
Covariance
• Variance shared by 2 variables
Cov XY
( X  X )(Y  Y )

N 1
Cross products
• Covariance reflects the direction of the
relationship:
+ve cov indicates + relationship
-ve cov indicates - relationship.
Covariance – Cross-products
3
-ve
cros
s
prod
ucts
-ve dev.
products
3
Y1
2
2
+ve dev.
products
1
1
0
0
X1
+ve dev.
10
products
20
30
-ve dev.
products
40
Covariance
• Covariance is dependent on the scale of
measurement used
• Can’t compare magnitude of cov across
different scales of measurement (e.g.,
age by weight in kilos versus age by
weight in grams).
• Therefore, standardise covariance
-> correlation
Correlation formula
Cov XY
r
s X sY
The correlation between 2 variables is:
• an effect size – i.e., standardised measure
of amount of covariation.
Correlation - SPSS
Correlations
Cigarette
CHD
Consumption Mortali
per Adult per ty per
Day
10,000
Cigarette
Pearson
Consumption per Correlation
Adult per Day
Sig.
(2-tailed)
N
CHD Mortality
Pearson
.713**
per 10,000
Correlation
Sig.
.000
(2-tailed)
N
21
**. Correlation is significant at the 0.01 level
(2-tailed).
Hypothesis testing
• Almost all correlations are not 0,
therefore “What is the likelihood
that a relationship between
variables is a ‘true’ relationship, or
could it simply be a result of
random sampling variability or
‘chance’”?
Significance of Correlation
• Null hypothesis (H0): assumes that there is no
‘true’ relationship
• Alternative hypothesis (H1): assumes that the
relationship is real
• We initially assume the null hypothesis, and
evaluate whether the data support the alternative
hypothesis
• Null hypothesis: H0: r= 0
• Alternative hypothesis H0: r<>0
r (rho)= population product-moment correlation
coefficient
How do we test the null
hypothesis?
• Use statistical tests of probability which
produce a p value
• Convention specifies a criterion for statistical
significance of 0.05 (alpha level)
• We generate a p value and compare it to
0.05.
• If p is less than 0.05, this indicates statistical
significance and less than a 5% chance that
the relationship being tested is due to random
sampling variability
Imprecision in hypothesis
testing
• Type I error: rejecting the null
hypothesis when it is true
• Type II error: Accepting the null
hypothesis when it is false
• Statistical significance is a function of
effect size, sample size and alpha level
Significance of Correlation
• A 1- or 2-tailed significance test can be done
in an effort to infer to a population
• Result will depend on the power of study (i.e.,
higher N, p, and r, more likely to be sig.)
• Alternatively look up r tables with df = N - 2
Scatterplot showing a confidence
interval for a line of best fit
US States 4th Academic Achievement by SES
Significance of Correlation
df
(N-2)
critical
p = .05
5
10
15
20
25
30
50
200
500
1000
.67
.50
.41
.36
.32
.30
.23
.11
.07
.05
When we say that the correlation between
Age and test Performance is significant,
we mean that:
a. there is an important relationship between Age
and test Performance
b. the true correlation between Age and
Performance in the population is equal to 0
c. the true correlation between Age and
Performance in the population is not equal to 0
d. getting older causes you to do poorly on tests
Descriptive Assumptions
• LOM >= interval
• Linear relationships
• No outliers
Inferential Assumptions
• Homoscedasticity
• Similar, normal underlying
distributions
• Correction for attenuation
(unreliability)
• Minimal and normally distributed
measurement error
Homoscedasticity
Factors that affect correlation
•
•
•
Restricted range
Heterogenous samples
Scale has no effect
Coefficient of Determination
(r2)
• CoD = The proportion of variance or
change in one variable that can be
accounted for by another variable.
• e.g., r = .60, r2 = .36
Interpreting Correlation
(Cohen, 1988)
• A correlation is an effect size, so
guidelines re strength can be
suggested.
Strength
r
r2
weak:
.1 to .3 (1 to 10%)
moderate: .3 to .5 (10 to 25%)
strong:
>.5
(> 25%)
Size of Correlation (Cohen, 1988)
WEAK (.1 - .3)
MODERATE (.3-.5)
STRONG (>.5)
Interpreting Correlation 1
Strength
very weak
weak
moderate
strong
very strong
r
r2
0 - .19
(0 to 4%)
.20 - .39 (4 to 16%)
.40 - .59 (16 to 36%)
.60 - .79 (36% to 64%)
.80 - 1.00 (64% to 100%)
Interpreting Correlation 2
Interpreting Correlation 3
Correlation of this scatterplot = .9
3
3
Y1
2
2
1
1
0
0
X1
10
20
30
40
Y1
Correlation of this scatterplot = .9
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
X1
10
20
30
40
50
60
70
80
90
100
What do you estimate the correlation of this
scatterplot of height and weight to be?
17 6
17 4
HEIGHT
a. -.5
b. -1
c. 0
d. .5
e. 1
17 2
17 0
16 8
16 6
65
66
WEIGHT
67
68
69
70
71
72
73
What do you estimate the correlation of this
scatterplot to be?
14
12
10
Y
a. -.5
b. -1
c. 0
d. .5
e. 1
8
6
4
2
4.4
X
4.6
4.8
5.0
5.2
5.4
5.6
What do you estimate the correlation of this
scatterplot to be?
6
5
5
Y
a. -.5
b. -1
c. 0
d. .5
e. 1
5
5
5
4
2
X
4
6
8
10
12
14
Non-linear Relationships
40
30
X2
Check scatterplot
Can a linear
relationship
‘capture’ the
lion’s share of
the variance?
If so,use r.
20
10
0
0
Y2
10
20
30
40
50
60
Non-linear relationships
• If non-linear, consider transforming
variables to ‘create’ linear
relationship = equivalent to finding
a non-linear mathematical function
to describe the relationship
between the variables
Range restriction
• Range restriction is when sample
contains restricted (or truncated) range
of scores
– e.g., fluid intelligence and age < 18 might
have linear relationship
• If range restriction, be cautious in
generalising beyond the range for which
data is available
– E.g., fluid intelligence does not continue to
increase linearly with age after age 18
Range restriction
Heterogenous samples
190
180
170
H1
• Sub-samples (e.g.,
males & females) may
artificially increase or
decrease overall r.
• Solution - calculate r
separately for subsamples & overall,
look for differences
160
150
140
130
50
W1
60
70
80
Scatterplot of Same-sex Relations &
Opposite-sex Relations by Gender
Same Sex Relations
boys r = .67
girls r = .52
7
6
5
4
SEX
3
female
2
male
0
1
2
Opp Sex Relations
3
4
5
6
7
Scatterplot of Weight and Selfesteem by Gender
r = .50
r = -.48
10
8
6
4
SEX
2
male
SE
Males
Females
0
female
40
50
WEIGHT
60
70
80
90
100
110
120
Effect of Outliers
• Outliers can disproportionately increase or
decrease r.
• Options
– compute r with & without outliers
– get more data for outlying values
– recode outliers as having more conservative
scores
– transformation
– recode variable into lower level of
measurement
Age & self-esteem ( r = .63)
10
8
6
4
SE
2
0
10
AGE
20
30
40
50
60
70
80
Age & self-esteem (outliers
removed) r = .23
9
8
7
6
5
4
3
SE
2
1
10
AGE
20
30
40
Checklist
1. Graphs & Scatterplots
–
–
–
–
Outliers?
Linear?
Does each variable have a reasonable range?
Are there subsamples to consider?
2. Choose appropriate measure of
Association
3. Conduct inferential test (if needed)
4. Interpret/Discuss
Dealing with several correlations
• Scatterplot matrices and correlation
matrices organised correlations
amongst several variables at once.
• Example (Kliewer et al, 1998)
– 99 young children
– Measured level of
• Witnessed violence, Intrusive thoughts,
Social support, and Internalizing
symptoms
Correlation matrix
Wit Intrus Social Internal
ness
Support izing
Witness 1.00 .37
.08
.20
Intrus
.37 1.00 -.08
.39
SocSup .08 -.08 1.00
-.17
Internal .20 .39 -.17
1.00
Reporting
• Relate back to research hypothesis
• Describe & interpret correlation
– direction of relationship
– size/strength
– significance
• If many correlations, report in a table
• Acknowledge limitations e.g.,
– Heterogeneity (sub-samples)
– Range restriction
– Causality?
Writing
“Number of children and marital
satisfaction were inversely related to
each other, r (48) = -.35, p<.05,
indicating that contentment in marriage
declines as couples elect to have more
children. Overall, number of children
explained approximately 10% of the
variance in marital satisfaction, a smallmoderate effect.”
Also see end of Howell (Fundamentals)
Ch9 for an example write-up
Key Points
• Covariations are the building blocks of
reliability analysis, factor analysis, multiple
regression
• Correlation does not prove causation – may
be in opposite direction, co-causal, or due
to other variables
• Check scatterplots to see whether a
correlation makes sense
• Choose appropriate measure of association
based on levels of measurement
• Use r, r2 and statistical significance
References
• Rank order correlation
http://faculty.vassar.edu/lowry/ch3b.html
• Correlation coefficient
http://www.sportsci.org/resource/stats/correl.h
tml
• Correlation
http://www2.chass.ncsu.edu/garson/pa765/co
rrel.htm
• Correlation
http://www.uwsp.edu/psych/stat/7/correlat.htm