Section 7.2 Correlation

Download Report

Transcript Section 7.2 Correlation

Looking at data: relationships
- Correlation
Lecture Unit 7
Objectives
Correlation

The correlation coefficient “r”

r does not distinguish x and y

r has no units

r ranges from -1 to +1

Influential points
The correlation coefficient "r"
The correlation coefficient is a measure of the direction and strength of
the linear relationship between 2 quantitative variables. It is calculated
using the mean and the standard deviation of both the x and y variables.
Time to swim: x = 35, sx = 0.7
Pulse rate: y = 140 sy = 9.5
Correlation can only be used to
describe quantitative variables.
Categorical variables don’t have
means and standard deviations.
Example: calculating correlation


(x1, y1), (x2, y2), (x3, y3)
(1, 3) (1.5, 6) (2.5, 8)
x  1.67, y  5.67, sx  .76, s y  2.52
r
11.67  35.67   1.51.67  65.67   2.51.67 85.67 
(31)(.76)(2.52)
 .9538
Properties of Correlation



r is a measure of the strength of the linear relationship between x
and y.
No units [like demand elasticity in economics (-infinity, 0)]
-1 < r < 1
Values of r and scatterplots
r near +1
r near -1
y
r near 0
r near 0
y
x
x
Properties (cont.) r has no unit
Changing the units of variables does
not change the correlation coefficient
"r", because we get rid of all our units
when we standardize (get z-scores).
r = -0.75
z-score plot is the same
for both plots
r = -0.75
Properties (cont.)
r ranges from
-1 to+1
"r" quantifies the strength
and direction of a linear
relationship between 2
quantitative variables.
Strength: how closely the points
follow a straight line.
Direction: is positive when
individuals with higher X values
tend to have higher values of Y.
Properties of Correlation (cont.)

r = -1 only if y = a + bx with slope b<0

r = +1 only if y = a + bx with slope b>0
10
20
y = 11 - x
8
y = 1 + 2x
r=1
r = -1
15
6
Y
10
y
4
5
2
0
0
0
2
4
6
x
8
10
0
2
4
6
X
8
10
Properties (cont.) High correlation
does not imply cause and effect
CARROTS: Hidden terror in the produce department
at your neighborhood grocery

Everyone who ate carrots in 1920, if they are still
alive, has severely wrinkled skin!!!

Everyone who ate carrots in 1865 is now dead!!!

45 of 50 17 yr olds arrested in Raleigh for juvenile
delinquency had eaten carrots in the 2 weeks
prior to their arrest !!!
Properties (cont.) Cause and Effect

There is a strong positive correlation between the monetary damage
caused by structural fires and the number of firemen present at the
fire. (More firemen-more damage)

Improper training? Will no firemen present result in the least amount
of damage?
Properties (cont.) Cause and Effect
(1,2) (24,75) (1,0) (18,59) (9,9) (3,7) (5,35) (20,46) (1,0)
(3,2) (22,57)
x = fouls committed by player;
y = points scored by same player
The correlation is due to a third “lurking”
variable – playing time

r measures the strength of
the linear relationship
between x and y; it does
not indicate cause and
effect
correlation
r = .935
(x, y) = (fouls, points)
Points

80
70
60
50
40
30
20
10
0
0
5
10
15
Fouls
20
25
30
Properties (cont.) r
does not distinguish x & y
The correlation coefficient, r, treats
x and y symmetrically.
r = -0.75
r = -0.75
"Time to swim" is the explanatory variable here, and belongs on the x axis.
However, in either plot r is the same (r=-0.75).
Outliers
Correlations are calculated using
means and standard deviations,
and thus are NOT resistant to
outliers.
Just moving one point away from the
general trend here decreases the
correlation from -0.91 to -0.75
PROPERTIES (Summary)

r is a measure of the strength of the linear relationship between x and y.

No units [like demand elasticity in economics (-infinity, 0)]

-1 < r < 1

r = -1 only if y = a + bx with slope b<0

r = +1 only if y = a + bx with slope b>0

correlation does not imply causation

r does not distinguish between x and y

r can be affected by outliers
Correlation: Fuel Consumption vs Car
Weight
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
r = .9766
7
6
5
4
3
2
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
SAT Score vs Proportion of Seniors
Taking SAT
88-89 SAT vs % Seniors Taking SAT
r = -.868
88-89 SAT State Avg.
IW
ND
1075
1025
975
88-89 SAT
925
875
SC
825
0
20
40
DC
NC
60
% Seniors that Took SAT
80
Extra Slides
Part of the calculation
involves finding z, the
standardized score we used
when working with the
normal distribution.
You DON'T want to do this by hand.
Make sure you learn how to use
your calculator!
Standardization:
Allows us to compare
correlations between data
sets where variables are
measured in different units
or when variables are
different.
For instance, we might
want to compare the
correlation between [swim
time and pulse], with the
correlation between [swim
time and breathing rate].
When variability in one
or both variables
decreases, the
correlation coefficient
gets stronger
( closer to +1 or -1).
Correlation only describes linear relationships
No matter how strong the association,
r does not describe curved relationships.
Note: You can sometimes transform a non-linear association to a linear form,
for instance by taking the logarithm. You can then calculate a correlation using
the transformed data.
Review examples
1) What is the explanatory variable?
Describe the form, direction and strength
of the relationship?
Estimate r.
r = 0.94
(in 1000’s)
2) If women always marry men 2 years older
than themselves, what is the correlation of the
ages between husband and wife?
r=1
ageman = agewoman + 2
equation for a straight line
Thought quiz on correlation
1. Why is there no distinction between explanatory and response
variable in correlation?
2. Why do both variables have to be quantitative?
3. How does changing the units of one variable affect a correlation?
4. What is the effect of outliers on
correlations?
5. Why doesn’t a tight fit to a horizontal line
imply a strong correlation?