Chapter 7: Correlation

Download Report

Transcript Chapter 7: Correlation

AP Statistics
Chapter 3
Scatterplots, Association, and
Correlation
Relationships

“You can observe a lot by watching,” Yogi Berra
 Although
truth.



this statement is said in jest, it carries much
Many statistical studies look at multiple
variables to try to show a relationship between
one variable and another.
Most of the time, questions ask whether there is
an association between two variables.
It is important to know the definition of a few
words before we continue to explore this
concept.


When one variable effects another, one variable
will be referred to the explanatory variable and
the other as the response variable
Explanatory Variable
 “A variable that attempts to explain the observed
outcomes.”

Response Variable
A

variable that measures an outcome of a study.
Association
 Any
relationship between two measured quantities that
renders them statistically dependent
 Simply, if there is a (direct or indirect) link between
two variables
Example

Suppose that I randomly select 10 students from
a Stats. class and record their weight in pounds
and get the following results:
103, 201, 125, 179, 150, 138, 181, 220, 113, 126



Now, let’s say that I was going to pick another
random student, could I come up with a
prediction on how much that student was going
to weigh? (The mean is 153.6 and standard
deviation = 39.58)
How accurate will my prediction be?
Is there a way to improve this prediction?
Example

Now let’s say that I have more information:
Weight: 103, 201, 125, 179, 150, 138, 181, 220, 113, 126
Height: 61, 68, 65, 69, 65, 61, 64, 72, 63, 62




The following represents both the weights and
heights (in inches) of the 10 students
Now, let’s say that I was going to pick another
random student, knowing their height is 65
inches, could I come up with a prediction on
how much that student was going to weigh?
How accurate will my prediction be?
Is this a better way to make a prediction?
Roles for Variables




When we have two variable (or bivariate data), it
always a good idea to make a picture!
As we graph the two variables, it is important to
determine which of the two quantitative
variables goes on the x-axis and which on the yaxis.
This determination is made based on the roles
played by the variables.
When the roles are clear, the explanatory or
independent variable goes on the x-axis,
and the response variable or dependent
variable goes on the y-axis.
Roles for Variables (cont.)



The roles that we choose for variables are more
about how we think about them rather than
about the variables themselves.
Just placing a variable on the x-axis doesn’t
necessarily mean that it explains or predicts
anything. And the variable on the y-axis may not
respond to it in any way.
In a cause and effect relationship, the
explanatory variable is the cause, and the
response variable is the effect. Regression is a
method for predicting the value of a dependent
variable y, based on the value of an independent
variable x.
Examples

A study looks at smoking and lung cancer.
1.
2.
3.
4.
Which (if any) is the explanatory variable?
Which (if any) is the response variable?
Is smoking a quantitative or categorical variable?
Is lung cancer a quantitative or categorical variable?
Examples

A study looks at cavities and milk drinking.
1.
2.
3.
4.
Which (if any) is the explanatory variable?
Which (if any) is the response variable?
Is cavities a quantitative or categorical variable?
Is milk drinking a quantitative or categorical
variable?
Examples

A study looks at rain fall and SAT scores.
1.
2.
3.
4.
Which (if any) is the explanatory variable?
Which (if any) is the response variable?
Is rainfall a quantitative or categorical variable?
Is SAT scores a quantitative or categorical variable?
Looking at Scatterplots

Scatterplots may be the most common and most
effective display for two-variable data.
 In
a scatterplot, you can see patterns, trends,
relationships, and even the occasional extraordinary
value sitting apart from the others.

Scatterplots are the best way to start observing
the relationship and the ideal way to picture
associations between two quantitative
variables.
Example

Now let’s revisit the problem we started with:
Weight: 103, 201, 125, 179, 150, 138, 181, 220, 113, 126
Height: 61, 68, 65, 69, 65, 61, 64, 72, 63, 62


The following represents both the weights and
heights (in inches) of the 10 students
Graph the data and describe the distribution.
Regression

When we perform regression, we take two
variables and we attempt to use the explanatory
variable to estimate (or predict) the value of the
response variable. This process is called
Regression
 As
you can imagine, at times, there’s clear
relationship between the two variable and sometimes
there might not be any relationship.
 Also, even if there is a relationship, some
relationships are strong while others are weak.
 To best describe the relationship, we should always
describe the form, direction, and the strength.
Interpreting Association




Form: Is there a pattern? Is the data linear or
curved? Are there clusters of data?
Strength: Is it weak or strong? Does the data
tightly conform or loosely conform?
Direction: If linear, does the data go up
(positively associated) or go down (negatively
associated) or is it a horizontal line (no
association)?
Deviations from pattern: Are there areas where
the data conform less to the pattern? Are there
any outliers?
(Percent of graduates taking SAT vs. the Average
SAT Math Score)

Attributes of a good scatterplot
 Consistent
and uniform scale
 Label on both axis
 Accurate placement of data
 Data throughout the axis
 Axis break lines if not starting at zero.

To achieve these goal you are required to do your
scatterplots on graph paper.
Examples

1.
2.
3.
4.
5.
6.
Try to make a graph of the following situation,
then describe the association:
Points allowed vs. winning percentage
World population vs. year
Amount of rain vs. crop yield
Height vs. weight
Height vs. GPA
Shoe size vs. probability of winning the
National Spelling Bee
Bivariate Data - Review


The very first step to analyzing bivariate data is
to graph it. When we graph this data, we use a
scatterplot.
After we graph it, we examine three things to
describe the association:
 Form
– linear or curved (we will discussed curved
data later in this unit)
 Direction – positive, negative, or no association
 Strength – strong or weak
Looking at Scatterplots

Form:
 If
there is a straight line (linear) relationship, it will
appear as a cloud or swarm of points stretched out in a
generally consistent, straight form of a line.

How should we describe the form for these two
graphs?
Looking at Scatterplots (cont.)

Direction:
A
positive association generally tells us that as one
variable increases, the other variable also increases.
 A negative association
generally tells us that as
one variable increases,
the other variable
decreases.
 In
this example, there is a negative association between
central pressure and maximum wind speed is given.
 As the central pressure increases, the maximum wind
speed decreases.
Looking at Scatterplots (cont.)

Direction:
 When
the points are scattered about randomly with
no discernable pattern, we say that there is no
association
 No association generally tells us that as one variable
increases, we know nothing about the other variable
 The
scatterplot above show no association between
the explanatory and response variables.
Looking at Scatterplots (cont.)

Strength:
 This
describes how “tightly” the points follow the
form (or pattern)
 A strong association has points follow the pattern
very tightly; whereas a weak association has points
that follow the pattern, but in a much “looser”
manner. Note: we will quantify the amount of
scatter soon.
This has a
strong,
positive linear
association
This has a
weak,
negative linear
association
Looking at two-variable data


Let’s look at a real-life example to make this idea
a little more concrete…
Do taller people tend to have heavier weights?



This question is an example of how two variable play
different roles in data.
Height is the explanatory or predictor variable and
weight is the response variable.
Let’s take a look at the Detroit Pistons
2003-04 Detroit Pistons
#
PLAYER
7
Chucky Atkins
1
Pistons Roster
POS
HT
WT
DOB
FROM
YRs
G
5-11
160
8/14/74
South Florida '96
4
Chauncey Billups
G
6-3
202
9/25/76
Colorado '97
6
41
Elden Campbell
C-F
7-0
279
7/23/68
Clemson '90
13
44
Hubert Davis
G
6-5
183
5/17/70
North Carolina '92
11
Carlos Delfino**
G
6-6
230
8/29/82
Argentina
R
Andreas Glyniadakis
C
7-1
280
8/21/81
Greece
R
Darvin Ham
F-G
6-7
240
7/23/73
Texas Tech '96
6
Richard Hamilton
G-F
6-7
193
2/14/78
Connecticut '00
4
Lindsey Hunter
G
6-2
195
12/03/70
Jackson State '93
10
31
Darko Milicic
F-C
7-0
245
6/20/85
Serbia-Montenegro
R
13
Mehmet Okur
F
6-11
249
5/26/79
Turkey
1
22
Tayshaun Prince
F
6-9
215
2/28/80
Kentucky '02
1
39
Zeljko Rebraca
C
7-0
257
4/09/72
Serbia-Montenegro
2
52
Don Reid
F
6-8
250
12/30/73
Georgetown '95
8
Bob Sura
G
6-5
200
3/25/73
Florida State '95
8
3
Ben Wallace
F-C
6-9
240
9/10/74
Virginia Union '96
7
34
Corliss Williamson
F
6-7
245
12/04/73
Arkansas '96
8
32
Going to work with the Pistons
PLAYER
POS
HT
WT
Chucky Atkins
G
5-11
160
Chauncey Billups
G
6-3
202
Elden Campbell
C-F
7-0
279
Hubert Davis
G
6-5
183
Carlos Delfino**
G
6-6
230
Andreas Glyniadakis
C
7-1
280
Darvin Ham
F-G
6-7
240
Richard Hamilton
G-F
6-7
193
Lindsey Hunter
G
6-2
195
Darko Milicic
F-C
7-0
245
Mehmet Okur
F
6-11
249
Tayshaun Prince
F
6-9
215
Zeljko Rebraca
C
7-0
257
Don Reid (FA)
F
6-8
250
Bob Sura
G
6-5
200
Ben Wallace
F-C
6-9
240
Corliss Williamson
F
6-7
245
Using the Calculator
300
275
250
225
200
175
150
125
68
70
72
74
76
78
80
82
84
86
Correlation Coefficient
The correlation coefficient numerically
measures the strength of the linear association
between two quantitative variables.
Correlation = Linear Association = Relationship
 There are three conditions that must be met
before we can look at the correlation coefficient:

 Quantitative
Condition: Correlation only applies to
quantitative variables. Don’t apply correlation to
categorical data!
 Straight Enough Condition: Correlation measures
the strength of linear association, which is useless if
the data is not linear!
Correlation Coefficient

There is one more condition:
 Outlier
Condition:
 Outliers can distort the correlation dramatically.
 An outlier can make an otherwise small correlation
look big or hide a large correlation.
 It can even give an otherwise positive association a
negative correlation coefficient (and vice versa).
 When you see an outlier, it’s often a good idea to
report the correlations with and without the point.

Note: when asked about correlation, you should
memorize this phrase:
 With
a correlation of (r), there is a (strong/weak),
(positive/negative) linear association between the
(explanatory variable) and the (response variable)
Correlation Coefficient



Is there a “correlation”
between a basketball
team’s heights and
weights?
Is the association
positively associated or
negatively associated?
Is the association
strong or weak?
What do we do with correlation?
Examine pg. 151
Calculating Correlation Coefficient


The calculation of correlation is based on mean
and standard deviation.
Remember that both mean and standard
deviation are not resistant measures.
 xi  x   yi  y 
1
r


 

n  1  sx   s y 
Calculating Correlation Coefficient


What does the contents of the parenthesis look like?
What happens when the values are both from the lower
half of the population? From the upper half?
 xi  x   yi  y 
1
r


 

n  1  sx   s y 
The formula
for calculating
z-values.
Both z-values
are negative.
Their product
is positive.
Both z-values
are positive.
Their product
is positive.
Calculating Correlation Coefficient

What happens when one value is from the lower half of
the population but other value is from the upper half?
 xi  x   yi  y 
1
r


 

n  1  sx   s y 
One z-value is positive
and the other is
negative. Their product
is negative.
Using the TI-84 to calculate r

If you have a TI-84, you must turn your Diagnostic on;
you need to enter “DiagnosticOn” from the “Catalog”

TI-89 users don’t need
to worry about this
operation since the
Diagnostic is
automatically “on” in
your calculator
Using the TI to calculate r

Run LinReg(a+bx) with the explanatory variable as the
first list, and the response variable as the second list
TI-84
TI-89
Using the TI to calculate r

The results are the slope and vertical intercept of the
regression equation (more on that later) and values of r
and r2 (more on r2 later as well).
Facts about correlation


Both variables need to be quantitative
Because the data values are standardized, it does
not matter what units we use for each of the
variables
 Also,
since r uses standardized values of the
observations, r does not change if we change the units
of x, y, or both (in other words, we can multiply,
divide, add or subtract a value to x, y, or both and r
will stay the same)

The value of r is unit-less.
Facts about correlation





The value of r will always be between -1 and 1.
Values closer to -1 reflect strong negative linear
association.
Values closer to +1 reflect strong positive linear
association.
Values close to 0 reflect no linear association.
Correlation does not measure the strength of
non-linear relationships
Facts about correlation


Correlation is blind to the relationship between
explanatory and response variables.
Even though you may get a r value close to -1 or
1, it does not mean that you can say that the
explanatory variable causes the response
variable.
 Scatterplots
and correlation coefficients never prove
causation.
 A hidden variable that stands behind a relationship
and determines it by simultaneously affecting the
other two variables is called a lurking variable.
Facts about correlation


The value of r is a measure of the strength of a
linear relationship. It measures how closely the
data fall to a straight line. An r value near 0,
however, does not imply that there is no
relationship, only no linear relationship. For
example, quadratic or sinusoidal data have an r
close to 0, even though there is a strong
relationship present.
r measures the correlation between 2 variables in
a sample of observations from the population of
interest. Thus, r is the sample correlation
coefficient which is used to estimate ρ (rho), the
population correlation coefficient.
What Can Go Wrong?

Don’t say “correlation” when you mean
“association.”
 More
often than not, people say correlation when
they mean association.
 The word “correlation” should be reserved for
measuring the strength and direction of the linear
relationship between two quantitative variables.
What Can Go Wrong?

Don’t correlate categorical variables.
 Be
sure to check the Quantitative Variables
Condition.

Don’t confuse “correlation” with “causation.”
 Scatterplots
and correlations never demonstrate
causation.
 These statistical tools can only demonstrate an
association between variables.
What Can Go Wrong? (cont.)

Be sure the association is linear.
 There
may be a strong association between two
variables that have a nonlinear association. The
correlation will be near 0 – why do you think that is?
What Can Go Wrong? (cont.)

Don’t assume the relationship is linear just
because the correlation coefficient is high.

Here the correlation is
0.979, but the relationship is
actually bent.
What Can Go Wrong? (cont.)

Beware of outliers.
 Even
a single outlier
can dominate the
correlation value.
 Make sure to check
the Outlier Condition.
 Without
the outlier,
the correlation would
be 0; but with the
outlier, the correlation,
deceivingly, is much
closer to 1
What have we learned?


We examine scatterplots for form, direction,
strength, and unusual features.
Although not every relationship is linear, when
the scatterplot is straight enough, the correlation
coefficient is a useful numerical summary.
 The
sign of the correlation tells us the direction of the
association.
 The magnitude of the correlation tells us the strength
of a linear association.
 Correlation has no units, so shifting or scaling the
data, standardizing, or swapping the variables has no
effect on the numerical value.
What have we learned? (cont.)

Doing Statistics right means that we have to
Think about whether our choice of methods is
appropriate.
 Before
finding or talking about a correlation, check
the Straight Enough Condition.
 Watch out for outliers!

Don’t assume that a high correlation or strong
association is evidence of a cause-and-effect
relationship—beware of lurking variables!
What have we learned? (cont.)

Unusual features:
 Look
for the unexpected.
 Often the most interesting thing to see in a scatterplot
is the thing you never thought to look for.
 One example of such a surprise is an outlier standing
away from the overall pattern of the scatterplot.
 Clusters or subgroups should also raise questions.