Transcript Chapter 7

Chapter 7
Scatterplots,
Association,
and Correlation
Scatterplots


…show patterns, trends, relationships, and even
the occasional extraordinary value.
…are the best way to visualize associations
between two quantitative variables.
Roles for Variables


First we need to determine which variable to put on each
axis, depending on the roles of the variables. The roles
we choose are determined by how we think about them
(height/weight example).
When the roles are defined, the explanatory or predictor
variable goes on the x-axis, and the response variable
goes on the y-axis.
Caution: The act of placing a
variable on the x-axis doesn’t
mean it explains or predicts
anything… and the variable
we place on the y-axis may
not respond to the x-axis
variable it in any way.
Roles for Variables

When examining and describing
scatterplots, look for and talk about four
things:

direction (positive or negative)

form (linear or something else)

strength (strong, moderate, or weak)

unusual features (outliers, groupings, etc).
Scatterplots (Direction)
Upper left to the lower
right - negative direction.
Lower left to upper right positive direction.
Scatterplots (Direction)
Can the NOAA predict where a hurricane will go?

This scatterplot shows a
negative direction
between the year and
the prediction errors
made by NOAA.

What does this mean in
context of the data?

As the years have
passed, NOAA’s
predictions have
improved.
Scatterplots (Direction)

This example shows
negative
a ______________
association between
central pressure and
maximum wind
speed.

What does this
mean in context of
the data?

As the central
pressure increases,
the maximum wind
speed decreases.
Scatterplots (Form)
Form:
If there is a straight
(linear) relationship,
the data points will
appear as a cloud
stretched out in a
generally consistent,
direction.

Scatterplots (Form)

If the relationship isn’t straight, but contains a
(gentle) curve, we can often find ways to make
this non-linear data more nearly straight.
See the
curve??

As we’ve already discussed, this process is
called ‘re-expressing’ the data.
Scatterplots (Strength)

What is the strength of the association?
At one extreme, the
points appear to follow a
single pattern (straight,
curved, etc.)
At the other
extreme, points
appear as a vague
cloud with no
discernable trend
or pattern:
Scatterplots (Unusual Features)

Outliers
Outliers can be obvious
and look completely out of
place like this one.
Or they can be less obvious
like this one…
Scatterplots (Unusual Features)

Groupings
Correlation


Data collected from students in a class; heights
and weights:
Right There
What do you see?
Moderate to
strong positive
linear association
with one possible
high outlier.
Correlation (cont.)

HOW strong?

If we had to put a number on the strength, we would not
want it to depend on the units we used.
After all, the strength of
the association
between height and
weight shouldn’t
change if height is
measured in inches
and weight in
pounds…right?
Correlation (cont.)



Since the units shouldn’t
matter, statisticians have
agreed to remove them
altogether and create a
unit-less measure of
association using zscores.
If we standardize both
variables and write the
coordinates of each point
as (zx, zy)...
We can create a
scatterplot of
standardized weights and
heights:
Correlation (cont.)

The underlying linear pattern seems steeper in
the standardized plot than in the original…WHY?
Equal scaling gives a neutral way
of drawing the scatter plot and a
fairer impression of the strength of
the association.
Correlation (cont.)

The green points strengthen
the impression of a positive
association between height
and weight.

The red points tend to weaken
the positive association.

Blue points (with z-scores of
zero) don’t “vote” either way.
Correlation (cont.)

The correlation coefficient (r) gives us a
numerical measurement of the strength of
the linear relationship between the
explanatory and response variables.
zz

r
x y


n 1
For the students’ heights and weights, the
correlation is 0.644.
What does this mean in terms of strength?
We’ll address this shortly.
Correlation Conditions


Correlation measures the strength of
the linear association between two
quantitative variables.
Before you use correlation, you must
check several conditions:
 Quantitative Variables Condition
 Straight Enough Condition
 Outlier Condition
Correlation Conditions (cont.)


zz

r
x y
Straight Enough Condition:
n 1
 You can calculate a correlation coefficient for any
pair of variables.
 But correlation measures the strength only of linear
associations between quantitative variables; r is
meaningless when the relationship is not linear.
Outlier Condition:
 Outliers can distort the correlation dramatically by,
for example, making a small correlation look large.
 Outliers can even give an otherwise positive
association a negative correlation coefficient (and
vice versa).
 When you see an outlier, it’s often a good idea to
report the correlations with and without the point.
Correlation Notes

The sign of a correlation coefficient gives the
direction of the linear association
_________________________?

- 1 and +___.
1
Correlation is always between ___

Correlation can be exactly equal to –1 or +1,
but these values are unusual in real data
because they mean that all the data points fall
exactly on a straight line.
Correlation Notes (cont.)

Correlation treats x and y symmetrically;
the correlation of x with y is the same
as the correlation of y with x (e.g. it
doesn’t matter which axis the variables
are on in the scatterplot).

A Correlation is (like a z-score) unit-less.

Because correlation is based on zscores, it is not affected by changes in
the center or scale of either variable.
Correlation Notes (cont.)


Correlation measures the strength of the
linear association between the two
variables; variables can have a strong
association but still have a small
correlation if their association is nonlinear.
Correlation, like mean and standard
deviation, is sensitive to outliers. A
single outlier can make a small
correlation large or make a large one
small.
Correlation Notes (cont.)

Correlation is a mathematical
calculation…you give me ANY two
quantitative variables and I can tell my
technology to calculate r; it will always
give me a value between -1 and +1. The
mere existence of the number, however,
does not mean the variables have an
association!

r must be looked at in the context of the
data (shark attacks/ice cream vs.
height/weight).
Correlation ≠ Causation
Whenever we have a strong correlation,
it is tempting to conclude that the
predictor variable has caused the
change in the response variable.
 Scatterplots and correlation coefficients
never prove causation.
 A hidden variable that stands behind a
relationship and influences it by
simultaneously affecting the other two
variables is called a lurking variable.

Correlation Tables

It is common to compute correlations between
each possible pair of variables and to arrange
these correlations in a correlation table.
What Can Go Wrong?

Don’t say “correlation” when you mean
“association.”

The word “correlation” should only be
used when discussing ‘r,’ the actual
number that measures the strength
and direction of the linear relationship
between two quantitative variables.
What Can Go Wrong?


Don’t correlate categorical variables.
 Be sure to check the Quantitative
Variables Condition.
Don’t confuse “correlation” with “causation.”
 Scatterplots and correlations never
demonstrate causation, they only
demonstrate an association between
variables.
What Can Go Wrong? (cont.)

Be sure the association is linear.
 There may be a strong association between
two variables that have a nonlinear
association.
What Can Go Wrong? (cont.)

Don’t assume the relationship is linear just
because the correlation coefficient is high.
Here the
correlation is
0.979, but the
relationship is
actually bent.

What Can Go Wrong? (cont.)

Beware of outliers.

Even a single outlier
can dominate the
correlation value.

Make sure to check
the Outlier Condition.
What have we learned?


We examine scatterplots for direction, form,
strength, and unusual features.
Although not every relationship is linear,
when the scatterplot is straight enough, the
correlation coefficient is a useful numerical
summary:
 The sign of the correlation tells us the
direction of the association.
 The magnitude of the correlation tells us
the strength of a linear association.
 Correlation has no units, so shifting or
scaling the data, standardizing, or
swapping the variables has no effect on the
numerical value.
What have we learned? (cont.)


Doing Statistics right means that we have to
Think about whether our choice of methods is
appropriate.
 Before finding or talking about a correlation,
check the Straight Enough Condition.
 Watch out for outliers!
Don’t assume that a high correlation or strong
association is evidence of a cause-and-effect
relationship—and beware of lurking variables!