Scatterplots and correlations

Download Report

Transcript Scatterplots and correlations

Objectives
2.1
Scatterplots

Scatterplots

Explanatory and response variables

Interpreting scatterplots

Outliers
Adapted from authors’ slides © 2012 W.H. Freeman and Company
Relationship of two numerical variables
Most statistical studies involve more than one variable and
the primary questions are about their relationships.
Questions one can ask:

Which variable(s) are explanatory and which are responses?



How is the relationship best described?




Do we want to know how one variable affects the value of another?
Or do we simply want to measure their association?
Is the association positive or negative?
How can we predict one variable from the value of the other(s)?
Can a straight line be used effectively or is the relationship more
complex?
How well (close) do the data fit the relationship we describe?



How strong (or weak) is the relationship?
Is the relationship “significant”? (Can we reject H0: no association?)
How do the data deviate from the overall pattern?
Examples: variables of interest
 Here are two data sets which may interest you:
 The weight of a calf (at certain week) and his/her girth. Does the weight of the
calf influence the girth, what sort of relationship is there? Can we reliably predict
the girth given its weight. How does the relationship change over time.
 Your midterm scores. Is there a relationship between the scores in midterm 1
and 2 and midterm 3. Is this relationship strong or weak. If the relationship is
strong, then your final grade is pretty much clear. However, if the relationship is
weak then those who did well still need to work hard and those who did poorly can
still change their grade by working hard.
These data sets are available on my website.
Our objective in the next few lectures is to plot this data (in a meaningful way).
Look at the plot for a relationship and to describe the relationship (this is
descriptive statistics). Then we will describe how to measure the strength of the
relationship and do prediction.
Explanatory and response variables
A response variable measures or records an outcome of a study. An
explanatory variable explains changes in the response variable.
Typically, the explanatory variable is plotted on the x axis, and the
response variable is plotted on the y axis.
variables: How is
one affected by
changes in the other
one?
Blood Alcohol as a function of Number of Beers
0.20
Blood Alcohol Level (mg/ml)
Two numerical variables for each of
16 students.
Response
variable:
We are interested in
blood
the relationship
alcohol
between the two
content
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0
1
2
3
4
5
6
7
Number of Beers
Explanatory variable:
number of beers
8
9
10
Looking at relationships: Scatterplots
In a scatterplot, one axis is used to represent each of the variables,
and the data are plotted as points on the graph.
We look for an overall pattern and for deviations from the pattern.
Student
Beers
BAC
1
5
0.1
2
2
0.03
3
9
0.19
6
7
0.095
7
3
0.07
9
3
0.02
11
4
0.07
13
5
0.085
4
8
0.12
5
3
0.04
8
5
0.06
10
5
0.05
12
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Interpreting scatterplots

After plotting two variables on a scatterplot, we describe the
relationship by examining the direction, form, and strength of the
association. We look for an overall pattern …


Direction: positive, negative, no direction.

Form: straight line, curved, clusters, no pattern.

Strength: how closely the points fit the “form”.
… and for deviations from that pattern.

Do the points fit more closely for one part of the form than it does for
another?

Are there outliers?

Would it be appropriate to extrapolate the relationship we see?
Form and direction of an association
Straight Line Relationship
No Relationship
Negative
Positive
Curved Relationship
Positive
Neither
Positive or Negative?
Positive association: High values of the response variable tend to
occur together with high values of the explanatory variable.
Negative association: High values of the response variable tend to
occur together with low values of the explanatory variable.
Flat (no) association: The values of the response variable are
similarly distributed for all values of the other variable. There is no
information about the response variable that can be predicted from the
explanatory variable.
Complex association: For some values of the explanatory variable
the variables appear to be positively associated, but for other values of
that variable they appear to be negatively associated (curvature).
Or information other than the general (average) level of the response
variable can be predicted from the explanatory variable.
Strength of the association
The strength of the relationship between the two variables can be
seen by how much variation, or scatter, there is around the main form.
This is a weak positive relationship.
For a particular median household
income (X), you cannot predict the
state per capita income (Y) very well.
Y varies widely for a given X.
This is a very strong positive
relationship. The daily amount of gas
consumed can be predicted quite
accurately for a given temperature
value. Y varies very little for a given X.
How to scale a scatterplot
Same data in all four plots. There is a negative relationship between swim time and pulse rate.
Using an inappropriate
scale for a scatterplot will
give an incorrect
impression and
interpretation of the data.
Both variables should be
given a similar amount of
space:
• The plot is roughly square.
• Space cannot be reduced
without removing some
points.
Outliers
An outlier is a data point that is exceptionally unusual or unexpected.
They fall outside of the overall pattern of the relationship.
This point is unusual in its values but it
is not an outlier of the relationship.
This point is not in line with the others.
It is an outlier of the relationship.
Objectives
2.2
Correlation

The correlation coefficient r

Properties of the correlation coefficient
Adapted from authors’ slides © 2012 W.H. Freeman and Company
Measuring relationship: correlation


The correlation coefficient is a measure of the direction and
strength of a linear relationship.
It is calculated using the standardized values (z-scores) of both the x
and y variables.
1 n x i  x y i  y 

r




n 1 i1  sx  sy 

z-score for x z-score for y



Compute this with your calculator or software!
r is positive if the relationship is positive and negative if the
 is negative.
relationship
r is always between −1 and 1. The closer it is to −1 or 1, the
stronger the relationship.
But close to 0 does not necessarily mean no relationship.
r has no units of measurement and does not depend on the units for
x and y.
The correlation coefficient r
Time to swim:
x  35; sx  0.70
Pulse rate:
y  140; s y  9.5
Correlation:
r  0.75
This indicates a moderately
strong negative relationship.
The value of r would be the same if, for example, “Time to
Swim” was measured in seconds and “Pulse Rate” was
measured in beats per hour.
"Time to Swim" is the explanatory variable here, and
belongs on the x axis. However, the value of r is the same
regardless of how we label or plot the variables.
r ranges from
−1 to +1
The correlation coefficient r
quantifies the strength and
direction of a linear
relationship between two
quantitative variables.
Strength: how closely the
points follow a straight line.
Direction: is positive when
individuals with higher X
values tend to have higher
values of Y, and is negative
when individuals with higher
X values tend to have lower
values of Y.
Direction? Form? Strength?
Automobiles in Albuquerque were randomly selected (at a shopping center) in 1974 and
given an emissions test. Total hydrocarbon emissions level and model year were
observed.
Negative
Straight Line?
Weak
r = −.483
Direction? Form? Strength?
Pollutants were observed over a 28 day period. The carbon pollutants and the ozone
level are to be related.
Positive
Straight Line
Moderate
r = .687
Direction? Form? Strength?
The efficiency of an industrial biofilter is tested at different temperature levels.
Positive
Straight Line
Moderate to
Strong
r = .891
Direction? Form? Strength?
The nickel-to-iron ratio was measured in oat plants and the plant age (in days after
emergence) was also recorded.
Complex (positive until
50 days, then negative)
Curved
Strong (if curve is taken
into account)
r = .479
The correlation measures
the degree to which the
points fit a straight line,
not a curve.
Example: correlations between midterm scores
Midterm 1
Midterm 2
Midterm 3
Midterm 1
Midterm 2
0.256
Midterm 3
0.435
0.306
We can see from the correlations above, that as expected the correlation
between the midterm scores is positive (because the correlation
coefficients are all greater than zero). However, none of the correlation
coefficients are that large. This means that the association is not strong.
This means that the midterm score can not be predicted well from the
previous midterm scores.
This is good news, it appears that you can improve!
The correlation is strongest between midterm 1 and midterm 3, this I did
not expect!