Examining Relationships

Download Report

Transcript Examining Relationships

Examining Relationships
Scatterplots and Correlation
The “W’s”
• When you examine the relationship between two
or more variables, first ask the familiar key
questions about the data
–
–
–
–
Who (are the individuals described by the data)
What (are the variables and in what units)
Why (were the data collected, if possible)
When, Where, how and by whom (were the data
produced)
• Note: the answers to “who” and “what” are
essential.
Response Variables – measure an outcome
of a study
Explanatory Variables – helps explain or
influences changes in a response variable
Response
(dependent
variable)
Explanatory (independent variable)
Lurking Variable – variables in the background
that can strongly influence the relationship
between two variables
Looking at Scatterplots
• Scatterplots may be the most common
and most effective display for data.
– In a scatterplot, you can see patterns, trends,
relationships, and even the occasional
extraordinary value sitting apart from the
others.
• Scatterplots are the best way to start
observing the relationship and the ideal
way to picture associations between two
quantitative variables.
Looking at Scatterplots (cont.)
• When looking at scatterplots, we will look
for direction, form, strength, and unusual
features.
• Direction:
– A pattern that runs from the upper left to the
lower right is said to have a negative
direction.
– A trend running the other way has a positive
direction.
Looking at Scatterplots (direction)
• Figure 3.1 from the
text shows a
negative
association
between the
percent of high
school seniors who
take the SAT and
their mean SAT
math scores
Looking at Scatterplots (direction)
• This figure shows a
positive association
between the year since
1900 and the % of people
who say they would vote
for a woman president.
• As the years have
passed, the percentage
who would vote for a
woman has increased.
Looking at Scatterplots (form)
• Form:
– a straight line (linear)
relationship, will
appear as a cloud or
swarm of points
stretched out in a
generally consistent,
straight form.
Looking at Scatterplots (form)
• Form:
– Curved - the
relationship
curves gently,
while still
increasing or
decreasing
steadily
or
Looking at Scatterplots (form)
• Clustered –
when there
are clear
groups
within the
data
Looking at Scatterplots (cont.)
• Strength:
– At one extreme, the points appear to follow a
single stream
(whether straight, curved, or bending all over
the place).
Looking at Scatterplots (cont.)
• Strength:
– At the other extreme, the points appear as a
vague cloud with no discernable trend or
pattern:
– Note: we will quantify the amount of scatter
soon.
Interpreting a Scatterplot
The scatterplot of the mean SAT
Math scores in each state against
the percent of that state’s high
school seniors who took the SAT
in 2005. The dotted lines intersect
at the point (9,557), the data for
Colorado.
Interpret the scatterplot, look for patterns and any
important deviations from the pattern.
See p 176 for text soln
Adding Categorical Variables to
Scatterplots
Dividing the states
into southern and
non-southern
Introduces a third
variable into The
scatterplot. This
categorical
Variable only has
two values.
Correlation
• Data collected from students in Statistics
classes included their heights (in inches)
and weights (in pounds):
• Here we see a
positive association
and a fairly straight
form, although
there seems to be
a high outlier.
Correlation (cont.)
• How strong is the association between weight
and height of Statistics students?
• If we had to put a number on the strength, we
would not want it to depend on the units we
used.
• A scatterplot of heights
(in centimeters) and
weights (in kilograms)
doesn’t change the
shape of the pattern:
Correlation (cont.)
• Since the units
don’t matter, why
not remove them
altogether?
• We could
standardize both
variables and write
the coordinates of a
point as (zx, zy).
• Here is a scatterplot
of the standardized
weights and
heights:
Correlation
• The correlation coefficient (r) gives us a
numerical measurement of the direction
and strength of the linear relationship
between the explanatory and response
variables.
zz

r
x y
n 1
For the students’ heights and weights, the
correlation is 0.644.

How correlation measures the strength of a
linear relationship. Patterns closer to a
straight line have correlations closer to 1 or -1.
Correlation Cautions
• Correlation measures the strength of the
linear association between two
quantitative variables.
• Before you use correlation, you must
check several conditions:
– Quantitative Variables Condition
– Straight Enough Condition
– Outlier Condition
Correlation Conditions (cont.)
• Quantitative Variables Condition:
– Correlation applies only to quantitative
variables.
– Don’t apply correlation to categorical data
masquerading as quantitative.
– Check that you know the variables’ units and
what they measure.
Correlation Conditions (cont.)
• Straight Enough Condition:
– You can calculate a correlation coefficient for
any pair of variables.
– But correlation measures the strength only of
the linear association, and will be misleading
if the relationship is not linear.
Correlation Conditions (cont.)
• Outlier Condition:
– Correlation is not resistant.
– An outlier can make an otherwise small
correlation look big or hide a large correlation.
– It can even give an otherwise positive
association a negative correlation coefficient
(and vice versa).
– When you see an outlier, it’s often a good
idea to report the correlations with and without
the point.
Variability
Correlation is given by r, indicating the strength and
direction of an association.
Variability is given by r2, and it indicates that 100r2 is the
percent of variability in the dependent variable that can
be explained by the independent variable.
The remaining percent variability must then be explained
by lurking variables.
