Chapter 3 Examining Relationships “Get the facts first, and then you can distort them as much as you please.” Mark Twain.
Download ReportTranscript Chapter 3 Examining Relationships “Get the facts first, and then you can distort them as much as you please.” Mark Twain.
Chapter 3 Examining Relationships
“Get the facts first, and then you can distort them as much as you please.” Mark Twain
3.1 Scatterplots
Many statistical studies involve MORE THAN ONE variable.
A SCATTERPLOT represents a graphical display that allows one to observe a possible relationship between two quantitative variables.
Response Variable vs. Explanatory Variable
Response Variable – Measures an outcome of a study Explanatory variable – Attempts to explain the observed outcomes
Response Variable vs. Explanatory Variable
When we think changes in a variable
x
explain , or even cause , changes in a second variable,
y
, we call
x
an explanatory variable and y a response variable.
y
Response Variable
x
Explanatory variable
IMPORTANT!
Even if it appears that y can be “predicted” from
x
, it does not follow that
x
causes
y
.
ASSOCIATION DOES NOT IMPLY CAUSATION.
When examining a scatterplot, look for an overall PATTERN.
Consider: – Direction – – Form Strength – Positive association – Negative association – outliers
Positive vs. Negative Association
Positive Association (between two variables) – Above-average values of one tend to accompany above-average values of the other – Below-average values of one tend to accompany below-average values of the other Negative Association (between two variables) – Above-average values of one tend to accompany below-average values of the other
3.2 Correlation
Describes the direction and strength of a straight-line relationship between two quantitative variables.
Usually written as
r
.
r
n
1 1
x i s
x
x y i
y
s y
Facts About Correlation
Positive
r
indicates positive association between the variables and negative
r
indicates negative association.
The correlation
r
always fall between –1 an 1 inclusive.
The correlation between
x
and
y
does NOT change when we change the units of measurement of
x
,
y
, or both.
Correlation ignores the distinction between explanatory and response variables.
Correlation measures the strength of ONLY association between two variables.
straight-line The correlation is STRONGLY observations.
affected by a few outlying
3.3 Least-Squares Regression
If a scatterplot shows a linear relationship between two quantitative variables, least squares regression is a method for finding a line that summarizes the relationship between the two variables, at least within the domain of the explanatory variable
x
.
The least-squares regression line (
LSRL
) is a mathematical model for the data.
Regression Line
Straight line Describes how a response variable
y
as an explanatory variable
x
changes.
changes Sometimes it is used to PREDICT
y
for a given value of
x
.
the value of Makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
y
Residual
A difference between an OBSERVED
y
and a PREDICTED
y
:
y
Some Important Facts About the LSRL
It is a mathematical model for the data.
It is the line that makes the sum of the squares of the residuals AS SMALL AS POSSIBLE.
intercept.
s
b r s x
(On the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y)
Some Important Facts About the LSRL
a
The slope
b
is the approximate change in
y
when
x
increases by 1.
The
y
-intercept a is the predicted value of
y
when
x
0
.
Coefficient of Determination
Symbolism:
r
2 It is the fraction of the variation in the values of
y
that is explained by the least-squares regression of
y
on
x
.
Measure of HOW SUCCESSFUL the regression is in explaining the response.
Calculation of
r
2
r
2 SSM
SSE
SSM where SSM
y
y
2 sum of squares about the mean
y
SSE 2 sum of squares of residuals
Example
L1 2 4 6
x
?
y
?
L2 6 12 15
y
y
2 2
Example Solution
L1 2 4 6
x
4
y
11 L2 6 12 15
y
y
2 25 1 16 42 2 .25
1 .25
1.50
Things to Note:
Sum of deviations from mean = 0.
Sum of residuals = 0.
r
2 > 0 does not mean
r
> 0. If
x
and
y
negatively associated, then
r
< 0.
are
Outlier
A point that lies outside the overall pattern of the other points in a scatterplot.
It can be an outlier in the
x
direction, in the
y
direction, or in both directions.
Influential Point
A point that, if removed, would considerably change the position of the regression line.
Points that are outliers in the
x
often influential.
direction are
Words of Caution
Do NOT CONFUSE the slope
b
correlation
r
.
of the LSRL with the – – The relation between the two is given by the formula
b r
s s x
If you are working with normalized data, then b does equal r since
s y
s x
1 When you normalize a data set, the normalized data has a mean = 0 and standard deviation = 0.
More Words of Caution
If you are working with normalized data, the regression line has the simple form
y n
rx n x n y n
Since the regression line contains the mean of
x
and the mean of
y
, and since normalized data has a mean of 0, the regression line for normalized
x
and
y
values contains (0, 0).