Transcript Chapter 7:

Chapter 2
Lecture Slides
1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 2:
Summarizing Bivariate Data
2
Introduction
• Often, scientists and engineers collect data in order to
determine the nature of the relationship between two
quantities.
• An example is: heights and forearm lengths of men.
• Data that consist of ordered pairs are called bivariate
data.
• In many cases, ordered pairs tend to cluster around a
straight line when plotted.
• The summary statistic most often used to measure the
closeness of association between two variables is the
correlation coefficient.
• When two variables are closely related to each other, it is
often of interest to predict the value of one of them when
given the value of the other. This is done with the
equation of a line known as the least squares line.
3
Section 2.1: The Correlation
Coefficient
• Data for which items consists of a pair of values is
called bivariate.
• The graphical summary for bivariate data is a
scatterplot.
• Display of a scatterplot:
2
y
1
0
-1
0
1
2
3
4
5
6
7
8
x
4
Looking at Scatterplots
• If the dots on the scatterplot are spread out
randomly, then the two variables are not well
related to each other.
• If the dots on the scatterplot are spread around a
straight line, then one variable can be used to help
predict the value of the other variable.
5
Example
• This is a plot of height vs. forearm length for men.
• We say that there is a positive association between height and
forearm length. This is because the plot indicates that taller
men tend to have longer forearms.
• The slope is roughly constant throughout the plot, indicating
that the points are clustered around a straight line.
• The line superimposed on the plot is a special line known as
the least-squares line.
6
Correlation Coefficient
• The degree to which the points in a scatterplot
tend to cluster around a line reflects the
strength of the linear relationship between x
and y.
• The correlation coefficient is a numerical
measure of the strength of the linear
relationship between two variables.
• The correlation coefficient is usually denoted
by the letter r.
7
Computing r
• Let (x1, y1),…,(xn, yn) represent n points on a scatterplot.
• Compute the means and the standard deviations of the x’s
and y’s.
• Then convert each x and y to standard units. That is,
compute the z-scores: ( xi  x ) / sx and ( yi  y ) / s y .
• The correlation coefficient is the average of the products
of the z-scores, except that we divide by n – 1 instead of
n.
1 n  xi  x   yi  y 
r



 
n  1 i 1  sx   s y 
8
How the Correlation Coefficient Works
• The origin of the scaterplot is placed at the point of
averages
( x, y)
1 n  xi  x   yi  y 
r



 
n  1 i 1  sx   s y 
9
Computational Formula
n
r
 x  x  y  y 
i
i 1
i
n
n
 x  x    y  y 
i 1
2
i
i 1
2
i
• This formula is easier for calculations by hand:
n
r
 x y  nxy
i 1
n
i
2
2
x

n
x
i
i 1
i
n
2
2
y

n
y
 i
i 1
10
Properties of r
• It is a fact that r is always between -1 and 1.
• Positive values of r indicate that the least squares line
has a positive slope. Greater values of one variable are
associated with greater values of the other.
• Negative values of r indicate that the least squares line
has a negative slope. Greater values of one variable are
associated with lesser values of the other.
11
More Comments
• Values of r close to –1 or 1 indicate a strong linear
relationship.
• Values of r close to 0 indicate a weak linear
relationship.
• When r is equal to –1 or 1, then all the points on the
scatterplot lie exactly on a straight line.
• If the points lie exactly on a horizontal or vertical
line, then r is undefined.
• If r  0, then x and y are said to be correlated.
If r = 0, then x and y are uncorrelated.
12
More Properties of r
• An important feature of r is that it is unitless. It is a pure
number that can be compared between different samples.
• r remains unchanged under each of the following operations:
– Multiplying each value of a variable by a positive constant.
– Adding a constant to each value of a variable.
– Interchanging the values of x and y.
• If r = 0, this does not imply that there is not a relationship
between x and y. It just indicates that there is no linear
relationship.
• Outliers can greatly distort r, especially, in small data sets, and
present a serious problem for data analysts.
• Correlation is not causation. For example, vocabulary size is
strongly correlated with shoe size, but this is because both
increase with age. Learning more words does not cause feet to
grow or vice versus. Age is confounding the results.
13
Example 1
An environmental scientist is studying the rate of absorption of a
certain chemical into skin. She places differing volumes of the
chemical on different pieces of skin and allows the skin to remain
in contact with the chemical for varying lengths of time. She
then measures the volume of chemical absorbed into each piece
of skin. The scientist plots the percent absorbed against both
volume and time. She calculates the correlation between volume
and absorption and obtains r = 0.988. She concludes that
increasing the volume of the chemical causes the percentage
absorbed to increase. She then calculates the correlation between
time and absorption, obtaining r = 0.987. She concludes that
increasing the time that the skin is in contact with the chemical
causes the percentage absorbed to increase as well. Are these
conclusions justified?
14
Example 1
Volume (mL)
0.05
0.05
0.05
2.00
2.00
2.00
5.00
5.00
5.00
Time(h) Percent Absorbed
2
48.3
2
51.0
2
54.7
10
63.2
10
67.8
10
66.2
24
83.6
24
85.1
24
87.8
15
Example 2
Volume (mL)
0.05
0.05
0.05
2.00
2.00
2.00
5.00
5.00
5.00
Time(h) Percent Absorbed
2
49.2
10
51.0
24
84.3
2
54.1
10
68.7
24
87.2
2
47.7
10
65.1
24
88.4
16
Controlled Experiments and
Confounding
• In a controlled experiment the experimenter can
choose the values of the factors to reduce
confounding.
• In controlled experiments, confounding can often
be avoided by choosing values for factors in such a
way so that the factors are uncorrelated.
HW 2.1: 1, 7, 8
17
Section 2.2: The Least-Squares Line
Can we predict the strength for a nitrogen content not in the table?
Is there any relationship between nitrogen content and strength?
18
Section 2.2: The Least-Squares Line
19
Section 2.2: The Least-Squares Line
• The line that we are trying to fit is
yi = 0 +1xi +i.
• The variable yi is the dependent variable, the
xi is the independent variable, and 0 and 1
are called the regression coefficients, and i is
called the residual. We only know the values
of x and y, we must estimate 0 and 1.
• This is what we call simple linear regression.
20
Using the Data
• We write the equation of the least-square line as
.
y  ˆ0  ˆ1 x
• The quantities ˆ0 and ˆ1 are called the least-squares
coefficients.
• The least-squares line is the line that fits the data
“best”.
• To find the least-squares line, we must determine
estimates for the slope 0 and 1 intercept that minimize
the sum of the squared residuals.
n
n
n
S   e    yi  yˆi   
i 1
2
i
i 1
2
i 1

yi  ˆ0  ˆ1 xi

2
21
Finding the Equation of the Line
• These quantities are
n
ˆ1 
(x
i 1
i
n
 x )( yi  y )
2
(
x

x
)
 i
and
ˆ0  y  ˆ1 x .
i 1
Note: The true values of 0 and 1 are unknown.
22
Some Shortcut Formulas
The expressions on the right are equivalent to those
on the left, and are often easier to compute:
n
n
 x  x    x
2
i
i 1
n
i 1
2
i
n
y  y   y
2
i
i 1
n
(x
i 1
i
i 1
2
i
 nx
2
 ny
2
n
 x )( yi  y )   xi yi  nx y
i 1
23
Example 2
Using the weld data in Table 2.1 in the text, compute the least-squares
estimates of ˆ0 and ˆ1. Write the equation of the least-squares line.
x  0.0319
n
y  63.79
n
2
(
x

x
)

x
 i
 i  nx = 0.1002
2
i 1
i 1
n
n
(x
i 1
i
 x )( yi  y )   xi yi  nx y = 3.32
i 1
n
ˆ1 
(x
i 1
i
n
 x )( yi  y )
2
(
x

x
)
 i
3.32

 331 .62
0.1002
i 1
ˆ0  y  ˆ1 x  63.79  (331.62)(0.0319)  53.197
y  53.197 331.62x
24
Example 3
Using the equation of the least-squares line for the weld data in Table
2.1 in the text, predict the yield strength for a weld whose nitrogen
content is 0.02%.
y  53.197 331.62x
y  53.197 331.62(0.02)  59.83ksi
In the table the yield strength of that weld was 57.67 ksi, so should we
use the table or the equation to predict another weld?
HW 2.2: 4, 12
25
Section 2.3: Features and Limitations of
the Least-Squares Line
• Do not extrapolate the fitted line (such as the leastsquares line) outside the range of the data. The linear
relationship may not hold there.
• We learned that we should not use the correlation
coefficient when the relationship between x and y is not
linear. The same holds for the least-squares line. When
the scatterplot follows a curved pattern, it does not make
sense to summarize it with a straight line.
• If the relationship is curved, then we would want to fit a
regression line that contain squared terms.
26
Section 2.3: Features and Limitations of
the Least-Squares Line
The relationship between the height of a free-falling object and the
time in free fall is not linear.
27
Outliers and Influential Points
If we define outliers as points that have unusual residuals,
then a point that is far from the bulk of the data, yet near the
least-square line, is not an outlier.
28
Measures of Goodness of Fit
• A goodness of fit statistic is a quantity that measures
how well a model explains a given set of data.
• The quantity r2 is the square of the correlation
coefficient and we call it the coefficient of
determination.
n
n
2
2
ˆ
(
y

y
)

(
y

y
)


i
i
i
2
i 1
i 1
r 
n
2
(
y

y
)
 i
i 1
• The proportion of variance in y explained by
regression is the interpretation of r2.
• r2 explained the percentage of variability in y that can
be explained by x. Where is x in the above equation?
29
Sums of Squares
•
•
•
•
•
2
ˆ
(
y

y
)
 i 1 i
n
is the error sums of squares and measures
the overall spread of the points around the leastsquares line.
n
2
(
y

y
)
is the total sums of squares and measures
 i 1 i
the overall spread of the points around the line y  y.
The difference  ( y  y )   ( y  yˆ ) is called the
regression sum of squares and measures the reduction
in the spread of points obtained by using the leastsquares line rather than y  y.
The coefficient of determination r2 expresses the
reduction as a proportion of the spread around y  y .
Clearly, the following relationship holds:
n
i 1
2
i
n
i 1
2
i
Total sum of squares = regression sum of squares + error sum
of squares
30
Sums of Squares
31
Interpreting Computer Output
Regression Analysis: Strength versus Nitrogen Concentration
The regression equation is
Strength = 53.197 + 331.62 Nitrogen
Predictor
Constant
Nitrogen
Coef
53.19715
331.6186
SE Coef
1.02044
27.4872
S = 2.75151 R-Sq = 84.8%
T
52.131
12.064
P
0.000
0.000
R-Sq(adj) = 84.3%
1. This is the equation of the least-squares line.
2. Coef The least squares coefficients
.
3. R-Sq This is r2, the square of the correlation coefficient r, also called the
coefficient of determination.
HW 2.3: 7, 10
Supplementary: 6, 7
Summary
•
•
•
•
Bivariate Data
Correlation Coefficient
Least-Squares Line
Features and Limitations of the Least-Squares
Line
33