Chapter 4: Describing the Relation between Two Variables

Download Report

Transcript Chapter 4: Describing the Relation between Two Variables

Chapter 4: Describing the
Relation between Two Variables
4.1 Scatter Diagrams and Correlation
4.2 Least Squares Regression
4.3 Diagnostics on the Least Squares Regression Line
October 11, 2008
1
Variable Association
Question: In a population are two
or more variables of the population
linked? For example, do Math
127A students with brown eyes
have higher IQ than students with
other eye colors?
2
Bivariate Data
Recall: A variable is any characteristic of the objects in
the population that will be analyzed. Data is the value
(categorical or quantitative) that is measured for a
variable. If we have only one variable that is measured,
then we call this univariate data. If two variables are
measured simultaneously, then we call it bivariate data.
3
Example 1
Consider the population of all cars in the State of Tennessee.
Suppose we collect data on the number of miles on each car and
the age of the car. One variable is the mileage of each car and a
second variable is the age (in years) of each car. This would form
a bivariate dataset.
Question: Is there a relationship between the age of the car
and the number of miles?
4
Example 2
Consider the population of all undergraduates at Vanderbilt during
the present academic year. Suppose that we survey each student
to determine the number of hours that they watch television each
week and their GPA at the end of Spring 2007 semester. The two
variables are: (1) hours of TV watching per week and (2) GPA.
This forms a bivariate dataset.
Question: Is there a relationship for Vanderbilt undergraduates
concerning these two variables?
5
Response & Explanatory Variables
Definition: Suppose we have bivariate data for two variables in a
population or sample. The response ( or dependent) variable is the
variable whose value can be explained by the values of the
explanatory (or independent) variable.
Mathematics: y 

1 x
. y is the dependent variable and
x is the independent variable.
5  x2
6
Association between Variables
Definition: Consider two variables associate with a population.
We say that an association exists between the two variables if a
particular value for one of the variables is more likely to occur
with certain values of the other variable.
7
Association between Two
Quantitative Variables
We now consider a sample that contains information about two
quantitative variables. We want to determine if an association
between these two variables exist.
Consider a set of bivariate data. Let S  x1 , x2 ,K , xn  denote data for the one variable and
T  y1 , y2 ,K , yn  data for the second variable. Here xi and yi are numbers (values) for our
two quantitative variables. Our main goal will is to determine if there is a relationship between
the two sets. Technically, it would be desireable to find a function, f , such that yi  f (xi ). However,
for bivariate data that will vary from sample to sample, this is virtually impossible. Therefore, we
look for an association that permits this variability.
8
Association between Sets
(variables)
9
One Approach
One approach is to look at the descriptive characteristics (statistics)
of set of the bivariate dataset separately.
Example:
S = (-2,3,7,8,9)
Range:
r = 11
r = 10
Mean:
m=5
m=4
Median:
median = 7
median = 4
SD:
s = 4.52
s = 3.94
Conclusion: Not much help!
and
T = (0,1,4,5,10).
10
A Better Approach: Scatterplots
Suppose that we have bivariate data
: S  x1, x 2, , x n  and T  y1, y 2, , y n , for two variables.
From these two sets we form a third set of order pairs
: A  x1, y1 , x 2, y 2 , , x n , y n .
Definition: The plot of the points of A as points in the xy-plane is
called a scatterplot.
Remark: Although it technically doesn’t matter, we choose the first set to be the
explanatory variable (horizontal axis) and the second set to be the response
variable (vertical axis).
11
Example
Suppose that we have bivariate data where the one sample data (explanatory
variable ) is (1,3,4,6,9,12) and the other sample data (response variable) is
(2,-1,3,0,1,4).
12
Scatterplots & Excel
It is easy to create scatterplots in Excel. Assuming that your
data is list in two columns (or two rows), you select Chart from
the Insert Menu and then choose xy scatter from the different
types of charts. Then use the chart wizard to construct the
scatterplot.
13
Example
Explanatory Variable: GDP
Response Variable: Internet Use
14
Positive & Negative Associations
Definition: We say that two numerical variables (x & y) have a
positive association if as x increases, then y also tends to increase.
We say that they have a negative association if as x increases, y
tends to decrease. If there is neither a positive or negative
association, we say that there is no association.
15
Positive & Negative Association in
Scatterplot
16
Example
Consider the bivariate date:
• S = (0,1,2,3,4,5,6,7,8,9,10) (explanatory)
• T = (4,4,5,6,4,4,5,9,5,11,6) (response)
Is there an association?
17
Example
Consider the bivariate date:
• S = (0,1,2,3,4,5,6,7,8,9,10) (explanatory)
• T = (4,4,5,6,4,4,5,9,5,11,6) (response)
Is there an association?
There appears to be a
positive association between
the explanatory and
response variables.
18
Example
This example deals with the correlation between the Pat Buchanan and Ross
Perot countywide votes in the 1996 and 2000 elections in Florida. Each dot
(x,y) is a county in Florida with the first component the Perot vote and the
second component the Buchanan vote for two different elections (1996, 2000).
19
Generic Scatterplot
Consider a bivariate set of quantitative data and suppose that
we construct the scatterplot for this data.
20
Linear Response
Consider a bivariate set of quantitative data and suppose that
we construct the scatterplot for this data.
There appears to be a linear
relationship between x and y
in a “fuzzy” sense.
21
The Linear Correlation Coefficient
Consider a bivariate set of data. If we believe that there is a linear
response between the two variables, then we can define a number
(which we will denote by r) that is a measure of how much the
scatterplot varies from a linear relationship between the two variables
(x & y):
y = mx + b.
Remark: The correlation
coefficient is sometimes
called the Pearson
Correlation Coefficient.
22
Calculation of r
Consider two sets (samples) of data: S  x1 , x2 ,K , xn  and T  y1 , y2 ,K , yn .
Let sx denote the sample standard deviation of S and sy the sample standard
deviation of the T . The we define the correlation coefficient r as
n
n
n
x y
i
i
 xi yi  n
1 n  xi  x   yi  y 
i 1
r






2
2
n  1 i 1  sx   sy 

 n  
 n  
 n
  xi    n
  yi  
  x 2  i 1
   y 2  i 1

i
i
 i 1
  i 1

n
n






i 1
i 1
where x is the mean of S and y is the mean of T .
23
Remark
Recall that we have introduced the z-score of a data value:
xx
y y
1 n
zx =
or zy 
. The correlation coefficient is the same as r 
 zx zy .
sx
sy
n  1 i 1 i i
Hence, it a normalized product of the z-scores for both data sets. If x and y have a
positive association, as x increases, then so will y. In this case, we expect that zxi and
zyi will have the same sign and hence, zxi zyi  0. If the the association is negative, then
we expect that that zxi and zyi will have the opposite signs and hence, zxi zyi  0. The
strength of the association will depend on the magnitude of r i.e, r . The magnitude will
depend of the sizes of zxi zyi . The smaller that zxi =
xi  x
y y
or zyi  i
, the smaller
sx
sy
zxi zyi . This will happen when xi  x and/or yi  y i.e, all the x-points and y-points do not
differ from their means.
24
What does r tell us?
Suppose we have a bivariate set of data with x be arbitrary, but
y = mx + b. That is, the two set are linearly related. What is the
correlation number for this type of set?
Example: Let m = 2 and b = 1. Then S = (1,2,3,4,…,12) and
T = (3,5,7,9,…,25). Using the formula, we find r = 1 .
25
What does r tell us?
Example: Let m = -2 and b = 25. Then S = (1,2,3,4,…,12) and
T = (23,21,19,…,1). Using the formula, we find r = -1 .
26
Linear Correlation: r
•
•
•
•
•
•
If r = 1, then there is a perfect positive linear association between the
variables.
If r = -1, then there is a perfect negative linear association between the
variables.
If r = 0, then there is no linear correlation between the variables.
If 0 < r < 1, then there is some positive correlation, although the nearer that r
is to zero, the weaker the correlation.
If -1 < r < 0, then there is some negative correlation between the variables.
If r = 0, it does not mean that there is no association, but rather no linear
association.
In other words, r measures the strength of the linear association between the
two variables. The relationship between two variables may be nonlinear, yet
you can approximate the nonlinear relationship by a linear relationship.
27
The Bottom Line
If you want to know if there is a linear assoication between two
quantitative variables in a bivariate set, compute the correlation coefficient
r. Its sign (+ or -) will tell you if there is a positive or negative association
and the magnitude of r, |r|, will tell you the strength of the association.
28
Example
Consider bivariate data: S = (0,1,2,…,8,9) and T = (1.00,2.00,2.09,2.14,…,2.30,2.32). The data in set T
was generated by the function: f(x) = x(1/8) + 1.
r = 0.72
29
Example
Consider the function y = f(x) = x10 and the points (0, 0.1, 0.2, 0.3,…, 0.9, 1.0). We form a
bivariate set with these points: ((0,0), (0.1, 10-10),…,(0.9,0.348678), (1,1)). The correlation
coefficient for this data is r = 0.669641. This indicates a medium strength linear correlation.
However, it is a perfect nonlinear correlation with the nonlinear function x10.
30
Example
Animal
Gestation
Life Expectancy
Cat
63
11
Chicken
22
7.5
Dog
63
11
Duck
28
10
Goat
151
12
Lion
108
10
Parakeet
18
8
Pig
115
10
Rabbit
31
7
Squirrel
44
9
Is there a linear association between gestation period and life expectancy?
31
Note: The explanatory variable is the gestation period and the response variable is the life
expectancy. Also, dogs and cats have the same data.
r
1 n  xi  x   yi  y 

n  1 i 1  sx   sy 
n  10
x  64.3
y  9.5
sx  45.6704
sy  1.6403
 r  0.725657
32
Example
The U.S. Federal Reserve Board provides data on the percentage of disposable
personal income required to meet consumer loan payments and mortgage payments.
The following table summarizes this yearly data over the pass several years.
Consumer Debt
Household Debt
Consumer Debt
Household Debt
7.88
6.22
6.24
5.73
7.91
6.14
6.09
5.95
7.65
5.95
6.32
6.09
7.61
5.83
6.97
6.28
7.48
5.83
7.38
6.08
7.49
5.85
7.52
5.79
7.37
5.81
7.84
5.81
6.57
5.79
Question: Are consumer debt and house debt correlated?
33
Means:
7.22133 (consumer)
5.94333 (household)
Sample SD:
0.623583 (consumer)
0.175526 (household)
Correlation Coefficient: r = 0.117813
34
Excel and Correlation
Excel can be used to find the regression line for a set of bivariate
data. In the Tools menu, select Data Analysis. In the Data
Analysis window, select Correlation and follow the wizard. It
produces what is called a correlation matrix. The number that
occupies the 2nd row and 1st column is the correlation coefficient.
It is possible to calculate the correlation coefficient between
several variables using this tool.
35
Least Squares Regression
Suppose that we have bivariate data,
x , y , x , y ,K , x , y , and there is a linear association
1
1
2
2
n
n
between the two variables, x and y. That is, r  0. When r  0, we say that there is no association between
the variables even though there is a horizontal line (slope zero) through the data. We know then there is line
with the equation y  mx  b that in some sense approximates the relationship between x and y.
Note : In general, yi  mxi  b.
Section 4.2
36
Reminder about Lines
The equation of a straight line is: y = mx + b. The number m is called the
slope of the line and the number b is called the y-intercept. If m > 0, the y
increases with x and if m < 0, then it decreases with x. Given two distinct
points in the plane, one can find the numbers m and b. Points, (x,y), that
satisfy the same equation, y = mx + b, are said to be co-linear.
37
Remark
Given a value for the explanatory variable, say xk , we can compute an approximation for
yk , call it yˆk , by using the equation for the straight line i.e., yˆk  mxk  b  yk . If fact, for any
x (not necessary a data point), we can infer a value for the corresponding y.
Objective : Find the equation of the line that best approximates the association between the two variables.
This line is called a regression line for the bivariate data.
38
Problem
Give a set of points in the xy-plane, there are an infinite number of lines that
can be drawn through the points if the points are not co-linear.
39
Error and Residual
Consider a bivariate set of data,
x , y , x , y ,K , x , y  and suppose that we construct some straight line:
1
1
2
2
n
n
y  mx  b. Let yˆi  mxi  b. The difference  i  yi -yˆi is called the residual or prediction error of using
yˆi as an approximation for yi .
Idea : Given a set of bivariate data, choose the straight line that minimizes the residuals of all points.
40
Least Squares Line
Consider a set of bivariate data,
x , y ,K , x , y  and consider the line: y  mx  b.
1
1
n
n
n
n
Let  i  yi  yˆi where yˆi  mxi  b. The numbers  i is called the residual of yi . Consider R      yi  yˆi  .
2
2
i
i 1
i 1
 sy 
The values of m and b that make R as small as possible are: (1) m  r   where r is the correlation coefficient
 sx 
for the bivariate set and sx and sy are the sample standard deviations and (2) b  y  mx where x and y are the
means of their particular sets. The straight line calculated in this way is called a least squares line. Alternately, it
is called a regression line.
Remark : Your book uses the notation for the line: y  b1 x  b0 .
41
Lot of things to compute!
To compute the least squares lines one must calculate:
• the sample standard deviations of two sets
• the mean of two sets
• the correlation between two sets.
Fortunately, there is technology available to do this for us:
http://www.shodor.org/unchem/math/lls/leastsq.html .
42
Example (by hand)
Find the least squares line for the data set: ((-1,1),(0,2),(2,-1),(3,0)).
• X = (-1,0,2,3) and Y = (1,2,-1,0).
• Means: 1 and 0.5, respectively
• Sample standard deviations: 1.83 and 1.29, respectively
• Correlation coefficient: r = -0.71
• m = -0.71(1.29/1.83) = -0.50
• b = 0.5 - (-0.50)(1) = 1.0
• y = -0.5x + 1.0
43
Example
Anthropolgists using bones to predict height of individuals.
x = length of femur (thighbone) in cm
y = height in cm
yˆ  2.4 x  61.4
What is the predicted height of an individual with a 50 cm femur? The
regression
equation predicts:

(2.4)(50) + 61.4 = 181.4 cm = 71.4 in
44
Interpreting the y-intercept
y-intercept:
• The predicted value for y when x = 0.
• Helps in plotting the line, because it gives the point where the least
squares regression line crosses the y-axis.
• May not have any interpretative value if no observations had x
values near 0.
45
Interpreting the Slope
Slope:
Measures the change in the predicted variable for every unit change in
the explanatory variable. Hence, it is a rate of change between the
explanatory variable and the predicted (response) variable. Note that
slope has units (units of response variable divided by the units of the
explanatory variable).
46
Slopes and Association
47
Example
The population of the Detroit Metropolitan Area is summarized in the following table from
1950 to 2000:
Year
Population
(millions)
1950
1960
1970
1980
1990
2000
3.0
3.8
4.2
5.2
5.9
7.0
n6
x  1975
sx  18.7083
y  4.85
sy  1.46935
1 n  xi  x   yi  y 
r
 0.993121

n  1 i 1  sx   sy 
 sy 
m  r    0.078
 sx 
yˆ  0.078x  149.2
b  y  mx  149.2
Questions :
(i) Give a prediction of the population in the year 2010.
(ii) Why did the population slow during the 1980's?
Answers :
(i) yˆ  m 2010   b  5.58
(ii) Recession, poor domestic auto sales.
48
Residuals
•
•
•
•
They measure the difference between a data point (observation) and a
prediction: y - (mx + b).
Every data point has a residual.
A residual with a large absolute value (±) indicates an unusual
observation.
Large residuals can be found by constructing a histogram of the
residuals.
49
Example
Research at NASA studied the relationship between the right humerus and right tibia of 11 rats that were sent
into space on the Spacelab. Here is the data collected.
Right Humerus
(mm)
Right Tibia
(mm)
Right Humerus
(mm)
Right Tibia
(mm)
24.80
36.05
25.90
37.38
24.59
35.57
26.11
37.96
24.59
35.57
26.63
37.46
24.29
34.58
26.31
37.75
23.81
34.20
26.84
38.50
24.87
34.73
Find a least-squares regression line with x being the right humerus and y the right tibia.
50
n  11
x  25.34
sx  1.04127
r
y  36.3409
sy  1.52162
1 n  xi  x   yi  y 
 0.951316

n  1 i 1  sx   sy 
 sy 
m  r    1.39017
b  y  mx  1.11395
 sx 
yˆ  1.39017x  1.11395
51
52
Slope & Correlation
•
•
Correlation
– Describes the strength of a linear association between the two variables
– Does not change when the units of the measurement of the variables
change.
– It is not necessary to identify which variable is the response variable and
which variable is the explanatory variable.
Slope
– Numerical value of slope depends on the units used to measure the
variables.
– Does not indicate in the association whether association is strong or weak.
– One must identify the explanatory and response variables.
– The regression equation (y = mx + b) can be used to predict values of the
response variable.
53
Summing the Residuals
The residual of a data point, xi , yi , is a measure of how well the regression line,
yˆ  mx  b, approximates the data i.e.,  i  yi  yˆi  yi  mxi  b . The smaller
that  i is, the better the approximation. Hence, we calculate the sum of the squares
n
of the residuals and then take the square root of this sum:
 y  yˆ  .
2
i
i
This is an
i 1
overall measure of the approximation.
54
Example
Find the least squares regression line for the bivariate data:
((-1,0),(1,2),(2,3),(4,3),(5,4))
and then calculate the square root of the sum of the squared residuals.
X  1,1, 2, 4, 5
n5
11
x
5
57
sx 
 2.38747
10
r  0.939026
Y  0, 2, 3, 3, 4
y
12
5
sy 
23
 1.51658
10
 sy 
m  r    0.596491
b  y  mx  1.08772
 sx 
yˆ  0.596491x  1.08772
 i  yi  yˆi , i  1, 2,..., n {-1.08772, -0.280702, 0.122807, 0.122807, 0.526316}
n
 y  yˆ 
2
i
i 1
i
 1.25264
55
Example
Data: {(6, 4), (7, 5), (8, 2), (6, 7), (1, 6), (8, 6), (2, 5), (6, 8), (7, 6), (6, 2), (3, 0), (6, 1), (4, 7), (0, 0), (3, 3), (3, 8), (3, 3),
(0, 7), (8, 2), (7, 3), (7, 3), (6, 0), (0, 5), (2, 8), (1, 3), (9, 5), (7, 0), (4, 4), (8, 5), (4, 2), (2, 7), (8, 9), (0, 3), (7, 6), (7, 8),
(9, 7), (5, 4), (8, 0), (7, 4), (1, 3), (9, 7), (3, 5), (2, 9), (7, 6), (7, 5), (8, 6), (7, 6), (8, 2), (8, 3), (2, 5)}
x  5.14 y  4.5 sx  2.83571 sy  2.50917
r  0.00143411 m  0.00126897 b  4.50652
56
Residuals: {-0.498909, 0.50236, -2.49637, 2.50109, 1.49475, 1.50363, 0.496015, 3.50109, 1.50236,
2.49891, -4.50272, -3.49891, 2.49855, -4.50652, -1.50272, 3.49728, -1.50272, 2.49348, -2.49637, 1.49764, -1.49764, -4.49891, 0.493477, 3.49602, -1.50525, 0.504898, -4.49764, -0.501447, 0.503629, 2.50145, 2.49602, 4.50363, -1.50652, 1.50236, 3.50236, 2.5049, -0.500178, -4.49637, -0.49764,
1.50525, 2.5049, 0.497284, 4.49602, 1.50236, 0.50236, 1.50363, 1.50236, -2.49637, -1.49637,
0.496015}
57
Excel and Regression Lines
Excel can be used to find the regression line for a set of bivariate
data. In the Tools menu, select Data Analysis. In the Data
Analysis window, select regression and follow the wizard.
58
Be Careful in Analyzing Association
between Two Variables
We have developed the regression line as a way to predict values for
the response variable in terms of the explanatory variable. Can this
lead to bad predictions?
By all means!
59
Extrapolation
Consider the data sets: S  x1 ,K , xn  (explanatory variable) and T  y1 ,K , yn  (response
variable). Let use assume the we order S so the x1  x2  K  xn . We compute a regression
line for this data: yˆ  mx  b where yˆ is a is a prediction for y. Technically, the equation of the
regression line was calculated under the assumption the x1  x  xn . If we use the equation of
the regression line to calculate predictions for x  x1 and/or x  xn , then this is called
extrapolation.
60
Example
Consider S = (-1,0,3,4,5) and T = (1,0,2,3,2). The regression line is then
y = 0.351x + 0.828.
Suppose that we want a prediction
for x = -3 (-0.224) or x = 6 (2.932).
Using our equation to obtain these
predictions is called extrapolation.
61
Problems with Extrapolation
• Extrapolation is using a regression line to predict values of the
response variable for values of explanatory variable that are outside
of the data set.
– It is riskier the farther we move from the range of the given xvalues.
– There is no guarantee that the same relationship (the regression
line) will have the same trend outside the range of x-values.
62
Diagnostics on the Least
Squares Regression Line
Given a set of bivariate data, we can compute the least
squares regression line that “fits” the data. We now look at
“how well” this line approximates the data.
Section 4.3
63
Review
Given a sample of data from our population, ((x1,y2),…,(xn,yn)), we can construct a
linear regression line: y = mx + b . We have formulas for the constants the slope, m,
and the y-intercept, b. Furthermore, we have the (linear) correlation coefficient, r.
The correlation number r measures the strength of the linear relation between the
two quantitative variables x and y.
n
n
n
x y
i
i
 xi yi  n
n
1 n  xi  x   yi  y 
1
i 1
r


 xi  x yi  y  
2
2
n  1 i 1  sx   sy  (n  1)sx sy i 1

 n  
 n  
 n
  xi    n
  yi  
  x 2   i 1     y 2   i 1  
 i 1 i
  i 1 i

n
n






i 1
i 1
x is the mean and sx the sample standard deviation of x1 ,K , xn 
y is the mean and sy the sample standard deviation of y1 ,K , yn 
 sy 
m  r   and b  y  mx
 sx 
64
Deviations
Consider a data set: x1 , y1 , x2 , y2 ,..., xn , yn . Let y  mx  b be the regression line,
1 n
yˆi  mxi  b, and y   yi . For a point xi , yi  :
n i 1
(i) yi  y = total deviation at xi , yi 
(ii) yˆi  y = explained deviation at xi , yi 
(iii) yi  yˆi = unexplained deviation at xi , yi 
65
Variation
Consider any point in the data, yi , and the prediction, yˆi  mxi  b. Then
total deviation = yi  y  yˆi  y   yi  yˆi   explained deviation + unexplained deviation.
n
n
Next, one can prove:
n
n
 y  y    yˆ  y    y  yˆ 
2
i
i 1
2
i
i 1
2
i
i 1
i
 1
 yˆi  y 
n
2
i 1
n

 y  yˆ 
2
i
i 1
n
i
 y  y   y  y 
2
i
i 1
.
2
i
i 1
We define:
n
 y  y 
2
= total variation
i
i 1
n
 yˆ  y 
2
= explained variation
i
i 1
n
 y  yˆ 
2
i
i
= unexplained variation.
i 1
Hence, 1 
unexplained variation explained variation

.
total variation
total variation
Definition : R 2 
of determination.
explained variation
. The constant R 2 is also called the coefficient
total variation
66
Interesting Fact
Theorem : Let R 2 be the coefficient of determination and r be the correlation coefficient, i.e.,
2
 1 n  x  x   y  y 
 xi  x   yi  y 
1
2
2
i
i
r
. Then R  r   


 .








n  1 i 1  sx   sy 
 n  1 i 1  sx   sy  
n
67
Correlation r and Proportional
Reduction in Error R2
Both r and r2 describe the strength of an association between the
quantitative variables x and y.
• -1 ≤ r ≤ 1 and it represents the slope of the regression line when x and y
have been standardized.
• 0 ≤ R2 ≤ 1 and it summarizes the reduction in relative error between the
errors with respect to the mean of the y-values in the sample and the
errors with respect to the residuals of the sample. People often express
R2 as a percentage.
Bottom Line: Both are measures of strength of an association, but have
different interpretations.
68
Example
Data: ((-1,1),(0,2),(2,4),(3,2),(3,1),(4,-1))
Scatterplot:
Sample Means : x 
11
6
y
3
2
Sample Standard Deviatons : sx 
Correlation Coefficinet : r  
113
27
 1.94 sy 
 1.64
30
10
3
 0.282
113
69
 246 219 165 138 138 111 
Predictions yˆi : 
,
,
,
,
,

 113 113 113 113 113 113 
Coefficient of Determination : R  r 
2
2
2
9
 3 
 

 0.079646
 113 
113
27
 0.239
113
219
y - intercept : b 
 1.938
113
27
219
Regression Line : yˆ  
x
113
113
Slope : m  
70
Residual Analysis
We have seen in the previous section that the residual for each point in
the bivariate set gives us a measure of how well the regression line “fits”
the data. Overall, residual analysis can be used:
• To determine if a linear regression model is appropriate.
• Check the variation of the residuals (the variance or standard deviation
of the residuals).
• Help isolate outliers.
71
Is a Linear Model
Appropriate?
Question: For a given set of bivariate data, is a regression line, y = mx + b, an
appropriate model for approximating the behavior of y as x varies?
Approach: Plot the residuals of the data set as a function of the explanatory
variable: (x,residual). If this plot is “flat, but random” i.e., the residuals do not
change much with x and there is not discernible pattern to the residuals, then a
linear model is probably OK.
72
Example
Right Humerus
(mm)
Right Tibia
(mm)
Right Humerus
(mm)
Right Tibia
(mm)
24.80
36.05
25.90
37.38
24.59
35.57
26.11
37.96
24.59
35.57
26.63
37.46
24.29
34.58
26.31
37.75
23.81
34.20
26.84
38.50
24.87
34.73
n  11
x  25.34
sx  1.04127
y  36.3409
sy  1.52162
1 n  xi  x   yi  y 
r
 0.951316

n  1 i 1  sx   sy 
 sy 
m  r    1.39017
b  y  mx  1.11395
 sx 
yˆ  1.39017x  1.11395
73
Residuals: {0.459784, 0.27172, 0.27172, -0.301229, -0.0139461, -0.957528, 0.260595, 0.548659, -0.674231,
0.0606241, 0.073833}. The sum of the squares of the residual is 1.43087.
The mean of the set of residuals is -2.58379 x 10-15 and the sample standard deviation is 0.468989.
74
Variance of Residuals
If the spread residuals as a function of the explanatory variable
increases or decreases as the explanatory variable increases, then the
linear regression line may not be an appropriate model. In this case it
violates the constant error variance criterion.
75
Example
Data:
1, 3.85725 , 1,1.89082 , 3, 0.183631, 4, 0.806224 , 5, 2.05 
x  2.4 y  0.615098
sx  2.40832 sy  2.3156
r  0.998419
 sy 
m  r    0.959982
 sx 
b  y  mx  2.91905
yˆ  mx  b  0.959982x  2.91905
Residuals: -0.0217823, -0.0682502, 0.144522, 0.11465, -0.16914
Sample Standard Deviation of Residuals: 0.130165
Constant Variance of
Residuals
76
Example
Data:
1, 2.29705 , 1, 2.06739 , 3, 0.528931, 4, 4.45035 , 5,10.6457 
x  2.4 y  2.25211
sx  2.40832 sy  5.42233
r  0.877213
 sy 
m  r    1.97504
 sx 
b  y  mx  2.488
yˆ  mx  b  1.97504x  2.488
Residuals: 2.16599, -1.55443, -2.9082, -0.961828, 3.25847
Sample Standard Deviation of Residuals: 2.60327
Non-constant Variance
of Residuals
77
Outliers
Outliers can dramatically effect the regression line parameters: m and b.
If outliers can be identified, then they should be removed from the
calculation of the equation for the regression line.
78
Example
Data: ((-1,-2),(0,0),(1,2),(3,4))
Means: x: 0.75 and y: 1.0
SD: sx = 1.708 and sy = 2.582
Correlation: r = 0.9827
Regression Line: m = 1.486 and b = -0.114
Data: ((-1,-2),(0,0),(1,2),(3,10))
Means: x: 0.75 and y: 2.5
SD: sx = 1.708 and sy = 5.30
Correlation: r = 0.9833
Regression Line: m = 3.029 and b = 0.229
Notice that the correlation coefficient r hasn’t
changed much, but the slope and y-intercept have.
79
Box-whisker Plot for Residuals
and Outliers
Consider the following bivariate data: {(0,1),(1,-3),(2,3),(4,4),(5,6),(6,6),(7,15)}
r  0.860226
yˆ  1.81507x  1.91096
It appears that (7,15) might be an outlier.
80
We examine the set of residuals:
{2.91096, -2.90411, 1.28082, -1.34932, -1.16438, -2.97945, 4.20548}.
We call the five number summary for the residuals without the last residual (4.20548). That is,
(-2.97945, -2.94178, -1.25685, 2.09589, 2.91096). The upper fence is: Q3 + 1.5(IQR) where
IQR = 5.03767. That is, UF = 9.6524. Hence, the value of residual corresponding to x = 7
would not be considered an outlier.
81
82
Influential Observations
An influential observation is a data point that significantly changes the
equation of the linear regression line. Such a point may or may not be
an outlier.
83
Example
Data: {(-2., -0.221996), (-1.75, -0.135721), (-1.5, -0.119657), (-1.25, -0.2215), (-1., -0.122489), (-0.75, -0.0977267),
(-0.5, -0.112159), (-0.25, -0.0423737), (0., -0.0465509), (0.25, 0.121379), (0.5, -0.0336865), (0.75, 0.118555),
(1., 0.080907), (1.25, 0.116514), (1.5, 0.181848), (1.75, 0.186804), (2., 0.170339), (2.25, 0.200667), (2.5, 0.233716),
(2.75, 0.362922), (3., 0.365765), (3.25, 0.349446), (3.5, 0.445323), (3.75, 0.306702), (4., 0.38776), (4.25, 0.510167),
(4.5, 0.41498), (4.75, 0.403201), (5., 0.410249)}
r  0.967795
m  0.1001504
b  0.00696613
yˆ  0.1001504x  0.00696613
84
Add the point (1,3) - outlier
r  0.334904
m  0.0904561
b  0.10627
yˆ  0.0904561x  0.10627
The point (1,3) is an outlier, but it is not an influential points since it doesn’t significantly
change the linear regression line.
85
Add the point (10,3) - outlier
r  0.850706
m  0.184705
b  0.0889436
yˆ  0.184705x  0.0889436
The point (10,3) is an outlier and and an influential point (it almost doubles the slope of the regression
line).
86
Rule of Thumb
If the a point is an outlier with respect to the
explanatory variable (x), then it will be an influential
point.
87