Transcript Section 9-3

STATISTICS
Chapter 5
Regression
MVS 250: V. Katch
1
Regression
Definition
 Regression Equation
Given a collection of paired data, the regression
equation
^
(y = mx + b)
algebraically describes the relationship between the
two variables
 Regression Line
(line of best fit or least-squares line)
the graph of the regression equation
2
Regression Line Plotted on Scatter Plot
3
Regression Line
4
Two different lines, one to predict X and one to predict Y.
5
The Regression Equation
x is the independent variable
(predictor variable)
^y is the dependent variable
(response variable)
y = mx +b
b = slope
6
Assumptions
1. We are investigating only linear relationships.
2. For each x value, y is a random variable
having a normal (bell-shaped) distribution.
All of these y distributions have the same
variance. Also, for a given value of x, the
distribution of y-values has a mean that lies
on the regression line. (Results are not
seriously affected if departures from normal
distributions and equal variances are not too
extreme.)
7
Formula for y-intercept and slope
Formula 1
b=
(y/n) (
x2/n)
(y-intercept)
- (x/n) (xy/n)
(x2/n) - (x/n)2
SD2x
Formula 2
m=
(xy/n) - (x/n) (y/n)
(slope)
(x2/n) - (x/n)2
SD2x
8
If you find r, then
Formula 3
slope =
m = r sy/sx
where y is the mean of the y-values and x
is the mean of the x values
Formula 4 Intercept =
b = y - mx
where y is the mean of the y-values, x is
the mean of the x-values and m is the slope
9
Rounding
the y-intercept and the slope
 Round to three significant digits
 If you use the formulas 1 and 2, and
3 try not to round intermediate
values.
10
The regression line
fits the sample
points best.
11
Residuals and the
Least-Squares Property
Definitions
Residual
for a sample of paired (x,y) data, the difference (y - ^
y)
^
between an observed sample y-value and the value of y,
which is the value of y that is predicted by using the
regression equation.
Least-Squares Property
A straight line satisfies this property if the sum of the
squares of the residuals is the smallest sum possible.
12
Residuals and the
Least-Squares Property
x
y
1 2
4 24
^
y = 5 + 4x
4 5
8 32
y
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
• Residual = 7
• Residual = 11
•
•
Residual = -13
Residual = -5
x
1
2
3
4
5
13
Predictions
In predicting a value of y based on some
given value of x ...
1. If there is not a significant linear
correlation, the best predicted y-value is y.
2. If there is a significant linear correlation,
the best predicted y-value is found by
substituting the x-value into the
regression equation.
14
Predicting the Value of a Variable
Start
Calculate the value of r
and test the hypothesis
that  = 0
Is
there a
significant linear
correlation
?
Yes
Use the regression
equation to make
predictions. Substitute
the given value in the
regression equation.
No
Given any value of one
variable, the best predicted
value of the other variable
is its sample mean.
15
Guidelines for Using The
Regression Equation
1. If there is no significant linear correlation,
don’t use the regression equation to make
predictions.
2. When using the regression equation for
predictions, stay within the scope of the
available sample data.
3. A regression equation based on old data is
not necessarily valid now.
4. Don’t make predictions about a population
that is different from the population from
which the sample data was drawn.
16
X
1
2
3
4
5
10
15
18
20
30
Y
34
36
37
39
41
50
59
64
68
86
Example
Compute r, slope, intercept,
regression
What is this equation used
for?
17
What is the best predicted size of a household
that discard 0.50 lb of plastic?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
18
What is the best predicted size of a household
that discard 0.50 lb of plastic?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Using a calculator:
b = 0.549
m = 1.48
y = 0.549 + 1.48 (0.50)
y = 1.3
A household that discards 0.50 lb of plastic has
approximately one person.
19
Definitions
 Marginal Change
the amount a variable changes when the
other variable changes by exactly one unit
 Outlier
a point lying far away from the other data
points
 Influential Points
points which strongly affect the graph of the
regression line
20
Example 5.4 Height and Foot Length (cont)
Three outliers were
data entry errors.
Regression equation
uncorrected data: 15.4 + 0.13 height
corrected data: -3.2 + 0.42 height
Correlation
uncorrected data: r = 0.28
corrected data: r = 0.69
21
Example 5.10 Earthquakes in US
San Francisco
earthquake of 1906.
Correlation
all data:
w/o SF:
r = 0.73
r = –0.96
22
Example: Predict the quiz score of a student who
spends 30 hours a week watching television.
One more step…….
23
Compute the
Standard Error of the Estimate
SY*X =
2
SDY√1-r
SY*X =
2
13.83√1-(-8.17)
SY*X = ±7.978
7.978
The predicted score is 56.56 points + points
24
Multiple Regression
Definition
Multiple Regression Equation
A linear relationship between a dependent
variable y and two or more independent
variables (x1, x2, x3 . . . , xk)
^
y = m0 + m1x1 + m2x2 + . . . + mkxk
25
Generic Models
Linear:
y = a + bx
Quadratic:
y = ax2 + bx + c
Logarithmic:
y = a + b lnx
Exponential:
y = abx
Power:
y = axb
Logistic:
y=
c
1 + ae -bx
26
27
28
29
30
31
32
Development of a Good
Mathematics Model
 Look for a Pattern in the Graph: Examine the graph of the
plotted points and compare the basic pattern to the known
generic graphs.
 Find and Compare Values of R2: Select functions that
result in larger values of R2, because such larger values
correspond to functions that better fit the observed points.
 Think: Use common sense. Don’t use a model that lead to
predicted values known to be totally unrealistic.
33