Example - TWISTED ILLUSION

Download Report

Transcript Example - TWISTED ILLUSION

Example
• Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
– x1 = advertising expenditure
– x2 = time of year
– x3 = state of economy
– x4 = size of inventory
• We want to predict y using knowledge of
x1, x2, x3 and x4.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
A Simple Linear Model
• In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
• If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose
•Deterministic Model: y = a + bx
•Probabilistic Model:
–y = deterministic model + random error
–y = a + bx + e
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
A Simple Linear Model
• Since the bivariate measurements that
we observe do not generally fall
exactly on a straight line, we choose to
use:
• Probabilistic Model:
– y = a + bx + e
Points
deviate
– E(y)
= a from
+ bx the
line of means by an amount
e where e has a normal
distribution with mean 0 and
variance s2.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Method of Least Squares
• The equation of the best-fitting line
is calculated using a set of n pairs (xi, yi).
•We choose our estimates a
and b to estimate a and b so
that the vertical distances of
the points from the line,
are minimized.
Bestfitting line :yˆ  a + bx
Choosea and b to minimize
2
2
ˆ
SSE  ( y  y )  ( y  a  bx)
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Least Squares Estimators
Calculatethe sumsof squares:
( x)
( y )
2
Sxx   x 
Syy   y 
n
n
( x)( y )
Sxy   xy 
n
Bestfitting line : yˆ  a + bx where
2
2
2
b
S xy
S xx
and a  y  bx
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
The table shows the math achievement test
scores for a random sample of n = 10 college
freshmen, along with their final calculus
Student
1 2
3
4
5
6
7
8
9
10
grades.
Math test, x
39 43
21
64
57
47
28
75
34
52
Calculus grade, y 65 78
52
82
92
89
73
98
56
75
Use your calculator
to find the sums
and sums of
squares.
 x  460
 y  760
 x  23634  y  59816
2
2
 xy  36854
x  46
©2003 Brooks/Cole
y Copyright

76
A division of Thomson Learning, Inc.
Example
(460) 2
Sxx  23634 
 2474
10
(760) 2
Syy  59816 
 2056
10
(460)(760)
Sxy  36854 
 1894
10
1894
b
 .76556 and a  76  .76556(46)  40.78
2474
Bestfitting line : yˆ  40.78 + .77 x
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Analysis of Variance
•
The total variation in the experiment is
measured by the total sum of squares:
Total SS  S yy  ( y  y ) 2
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Analysis of Variance
We calculate
SSR 
( S xy ) 2
S xx
18942

2474
 1449.9741
SSE  TotalSS - SSR
 S yy 
( S xy ) 2
S xx
 2056  1449.9741
 606.0259
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The ANOVA Table
Total df = n -1
Regression df = 1
Error df = n –1 – 1 = n - 2
Mean Squares
MSR = SSR/(1)
MSE = SSE/(n-2)
Source
df
SS
MS
F
Regression
1
SSR
SSR/(1)
MSR/MSE
Error
n-2
SSE
SSE/(n-2)
Total
n -1
Total SS
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Calculus Problem
SSR 
( S xy ) 2
S xx
18942

 1449.9741
2474
SSE  Total SS - SSR  S yy 
( S xy ) 2
S xx
 2056  1449.9741  606.0259
Source
df
SS
MS
F
Regression
1
1449.9741 1449.9741 19.14
Error
8
606.0259
Total
9
2056.0000
75.7532
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing the Usefulness
of the Model
•
•
The first question to ask is whether the
independent variable x is of any use in
predicting y.
If it is not, then the value of y does not change,
regardless of the value of x. This implies that
the slope of the line, b, is zero.
H0 : b  0 versus Ha : b  0
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing the
Usefulness of the Model
•
The test statistic is function of b, our best
estimate of b. Using MSE as the best estimate
of the random variation s2, we obtain a t
statistic.
Test statistic: t 
b0
which has a t distribution
MSE
S xx
MSE
with df  n  2 or a confidenceinterval: b  ta / 2
S
Copyright ©2003 Brooks/Colexx
A division of Thomson Learning, Inc.
The Calculus Problem
• Is there a significant relationship between
the calculus grades and the test scores at the
5% level of significance?
There is a significant linear relationship
between the calculus grades and the test scores
for the population of college freshmen.
H0 : b  0 versusHa : b  0
t
b0
MSE/ S xx

.7656  0
 4.38
75.7532 / 2474
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H 0 is rejected .Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The F Test
• You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
To testH0 : model is usefulin predicting y
This test is
exactly
equivalent to
the t-test, with t2
= F.
Copyright ©2003 Brooks/Cole
MSR
Test Statistic: F 
MSE
Reject H 0 if F  Fa with1 and n - 2 df .
A division of Thomson Learning, Inc.
Minitab Output
Least squares
To testregression
H 0 : b  line
0
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor
Coef
SE Coef
T
P
Constant
40.784
8.507
4.79
0.001
x
0.7656
0.1750
4.38
0.002
S = 8.704
R-Sq = 70.5%
Analysis of Variance
Source
DF
SS
Regression
1 1450.0
Residual Error 8
606.0
Total
9 2056.0
MSE
R-Sq(adj) = 66.8%
MS
1450.0
75.8
F
19.14
P
0.002
Regression coefficients, 2
t F
a and Copyright
b ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Measuring the Strength
of the Relationship
•
•
If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
Correlation coefficient : r 
S xy
S xx S yy
S xy
2
SSR
Coefficient of determination : r 

SCopyright
Brooks/ColeSS
xx S yy©2003 Total
2
A division of Thomson Learning, Inc.
Measuring the Strength
of the Relationship
• Since Total SS = SSR + SSE, r2 measures
 the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
 the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
For the calculus problem, r2 = .705 or
70.5%. The model is working well!
SSR
r 
Total SS
2
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Checking the
Regression Assumptions
•Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
1. The relationship between x and y is linear,
given by y = a + bx + e.
2. The random error terms e are independent and,
for any value of x, have a normal distribution
with mean 0 and variance s 2.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Residuals
•The residual error is the “leftover”
variation in each data point after the
variation explained by the regression model
has been removed.
Residual yi  yˆi or yi  a  bxi
•
•If all assumptions have been met, these
residuals should be normal, with mean 0
and variance s2.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Normal Probability Plot
 If the normality assumption is valid, the
plot should resemble a straight line,
sloping upward to the right.
 If not, you will often see the pattern fail
in the tails of the graph.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Estimation and Prediction
•
Once you have
 determined that the regression line is useful
 used the diagnostic plots to check for
violation of the regression assumptions.
• You are ready to use the regression line to
 Estimate the average value of y for a
given value of x
 Predict a particular value of y for a
given value of x.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Estimation and Prediction
Estimating a
particular value of y
when x = x0
Estimating the
average value of y
when x = x0
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Estimation and Prediction
•
The best estimate of either E(y) or y for
a given value x = x0 is
yˆ  a + bx0
•
Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Estimation and Prediction
To estimatethe averagevalueof y when x  x0 :
 1 ( x0  x ) 2 

yˆ  ta / 2 MSE  +

n
S
xx


To predict a particularvalueof y when x  x0 :
yˆ  ta / 2
 1 ( x0  x ) 2 

MSE 1 + +

n
S
xx


Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Calculus Problem
• Estimate the average calculus grade for
students whose achievement score is 50
with a 95% confidence interval.
Calculate yˆ  40.78424 + .76556(50) 79.06
 1 (50  46)
yˆ  2.306 75.7532 +
2474
 10
79.06  6.55 or 72.51to 85.61.
2



Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Calculus Problem
• Estimate the calculus grade for a
particular student whose achievement
score is 50 with a 95% confidence
interval.yˆ  40.78424 + .76556(50) 79.06
Calculate

1 (50  46) 

yˆ  2.306 75.75321 +
+
2474 
 10
Notice how
79.06  21.11 or 57.95 to 100.17. much wider this
2
interval is!
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Minitab Output
Confidence and prediction
intervals when x = 50
Predicted Values for New Observations
New Obs
Fit
SE Fit
95.0% CI
1
79.06
2.84
(72.51, 85.61)
95.0% PI
(57.95,100.17)
Values of Predictors for New Observations
New Obs
x
1
50.0
Blue prediction
bands are always
wider than red
confidence bands.
Both intervals are
narrowest when x = xbar.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Correlation Analysis
•
The strength of the relationship between x and y is
measured using the coefficient of correlation:
Correlation coefficient : r 
S xy
S xx S yy
• Recall from Chapter 3 that
(1) -1  r  1 (2) r and b have the same sign
(3) r  0 means no linear relationship
(4) r  1 or –1 means a strong (+) or (-)
Copyright ©2003 Brooks/Cole
relationship
A division of Thomson Learning, Inc.
Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player
1
2
3
4
5
6
7
8
9
10
Height, x
73
71
75
72
72
75
67
69
71
69
Weight, y
185 175 200 210
Use your calculator
to find the sums
and sums of
squares.
190 195 150
170 180 175
S xy  328 S xx  60.4 S yy  2610
r
328
 .8261
(60.4)(2610)
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Some Correlation Patterns
•
Use
applet to
r = 0;the
No Exploring Correlation
r = .931; Strong
explore
correlationsome correlation patterns:
positive correlation
r = 1; Linear
relationship
r = -.67; Weaker
negative correlation
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Inference using r
•
The population coefficient of correlation is
called r (“rho”). We can test for a significant
correlation between x and y using a t test:
To testH0 : r  0 versusHa : r  0
This test is
exactly
equivalent to
the t-test for the
slope b0.
n2
Test Statistic: t  r
2
1 r
Reject H 0 if t  ta / 2 or t  ta / 2 with n - 2 df .
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
r  .8261
Example
Is there a significant positive correlation
between weight and height in the population
of all college football players?
H0 : r  0
Ha : r  0
Use the t-table with n-2 = 8 df to
bound the p-value as p-value <
.005. There is a significant
positive correlation.
n2
Test Statistic: t  r
2
1 r
8
 .8261
 4.15
2
1  .8261
Applet
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
• Let y be the monthly sales revenue for a
company. This might be a function of
several variables:
– x1 = advertising expenditure
– x2 = time of year
– x3 = state of economy
– x4 = size of inventory
• We want to predict y using knowledge of
x1, x2, x3 and x4.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The General
Linear Model
– y = b0+ b1x1 + b2x2 +…+ bkxk + e
• where
y is the response variable you want to
predict.
b0, b1, b2,..., bk are unknown constants
x1, x2,..., xk are independent predictor
variables, measured without error.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
• Consider the model E(y) = b0+ b1x1 + b2x2
• This is a first order model (independent
variables appear only to the first power).
• b0  y-intercept = value of E(y) when
x1=x2=0.
• b1 and b2 are the partial regression
coefficients—the change in y for a oneunit change in xi when the other
independent variables are held constant.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Method of
Least Squares
• The best-fitting prediction equation is
calculated using a set of n measurements
(y, x1, x2yˆ,…
 b0xk+) bas1 x1 + ... + bk xk
• We choose our estimates b0, b1,…, bk to
2
ˆ
SSEb0,b(1y,…,
 yb)k to minimize
estimate
 ( y  b0  b1 x1  ...  bk xk )
2
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
A computer database in a small community contains the
listed selling price y (in thousands of dollars), the amount
of living area x1 (in hundreds of square feet), and the
number of floors x2, bedrooms x3, and bathrooms x4, for n
= 15 randomly selected residences currently on the
market.
Property y
x1
x2
x3
x4
1
69.0
6
1
2
1
2
118.5
10
1
2
2
3
116.5
10
1
3
2
…
…
…
…
…
…
15
209.9
21
2
4
3
Fit a first order
model to the data
using the method
of least squares.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
The first order model is
E(y) = b0+ b1x1 + b2x2 + b3x3 + b4x4
fit using Minitab with the values of y and the
four independent variables entered into five
Regression equation
columns of the Minitab worksheet.
Regression Analysis: ListPrice versus SqFeet, NumFlrs, Bdrms, Baths
The regression equation is
ListPrice = 18.8 + 6.27 SqFeet - 16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths
Predictor
Constant
SqFeet
NumFlrs
Bdrms
Baths
Coef
18.763
6.2698
-16.203
-2.673
30.271
SE Coef
9.207
0.7252
6.212
4.494
Partial
6.849
T
2.04
8.65
-2.61
-0.59
regression
4.42
coefficients
P
0.069
0.000
0.026
0.565
0.001
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Analysis of Variance
•
The total variation in the experiment is
measured by the total sum of squares:
Total SS  S yy  ( y  y ) 2
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using the
regression equation.
SSE (sum of squares for error): measures the
leftover variation not explained by the
independent variables.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The ANOVA Table
Total df = n -1
Regression df = k
Error df = n –1 – k = n – k -1
Mean Squares
MSR = SSR/k
MSE = SSE/(n-k-1)
Source
df
SS
MS
F
Regression
k
SSR
SSR/k
MSR/MSE
Error
n – k -1
SSE
SSE/(n-k-1)
Total
n -1
Total SS
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Real Estate Problem
Another portion of the Minitab printout
shows the ANOVA Table, with n = 15
and k = 4.
MSE
S = 6.849
R-Sq = 97.1%
Analysis of Variance
Source
DF
Regression
4
Residual Error
10
Total
14
Source
SqFeet
NumFlrs
Bdrms
Baths
DF
1
1
1
1
SS
15913.0
469.1
16382.2
Seq SS
14829.3
0.9
166.4
916.5
R-Sq(adj) = 96.0%
MS
F
3978.3
84.80
Sequential
Sums
46.9
P
0.000
of squares:
conditional contribution of
each independent variable
to SSR given the variables
already entered into the
model.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing the Usefulness
of the Model
•
•
The first question to ask is whether the
regression model is of any use in predicting y.
If it is not, then the value of y does not change,
regardless of the value of the independent
variables, x1, x2 ,…, xk. This implies that the
partial regression coefficients, b1, b2,…, bk are
all zero.
H 0 : b1  b 2  ...  b k  0 versus
H a : at least one b i is not zero
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The F Test
• You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
To test H 0 : model is useful in predicting y is equivalent to
H 0 : b1  b 2  ...  b k  0
MSR
Test Statistic : F 
MSE
Reject H 0 if F  Fa with k and n - k-1 dfCopyright
.
©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Measuring the Strength
of the Relationship
•
•
If the independent variables are useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
Multiple coefficien t of determinat ion :
SSR
R 
Total SS
2
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Measuring the Strength
of the Relationship
• Since Total SS = SSR + SSE, R2 measures
 the proportion of the total variation in the
responses that can be explained by using the
independent variables in the model.
 the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
2
SSR
MSR
R
/k
2
R 
and F 

2
Total SS
MSE (1  R ) /( n  k  1)
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing the Partial
Regression Coefficients
•
Is a particular independent variable useful in
the model, in the presence of all the other
independent variables? The test statistic is
function of bi, our best estimate of bi.
H0 : b i  0 versus Ha : b i  0
bi  0
Test statistic : t 
SE( bi )
which has a t distribution with error df = n – k –1.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Real Estate Problem
Is the overall model useful in predicting list
price? How much of the overall variation in
the response is explained by the regression
model?
S = 6.849
R-Sq = 97.1%
Analysis of Variance
Source
DF
Regression
4
2Residual Error
10
Total
14
SS
15913.0
469.1
16382.2
R = .971 indicates that
97.1% of the overall
Source
DF
Seq SS
variation
is
explained
by
SqFeet
1
14829.3
1
0.9
theNumFlrs
regression
model.
Bdrms
1
166.4
Baths
1
R-Sq(adj) = 96.0%
916.5
MS
3978.3
46.9
F
84.80
P
0.000
F = MSR/MSE = 84.80 with
p-value = .000 is highly
significant. The model is very
useful in predicting the list
price of homes.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Real Estate Problem
To test H0: b3  0, the test statistic is t =
Inwith
thep-value
presence
of the other
-0.59
= .565.
three
is0 isthe number of
Theindependent
p-value is larger variables,
than .05 and H
not bedrooms
rejected.
significant in predicting the list
of homes?
Test using
a = .05.
We price
cannot conclude
that number
of
bedrooms is a valuable predictor in the
Regression
Analysis:
ListPrice
SqFeet, NumFlrs, Bdrms, Baths
presence
of the
otherversus
variables.
The regression equation is
ListPrice
= 18.8
+ 6.27 could
SqFeetbe
- refit
16.2 NumFlrs - 2.67 Bdrms + 30.3 Baths
Perhaps
the model
without
Predictor
Constant
SqFeet
NumFlrs
Bdrms
Baths
x3 .
Coef
18.763
6.2698
-16.203
-2.673
30.271
SE Coef
9.207
0.7252
6.212
4.494
6.849
T
2.04
8.65
-2.61
-0.59
4.42
P
0.069
0.000
0.026
0.565
0.001
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Comparing
Regression Models
•
The strength of a regression model is
measured using R2 = SSR/Total SS. This
value will only increase as variables are
added to the model.
•
To fairly compare two models, it is better to
use a measure that has been adjusted using
df:


MSE
2
 100%
R (adj)  1 
n-1) 
 Total SS/( Copyright
©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Checking the
Regression Assumptions
•Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
 e are independent
 Have a mean 0 and common variance s2 for
any set x1, x2,..., xk .
 Have a normal distribution.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Normal Probability Plot
 If the normality assumption is valid, the
plot should resemble a straight line,
sloping upward to the right.
 If not, you will often see the pattern fail
in the tails of the graph.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Estimation and Prediction
•
Enter the appropriate values of x1, x2, …, xk in
Minitab. Minitab calculates
yˆ  b0 + b1 x1 + b2 x2 + ... + bk xk
•and both the confidence interval and the
prediction interval.
•Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Real Estate Problem
• Estimate the average list price for a home
with 1000 square feet
of living
We estimate
that thespace,
average list
price will
between
$110,860
one floor, 3 bedrooms
andbetwo
baths
with
and $124,700 for a home like
a 95% confidence interval.
this.
Predicted Values for New Observations
New Obs
Fit
SE Fit
95.0% CI
1
117.78
3.11
( 110.86, 124.70)
(
95.0% PI
101.02, 134.54)
Values of Predictors for New Observations
New Obs
SqFeet
NumFlrs
Bdrms
Baths
1
10.0
1.00
3.00
2.00
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Using Regression Models
When you perform multiple regression analysis,
use a step-by step approach:
1. Obtain the fitted prediction model.
2. Use the analysis of variance F test and R 2 to determine
how well the model fits the data.
3. Check the t tests for the partial regression coefficients to
see which ones are contributing significant information
in the presence of the others.
4. If you choose to compare several different models, use
R 2(adj) to compare their effectiveness.
5. Use diagnostic plots to check for violation of the
regression assumptions.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
A Polynomial Model
• A response y is related to a single independent
variable x, but not in a linear manner. The
polynomial model is:
y  b 0 + b1 x + b 2 x + ... + b k x + e
2
k
• When k = 2, the model is quadratic:
y  b 0 + b1 x + b 2 x 2 + e
• When k = 3, the model is cubic:
y  b 0 + b1 x + b 2 x + b 3 x + e
2
3
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
A market research firm has observed the sales (y) as a
function of mass media advertising expenses (x) for 10
different companies selling a similar product.
Company
1
Expenditure, x
Sales, y
2
3
4
5
6
7
8
9
10
1.0 1.6
2.5
3.0 4.0
4.6 5.0
5.7
6.0
7.0
2.5 2.6
2.7
5.0 5.3
9.1 14.8 17.5 23.0
28.0
Since there is only one
independent variable, you
could fit a linear, quadratic, or
cubic polynomial model.
Which would you pick?
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Two Possible Choices
A straight line model: y = b0 + b1x + e
Fx2test
is highly
A quadratic model: y = b0 + bOverall
x
+
b
+
e
1
2
significant, as is the t-test of
the slope. R2 = .856 suggests a
Here is the Minitab printout for
thefit.straight
line:the
good
Let’s check
residual plots…
Regression Analysis: y versus x
The regression equation is
y = - 6.47 + 4.34 x
Predictor
Coef
SE Coef
T
P
Constant
-6.465
2.795
-2.31
0.049
x
4.3355
0.6274
6.91
0.000
S = 3.725
R-Sq = 85.6%
R-Sq(adj) = 83.9%
Analysis of Variance
Source
DF
SS
MS
F
Regression
1
662.46
662.46
47.74
Residual Error
8
111.00
13.88
Total
9
773.46
P
0.000
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
There is a strong pattern of a
“curve” leftover in the residual
plot.
Use Minitab to fit the
quadratic model:
y = b0 + b1x + b2x2 + e
This indicates that there is a
curvilinear relationship
unaccounted for by your
straight line model.
You should have used the
quadratic model!
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Quadratic Model
Regression Analysis: y versus x, x-sq
The regression equation is
y = 4.66 - 3.03 x + 0.939 x-sq
Predictor
Constant
x
x-sq
S = 1.752
Coef
SE Coef
T
P
4.657
2.443
1.91
0.098
-3.030
1.395
-2.17
0.067
0.9389
0.1739
5.40
0.001
R-Sq = 97.2%
R-Sq(adj) = 96.4%
Analysis
of is
Variance
Overall
F test
highly significant,
Source
DF
SS
as Regression
is the t-test of the quadratic
term
2
751.98
2
Error
21.49
b2.Residual
R = .972
suggests7 a very good
Total
9
773.47
fit.
Let’s compare the two models, and
check the residual plots.
MS
375.99
3.07
F
122.49
P
0.000
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Which Model to Use?
Use R2(adj) to compare the models:
The straight line model: y = b0 + b1x + e R 2 (adj)  83.9%
The quadratic model: y = b0 + b1x + b2x2 + e
R 2 (adj)  96.4%
The quadratic model is
better.
There are no patterns in the
residual plot, indicating
that this is the correct
model for the data.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Using Qualitative
Variables
•
•
•
•
•
Multiple regression requires that the
response y be a quantitative variable.
Independent variables can be either
quantitative or qualitative.
Qualitative variables involving k categories
are entered into the model by using k-1
dummy variables.
Example: To enter gender as a variable, use
xi = 1 if male; 0 if female
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
Data was collected on 6 male and 6 female assistant
professors. The researchers recorded their salaries (y)
along with years of experience (x1). The professor’s
gender enters into the model as a dummy variable: x2 = 1
if male; 0 if not.
Professor Salary, y
Experience, x1 Gender, x2
Interaction,
x1x2
1
$50,710
1
1
1
2
49,510
1
0
0
…
…
…
…
…
11
55,590
5
1
5
12
53,200
5
0
0
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
We want to predict a professor’s salary based on years
of experience and gender. We think that there may be a
difference in salary depending on whether you are
male or female.
The model we choose includes experience (x1), gender
(x2), and an interaction term (x1x2) to allow salary’s
for males and females to behave differently.
y  b 0 + b1 x1 + b 2 x2 + b 3 x1 x2 + e
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
What is the regression
equation for males? For
Is the overall model useful
females?
We use Minitab to fit the model.
in predicting y?
For males, x2 = 1,
Regression Analysis: y versus x1, x2, x1x2 The overall F test is F =
The regression equation is
y = 49459.7
+ 1229.13x
346.24
with p-value
= .000.1
y = 48593 + 969 x1 + 867 x2 + 260 x1x2
TheFor
value
of R2 =x2.992
females,
= 0,
Predictor
Coef
SE Coef indicates
T that thePmodel fits
Constant
48593.0
207.9
233.68
0.000
y = 48593.0
+ 969.0x1
well. 0.000
x1
969.00
63.67 very15.22
Two
x2
866.7
305.3
2.84different
0.022straight line
x1x2
260.13
87.06
2.99
0.017
models.
Minitab Output
S = 201.3
R-Sq = 99.2%
R-Sq(adj) = 98.9%
Is there a difference in the relationship between salary and years of
Analysis of
Variance on the gender of the professor?
experience,
depending
Source
DF
SS
MS
F
P
Regression
3
42108777
14036259
0.000pYes.
The individual t-test
for
the interaction
term is346.24
t = 2.99 with
Residual Error
8
324315
40539
value
= .017. This indicates
a significant interaction between
Total
11
42433092
gender and years of experience.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
It does not appear from the
diagnostic plots that there
are any violations of
assumptions.
Example
Have any of the regression assumptions been violated, or
The model is ready to be
have we fit the
wrong
model?or
used
for prediction
estimation.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing Sets of Parameters
•
•
•
•
•
Suppose the demand y may be related to five independent
variables, but that the cost of measuring three of them is very
high.
If it could be shown that these three contribute little or no
information, they can be eliminated.
You want to test the null hypothesis
H0 : b3 = b4 = b5 = 0—that is, the independent variables x3,
x4, and x5 contribute no information for the prediction of y—
versus the alternative hypothesis:
Ha : At least one of the parameters b3, b4, or b5 differs
from 0 —that is, at least one of the variables x3, x4, or x5
contributes information for the prediction of y.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing Sets of Parameters
To explain how to test a hypothesis concerning a set
of model parameters, we define two models:
Model One (reduced model)
E ( y)  b 0 + b1 x1 + b 2 x2 +  + b r xr
•
•
Model Two (complete model)
E ( y)  b 0 + b1 x1 + b 2 x2 +  + b r xr + b r +1 xr +1 + b r + 2 xr + 2 +  + b k xk
•
terms in model 1
additional terms in model 2
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Testing Sets of Parameters
•
The test of the hypothesis
•
H0 : b 3 = b 4 = b 5 = 0
•
Ha : At least one of the bi differs from 0
•
uses the test statistic F  SSE1  SSE 2  k  r 
MSE 2
where F is based on df1 = (k - r ) and df2 =
n -(k + 1).
The rejection region for the test is identical to
other analysis of variance F tests, namely F > Fa.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Stepwise Regression



A stepwise regression analysis fits a variety
of models to the data, adding and deleting variables
as their significance in the presence of the other
variables is either significant or nonsignificant,
respectively.
Once the program has performed a sufficient
number of iterations and no more variables are
significant when added to the model, and none of
the variables are nonsignificant when removed, the
procedure stops.
These programs always fit first-order models and
are not helpful in detecting curvature or interaction
Copyright ©2003 Brooks/Cole
in the data.
A division of Thomson Learning, Inc.
Pearson’s Chi-Square
Statistic
• We have some preconceived idea about the
values of the pi and want to use sample
information to see if we are correct.
• The expected number of times that
outcome i will occur is Ei = npi. If the
observed cell counts, Oi, are too far from
what we hypothesize under H0, the more
likely it is that H0 should be rejected.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Pearson’s Chi-Square
Statistic
• We use the Pearson chi-square statistic:
2
(
O

E
)
i
X2  i
Ei
• When H0 is true, the differences O-E will
be small, but large when H0 is false.
• Look for large values of X2 based on the
chi-square distribution with a particular
number of degrees of freedom.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Degrees of Freedom
• These will be different depending on the
application.
1. Start with the number of categories or
cells in the experiment.
2. Subtract 1df for each linear restriction on
the cell probabilities. (You always lose 1
df since p1+p2 +…+ pk = 1.)
3. Subtract 1 df for every population
parameter you have to estimate to
calculate or estimate E .
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Goodness of Fit Test
•The simplest of the applications.
•A single categorical variable is measured,
and exact numerical values are specified
for each of the pi.
•Expected cell counts are Ei = npi
•Degrees of freedom: df = k-1
2
(
O

E
)
2
i
Test statistic: X   i
Ei
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
•Toss a die 300 times with the following
results. Is the die fair or biased?
Upper Face
1
2
3
4
5
6
Number of times
50
39
45
62
61
43
• A multinomial experiment with k = 6 and
O1 to O6 given in the table.
• We test:
H0: p1= 1/6; p2 = 1/6;…p6 = 1/6 (die is fair)
Ha: at least one pi is different from 1/6 (die is biased)
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
•Calculate the expected cell counts:
Ei = npi = 300(1/6) = 50
Upper Face
Oi
Ei
• Test
1
2
3
4
5
6
Do
62 not
61 reject
43 H0.
There
50 50
50
50 50 is insufficient
50
evidence to indicate
statistic and rejection
thatregion:
the die is biased.
50
39
45
(Oi  Ei ) 2 (50  50) 2 (39  50) 2
(43  50) 2
X 

+
+ ... +
 9.2
Ei
50
50
50
2
Reject H 0 if X2   .205  11.07 with k  1  6  1  5 df.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Some Notes
• The test statistic, X2 has only an
approximate chi-square distribution.
• For the approximation to be accurate,
statisticians recommend Ei  5 for all cells.
• Goodness of fit tests are different from
previous tests since the experimenter uses
H0 forH the
model
he
thinks
is
true.
: model is correct (as specified)
0
Ha: model is not correct
• Be careful not to accept H0 (say the model is
correct) without reporting b. Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Contingency Tables: A
Two-Way Classification
•
The experimenter measures two
qualitative variables to generate
bivariate data.
– Gender and colorblindness
– Age and opinion
– Professorial rank and type of
university
• Summarize the data by counting the
observed number of outcomes in each
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
r x c Contingency Table
•
•
The contingency table has r rows and c
columns—rc total cells.
1
2
…
c
1
O11
O12
…
O1c
2
O21
O22
…
O2c
…
…
…the distribution
…
….
Does
of measurements
in
r
Or1 the O
…
various
categories
for Ovariable
1
r2
rc
depend on which category of variable 2 is
We study the being
relationship
observed? between the two
If not,method
the variables
are independent.
variables. Is one
of classification
contingent or dependent on the other?
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Chi-Square Test of
Independence
H0: classifications are independent
Ha: classifications are dependent
•Observed cell counts are Oij for row i and column j.
•Expected cell counts are Eij = npij
If H0 is true and the classifications are
independent,
pij = pipj = P(falling in row i)P(falling in row j)
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Chi-Square Test of
Independence
cj
ri
Estimate pi and p j with and .
n
n
 ri  c j  ri c j
ˆ
Eij  n   
n
 n  n 
2
ˆ
(Oi j  Eij )
2
Test statistic: X  
Eˆ
ij
•The test statistic has an approximate chi-square
distribution with df = (r-1)(c-1).
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
Furniture defects are classified according
to type of defect and shift on which it
was made.
Shift
Type
1
2
3
Total
A
15
26
33
74
B
21
31
17
69
C
45
34
49
128
D
13
5
20
38
Total
94 sufficient
96 evidence
119to indicate
309
Do the
data present
that the type
H0: type
of defect
is independent
ofwhich
shift the piece
of furniture
defect
varies with
the shift during
Ha: type
of defectTest
depends
on the
of furniture
is produced?
at the 1%
levelshift
of significance.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Furniture Problem
•Calculate the expected cell counts. For
Applet
example: Eˆ  r c  74(96)  22.99
1 2
12
n
309
Chi-Sq =
2.506 +
0.394 +
Chi-Square Test: 1, 2, 3
0.000 + 4.266 +
Expected counts are printed below observed0.944
counts
+ 0.836 +
1
2
3
Total 0.179 + 3.923 +
1
15
26
33 19.178 74
22.51
22.99
28.50 DF = 6, P-Value = 0.004
0.711
3.449
0.002
1.967
+
+
+
=
Reject
H0. There is
69
Test statistic: X  
 sufficient evidence to
ˆ
Eij49
3
45
34
128
indicate that the
38.94 2 39.77
49.292
(15  22.51)
(26  22.99)
(proportion
20  14.63) 2 of defect
+ 5
+ ... + 38
 19.18
4
13
20
22
.51
22.99
14.63
types
vary from shift
11.56
11.81
14.63
2
2
Reject H 0 if X   .05  11.07 with(r  1)(c  1)  6 df.
to shift.
Total
94
96
119
309
2
21
20.99
31
21.44
2
17 2
ˆ
(O26.57
ij  Eij )
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Comparing Multinomial
Populations
•
Sometimes researchers design an experiment
so that the number of experimental units falling
in one set of categories is fixed in advance.
•
Example: An experimenter selects 900 patients who
have been treated for flu prevention. She selects 300
from each of three types—no vaccine, one shot, and
The column totals have
two shots.
No Vaccine
been fixed in
advance!
Total
One Shot Two Shots
Flu
r1
No Flu
r2
Total
300
300
300
n = 900
Copyright ©2003
Brooks/Cole
A division of Thomson Learning, Inc.
Comparing Multinomial
Populations
•
Each of the c columns (or r rows) whose
totals have been fixed in advance is actually a
single multinomial experiment.
•
The chi-square test of independence with (rThree binomial
1)(c-1) df is equivalent
to a populations—no
test of the
vaccine, one shot and two shots.
equality of c (orIsr)themultinomial
populations.
probability of getting
the flu
No Vaccine
independent
the type
of flu
One
Shot TwoofShots
Total
prevention used?
Flu
No Flu
Total
r1
r2
300
300
300
n = 900
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Example
Applet
Random samples of 200 voters in each of
four wards were surveyed and asked if
they favor candidate A in a local
election.
Ward
1
2
3
4
Total
Favor A
76
53
59
48
236
Do not favor A
124
147
141
152
564
Total
200
200
200 to indicate
200
800
Do
the
data
present
sufficient
evidence
that the the
HH
:
p
=
p
=
p
=
p
0 : 1fraction
2
3
4
A is independent of ward
0 of votersfavoring
fraction
favoring candidate A differs in the four wards?
where
pi = fraction
favoring
A in each
ofward
the four
Ha: fraction
favoring
A depends
on the
wards
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
Applet
ˆ )2
(
O

E
ij
ij
Test statistic : X 2  

•Calculate the expectedEˆ cell counts. For
The Voter Problem
ij
example
2
2
(76  59):2 ˆ (53 r159
)
(
152

141
)
c2 236(200)
+E12 
 10.722
 + ... +
 59
59
59n
800 141
Reject H 0 if X 2   .205  7.81 with (r  1)(c  1)  3 df.
Chi-Square Test: 1, 2, 3, 4
Chi-Sq
= below
4.898 observed
+ 0.610 counts
+ 0.000 +
Expected counts are
printed
2.050 + 0.255 + 0.000 +
0
10.722
1
2
3
4
Total
DF 53
= 3, P-Value
1
76
59 = 0.013
48
236
59.00
59.00
59.00
59.00
2
124
141.00
147
141.00
141
141.00
Total
200
200
200
2.051 +
0.858 =
Reject H . There is
sufficient evidence to
indicate that the
152
564
fraction of voters
141.00
favoring
A varies
200
800
from ward to ward.
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.
The Voter Problem
Since we know that there are differences
among the four wards, what are the
nature of the differences?
Look at the proportions in favor of
candidate A in the four wards.
Ward
1
2
3
4
Favor A
76/200=.38
53/200 = .27 59/200 = .30
48/200 = .24
Candidate A is doing best in the first ward, and worst in the
fourth ward. More importantly, he does not have a majority of
the vote in any of the wards!
Copyright ©2003 Brooks/Cole
A division of Thomson Learning, Inc.