Example 8.1 - Rice University

Download Report

Transcript Example 8.1 - Rice University

Inference About
Regression Coefficients
BENDRIX.XLS

This is a continuation of the Bendrix manufacturing
example from the previous chapter.

As before, the response variable is Overhead and the
explanatory variables are MachHrs and ProRuns.

The data are contained in this file.

What inferences can we make about the regression
coefficients?
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Multiple Regression Output

We obtain the output from using StatPro’s Multiple
Regression procedure.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Multiple Regression Output -continued

Regression coefficients estimate the true, but
unobservable, population coefficients.

The standard error of bi indicates the accuracy of
these point estimates.

For example, the effect on Overhead of a one-unit
increase in MachHrs is 43.536.

We are 95% confident that the coefficient is between
36.357 to 50.715. Similar statements can be made
for the coefficient of ProdRuns and the intercept term.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Multicollinearity
The Problem

We want to explain a person’s height by means of
foot length.

The response variable is Height, and the explanatory
variables are Right and Left, the length of the right
foot and the left foot, respectively.

What can occur when we regress Height on both
Right and Left?
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Multicollinearity

The relationship between the explanatory variable X
and the response variable Y is not always accurately
reflected in the coefficient of X; it depends on which
other X’s are included or not included in the equation.

This is especially true when there is a linear
relationship between to or more explanatory
variables, in which case we have multicollinearity.

By definition multicollinearity is the presence of a
fairly strong linear relationship between two or more
explanatory variables, and it can make estimation
difficult.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution

Admittedly, there is no need to include both Right and
Left in an equation for Height - either one would do but we include both to make a point.

It is likely that there is a large correlation between
height and foot size, so we would expect this
regression equation to do a good job.

The R2 value will probably be large. But what about
the coefficients of Right and Left? Here is a problem.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The coefficient of Right indicates that the right foot’s
effect on Height in addition to the effect of the left
foot. This additional effect is probably minimal. That
is, after the effect of Left on Height has already been
taken into account, the extra information provided by
Right is probably minimal. But it goes the other way
also. The extra effort of Left, in addition to that
provided by Right, is probably minimal.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
HEIGHT.XLS

To show what can happen numerically, we generated
a hypothetical data set of heights and left and right
foot lengths in this file.

We did this so that, except for random error, height is
approximately 32 plus 3.2 times foot length (all
expressed in inches).

As shown in the table to
the right, the correlations
between Height and either
Right or Left in our data set
are quite large, and the
correlation between Right
and Left is very close to 1.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The regression output when both Right and Left are
entered in the equation for Height appears in this
table.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

This output tells a somewhat confusing story.

The multiple R and the corresponding R2 are about
what we would expect, given the correlations
between Height and either Right or Left.

In particular, the multiple R is close to the correlation
between Height and either Right or Left. Also, the se
value is quite good. It implies that predictions of
height from this regression equation will typically be
off by only about 2 inches.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

However, the coefficients of Right and Left are not all
what we might expect, given that we generated
heights as approximately 32 plus 3.2 times foot
length.

In fact, the coefficient of Left has the wrong sign - it is
negative!

Besides this wrong sign, the tip-off that there is a
problem is that the t-value of Left is quite small and
the corresponding p-value is quite large.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

Judging by this, we might conclude that Height and
Left are either not related or are related negatively.
But we know from the table of correlations that both
of these are false.

In contrast, the coefficient of Right has the “correct”
sign, and its t-value and associated p-value do imply
statistical significance, at least at the 5% level.

However, this happened mostly by chance, slight
changes in the data could change the results
completely.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The problem is although both Right and Left are
clearly related to Height, it is impossible for the least
squares method to distinguish their separate effects.

Note that the regression equation does estimate the
combined effect fairly well, the sum of the coefficients
is 3.178 which is close to the coefficient of 3.2 we
used to generate the data.

Therefore, the estimated equation will work well for
predicting heights. It just does not have reliable
estimates of the individual coefficients of Right and
Left.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

To see what happens when either Right or Left are
excluded from the regression equation, we show the
results of simple regression.

When Right is only variable in the equation, it
becomes
Predicted Height = 31.546 + 3.195Right

The R2 and se values are 81.6% and 2.005, and the tvalue and p-value for the coefficient of Right are now
21.34 and 0.000 - very significant.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

Similarly, when the Left is the only variable in the
equation, it becomes
Predicted Height = 31.526 + 3.197Left

The R2 and se values are 81.1% and 2.033, and the tvalue and p-value for the coefficient of Left are 20.99
and 0.0000 - again very significant.

Clearly, both of these equations tell almost identical
stories, and they are much easier to interpret than the
equation with both Right and Left included.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Include/Exclude
Decisions
CATALOGS1.XLS

This file contains data on 100 customers who
purchased mail-order products from the HyTex
Company in 1998.

Recall from Example 3.11 that HyTex is a direct
marketer of stereo equipment, personal computers,
and other electronic products.

HyTex advertises entirely by mailing catalogs to its
customers, and all of its orders are taken over the
telephone.

We want to estimate and interpret a regression
equation for Spent98 based on all of these variables.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
The Data

The company spends a great deal of money on its
catalog mailings, and it wants to be sure that this is
paying off in sales.

For each customer there are data on the following
variables:
– Age in years.
– Gender: coded as 1 for males, 0 for females
– OwnHome: coded as 1 if customer owns a home, 0
otherwise
– Married: coded as 1 if customer is currently married, 0
otherwise
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
The Data -- continued
– Close: coded as 1 if customers lives reasonably close to a
shopping area that sells similar merchandise, 2 otherwise
– Salary: combined annual salary of customer and spouse (if
any)
– Children: number of children living with customer
– Customer97: coded as a 1 if customer purchased from
HyTex during 1997, 0 otherwise
– Spent97: total amount of purchase made from HyTex during
1997
– Catalogs: Number of catalogs sent to the customer in 1998
– Spent98: total amount of purchase made from HyTex during
1998
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
The Data -- continued

With this much data, 1000 observations, we can
certainly afford to set aside part of the data set for
validation.

Although any split could be used, let’s base the
regression on the first 250 observations and use the
other 750 for validation.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
The Regression

We begin by entering all of the potential explanatory
variables.

Our goal then is exclude variables that aren’t
necessary, based on their t-values and p-values. To
do this we follow the Guidelines for Including /
Excluding Variables in a Regression Equation.

The regression output with all explanatory variables
included is provided on the following slide.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Analysis

This output indicates a fairly good fit. The R2 value is
79.1% and se is about $424.

From the p-value column, we see that there are three
variables, Age, Own_Home, and Married, that have
p-values well above 0.05.

These are the obvious candidates for exclusion. It is
often best to exclude one variable at a time starting
with the variable with the highest p-value.

The regression output with all insignificant variables
excluded is seen in the output on the next slide.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Interpretation of Final Regression
Equation

The coefficient of Gender implies that an average
male customer spent about $130 less than the
average female customer. Similarly, an average
customer living close to stores with this type of
merchandise spent about $288 less than those
customers living far form stores.

The coefficient of Salary implies that, on average,
about 1.5 cents of every salary dollar was spent on
HyTex merchandise.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Interpretation of Final Regression
Equation -- continued

The coefficient of Children implies that $158 less was
spent for every extra child living at home.

The Customer97 and Spent97 terms are somewhat
more difficult to interpret.
– First, both of these terms are 0 for customers who didn’t
purchase from HyTEx in 1997.
– For those that did the terms become -724 + 0.47Spent97
– The coefficient 0.47 implies that each extra dollar spent in
1997 can be expected to contribute an extra 47 cents in
1998.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Interpretation of Final Regression
Equation -- continued
– The median spender in 1997 spent about $900. So if we
substitute this for Spent 97 we obtain -301.
– Therefore, this “median” spender from 1997 can be expected
to spend about $301 less in 1998 than the 1997 nonspender.

The coefficient of Catalog implies that each extra
catalog can be expected to generate about $43 in
extra spending.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Cautionary Notes

When we validate this final regression equation with
the 750 customers, using the procedure from Section
11.7, we find R2 and se values of 75.7% and $485.

These aren’t bad. They show little deterioration from
the values based on the original 250 customers.

We haven’t tried all possibilities yet. We haven’t tried
nonlinear or interaction variables, nor have we looked
at different coding schemes; we haven’t checked for
nonconstant error variance or looked at potential
effects of outliers.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
The Partial F Test
BANK.XLS

Recall from Example 11.3 that the Fifth National Bank
has 208 employees.

The data for these employees are stored in this file.

In the previous chapter we ran several regressions
for Salary to see whether there is convincing
evidence of salary discrimination against females.

We will continue this analysis here.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Analysis Overview

First, we will regress Salary versus the Female
dummy, YrsExper, and the interactions between
Female and YrsExper, labeled Fem_YrsExper. This
will be the reduced equation.

Then we’ll see whether the JobGrade dummies
Job_2 to Job_6 add anything significant to the
reduced equation. If so, we will then see whether the
interactions between the Female dummy and the
JobGrade dummies, labeled Fem_Job2 to
Fem_Job6, add anything significant to what we
already have.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Analysis Overview -- continued

If so, we’ll finally see whether the education dummies
Ed_2 to Ed_5 add anything significant to what we
already have.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution

First, note that we created all of the dummies and
interaction variables with StatPro’s Data Utilities
procedures.

Also, note that we have used three sets of dummies,
for gender, job grad and education level.

When we use these in a regression equation, the
dummy for one category of each should always be
excluded; it is the reference category. The reference
categories we have used are “male”, job grade 1 and
education level 1.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The output for the “smallest” equation using Female,
YrsExper, and Fem_YrsExper as explanatory
variables is shown here.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

We’re off to a good start. These three variables
already explain 63.9% of the variation of Salary.

The output for the next equation which adds the
explanatory variables Job_2 to Job_6 is on the next
slide.

This equation appears much better. For example R2
has increased to 81.1%. We check whether it is
significantly better with the partial test in rows 26-30.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The degrees of freedom in cell C28 is the same as
the value in cell C12, the degrees of freedom for
SSE.

Then we calculate the F-ratio in cell C29 with the
formula =((Reduced!D12-D12)/C27)/E12 were
Reduced!D12 refers to SSE for the reduced equation
from the Reduced sheet.

Finally, we calculate the corresponding p-value in cell
C30 with the formula =FDIST(C29,C27,C28). It is
practically 0, so there is no doubt that the job grade
dummies add significantly to the explanatory power
of the equation.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

Do the interactions between the Female dummy and
the job dummies add anything more?

We again use the partial F test, but now the previous
complete equation becomes the new reduced
equation, and the equation that includes the new
interaction terms becomes the new equation.

The output for this new complete equation is shown
on the next slide.

We perform the partial F test in rows 31-35 as exactly
as before. The formula in C34 is =((Complete!D12D12)/C32)/E12.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

Again the p-value is extremely small, so there is no
doubt that the interaction terms add significantly to
what we already had.

Finally, we add the education dummies.

The resulting output is shown on the next slide. We
see how the terms reduced and complete are
relative.

This output now corresponds to the complete
equation, and the previous output corresponds to the
reduced equation.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The formula in cell C38 for the F-ratio is now
=((MoreComplete!D12-D12/C36)/E12. The R2 value
increased from 84.0% to 84.7%. Also the p-value is
not extremely small.

According to the partial F test, it is not quite enough
to qualify for statistical significance at the 5% level.

Based on this evidence, there is not much to gain
from including the education dummies in the
equation, so we would probably elect to exclude
them.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Concluding Comments

First, the partial test is the formal test of significance
for an extra set of variables. Many users look only at
the R2 and/or se values to check whether extra
variables are doing a “good job”.

Second, if the partial F test shows that a block of
variables is significant, it does not imply that each
variable in this block is significant. Some of these
values can have low t-values.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Concluding Comments -continued

Third, producing all of these outputs and doing the
partial F tests is a lot of work. Therefore, we included
a “Block” option in StatPro to make life easier. To run
the analysis in this example use StatPro/Regression
analysis/Block menu item. After selecting Salary as
the response variable, we see this dialog box.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Concluding Comments -continued

We want four blocks of explanatory variables, and we
want a given block to enter only if it passes the partial
F test at the 5% level. In later dialog boxes we
specify the explanatory variables. Once we have
specified all this, the regression calculations are done
in stages. The output from this appears on the next
two slides. The output spans over two figures. Note
that the output for Block 4 has been left off because it
did not pass the F test at 5%.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Concluding Comments -continued

Finally, we have concentrated on the partial F test
and statistical significance in this example. We don’t
want you to lose sight, however, of the bigger picture.
Once we have decided on a “final” regression
equation we need to analyze its implications for the
problem at hand.

In this case the bank is interested in possible salary
discrimination against females, so we should interpret
this final equation in these terms. Our point is simply
that you shouldn’t get so caught in the details of
statistical significance that you lose sight of the
original purpose of the analysis!
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Outliers
Questions

Of the 208 employees at Fifth National Bank, are
there any obvious outliers?

In what sense are they outliers?

Does it matter to the regression results, particularly
those concerning gender discrimination, whether the
outliers are removed?
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
BANK.XLS

There are several places we could look for outliers.

An obvious place is the Salary variable.

The boxplot shown here shows that there are several
employees making substantially more in salary than
most of the employees.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution

We could consider these outliers and remove them,
arguing perhaps that these are senior managers who
shouldn’t be included in the discrimination analysis.

We leave it to you to check whether the regression
results are any different with these high salary
employees than without them.

Another place to look is at the scatterplot of the
residuals versus the fitted values. This type of plot
shows points with abnormally large residuals.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

For example, we ran the regression with Female,
YrsExper, Fem_YrsExper, and the five job grade
dummies, and we obtained the output and scatterplot
shown here.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

This scatterplot has several points that could be
considered outliers, but we focus on the point
identified in the figure.

The residual for this point is approximately -21.

Given the se for this regression is approximately 5,
this residual is over four standard errors below 0 quite a lot.

This person is found to be unusual and special
circumstances can explain for this.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

If we delete this employee and rerun the regression
with the same variables, we obtain the output shown
here.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

Now, recalling that gender discrimination is the key
issue in this example we compare the coefficients of
Female and Fem_YrsExper in the two outputs.

The coefficient of Female has dropped from 6.063 to
4.353. In words, the Y-intercept for the female
regression line used to be about $6000 higher than
for the male line, now it’s only about $4350.

More importantly, the coefficient of Fem_YrsExper
has changed from -1.021 to -0.721. This coefficient
indicates how much less steep the female line for
Salary versus Yrs_Exper is than the male line.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

So a change from -1.021 to -0.721 indicates less
discrimination against females now than before. In
other words, this unusual female employee accounts
for a good bit of the discrimination argument although a strong argument still exists even without
her.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Prediction
Questions

Consider the following three male employees at Fifth
National:
– Employee 5 makes $29,000, is in job grade 1, and has 3
years of experience at the bank.
– Employee 156 makes $45,000, is in job grade 4, and has 6
years of experience at the bank.
– Employee 198 makes $60,000, is in job grade 6, and has 12
years of experience at the bank.

Using regression equations for Salary that includes
the explanatory variables Female, YrsExper,
FemYrs_Exper, and the job grade dummies Job_2 to
Job_6, check that the predicted salaries for these
three employees are close to their actual salaries.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Questions -- continued

Then predict the salaries these employees would
obtain if they were females.

How large are the discrepancies?

When estimating the equation, exclude the last
employee, employee 208, whom we diagnosed as an
outlier in Section 14.9.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution

The analysis appears on the next slide.

The top part includes the variables we need for this
analysis.

Note how employee 208 has been separated from
the rest of the data, so that she is not included in the
regression analysis.

The usual regression output is not shown, but the
standard error of estimate and estimated coefficients
have been copied to cell B216 and the range
B218:J218.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The values for male employees 5, 156, and 198 have
been copied to the range B222:B224.

We can then substitute their values into the
regression equation to obtain their predicted salaries
in column A.

The formula in cell A222, for example, is
=$B$218+SUMPRODUCT($C$218:$J$218,C222:J222)

Clearly the predictions are quite good for these three
employees. The worst prediction is off by less than
$2000.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

To see what would happen if these employees were
females, we need to adjust the values of the
explanatory variables Female and Fem_YrsExper.

For each employee in rows B227-229, the value of
Female becomes 1 and the value of Fem_YrsExper
becomes the same as the YrsExper. Copying the
formula in A222 down to these rows gives the
predicted salary for the females.

One way to compare females to males is to enter the
formula =(A227-B222)/$B$216 in cell B227 and copy
it down.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

This is the number of standard errors the predicated
female salary is above (if positive) or below (if
negative) the actual male salary.

As we discussed earlier with this data set, females
with only a few years experience actually tend to
make more than males. But the opposite occurs for
employees with many years of experience.

For example, male employee 198 is earning just
about the regression equation predicts he should
earn. But if he were female, we would predict a salary
about $4500 below the male, almost a full standard
error lower.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Prediction
The Problem

Besides the 50 regions in the data set, Pharmex,
does business in five other regions, which have
promotional expenses indexes of 114, 98, 84, 122,
and 101.

Find the predicted Sales and a 95% prediction
interval for each of these regions.

Also, find the mean Sales for all regions with each of
these values of Promote, along with 95% confidence
interval for these means.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
PHARMEX.XLS

This example cannot be solved with StatPro but it is
relatively easy with Excel’s built-in functions.

We illustrate the procedure in this file shown here on
the next slide.

The original data appear in Column B and C. We use
the range names SalesOld and PromoteOld for the
data in these columns.

The new regions appear in rows 9-13. Their given
values of Promote are in the range G9:G13, which
we name PromoteNew.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution

To obtain the predicted sales for these regions, we
use Excel’s TREND function by highlighting the range
H9:H13, typing the formula
=TREND(SalesOld,PromoteOld,PromoteNew)
and pressing Ctrl-Shift-Enter.

This substitutes the new values of the explanatory
variable (in the third argument) into the regression
equation based on the data from the first two
arguments.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

We can then use these same predictions in rows 1923 for the mean sales values.

For example, we predict the same Sales value of
112.03 for a single region with Promote equal to 114
or the mean of all regions with this value of Promote.

According to the approximate standard error of
predication for any individual value is se, calculated in
cell H6 with the formula
=STEYX(SalesOld,PromoteOld)
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The more exact standard error of prediction depends
on the value of Promote. To calculate it we enter the
formula
=$H$6*SQRT(1+1/50+(G9-AVERAGE(PromoteOld))^2
/(48*STDEV(PromoteOld))^2)
in cell I9 and copy it down through cell I13.

We then calculate the lower and upper limits of 95%
prediction intervals in columns J and K. These use
the t-multiple in cell I3, obtained with the formula
=TINV(0.05,73).
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Solution -- continued

The formulas in cells J9 and K9 are then =H9-$I$3*I9
and =H9+$I$3*I9 which we copy down to row 13.

The calculations for the mean predictions in rows 1923 are almost identical. The only difference is that the
approximate standard error is se divided by the
square root of 75 calculated in H16.

The more exact standard of error in column I are then
calculated by entering the formula
=$H$6*SQRT(1/50+(G9-AVERAGE(PromoteOld))^2
/(48*STDEV(PromoteOld))^2)
in cell I19 and copying it down.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Conclusions

We have gone through these rather tedious
calculations to make several points.
n are
– First, the approximate standard errors se and se
usually quite accurate. This is fortunate because the exact
standard errors are difficult to calculate and are not always
given in statistical software packages.
– Second, a simple rule of thumb for calculating individual 95%
prediction intervals is to go out an amount 2se, on either side
of the predicted value. Again this is not exactly correct; but
as calculations in this example indicate, it works quite well.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b
Conclusions -- continued

Finally, we see from the wide prediction intervals how
much uncertainty remains.

The reason is the relatively large standard error of
estimate, se.

Contrary to what you may believe this is not a sample
size problem.

The whole problem is that Promote is not highly
correlated with Sales. The only way to decrease se
and get more accurate predictions is to find other
explanatory variables that are more closely related to
Sales.
14.2 | 14.3 | 14.3a | 14.1a | 14.4 | 14.4a | 14.1b | 14.5 | 14.4b