Transcript How to DO It - Kellogg School of Management
Regression Analysis
•
The Motorpool Example
– Looking just at two-dimensional shadows, we don’t see the true effects of the variables.
– We need a way to look at all the dimensions of a relationship at the same time.
Course website: http://www.kellogg.northwestern.edu/faculty/weber/kg01
The Regression Machine
relevant sample data (no missing data) The regression model (terminology, structure, and assumptions) regression results (today’s focus)
The Regression Model Costs
α
β
1
Mileage
β
2
Age
β
3
Make
ε
dependent variable coefficients explanatory variables or independent variables residual term linear mathematical structure What we’re talking about is sometimes explicitly called “linear regression analysis,” since it assumes that the underlying relationship is linear (i.e., a straight line in two dimensions, a plane in three, and so on)!
Why Spend All This Time on such a Limited Tool?
• • • Some interesting relationships are linear.
All relationship are locally linear!
Several of the most commonly encountered nonlinear relationships in management can be translated into linear relationships, studied using regression analysis, and the results then untranslated back to the original problem! (This is part of what we’ll learn in Days 3 through 5.)
A Few Final Assumptions Concerning
• • The validity of regression analysis depends on several assumptions concerning the residual term.
– E[ε] = 0 . This is purely a cosmetic assumption. The estimate of α will include any on-average residual effects which are different from zero.
– ε varies normally across the population. While a substantive assumption, this is typically true, due to the Central Limit Theorem, since the residual term is the total of a myriad of other, unidentified explanatory variables. If this assumption is not correct, all statements regarding confidence intervals for individual predictions might be invalid.
The following additional assumptions will be discussed later in the course.
– StdDev[ε] does not vary with the values of the explanatory variables. (This is called the homoskedasticity assumption.) Again, if this assumption is not correct, all statements regarding confidence intervals for individual predictions might be invalid.
– ε is uncorrelated with the explanatory variables of the model. The regression analysis will “attribute” as much of the variation in the dependent variable as it can to the explanatory variables. If some unidentified factor covaries with one of the explanatory variables, the estimate of that explanatory variable’s coefficient (i.e., the estimate of its effect in the relationship) will suffer from “specification bias,” since the explanatory variable will have both its own effect, and some of the effect of the unidentified variable, attributed to it. This is why, when doing a regression for the purpose of estimating the effect of some explanatory variable on the dependent variable, we try to work with the most “complete” model possible.
1. Predictions
• • Given an individual, and some information about that individual, predict what the dependent variable will be.
– What annual maintenance and repair cost (Costs) would you predict for a new (Age = 0) Honda (Make = 1) driven 15,000 miles (Mileage = 15)?
Regress the dependent variable onto the given variables to get the “prediction equation”. Then make the prediction.
Predictions
The prediction equation: Costs pred = 107.34 + 29.65 · Mileage + 73.96 · Age + 47.43 · Make ( + 0 )
1.1 A Prediction for an Individual
Costs pred = 107.34 + 29.65 · 15 + 73.96 · 0 + 47.43 · 1 = $599.49
The margin of error in the prediction (at the 95%-confidence level) is 2.2010 · $55.75 = $122.70 , And so a 95%-confidence interval for the prediction is $599.49 ± $122.70 .
1.2 Prediction: The Estimated Mean for a Subgroup of Similar Individuals
• Estimate the mean annual costs for new Hondas driven 15,000 miles.
– The estimate for the group is what we’d predict for any one member of the group. The margin of error in the estimate is computed using the standard error of the estimated mean.
$599.49 ± 2.2010 · $26.67
$599.49 ± $58.69
1.3 Sources of Error
Y Y pred α β 1 X 1 a b 1 X 1 β k X k b k X k ε 0 standard error of the prediction this standard error of the estimated mean this 2 standard error of the regression, StdDev( ) this 2
The Standard Error of the Regression
• • • Using the prediction equation, we predict for each sample observation.
The difference between the prediction and the actual value of the dependent variable (i.e., the error) is an estimate of that individual’s residual.
StdDev( ) is estimated from these.
Indeed, the regression “process” simply finds the coefficient estimates which minimize the standard error of the regression (or equivalently, which minimize the sum of the squared residuals)!
A Brief Digression
• What annual maintenance and repair cost (Costs) would you predict for a Honda (Make=1) driven 15,000 miles (Mileage = 15)?
– Regress Costs onto just Mileage and Make.
Costs pred = $678.00 .
A Brief Digression
• The prediction made using the reduced model is precisely what we would get if we predicted Age from Mileage and Make, and then Costs from all 3!
Costs pred = $678.00 .
A Brief Digression
• The prediction made using the reduced model is precisely what we would get if we predicted Age from Mileage and Make, and then Costs from all 3!
The reason we don’t take this latter approach is that the standard error of the prediction here is based on the assumption that the age of the car is precisely 1.061546 years, instead of actually being unknown.
Still, it’s reassuring to see that the numbers all fit together.
2. Estimating an Effect
• An additional thousand miles of driving in the course of a year adds, on average, how much to the year’s maintenance and repair costs?
– It is ESSENTIAL to note that the additional driving changes neither the car’s Age, nor its Make. In order to hold them constant while varying Mileage, we need to work with a model including ALL of the explanatory variables.
Estimating an Effect
The coefficient of Mileage in the most-complete model is our estimate of the impact of a one-unit (1,000 mile) change in Mileage. That coefficient is $29.65/ thousand miles. (That is, 29.65 units of the dependent variable per unit of the explanatory variable.) The predictions below examine the impact of an additional thousand miles of driving for a two-year-old Ford and two two-year-old Hondas. Each difference in predictions is $29.65 greater for the car driven an additional 1,000 miles.
Estimating an Effect
Each coefficient is an estimate of the “true” coefficient, and is subject to sampling error.
One standard-deviation’s-worth of uncertainty in the estimate is given by the standard error of the coefficient. For example, a 95%-confidence interval for the coefficient of Mileage in the full model is 29.65 ± 2.2010 · 3.92
29.65 ± 8.62
3. The Explanatory Power of the Model
• Why do maintenance and repair costs vary from car to car across the current fleet?
– A partial answer is, “Because Mileage, Age, and Make vary from car to car across the fleet.” • Indeed, variations in those three variables can potentially explain 80.78% of the overall variability in Costs across the fleet!
• This is the adjusted coefficient of determination for our model.
The Explanatory Power of the Model
• • Names can vary: The {adjusted, corrected, unbiased} {coefficient of determination, r-squared} all refer to the same thing.
– Without an adjective, the {coefficient of determination, r squared} refers to a number slightly larger than the “correct” number, and is a throwback to pre-computer days.
When a new variable is added to a model, which actually contributes nothing to the model (i.e., its true coefficient is 0), the adjusted coefficient of determination will, on average, remain unchanged.
– Depending on chance, it might go up or down a bit.
– If negative, interpret it as 0%.
– The thing without the adjective will always go up. That’s obviously not quite “right.”
The Explanatory Power of the Model
• Subtracting the adjusted coefficient of determination from 100% yields the fraction of the population-wide variation in the dependent variable which must be explained by terms still lumped together in the residual.
– If your goal is to explain everything, you want the adjusted coefficient of determination to be large.
– If your goal is to explain something, a very small value might be perfectly acceptable.
4. The Relative Explanatory Importance of the Explanatory Variables: The Beta-Weights • What explains why maintenance and repair costs vary from car to car across the current fleet?
– (This is the same question as before, but now we seek a more detailed answer.) – Compare the absolute values of the beta-weights.
Variations in Mileage across the population are roughly twice as important as are variations in Age (1.1531 vs. 0.5597), in helping to explain why Costs vary across the population.
In turn, the fact that the cars vary in Age is more than twice as important as is the fact that some are Fords, and others Hondas (0.5597 vs. 0.2193), in helping to explain why Costs vary.
The Beta-Weights
• • You can’t compare regression coefficients directly, since they may carry different dimensions.
The beta-weights are dimensionless, and combine how much each explanatory variable varies, with how much that variability leads to variability in the dependent variable. – Specifically, they are the product of each explanatory variable’s standard deviation (how much it varies) and its coefficient (how much its variation affects the dependent variable), divided by the standard deviation of the dependent variable (just to remove all dimensionality).
5. The Significance Levels for the Explanatory Variables (the p-values)
• How strong is the evidence that Mileage does play a role in the relationship involving all three explanatory variables?
– “Strength of evidence” evokes memories of hypothesis testing!
– If we wish to conclude that the evidence supports the inclusion of Mileage in our model, we must take the opposite as our null hypothesis: • Mileage would not belong if it had no effect on Costs, i.e., if its true coefficient were 0.
The Significance Levels for the Explanatory Variables (the p-values)
• Null hypothesis: “The true coefficient of Mileage is 0.” – Our estimate is 29.65.
– One standard-deviation’s-worth of uncertainty in the estimate is 3.915.
– Our estimate is 7.5726 standard deviations away from the hypothesized true value.
– If the truth really were 0, we’d see something this far away (or further) only 0.0011% of the time.
– The data is an overwhelmingly strong contradiction to the null hypothesis, and therefore … The evidence is overwhelmingly strong in support of the statement that the true coefficient of Mileage differs from 0, and Mileage does belong in our model.
Does Make Belong in our Model?
• Null hypothesis: “The true coefficient of Make is 0.” – Our estimate is 47.43.
– One standard-deviation’s-worth of uncertainty in the estimate is 28.98.
– Our estimate is 1.6366 standard deviations away from the hypothesized true value.
– If the truth really were 0, we’d see something this far away (or further) only 12.9983% of the time.
– The data is a bit of a contradiction to the null hypothesis, and therefore … There’s only a bit of evidence in support of the statement that the true coefficient of Make differs from 0, and Mileage does belong in our model.
So, what should we do? Leave Make in, or take it out?
Does Make Belong in our Model?
• • • There’s only a bit of evidence in support of the statement that the true coefficient of Make differs from 0, and Mileage does belong in our model.
So, what should we do? Leave Make in, or take it out?
It depends: Remember, the belief decision must stand on three legs.
– – – If the Fords and Hondas came from a joint production facility … • I’d lean towards leaving it out If the Fords came from Detroit, and the Hondas from Kyoto … • I’d lean towards leaving it in More data might clarify the situation … • The standard error of the coefficient would drop.
– If the coefficient stayed around 43, the significance level would get closer to zero, building stronger evidence for including the variable – If the coefficient shrank towards 0, there would continue to be no real evidence supporting Make’s inclusion, and even if it did belong, its estimated effect would be small.
• •
The Significance Levels for the Explanatory Variables
Imagine that you have a model.
– You introduce a new variable into that model.
– The adjusted coefficient of determination increases.
Does this mean that the new variable belongs in your model?
– Not necessarily! Adding garbage to your model will increase the adjusted coefficient of determination a little bit around half of the time.
– The significance level (for the new variable) tells you if the adjusted coefficient of determination went up by enough to support keeping the new variable.
Summary
1.
2.
3.
4.
5.
Predictions
What annual maintenance and repair cost (Costs) would you predict for a new (Age = 0) Honda (Make = 1) driven 15,000 miles (Mileage = 15)?
Regress dependent variable onto all known (for this individual) explanatory variables.
Look at (prediction) ± (~2)·(standard error of prediction).
Estimate the mean annual costs for new Hondas (note the plural!) driven 15,000 miles.
Regress dependent variable onto all known (for the group members) explanatory variables.
Look at (prediction) ± (~2)·(standard error of estimated mean).
Estimating an Effect
An additional thousand miles of driving in the course of a year adds, on average, how much to the year’s maintenance and repair costs?
Regress dependent variable onto all explanatory variables (use most complete model).
Look at (estimated coefficient) ) ± (~2)·(standard error of coefficient).
The Explanatory Power of the Model
Why do maintenance and repair costs vary from car to car across the current fleet?
Look at the adjusted coefficient of determination to see how much of the variation in the dependent variable can be jointly explained by variations in the included explanatory variables.
The Relative Explanatory Importance of the Explanatory Variables
Variation in which explanatory variable is most important in explaining why maintenance and repair costs vary from car to car across the current fleet?
Compare the absolute values of the beta-weights.
The Significance Levels for the Explanatory Variables
How strong is the evidence that Mileage does play a role in the relationship involving all three explanatory variables?
The smaller the significance level, the stronger the evidence that this variable has a non-zero coefficient in this model.