Chapter 9: Regression

Download Report

Transcript Chapter 9: Regression

Chapter 9: Regression

Alexander Swan & Rafey Alvi

Residuals Grouping

● No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.

● Residuals often reveal subtleties that were not clear from a plot of the original data.

Residuals Grouping

● Sometimes the subtleties we see are additional details that help confirm or refine our understanding.

● Sometimes they reveal violations of the regression conditions that require our attention.

Subsets

Some important information: All the data must come from the same group.

When we discover that there is more than one group in a regression, neither modeling the groups together nor modeling them apart is correct.

Subsets

Extrapolation

Extrapolations are dubious because they require the additional —and very questionable— assumption that nothing about the relationship between

x

and

y

changes even at extreme values of

x

.

Extrapolations can get you into deep trouble. You’re better off not making extrapolations.

Outliers

Any point that stands away from the others is called an

outlier

and strongly influences a regression.

Leverage, influential

A data point can be unusual if its x-value is far from the mean of the x-values. These kind of points have high

leverage

.

A data point is

influential

if omitted it will give a very different model.

Lurking Variable

There is no way to conclude from a regression alone that a variable causes the other. With observational data, as opposed to data from a designed experiment, there is no way to be sure that a lurking variable is not the cause of any apparent association.

Summary values

Scatterplots of summary statistics show less scatter than the baseline data on individuals. Scatterplots of statistics summarized over groups tend to show less variability than if measured with same variable on individuals.

Question 3

Suppose you wanted to predict the trend in marriage age for American women into the early part of this century.

a. How could you use this data graphed in Exercise 1 to get a good prediction? Marriage ages in selected years starting in 1990 are listed below. Use all or part of these data to create an appropriate model for predicting the average age at which women will first marry in 2005.

1900-1950 (10 year intervals): 21.9, 21.6, 21.2, 21.3, 21.5, 20.3

1955-1995 (5 year intervals): 20.2, 20.2, 20.6, 20.8, 21.1, 22.0, 23.3, 23.9, 24.5

To predict average age you would use the most recent ages, from 1975-1995, which are straight enough for a linear regression model. The linear model used to predict the marriage age would come out to be Age = -322.430 + 0.174(Year). The residual plot showed no pattern, but according to the plot the average age of marriage for women would be 26.44 years old.

Question 3

b. How much faith do you place in this prediction? Explain.

I don’t have very much faith in the prediction because the prediction is for a year that is 10 years higher than the highest year we are given.

b. Do you think your model would produce an accurate prediction about your grandchildren, say, 50 years from now? Explain.

NO! If the prediction from 10 years higher would be unlikely, a prediction 50 years later would not be possible following the trend from 1955-1995.

Question 5

In justifying his choice of a model, a student wrote, “I know this is the correct model because R² = 99.4%.” a. Is this reasoning correct? Explain.

No, you would need a scattered plot to make this prediction.

a. Does this model allow the student to make accurate predictions? Explain.

No, the data could possibly be curved.

Vocabulary to know

The Outlier Condition means two things: n Points with large residuals or high leverage (especially both) can influence the regression model significantly.

● lurking variable