Chapter 10 Re-expressing the data

Download Report

Transcript Chapter 10 Re-expressing the data

Chapter 10 Re-expressing the data
math2200
If the relationship is nonlinear
• We may re-express the data to straighten
the bent relationship.
• Common ways to re-express data
– Logarithms of the response variable or the
explanatory variable, or both.
– Square root of the response variable.
– Reciprocals of the response variable.
Straight to the Point (cont.)
• The relationship between
fuel efficiency (in miles
per gallon) and weight (in
pounds) for late model
cars looks fairly linear at
first
• Using the linear model,
we get the intercept 40.65,
slope -0.0057 and Rsquared 82%.
Straight to the Point (cont.)
• The residual plot shows a problem:
Straight to the Point (cont.)
• We can re-express fuel efficiency as gallons per mile (a
reciprocal):
scatter plots: before and after re-expression
Straight to the Point (cont.)
• Improved the residuals plot.
• R-squared = 88% (was 82%)
• 7,200 lbs Lincoln Navigator (mpg) has predicted
9.8 mpg (reported 11.0) c.f. -0.32 mpg before
the re-expression.
When do we need to re-express data?
1. To make skewed distributions more
symmetric
•
Why?
– Easier to summarize symmetric distributions
•
We can use mean and sd
– We can use a normal model
Assets of 77 large companies
After log transformation
When do we need to re-express data?
2. Make the spread of several groups more
alike
• Why?
– Easier to compare groups that share a
common spread
– Some statistical methods require the
assumption that all groups have a common sd
Assets by market sectors
After log transformation
When do we need to re-expressing
data?
3. Make the relationship more linear
• Why?
– To apply linear regression
4. Make the points in a scatterplot spread out
evenly rather than thicken at one end.
• Why?
– Make the dataset easier to model.
Assets vs. Sales
After log transformation
The Ladder of Powers
• There is a family of simple re-expressions
that move data toward our goals in a
consistent way. We call this collection of
re-expressions the Ladder of Powers.
• The Ladder of Powers orders the effects
that the re-expressions have on data.
The Ladder of Powers
Power Name
2
Square of data
values
1
Raw data
½
“0”
-1/2
-1
Square root of
data values
We’ll use
logarithms here
Reciprocal
square root
The reciprocal
of the data
Comment
Try with unimodal distributions that are
skewed to the left.
Data with positive and negative values and
no bounds are less likely to benefit from reexpression.
Counts often benefit from a square root reexpression. For counted data, start here.
Measurements that cannot be negative often
benefit from a log re-expression.
An uncommon re-expression, but
sometimes useful.
Ratios of two quantities (e.g., mph) often
benefit from a reciprocal.
35
7.0
1500
40
7.5
Straighten the scatterplot
30
Square Root
sqrt(y)
log(y)
5.0
20
5.5
25
6.5
Logarithm
6.0
1000
4.0
10
4.5
15
500
0
50
100
150
200
250
300
0
50
100
150
x
200
250
300
0
50
100
150
x
-0.02
0.000
x
- 1/square root
-0.08
-0.10
-0.010
-1/sqrt(y)
-0.06
-0.005
-0.04
-reciprocal
-0.12
-0.015
0
-1/y
y
Raw Data
0
50
100
150
x
200
250
300
0
50
100
150
x
200
250
300
200
250
300
Plan B: Attack of the Logarithms
• When none of the data values is zero or
negative, logarithms can be a helpful ally
in the search for a useful model.
• Try taking the logs of both the x- and yvariable.
• Then re-express the data using some
combination of x or log(x) vs. y or log(y).
Plan B: Attack of the Logarithms
(cont.)
More generally
• We can transform both x and y
model
X-axis
Y-axis
Y = exp(a+bx)
x
Log(y)
Y = a+b log(x)
Log(x)
y
Y = a xb
Log(x)
Log(y)
Plan B: Attack of the Logarithms
(cont.)
Scatterplots : GDP and log(GDP) vs. Year
Logarithms: Fishing Line Length and Strength
length vs.
strength
1/length
vs.
strength
1/sqrt(length)
vs. strength
log (length)
vs. log
(strength)
Multiple Benefits
• We often choose a re-expression for one
reason and then discover that it has
helped other aspects of the analysis.
• For example, a re-expression that makes
a histogram more symmetric might also
straighten a scatterplot or stabilize
variance.
Why Not Just a Curve?
• If there’s a curve in the scatterplot, why not
just fit a curve to the data?
-Computationally more
difficult
-Straight lines are easy to
understand and interpret
A general principle
• Occam’s Razor
entia non sunt multiplicanda praeter necessitatem
entities should not be multiplied beyond
necessity.
When multiple competing theories are equal in other
respects, the principle recommends selecting the
theory that introduces the fewest assumptions and
postulates the fewest hypothetical entities
What Can Go Wrong?
• Don’t expect your
model to be perfect.
• Don’t choose a model
based on R2 alone:
– Example: large R2, but
residual plot shows
curvature
What Can Go Wrong? (cont.)
• Beware of multiple modes.
– Re-expression cannot pull separate modes together.
• Watch out for scatterplots that turn around.
– Re-expression can straighten many bent
relationships, but not those that go up and down.
What Can Go Wrong? (cont.)
• Watch out for negative data values.
– Can NOT use log or square root transformations
• Watch for data far from 1.
– Data values that are all very far from 1 may not be
much affected by re-expression unless the range is
very large. If all the data values are large (e.g., years),
consider subtracting a constant to bring them back
near 1.
• Don’t stray too far from the ladder
– Too difficult to interpret
What have we learned?
• When the conditions for linear regression
are not met, a simple re-expression of the
data may help.
• A re-expression may make the:
– Distribution of a variable more symmetric.
– Spread across different groups more similar.
– Form of a scatterplot straighter.
– Scatter around the line in a scatterplot more
consistent.
What have we learned? (cont.)
• Taking logs is often a good and simple
starting point.
– To search further, the Ladder of Powers or the
log-log approach can help us find a good reexpression.
• Our models won’t be perfect, but reexpression can lead us to a useful model.