Transcript Chapter 5

Chapter 5
Summarizing Bivariate
Data
Suppose we found the age and weight for
each person in a sample of 10 adults. Is
There does
there any relationship between the age
not appear
and weight of these adults?
Weight
to be a
relationshi
Do you think there is a
p
between
Create arelationship?
scatterplot of
the
data
below.
If so, what
age
and
kind? If not, why not?
weight in
adults.
Age
24
30
41
28
Wt
256
124
320
185
Age
50
46
49
35
20
39
158
129
103
196
110
130
Weight
Suppose we found the height and weight
for each person in a sample of 10 adults.
Is there any relationship between the
height and weight of these adults?
Do you think there is a
Create arelationship?
scatterplot
of
the
data
below.
If
so,
what
Height
kind? If not, why not?
Is it positive or negative? Weak or strong?
Ht
74
65
77
72
68
60
62
73
61
64
Wt
256
124
320
185
158
129
103
196
110
130
Correlation
feature(s)
of the
graph
• The What
relationship
between
bivariate
would variables
indicate a weak or strong
numerical
relationship?
– May be positive or negative
– May be weak or strong
What does it mean if the
relationship is positive?
Negative?
Identify the strength and direction
of the following data sets.
Set A
Set D
Set B
Set C
Set D shows a
strong,
Set Set
A shows
a strong,
positive
positive
curved
B
shows
little
or
no
Set linear
C shows
a weaker (moderate),
relationship.
relationship.
relationship.
negative linear relationship.
Identify as having a positive relationship,
a negative relationship, or no relationship.
1. Heights of mothers and heights of their
adult daughters
+
2. Age of a car in years and its current value
3. Weight of a person and calories consumed +
4. Height of a person and the person’s birth
month
5. Number of hours spent in safety training
and the number of accidents that occur
no
-
Correlation Coefficient (r)• A quantitative assessment of the
strength and direction of the linear
relationship in bivariate, quantitative
data
What
are are
these
These
the the
z• Pearson’s sample correlation
is
used
values
called?
scores
for x and y.
most
• Population correlation coefficient - r (rho)
• statistic correlation coefficient – r
• Equation:
 xi  x  yi  y 
1




r



n  1  s x  s y 
Example 5.1
For the six primarily undergraduate universities
in California with enrollments between 10,000
and 20,000, six-year graduation rates (y) and
student-related expenditures per full-time
students (x) for 2003 were reported as follows:
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
rates
64.6 53.0
46.3
42.5
Create a scatterplot and calculate r.
38.5
33.9
Example 5.1 Continued
Expenditures 8011 7323 8735 7548 7071 8248
Graduation Rates
Graduation
rates
64.6 53.0
46.3
42.5
38.5
33.9
r = 0.05
In order to interpret what
this number tells us, let’s
investigate the properties of
Expenditures
the correlation coefficient
Properties of r
(correlation coefficient)
1) legitimate values are -1 < r < 1
No
Correlation
Strong
correlation
Moderate
Correlation
Weak correlation
-1 -.8
-.5
0
.5
.8
1
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
rates
64.6 53.0
46.3
42.5
38.5
33.9
Suppose that the graduation rates were changed
from percents to decimals (divide by 100).
Transform the graduation rates and calculate r.
Do the following transformations
and calculate r
2) value of r isr not
changed by
= 0.05
1) x’ = 5(x + 14)
Ittransformation
is the same! Why?
any
linear
2) y’ = (y + 30) ÷ 4
Expenditures 8011 7323 8735 7548 7071 8248
Graduation
rates
64.6 53.0
46.3
42.5
38.5
33.9
Suppose we wanted to estimate the
expenditures per student for given graduation
rates.
Switch x and y, then calculate r.
r = 0.05
3) value of r does not depend on which
It is the same!
of the two variables is labeled x
Expenditures
8011 7323 8735 7548 7071 8248
Graduation
rates
64.6 53.0
46.3
42.5
38.5 63.9
33.9
Graduation
Rates
Graduation Rates
Plot a revised scatterplot and find r.
Suppose the 33.9 wasr =REALLY
0.42
63.9. What do you think would
happen to the value of the
Extreme
affect the by
Expenditures
Expenditures
correlation
coefficient?
4) value
of rvalues
is affected
correlation
coefficient
extreme values.
Find the correlation for these points:
x
-3 -1 1 3 5 7 9
Y
40 20 8 4 8 20 40
Compute the correlation coefficient?
r = 0
Sketch the scatterplot
r = 0, but the data y
set this
has amean
definite
Does
that there is NO
relationship!
relationship
5) value
of r is abetween
measure these
of the extent
to which x points?
and y are linearlyx related
Recap the Properties of r:
1. legitimate values of r are -1 < r < 1
2. value of r is not changed by any
transformation
3. value of r does not depend on which of
the two variables is labeled x
4. value of r is affected by extreme values
5. value of r is a measure of the extent to
which x and y are linearly related
Example 5.1 Continued
Expenditures 8011 7323 8735 7548 7071 8248
Graduation Rates
Graduation
rates
64.6 53.0
46.3
42.5
38.5
33.9
Interpret r = 0.05
There is a weak,
A quantitative assessment of the
positive, linear
strength and direction ofrelationship
the linear between
In order to interpret r, recall the
relationship between bivariate,
expenditures and
definition of the correlation
graduation rates.
quantitative
data
Expenditures
coefficient.
Does a value of r close to 1 or -1
mean that a change in one variable
cause a change in the other variable?
Consider the following examples:
Causality
can we
only
shown
by
carefully
Should
allbedrink
more
hot
• The
relationship
between
the
number
of
controlling
values
of the
alland
variables
that
chocolate
to lower
crime
rate?
cavities
in a child’s
teeth
the size
of
Both
arebe
responses
to the
coldones
weather
might
related
to
his or her vocabulary is strong andunder
study. In other words, with a wellpositive.
controlled,
well-designed
experiment.
These
variables
areIboth
strongly
So does
this mean
should
feed
related
the
age
the
child
• Consumption
ofto
hot
chocolate
is negatively
children
more
candy
toof
increase
their
correlated with crime
rate.
vocabulary?
Correlation does not imply
causation
Correlation does not imply
causation
Correlation does
not imply causation
What is the objective of regression
analysis?
objective of
regression
analysis is to
• x –The
variable:
is
the
independent
Suppose
that weabout
have two
use
information
one variables:
variable, x, to
or explanatory
variable
draw some sort of
a conclusion about a
second
variable,
y.
x
=
the
amount
spent
on
advertising
• y- variable: is the dependent or
y = the amount of sales for the product during
response variable
a given period
• We will use values of x to
What question might I want to answer using
predict values
ofdata?
y.
this
The LSRL is
yˆ  a  b x
Scatterplots frequently exhibit a linear
pattern. When this is the case, it makes
sense to summarize the relationship
- (y-hat)
means
the
predicted
y
yˆ between
the variables by finding a line that
as close
b – is the
slopeas possible to the plots in the
plot.
The
theBeline
that
the
– itLSRL
is theis
approximate
amount
by the
whichhat
y
sure
tominimizes
put
This
is of
done
bywhen
calculating
the
line
of
best fit
increases
x increases
by
1
unit
sum
the
squares
ofon
the
deviations
the
y
Least Square Regression Line (LSRL).
a – or
is the y-intercept
from the line
– it is the approximate height of the line

x  x y  y 

when
x
=
0
The slope of the LSRL is
b  Let’s
2


x

x
explore
 has no
– in some situations, the y-intercept
this
meaning
 y b
x
The
intercept of the LSRL is a what
means . . .
Suppose we have a data set that consists
just
fit
aof
Find
sum
Now
find
the
ofLet’s
thethe
observations
(0,0), (3,10) and 6,2).
line
to the of
the
squares
vertical
data
byfrom
these
distance
drawing
a line
deviations.
each
point
to
through
what
the
y =.5(0)
+ line.
4 = 4
appears to be
the
middle of
0 – 4 = -4
the points.
(3,10)
y =.5(6) + 4 = 7
4.5
2 – 7 = -5
y =.5(3) + 4 = 5.5
-4
(0,0)
10 – 5.5 = 4.5
yˆ  .5x  4
-5
(6,2)
Sum of the squares = 61.25
What is the sum
of the deviations
from the line?
Will it always be
zero?
(3,10)
6
Find the
vertical
deviations
from the
line
The line that minimizes the sum of the
squares of the deviations from the line
is-3
the LSRL.
(0,0)
Use a calculator to
Find the sum of the
find the line of best
squares of the
fit
deviations from the
line
1
yˆ  x  3
3
-3
(6,2)
Sum of the squares = 54
Researchers are studying pomegranate's antioxidants
properties to see if it might be helpful in the treatment
of cancer. In one study, mice were injected with cancer
cells and randomly assigned to one of three groups, plain
water, water supplemented with .1% pomegranate fruit
extract (PFE), and water supplemented with .2% PFE.
The average tumor volume for mice in each group was
recorded for several points in time. (x = number of days
after injection of cancer cells in mice assigned to plain
water and y = average tumor volume (in mm3)
x
11
15
19
23
27
y
150
270
450
580
740
Sketch a scatterplot for this data set.
Pomegranate
study continued
Remember
that an
interpretation is stating
x = number of
days
afterininjection of cancer cells in mice
the
definition
assigned to plaincontext.
water and y = average tumor volume
x
11
15
19
23
27
yaverage
270 positive,
450
740
The
volume
of
the580
tumor
increases
by
There
is a150
strong,
linear
relationship
approximately
37.25tumor
mm3 for
eachand
daythe
between
the average
volume
increasenumber
in the number
daysinjection.
after injection.
of daysofsince
Calculate the LSRL and the correlation
coefficient.
yˆ  269.75  37.25x
r  0.998
Interpret
the
slope
and thehave
correlation
Does
the
intercept
meaning in this
coefficient in context.
context? Why or why not?
Pomegranate study continued
This is the danger of
x = number of days after injection of cancer cells in mice
extrapolation.
The
leastassigned to plain water and y = average tumor volume
x
y
squares line should not be
11 to 15
23
27for
used
make 19
predictions
y 150
using 270
x-values
the
450 outside
580 740
range in the data set.
yˆ  269.75  37.25x
Why?
Predict the average volume of the tumor for 20
days It
after
injection. whether the pattern
is unknown
3
observed
in
the
scatterplot
ˆ
y  269.75  37.25(20)  475.25 mm
continues
outside
the
range
of
x
Predict the average volume of the tumor for 5
Can volume be negative?
days values.
after injection.
3
ˆ
y  269.75  37.25(5)  83.5 mm
Pomegranate study continued
the of
slope
theinjection
line forofpredicting
x =No,
number
daysof
after
cancer cellsxinismice
assigned to plain water
s y tumor volume
s x and y = average
r
not r
sy
x
11
15
19
23 s x 27
and the
almost
y intercepts
150 270are450
580always
740 different.
yˆ 
269
.75appropriate
 37.25x regression line:
Here
is the
Suppose we want to know how many days after
injection of cancer cells would the average tumor
size be 500 mm3?
xˆ  7.277  .027y
The regression
y onappropriate
x should not be used to
Is line
thisofthe
predictregression
x, because it
is not
the line that
line
to answer
minimizes the sum of the squared deviations in
thisx question?
the
direction.
Pomegranate study continued
x = number of days after injection of cancer cells in mice
assigned to plain water and y = average tumor volume
x
11
y
150
Will
19 the
23point
27of
averages always be on
270 450 580 740
the regression line?
15
Find the mean of the x-values (x) and the mean
of the y-values (y).
x = 19 and y = 438
+
Plot the point of averages (x,y) on the
scatterplot.
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8
y
2
5
4
6
9
Sketch a scatterplot. Calculate the LSRL and
the correlation coefficient.
yˆ  3.8  1.5x
r  0.916
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8 5
y
2
5
4
6
9 8
SupposeWhat
we addhappened?
the point (5,8) to the data set.
What happens to the regression line and the
correlation coefficient?
yˆ  3.8  1.5x
r  0.916
yˆ  1.15  1.17x
r  0.667
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8 12
y
2
5
4
6
9 12
SupposeWhat
we addhappened?
the point (12,12) to the data set.
What happens to the regression line and the
correlation coefficient?
yˆ  3.8  1.5x
r  0.916
yˆ  2.24  1.225x
r  0.959
Let’s investigate how the LSRL and
correlation coefficient change when
different points are added to the data set
Suppose we have the following data set.
x
4
5
6
7
8 12
y
2
5
4
6
9 0
SupposeWhat
we addhappened?
the point (12,0) to the data set.
What happens to the regression line and the
correlation coefficient?
yˆ  3.8  1.5x
r  0.916
yˆ  6.26  0.275x
r  0.248
The correlation coefficient and the
LSRL are both measures that are
affected by extreme values.
Pomegranate study revisited
x = number of days after injection of cancer cells in mice
assigned to plain water and y = average tumor volume
x
11
15
19
23
27
y
150
270
450
580
740
Minitab, a statistical software package, was used to fit
the least-squares regression line. Part of the resulting
We will discuss what these numbers
output is shown below.
mean
in the Chapter 13.
slope
The regression equation is
intercept
Predicted volume = -269.75 + 37.25 days
Predictor
Coef
SE Coef
T
P
Constant
-269.75
23.421412
-11.51724
0.0014
37.25
1.181454
31.52895
0.000
Days
Assessing the fit of the LSRL
Oncequestions
the LSRLare:
is obtained, the next
Important
is toanexamine
how effectively
the
1. Is step
the line
appropriate
way to summarize
line summarizes the relationship
the relationship between x and y.
between x and y.
2. Are there any unusual aspects of the data
set that we need to consider before We will
proceeding to use the line to make
look at
graphical
predictions?
and
3. If we decide to use the line as a basisnumerical
for
methods to
prediction, how accurate can we expect
predictions based on the line to be? answer
these
questions.
In a study, researchers were interested in how the
distance a deer mouse will travel for food (y) is related
to the distance from the food to the nearest pile of fine
woody
debris
). Distances
measured regression
in meters.
Minitab
was(x
used
to fit thewere
least-squares
line. From the partial output, identify the regression
x 6.94 5.23 5.21 7.10 line.
8.16 5.50 9.19 9.05 9.36
y
0
6.13
11.29
14.35 12.03 22.72 20.11
Predictor
Coef
SE Coef
Constant
-7.69
13.33
Distance to debris
3.234
1.782
S=8.67071
R-Sq = 32.0%
26.16 30.65
P
Plot Tthe data,
-0.58
0.582
including
the
1.82
0.112
regression
line.
R-Sq(adj) = 22.3%
yˆ  7.69  3.234x
In a study, researchers were interested in how the
distance a deer mouse will travelThe
for food
(y) is related
vertical
If the deviation
point is
to the distance
from the are
foodcalculated
to the
nearest
pile
of
fine
between
the
point
Residuals
by
above
the line,
woody debris (x). Distances were measured
in
meters.
and the LSRL is
Distance traveled
x
y
subtracting the predicted
y from
the
residual will
called
thepositive.
residual.
If the point is the
below
observed
y
.
be
6.94 5.23 5.21
7.10
8.16 5.50 9.19 9.05 9.36
the line the residual
0
6.13 11.29residual
14.35 12.03
22.72
20.11 26.16 30.65
ˆ

y

y
will be negative.
Distance to debris
In a study, researchers were interested in how the
LSRL
to calculate
the
distance a Use
deer the
mouse
will travel
for food
(y) is related
Subtract
to find
predicted
distance
traveled.
does this
remind
you of?pile of fine
to the distance fromWhat
the food
to the
nearest
the residuals.
woody debris (x). Distances were measured in meters.
Distance
from debris
Distance
traveled (y)
Predicted distance
traveled ( yˆ)
Residual
( y  yˆ)
6.94
0.00
14.76
-14.76
5.23
6.13
9.23
-3.10
5.21
11.29
9.16
2.13
7.10
14.35
15.28
-0.93
8.16
12.03
18.70
-6.67
5.50
22.72
10.10
12.62
9.19
20.11
22.04
-1.93
9.05
26.16
21.58
4.58
9.36
30.65
22.59
8.06
What
does the
Willofthe
sum
sum
theof the
residuals
residuals
always
equal?
equal
zero?
Residual plots
• Is a scatterplot of the (x, residual) pairs.
• Residuals can also be graphed against the
predicted y-values
• The purpose is to determine if a linear
model is the best way to describe the
relationship between the x & y variables
• If no pattern exists between the points
in the residual plot, then the linear model
is appropriate.
Residuals
Residuals
x
This residual shows no
pattern so it indicates that
the linear model is
appropriate.
x
This residual shows a curved
pattern so it indicates that
the linear model is not
appropriate.
In a study, researchers were interested in how theUse the
values in this
distance a deer mouse will travel for food (y) is related
table to
to the distance from the food to the nearest pile of
finea
create
woody debris (x). Distances were measured in meters.
residual plot
Distance
from debris
Distance
traveled (y)
Predicted distance
traveled ( yˆ)
Residual
( y  yˆ)
6.94
0.00
14.76
-14.76
5.23
6.13
9.23
-3.10
5.21
11.29
9.16
2.13
7.10
14.35
15.28
-0.93
8.16
12.03
18.70
-6.67
5.50
22.72
10.10
12.62
9.19
20.11
22.04
-1.93
9.05
26.16
21.58
4.58
9.36
30.65
22.59
8.06
for this data
set. Is a
linear model
appropriate
for
describing
the
relationship
between the
distance
from debris
and the
distance a
deer mouse
will travel
for food?
Plot the residuals against the distance from debris (x)
15
Residuals
10
5
5
-5
6
7
8
9
Distance from debris
-10
-15
Now plot the residuals
against the predicted
distance from food.
Since the
residual plot
displays no
pattern, a linear
model is
appropriate for
describing the
relationship
between the
distance from
debris and the
distance a deer
mouse will
travel for food.
15
Residuals
10
5
10
-5
15
20
9
25
Predicted Distance traveled
-10
What do you
notice about the
general scatter
of points on this
residual plot
versus the
residual plot
using the xvalues?
15
-15
Residual plots can be
plotted against either
the x-values or the
predicted y-values.
Residuals
10
5
5
-5
-10
-15
6
7
8
9
Distance from debris
Let’s examine the following data set:
The following data is for 12 black bears from
the Boreal Forest. This point is considered an
point
because it
x = age (in years)influential
and y = weight
(in kg)
affects the placement of
x 10.5 6.5 28.5 10.5 6.5
7.5 6.5 5.5 7.5 11.5 9.5 5.5
Do
you
anything
unusual
thenotice
least-squares
What would
happen
to the regression
Y
54 40 62
51 55 56regression
62 42
about
this40
data59set?51 50
line if this point isline.
removed?
Sketch a scatterplot with the fitted regression line.
60
Weight
55
Influential
observation
50
45
40
Age
5
10
15
20
25
30
Let’s examine the following data set:
The following data is for 12 black bears from
the Boreal Forest.
x = age (in years) and y = weight (in kg)
x 10.5
6.5 28.5 10.5 6.5 7.5 6.5 5.5 7.5 11.5 9.5 5.5
An
observation
is anNotice that this observation has a
Y outlier
54 40
42 residual.
40 59 51 50
if it62has51a 55 56 62large
large residual.
60
Weight
55
50
45
40
Age
5
10
15
20
Predicted Distance traveled
25
30
Coefficient of determination• Denoted by r2
• gives the proportion of variation in y
that can be attributed to an
approximate linear relationship
between x & y
Let’s explore the meaning of r2 by revisiting the deer mouse
data set.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
6.94
5.23
5.21
0
6.13
11.29
7.10
8.16
Why dosquares”
we
y the
 15.938
square
So
this
is the
What isdeviations?
total
amount
of total
variation
sumtraveled
of squares.
in the distance
(yvalues)? Hint: Find the sum of the
SSTo   y  y
9.19
14.35 12.03 22.72 20.11
Suppose you didn’t know any xvalues. What distance would you
expectSS
deer
mice to
travel?
stands
for
“sum of
squared deviations.
5.50

2
Distance traveled
x
y
9.05
9.36
26.16 30.65
30
25
20
15
10
5
5
6
7
8
Distance to Debris
9
Total amount of variation in
the distance traveled is
773.95 m2.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
6.94
5.23
5.21
0
6.13
11.29
7.10
8.16
5.50
9.19
14.35 12.03 22.72 20.11
Now suppose you DO know the
x-values. Your best guess
would be the predicted
distance traveled (the point
on the LSRL).
By how much do the observed
points vary from the LSRL?
Hint: Find the sum of the residuals
squared.
SSResid   y  yˆ
2
9.05
9.36
26.16 30.65
Distance traveled
x
y
Distance to debris
The points vary from the
LSRL by 526.27 m2.
x = the distance from the food to the nearest pile of fine woody debris
y = distance a deer mouse will travel for food
x
y
6.94
5.23
5.21
0
6.13
11.29
7.10
8.16
5.50
9.19
14.35 12.03 22.72 20.11
9.05
9.36
26.16 30.65
SSResid
r 1
SSTo
526.27
2
r 1
 0.320
773.95
Approximately what percent
Or approximately
of the variation in distance
32%
traveled can be explained by
the regression line?
Total amount of variation in
the distance traveled is
773.95 m2.
The points vary from the
LSRL by 526.27 m2.
2
Partial
output
fromthe
thevalues
regression
analysis
of deer and
mouse
Let’s
review
from
this output
data:
their meanings.
Predictor
Coef
SE Coef
T
P
Constant
-7.69
13.33
-0.58
0.582
1.82
0.112
Distance to
debris
S = 8.67071
What does this
3.234
1.782
number
represent?
R-sq = 32.0%
R-sq(adj) = 22.3%
The
standard
The
y-intercept
The
slope deviation
(b): (a): (s): 2
The
coefficient
of determination
(r
)an it
This
is
the
typical
amount
by
which
This
value
has
no
meaning
in
context
since
The distance traveled to food increases by
Only
32% ofmake
the
observed
in
theof
observation
deviates
from
thean
least
squares
doesn't
sense
to variability
have
a negative
approxiamtely
3.234
meters
for
increase
distance1traveled
can
explained
regression
line.
It’sbe
found
by: by the
distance.
meter
tofor
thefood
nearest
debris
pile.
approximate linear relationship
between the
SSResid
e 
distance traveled for sfood
and
n - 2 the distance to the
nearest debris pile.
Let’s examine this data set:
x = representative age
Because
of the curved
Since this curve
resembles
a
parabola,
a
Using
Minitab:finishpattern,
a straight line
y = average
marathon
quadratic
function
can be usedtime
to
would not
The least-squares
quadratic
regression
is accurately
describe this
relationship.
relationship
Age
15
25
35 describe
45 the55
65
2 between average finish
ˆ
y

a

b
x

b
x
Time yˆ302.38
 462 193.63
 141.2x 185.46
20.179198.49
x 2 time224.30
and age.288.71
This curve
minimizes the
sum of the
squares of the
residuals (similar
to least-squares
linear
regression).
Average Finish Time
Create a scatterplot for this data set.
300
250
200
10
20
30
40
50
Representative Age
60
Let’s examine this data set:
x = representative age
y = average marathonHere
finish
is time
the residual plot-
Since
there
Notice the residuals
from
theis no pattern in the
Age
15
25
35
45 the 55
65
residual
plot,
quadratic
quadratic
regression.
is an appropriate
model
302.38 193.63regression
185.46 198.49
224.30 288.71
for this data set.
Average Finish Time
Time
300
20
Residuals
250
200
10
-10
10
20
30
40
50
Representative Age
60
-20
10
20
30
40
50
60
Age
Let’s examine this data set:
x = representative age
The measure R2 is useful for
y = average marathon finish time
assessing the fit of the
quadratic regression.
Age
Time
15
25
35
45
55
65
SSResid
2
R

1

302.38 193.63 185.46 198.49
SSTo224.30 288.71
Average Finish Time
R2 = .921
300
250
200
10
20
30
40
50
Representative Age
60
92.1% of the variation in
average marathon finish
times can be explained by
the approximate quadratic
relationship between
average finish time and age.
Depending on the data set, other regression
models, such as cubic regression, may be used.
Statistical software (like Minitab) is commonly
used to calculate these regression models.
Another method for fitting regression
models to non-linear data sets is to
transform the data, making it linear.
Then a least-squares regression line can
be fit to the transformed data.
Commonly Used Transformations
Transformation
No transformation
Equation
yˆ  a  bx
Square root of x
yˆ  a  b x
Log of x *
yˆ  a  b log10 x 
Reciprocal of x
Log of y *
Exponential growth or decay
1
yˆ  a  b  
x 
log10 yˆ  a  bx
*Natural log may also be used
Pomegranate study revisited:
x = number of days after injection of cancer cells in
mice assigned to .2% Since
PFE and
= average
tumor
they data
appears
to volume
x
11
15
y
40
75
be exponential growth,
19let’s23
27 “log
31of y35
39
try the
”
transformation
90 210
230 330 450 600
Sketch a scatterplot for this data set.
Average tumor volume
600
500
400
300
200
100
10
15
20
25
Number of days
30
35
There
appears to
be a curve
in the
Let’s use
a
data
transformation
points.
to linearize the
data.
Pomegranate study revisited:
x = number of days after injection of cancer cells in
mice assigned to .2% PFE and y = average tumor volume
x
11
Log(y)
15
19
1.60 1.88 1.95
23
27
2.32 2.36
31
2.52
35
39
2.65 2.78
Log of Average tumor volume
Sketch a scatterplot of the log(y) and x.Notice that
3
2
log yˆ
1
10
15
20
25
Number of days
30
35
the
relationship
now appears
linear. Let’s
The LSRL
fit is
an LSRL
to the
 1.226
0.041x
transformed
data.
Pomegranate study revisited:
x = number of days after injection
of cancer
What
wouldcells
the in
mice assigned to .2% PFE and y predicted
= average tumor
volume
average
tumor
size
27
31 be
3530 39
days after injection
Log(y) 1.60 1.88 1.95 2.32 2.36 2.52 2.65 2.78
of cancer cells?
Sketch a scatterplot of the log(y) and x.
x
11
15
19
The LSRL is
3
Log of Average
tumor volume
23
2
1
10
10 10
15
20
25
30
Number of
2525
3030
days
35
3535
log yˆ  1.226 0.041x
log yˆ  1.226 0.041(30)
log yˆ  2.456
2.456
3
ˆ
y  10
 285.76mm
Another useful transformation is the power
transformation. The power transformation ladder and
the scatterplot (both below) can be used to help
determine what type of transformation is appropriate.
Power Transformation Ladder
Power Transformed Value
Name
3
(Original value)3
Cube
2
(Original value)2
Square
1
(Original value)
½
Originalvalue
No
transformation
1/3
0
-1
3 Originalvalue
Log(Original value)
1
Originalvalue
Square root
Cube root
Logarithm
Reciprocal
Suppose that the
Suppose looks
that the
scatterplot
like
scatterplot
looks
like
the curve labeled 1.
the curve labeled 2.
Then we would use a
Then we
would
power
that
is upuse
thea
power that
up no
the
ladder
fromisthe
ladder from the
transformation
row no
for
transformation
row
both the x and y for
the x
variable and a
variables.
power down the ladder
for the y variable.
Logistic Regression (Optional)
• Can be used if the dependent variable is
categorical with just two possible values
• Used to describe how the probability of
Theas
graph
of this equation
“success” changes
a numerical
predictor
For
any value
of x, the
has an “S” shape.
variable,
x, changes
value of p is always
• With
p denoting
between
0 andthe
1. probability of success,
the logistic regression equation is
a  bx
p
e
1e
a bx
Where a and b are constants
In a study on wolf spiders, researchers were interested
in what variables might be related to a female wolf
spider’s decision to kill and consume her partner during
courtship or mating. Data was collected for 53 pairs of
courting wolf spiders. (Data listed on page 287)
What
This equation
is the probability
can be used
of to
x = the difference in body width (female – male)
predict
cannibalism
the
probability
if
the
maleof
& the
female
male
y = cannibalism;
coded
0
for
no
cannibalism
and
1
for
cannibalism
Note that the plot was constructed so that if
two
spider
spiders
cannibalized
the
same
width
based
on
plots
fellbeing
in theare
exact
same
location
they would
be offset
adifference
little bitasoscatterplot
that
pointsand
would
Minitab was used
to
construct
to fit a
the
(difference
ofinall
0)?
size.
visible
(called jittering).
logistic regression tobe
the
data.
e 3.089043.06928x
p
1  e 3.089043.06928x
e 3.089043.06928( 0)
p
 0.044
3.089043.06928( 0 )
1e