Chapter 3 Notes

Download Report

Transcript Chapter 3 Notes

EXAMINING
RELATIONSHIPS
Chapter 3
LESSON 3-1
VARIABLES
Response variable (dependent)
 Measures the outcome of a study.
Explanatory variable (independent)
 Attempts to explain the observed outcomes
The most common treatment for breast cancer was once the
removal of the breast. It is now usual to remove only the tumor
and nearby lymph nodes, followed by radiation . The change in
policy was due to a large medical experiment that compared
two treatments. Some breast cancer patients , chosen at
random, were given each treatment. What are the explanatory
and response variable? Are they categorical or quantitative ?
 Explanatory variable
Treatment old or new
 Categorical
Response Variables
Survival Time
Quantitative
SCATTERPLOT
A scatterplot shows the relationship between two quantitative variables
measured on the same individuals .
Always plot the explanatory variable (if there is one) on the
horizontal axis (x-axis) of a scatterplot.
The response variable on the vertical axis (y-axis).
If there is no explanatory-response distinction, either
variable can go on the horizontal axis.
Manatees are large, gentle sea creatures that live along the Florida coast. Many
manatees are killed or injured by powerboats. Here are data on power boat
registrations (in thousands) and the number of manatees killed by boats in
Florida in the years 1977 to 1990:
A.We want to examine the relationship between number of power boats and
number of manatee killed by boats. Which is the explanatory variable?
Explanatory Variable = Number of powerboat registrations
MANATEES KILLED BY BOATS IN FLORIDA
FROM 1977 TO 1990
Years
Powerboats(
thousands)
Manate
es killed
1977
447
13
1978
460
21
79
481
24
80
498
26
81
513
24
82
512
20
83
526
15
84
559
34
85
585
33
86
615
33
87
645
39
88
675
43
89
711
50
Make a scatterplot of these data. (Be sure to label the axes with variable
names, not just x and y.) What does the scatterplot show about the
relationship between these variables?
ZOOMSTAT 9
2nd Y=
The plot shows a moderately strong linear
relationship. As registrations increase, the
number of manatee deaths also tends to
increase.
Describing A Scatterplot
Direction (think slope)
Positive Direction
Negative Direction
Form
Linear Relationship
Curved Relationship
Clusters
Scatter
How closely the points follow a clear form.
Outliers
An individual values that falls outside the overall pattern of
the relationship
A .Describe the direction of the relationship.
Are the variables positively or negatively
associated?
The variables are positively associated; that
is, at the number of jet skis in use increases,
the number of manatees killed also increase.
B. Describe the form of the association. Is it
linear?
The association is linear.
C. Describe the strength of the relationship. Can the number of manatees killed be
predicted accurately from power boat registrations? If powerboat registrations
remained constant at 719,000, about how many manatees would be killed by
boats each year?
The association is relatively strong. The number of manatees killed can be
predicated accurately from the number of powerboat registrations. If the number
of registrations remains constant 719,000, we would expect between 45 and 50
manatees to be killed per year.
Calories and hot dogs that are high in calories are also high in salt? The
following slide is a scatterplot of the calories and salt content (measured as
milligrams of sodium) in17 brands of meat hot dogs.
C. Are there any outliers? is the relationship
(ignoring outliers) roughly linear in form? Still
ignoring outliers, how strong would say the
relationship between calories and sodium is?
Lower left point is, but ignoring it we have a
linear pattern.
A .Roughly what are the lowest and
highest calories counts among these
brands? Roughly what is the sodium
level in the brands with the fewest
and with the most calories.
Lowest: about 107 calories with about
145 mg of sodium
Highest: about 195 calories with
about 510 mg of sodium
B. Does the scatterplot show a clear
positive or negative association? Say
in words what this association means
about calories and salt in hot dogs?
There is positive association: high
calorie hot dogs tendto be high in salt,
and low calorie hot dogs tend to have
low sodium.
Lesson
CORRELATION 3-2
Look at these
scatterplots, what can
you tell me about them?
They are the same graph,
I just changed the scale,
so be careful when
describing things. SO we
will use a mathematical
number CORRELATION to
accurately describe our
plots
CORRELATION
 Correlation measures the direction and strength of the linear
relationship between two quantitative variables .
Correlation is usually written as r.
-
-
1
xi - x yi - y
r=
S(
)(
)
n -1
sx
sy
The lengths of two bones in five fossil specimens of extinct beast
Archaeopteryx.
Femur
=56 59 64 74
Humerus =63 70 72 84
A. Find the correlation r
FACTS ABOUT CORRELATION
 Correlation makes no distinction between explanatory and
response variable.
 Correlation requires both variables to be quantitative.
 Correlation (r)itself has no unit of measurement; its just a
number.
 Positive (r) indicates positive association.
 Negative (r) indicates negative association.
 The correlation (r) is always between -1 and 1 .
FACTS ABOUT CORRELATION
 The closer (r) is to +1 , the stronger the evidence of positive
association between two variables.
 The closer (r) is to -1 , the stronger the evidence of negative
association between to variables.
 If (r) is close to 0, does not rule out any strong relationship
between xand y, there could still be a strong relationship but
one that is not linear.
 Correlation is strongly ef fected by a outlying observations.
POSITIVE LINEAR CORRELATION
Perfect Positive
Linear
Correlation
r=1
Strong Positive
Linear Correlation
r≈
Weak Positive
Linear Correlation
r ≈ .4
NEGATIVE LINEAR CORRELATION
Perfect Negative
Linear
Correlation
r = -1
Strong Negative
Linear Correlation
r ≈ - .9
Weak Negative Linear
Correlation
r ≈ - .4
NO LINEAR CORRELATION
No Linear Correlation r is close to 0
DESCRIBING THE STRENGTH OF A
LINEAR RELATIONSHIP
Strong
-1
moderate
-.8
-.5
weak
0
moderate
.5
.8
strong
1
Do people with larger brains have higher IQ scores? A study looked at 40 volunteer
subjects, 20 men and 20women. Brain size was measured by magnetic resonance
imagining. Table 3-3 gives the data. the MRI count is the number of “pixels” the
brain covered in the image. IQ was measured by the Wechsler test.
A) Make a scatterplot of IQ score versus MRI count, using distinct symbols for the
mean and women. In addition find the correlation between IQ and MRI for all 40
subjects for the men alone and for the women alone.
r (all) = .3576
r (men) = .4984
r (women) = .3257
B) Men are larger than women on the average, so they have larger brains. How is this
size effect visible in your graph?
The points for mean are generally located on the right side of the plot , while the
women’s points are generally on the left.
C) Your result in (b) suggests separating men and women in looking at he relationship
between brain size and IQ. Use your work in (a) to comment on the nature and
strength of this relationship for women and for men.
The correlation for men and women suggests that there is a moderately positive
association for men and a weak one for women. However, one significant feature of
the data that can be observed in the scatterplot is that the sample group was highly
stratified; that is, there were 10 men and 10 women with high IQs (at least 130),
while other 10 of each gender had IQs of no more than 103. The men’s higher
correlation can be attributed partly to the two subjects with large brains and 103 IQs
(which are relative to the low IQ group). The men’s correlation might not remain so
high with a larger sample.
LEAST SQUARE
REGRESSION
3-3
BEST FIT LINE
 Is a straight line (equation) that describes how a response
variable y changes as an explanatory variable x changes.
 A best Fit Line (equation) is used to predict the value of y for
a given value of x.
 Best Fit Line unlike correlation, requires that we have an
explanatory variable and a response variable.
BEST FIT LINE
Is the line that comes closer to all the
points.
y is the actual value
^
y
residual
is the predicted value.
Residual is the difference between the
observed value and the associated
predicted value.
Negative residual shows a model that is
overestimate
Positive residual shows a model value
that is underestimate.
“BEST FIT” MEANS LEAST-SQUARES
The line of “Best Fit” is the line for which the
sum of the squared residuals is the smallest
The line of “Best Fit” is called a Leastsquares Regression Line (LSRL)
Equation of LSRL
The line must go through the point
^
Equation
y = b0 - b1 x
Slope
b1 = r
Intercept
_
sy
sx
_
b0 = y- b1 x
- -
(x, y)
Keeping water supplies clean requires regular measurement of levels of
pollutants. The measurements are indirect –a typical analysis involves forming
a dye by a chemical reaction with the dissolved pollutant, then passing light
through the solution and measuring its “absorbance. ”To calibrate such
measurements, the laboratory measures known standard solutions and uses
regression to relate series of data on the absorbance for different levels of
nitrates. Nitrates are measured in milligrams per liter of water
Nitrates
50 50 100 200 400 800 1200 1600 2000 2000
Absorbance 7.0 7.5 12.8 24.0 47.0 93.0 138.0 183.0 230.0 226.0
r = .9999
xn = 840
_
A. What is the equation of the least-square lines for predicating
absorbance from concentration? Find our slope & intecept
s
90.953
b1 = r y = .9999(
) = .1133
sx
802.704
_
_
b0 = y- b1 x
96.83-(.1133)(840)= 1.658
^
y =1.658 +.1133x
sn = 802.704
_
y A = 96.83
sA = 90.953
How do I get these
from calculator?
2 var stats
If the lab analyzed a specimen with 500 milligrams of nitrates per
liter, what do you expect the absorbance to be? Based on your
plot and the correlation, do you expect your predicted absorbance
to be very accurate?
^
y =1.658 +.1133x
^
y =1.658 +.1133(500)
=58.31
This prediction should be very
accurate since the
relationship is so strong
ASSESSING THE ACCURACY OF THE
LINEAR MODEL
 The standard deviation and r² are numerical measures use in the
assessment of how well the model fits.
 r² -Coefficient of Determination
 Measure of the proportion of variability in the y variable that can be
“explained” by the linear relationship between x and y
 100*r² is the percentage of variation in y that can be attributed to
approximate linear relationship between x and y.
 Is a number between 0 and 1, the close r² is to 1, the better the line
describes how the change in the explanatory variable affects the value of
the response variable. Meaning a better fit.
 r² is the square of the linear correlation coefficient for the least-square
regression model.
A natural measure of variation about the least-square regression line is
the sum of the squared residuals.
Measures the amount of variation in y that cannot be explained by
the linear relationship between x and y.
A study of class attendance and grades among first year students at a state
university showed that in general students who attended a higher percent of
their classes earned higher grades. Class attendance explained 16% of the
variation in grade index among students. What is the numerical value of the
correlation between percent of classes attended and grade index?
r = .16
2
r 2 = .16
r = .40
Higher attendance
goes with high
grades, so the
correlation must be
positive
Some people think that the behavior of the stock market in January predicts its
behavior for the rest of the year. Take the explanatory variable x to be the percent
change in a stock market index January and the response variable y to be the
change in the index for the entire year. We expect a positive correlation between x
and y because the change in January contributes to the full year’s change.
Calculation from data for the years 1960 to 1997 gives
_
x = 1.75%
_
y = 9.07%
sx = 5.36%
r = .596
sy = 15.35%
A. What percent of the observed variation in yearly changes in the index is
explanatory by a straight-line relationship with the change during January.
r = .596
r 2 = .355 = 35.5%
B. What is the equation of the least-squares line for
predicting full-year change from January change?
_
s
.1535
b1 = r y = .596(
) = 1.707
sx
.0536
_
_
b0 = y- b1 x
C. The mean change in January isx =1.75%
.
Do you have to use the regression line to
predict the change in the index in a year
in which the index rises 1.75% in
January.
_
=.0907 (1.707)(.0175)
= 0.06083 = 6.083%
yˆ
6.083%
1.707x
NO. The predicted change is .y = 9.07%
Since the _regression
line must pass
_
through x, y.
ASSUMPTIONS AND CONDITIONS
Check the scatterplot
The shape must be linear or we can’t use regression at all.
Watch out for outliers.
Outlying values have large residuals and squaring makes their
influences that much greater.
Outlying points can dramatically change a regression model.
They can change the sign of the slope, misleading us about the
underlying relationship between variables.
A r²of 100%
You may have accidentally regressed two variables that measure the
same thing.
Don’t extrapolate beyond the data.
A linear model will often do reasonable job of summarizing a relationship
in the narrow range of observed x-values.
Beware of predicting y-values for x-values that lie outside the range of the
original data.
If you must extrapolate into the future, at least don’t believe that the
prediction will come true!
Don’t infer that x causes y just because there is good linear model for their
relationship.
Correlation and regression describe only linear relationships.
You can do calculations on any two quantitative variables, but results are
only useful if scatter plot is linear
Correlation is not resistant. Look for unusual observations.
RESIDUALS
 A residual is the dif ference between an observed value of the
response variable and the value predicted by the regression
line.
residual = observed – predicted
_
= y- y
Plotting the residual
A residual plot is a scatterplot of the ( x, residual) pairs
Residual plot is a good place to start when assessing the
appropriateness of the regression line.
Residuals will always sum up to be 0. S0, the residuals mean
will equal 0 also.
CALCULATE RESIDUAL
 We have a study of fat gained dues to change in excercise
 Fat gain=3.505 - .00344Change
 One subjects Change rose by 135 calories. That subject
gained 2.7 kg of fat.
 The predicted fat gain on 135 calories is what.
 3.505 - .00344(135)
 = 3.04 kg
 So the residual is actual – predicted
2.7 - 3.04 = -.34 kg
Residuals
If your residual is positive, what has happened?
That means you have value that exceeds the prediction,
overestimate
What about when it is Negative?
Then our actual does not get to the predicted value. We
underestimate.
RESIDUAL PLOT
 Determine whether a linear model is appropriate to describe
the relationship between the explanatory and response
variables.
 Residual are what is “left over” after the model describes the
relationship, they often reveal subtleties that were not clear
from a plot of the original data.
 Determine whether the variance of the residuals is constant .
 Check for outliers
RESIDUAL PLOTS –UNIFORMED
The uniform scatter of points indicated that the regression line is
good model
RESIDUAL PLOTS –CURVED
The residuals have a curved pattern, so a straight line is an
inappropriate model
RESIDUAL PLOTS –
INCREASING/DECREASING
The response variable y has more spread for larger values
of the explanatory variable x, so prediction will be less
accurate when x is large.
INFLUENTIAL OBSERVATIONS
 An outlier is an observation that lies outside the overall
pattern of the other observation.
 An observation is influential for a statistical calculation if
removing it would markedly change the result of the
calculation.
 Points that are outliers in the x directions of a scatterplot are
often influential for the least-square regression line .
RESIDUAL PLOT –INDIVIDUAL POINTS
Individual points with large residual (Child
19) are outliers.
Individual points that are extreme in the x
direction (Child 18)are influential
observations
Lean body mass as a predictor of metabolic rate. Exercise 3.12, page 132
provides data from a study of dieting for12 women and 7 men subjects. We
explore the data further.
Type in mass to L1 and rate to L2.
A. Make a scatterplot
B. Perform least-squares regression
on your calculator and
record the equation and the
correlation. Lean body mass
explains what percent of the
variation in metabolic rate
for women?
Lean body mass explains about 76.82% of
the variation in metabolic rate.
C. Does the least-square line provide an adequate model for the data?
We will make this on the calculator and look at the pattern
Graphing Residuals
Turn on Stat Plot 1
Y1 → Vars/Y-Vars/Function
ZOOM 9
C. From the residual plot, the line does appear to provide an
adequate model. The residual are scattered about the
horizontal axis and no patterns are evident.