STT 200 (Section 102)

Download Report

Transcript STT 200 (Section 102)

STT 315
Acknowledgement: Author is indebted to Dr. Ashok Sinha, Dr. Jennifer Kaplan and Dr.
Parthanil Roy for allowing him to use/edit some of their slides.
What is the goal ?
 Till now we have been studying about one variable. Topics covered:
Histogram, Box-plot; Pie-chart; about center; shape and spread.
 This lecture talks about two or more variables, we will mainly talk
about how to display and measure relationship between two
variables.
 Exploring relationships (or association) between two quantitative
variables
 by drawing a picture (known as scatterplot), and
 Using a quantitative summary (known as correlation coefficient or
simply correlation).
2
Example: Height and Weight
 How is weight of an individual related to his/her height?
• Typically, one can expect a taller person to be heavier.
• Is it supported by the data?
“association”?
If yes, how to determine this
 Price increase of onion and it’s demand in market ?
 Supply and Price of a commodity in market ? Less is the supply
more is the price increase.
3
What is a scatterplot?
• A scatterplot is a diagram which is used to
display values of two quantitative variables
from a data-set.
• The data is displayed as a collection of points,
each having the value of one variable
determining the position on the horizontal
axis and the value of the other variable
determining the position on the vertical axis.
4
Example 1: Scatterplot of height and
weight
5
Example 2: Scatterplot of hours watching
TV and test scores
6
Looking at Scatterplots
We look at the following features of a scatterplot:• Direction (positive or negative)
• Form (linear, curved)
• Strength (of the relationship)
• Unusual Features.
When we describe
histograms we mention
• Shape
• Center
• Spread
• Outliers
7
Asking Questions on a Scatterplot
• Are test scores higher or lower when the TV watching
is longer? Direction (positive or negative association).
• Does the cloud of points seem to show a linear
pattern, a curved pattern, or no pattern at all? Form.
• If there is a pattern, how strong does the relationship
look? Strength.
• Are there any unusual features? (2 or more groups or
outliers).
8
Positive and Negative Associations
• Positive association means for most of the datapoints, a higher value of one variable corresponds to
a higher value of the other variable and a lower value
of one variable corresponds to a lower value of the
other variable.
• Negative association means for most of the datapoints, a higher value of one variable corresponds to
a lower value of the other variable and vice-versa.
9
This association is:
A. positive
B. negative.
10
This association is:
A. positive
B. negative.
11
Curved Scatterplot
• When the plot shows a clear curved pattern,
we shall call it a curved scatterplot.
12
Linear Scatterplot
• Unless we see a curve, we shall call the scatterplot
linear.
• We shall soon learn how to quantify the strength of
the linear form of a scatterplot.
13
Which one has stronger linear association?
A.left one,
B.right one.
Because, in the right
graph the points are
closer to a straight line.
14
Which one has stronger linear association?
A.left one,
B.right one.
Hard to say
– we need a measure
of linear association.
15
Unusual Feature: Presence of Outlier
• This scatterplot clearly has an outlier.
16
Unusual Feature: Two Subgroups
• This scatterplot clearly has two subgroups.
17
Explanatory and Response Variables
• The main variable of interest (the one which
we would like to predict) is called the
response variable.
• The other variable is called the explanatory
variable or the predictor variable.
• Typically we plot the explanatory variable
along the horizonatal axis (x-axis) and the
response variable along the vertical axis (yaxis).
18
Example: Scatterplot of height and weight
In this case, we are trying to predict the weight based on the height
of a person. Therefore
• weight is the response variable, and
• height is the explanatory variable.
19
How to measure linear association?
• Use correlation coefficient or simply correlation.
• Correlation is a value to describe the strength of
the linear association between two quantitative
variables.
• Suppose x and y are two variables. Let zx and zy
denote the z-scores of x and y respectively. Then
correlation is defined as:
1
r
zx z y .

n 1
• We shall use TI 83/84 Plus to compute correlation
coefficient (to be discussed later).
20
Correlation is unit-free
Because correlation is calculated using standardized scores
 it is free of unit (i.e. does not have any unit);
 does not change if the data are rescaled.
In particular, this means that correlation does not depend
on the unit of the two quantitative variables.
For example, if you are computing the correlation between
the heights and weights of a bunch of individuals, it
does not matter if the heights are measured in inches or
cms and if the weights are measured in lbs or kgs.
21
Properties of Correlation
• Correlation is unit-free.
• Correlation does not change if the data are rescaled.
• It is a number between -1 and 1.
• The sign of the correlation indicates the direction of
the linear association (if the association is positive then
so is the correlation and if the association is negative
then so is the correlation).
• The closer the correlation is to -1 or 1, the stronger is
the linear association.
• Correlations near 0 indicate weak linear association.
22
Words of Warning about Correlation
• Correlation measures linear association between two
quantitative variables.
• Correlation measures only the strength of the linear
association.
• If correlation between two variables is 0, it only means that
they are not linearly associated. They may still be nonlinearly
associated.
• To measure the strength of linear association only the value of
correlation matters.
A correlation of -0.8 is a stronger linear association compared
to a correlation value 0.7.
The negative and positive signs of correlation only indicate
direction of association.
• Presence of outlier(s) may severely influence correlation.
• High correlation value may not always imply causation.
23
Check before calculation of correlation
• Are the variables quantitative?
• Is the form of the scatter plot straight enough
(so that a linear relationship makes sense)?
• Have we removed the outliers? Or else, the
value of the correlation can get distorted
dramatically.
24
Regression line and
prediction
25
Explanatory and Response Variables
• The main variable of interest
(the one which we would like
to predict) is called the
response variable (denoted
by y).
• The other variable is called
the explanatory variable or
the
predictor
variable
(denoted by x).
Above scatter plot indicates a linear
relationship between height and
weight. Suppose an individual is
68 in tall. How can we predict his
weight?
Here height is the
predictor (or explanatory
variable) and weight is
the response variable. 26
What is Linear Regression?
• When the scatter plot looks roughly
linear, we may model the relationship
between the variables with “bestfitted” line (known as regression
line): y = b0 + b1 x.
• b1 (the coefficient of x) is called the
slope of the regression line.
• b0 is called the intercept of the
regression line.
• We estimate the slope (b1) and the
intercept (b0).
• Next given the value of x, we plug in
that value in the regression line
equation to predict y.
This procedure is called
linear regression.
27
Conditions for Linear Regression
• Quantitative Variables Condition:
variables have to be quantitative.
both
• Straight Enough Condition: the scatter plot
must appear to have moderate linear
association.
• Outlier Condition: there should not be any
outliers.
28
Example of Linear Regression
Suppose
• x = amount of protein (in gm) in a burger
(explanatory variable),
• y = amount of fat in (in gm) the burger (response
variable).
Goal: Express the relationship of x and y using a line (the
regression line): y = b0 + b1x.
Questions:
1. How to find b1 (slope) and b0 (intercept)?
2. How will it help in prediction?
29
Formulae of b0 and b1
• Although there are many lines that can describe the
relationship, there is a way to find “the line that fits best”.
• For the best fitted line:
 Slope: b1 = (correlation)×(std.dev. of y)/(std.dev. of x)
i.e.
 sy
b1  r 
 sx

.

 Intercept: b0 = (mean of y) – b1×(mean of x)
i.e.
b0  y  b1 x.
30
Computation of b0 and b1
• If we are given the summary statistics, i.e. mean,
standard deviations of x and y and their correlations,
then we plug in those values in the formulae to find
b0 and b1.
• If we are given the actual data (not the summary),
then we need to compute all those summary values.
• However given the data TI 83/84 Plus can find the
equation of regression line.
But be careful, because TI 83/84 writes the
regression equation as y = ax + b.
So a = slope (= b1), and b = intercept (= b0).
31
Example 1
If in a linear regression problem, the correlation
between the variables is 0.9 and the standard
deviations of the explanatory (x) and the response
(y) variables are 0.75 and 0.25 respectively, and the
means of the explanatory and the response
variables are 10 and 5 respectively, calculate the
regression line.
• Estimate of slope:
b1 = 0.9 × (0.25/0.75) = 0.3.
• Estimate of intercept:
b0 = 5 - 0.3 × 10 = 2.
• So the estimated regression line is:
y = 0.3x + 2.
32
Example 2
Fat (g)
19
31
34
35
39
39
43
Sodium (mg) Calories
920
410
1500
580
1310
590
860
570
1180
640
940
680
1260
660
Fat (in gm), Sodium (in mg) and Calorie
content in 7 burgers are given above.
33
Using TI 83/84 Plus for regression
• Press [STAT].
• Choose 1: Edit.
• Type the Fat data under L1, Sodium under L2 and Calories under
L3.
Suppose (L1) Fat is the predictor and (L2) Sodium is the response.
• Press [STAT] again and select CALC using right-arrow.
• Select 4: LinReg(ax+b) (LinReg(ax+b) appears on screen).
• Type [2nd] and [1] (that will put L1 on screen).
• Type , and then [2nd] and [2] (that will put ,L2 on screen).
• Press [ENTER].
• This will produce a (slope), b (intercept), r2 and r (correlation
coefficient).
34
Tips: Using TI 83/84 Plus
• Caution: After LinReg(ax+b) you must first put the predictor
(explanatory) variable, and then the response variable.
• Note that the values of r and r2 will not show up if the
diagnostic is not switched on in the TI 83/84 Plus calculator.
• To get the diagnostic switched on:
1. Press [2nd] and [0] (that will choose CATALOG).
2. Select using arrow keys DiagnosticOn.
3. Press [ENTER] and [ENTER] again.
4. This will switch the diagnostic on.
35
Tips: Using TI 83/84 Plus
• To delete one particular list variable (say L2):
1. Press [STAT].
2. Choose 1: Edit.
3. Select the variable L2 using the arrow keys.
4. Press [CLEAR] followed by [ENTER].
• To delete all stored data:
a) Press [2nd] and [+] (that will choose MEM).
b) Select 4: ClrAllLists.
c) Press [ENTER] and [ENTER] again.
d) This will clear all the stored data in the lists.
36
Fat vs. Sodium
Fat vs Sodium
Using TI 83/84 Plus we get:
1600




Sodium (mg)
1400
1200
1000
800
r = 0.199
r2 = 0.0396
a = 6.08 (= b1)
b = 930.02 (= b0)
600
15
20
25
30
35
40
45
Fat (g)
• Correlation is approximately 0.2, which indicates
that linear association is very weak (positive) between
fat and sodium.
• Scatter plot supports the small value of r.
• Regression line:
y = 6.08x + 930.02.
37
Fat vs. Calories
Fat vs Calories
Using TI 83/84 Plus we
get:
700
650




Calories
600
550
500
450
400
350
15
20
25
30
35
40
45
r = 0.961
r2 = 0.923
a = 11.06 (= b1)
b = 210.95 (= b0)
Fat (g)
• Correlation 0.96 indicates that there is a very strong
positive linear relation between fat and calories.
• Scatter plot supports the high positive value of r.
• Regression line:
y = 11.06x + 210.95.
38
Example 3
Country
Percent with
Cell Phone
Life Expectancy
(years)
Turkey
85.7%
71.96
France
92.5%
80.98
Uzbekistan
46.1%
71.96
China
47.4%
73.47
Malawi
11.9%
50.03
Brazil
75.8%
71.99
Israel
123.1%
80.73
Switzerland
115.5%
80.85
Bolivia
49.4%
66.89
Georgia
59.7%
76.72
Cyprus
93.8%
77.49
122.6%
80.05
Indonesia
58.5%
70.76
Botswana
74.6%
61.85
U.S.
87.9%
78.11
Spain
39
Example 3: Scatter plot with regression line
%Cell Phone vs Life Expectancy
85
Life Expectancy
80
75
70
y = 0.21x + 56.91
R = 0.7848
65
R2 = 0.6159
60
55
Possible
outliers
50
45
0
20
40
60
80
100
120
140
% Cell Phone
40
Example 3: Scatter plot with regression line
% Cell Phones vs Life Expectancy (without outliers)
85
80
Life Expectancy
75
y = 0.13x + 64.7
R = 0.802
70
2
R = 0.6437
65
60
55
50
45
0
20
40
60
80
100
120
140
% Cell Phones
41
Predicted values and residuals
• Let the regression line be y = b0 + b1x.
• Suppose (x0, y0) is an observed data.
• Then the predicted value of y given x = x0 is
yˆ  b0  b1x0 .
• Residual (denoted by e) measures how much the predicted
value deviates from the observed value of y.
• So if the observed value of y is y0 then
residual = (observed value - predicted value)
e  y0  yˆ.
• Residuals are the errors due to prediction using the regression
line.
42
Example 1 revisited
If in a linear regression problem, the correlation between the
variables is 0.9 and the standard deviations of the x and y
are 0.75 and 0.25 respectively, and the means of the x and y
are 10 and 5 respectively. Calculate the residual when x = 20
and y = 7.
• Estimate of slope: b1 = 0.9 × (0.25/0.75) = 0.3.
• Estimate of intercept: b0 = 5 - 0.3 × 10 = 2.
• So the estimated regression line is: y = 0.3x + 2.
• The predicted value of y when x = 20, is
= 0.3 × 20 + 2 = 8.
•The corresponding residual is
= (observed y) – (predicted y) = 7 – 8 = -1.
43
Evaluating regression
The fit of a regression line can be evaluated with
R2 (the coefficient of determination),
se (standard deviation of residuals).
44
R2 (the coefficient of determination)
R2 is the fraction of total sample variation in y explained by
the regression model.
Some properties of R2 :
 R2 = (correlation)2 = r2.
 0 ≤ R2 ≤ 1.
 R2 close to 0 implies weak linear relationship (and
also not a good fit of the regression line).
 R2 close to 1 implies strong linear relationship (and
also a very good fit of the regression line).
45
R2 (the coefficient of determination)
• For instance, if R2 = 0.54, then 54% of the total sample variation
in y is explained by the regression model.
It indicates a moderate fit of the regression line.
On the scatter plot the points will not be very close to the
regression line.
• If R2 = 0.96, then 96% of the total sample variation in y is
explained by the regression model. It indicates a very good fit of
the regression line. On the scatter plot the points will be very
close to the regression line.
• On the other hand, if R2 = 0.19, then only 19% of the total
sample variation in y is explained by the regression model, which
indicates a very bad fit of the regression line.
The scatter plot will show either a curved pattern, or the points
will be clustered showing no pattern.
46
se (standard deviation of residuals)
se is the standard deviation of the residuals.
In case there is no ambiguity, we often just write s instead of se.
 Smaller the se better the model fit.
 Larger the se worse the model fit.
Remember that residuals are the errors due to prediction using the
regression line.
So larger value of se implies that there is more spread in the
residuals, as a result there is more error in the prediction. Hence
the observations are not close to the regression line.
On the other hand, smaller value of se indicates that the
observations are closer to the regression line, implying a better
fit.
47
Summary: Direction of linear association
• The sign of correlation (r) indicates the direction of
the linear association between x and y.
• Notice that the slope (b1) of regression line has the
same sign as that of correlation (r). Hence the sign of
slope also indicates the direction of the linear
association between x and y.
• Positive sign of correlation (and slope) implies
positive linear association.
• Negative sign of correlation (and slope) implies
negative linear association.
48
Summary: Strength of linear association
• The value of correlation (r) [ignoring the sign] gives
us the strength of the linear association.
• Also the value of R2 gives us the strength of the
linear association.
• Values close to 1 implies strong linear association.
• Values close to 0 implies weak or no linear
association.
• You cannot determine strength of linear association
from the value of slope (b1) of regression line.
49
Correlation (r) and R2
• If you know the value of correlation, you can
compute R2 = (correlation)2.
• But knowing the value of R2 alone is not sufficient to
compute correlation, because we cannot get the
information of sign of correlation.
• However if we know the value of R2 and also the sign
of slope (b1) we can compute correlation as follows:
r = (sign of b1) √ R2.
50
Lurking variable
• High correlation between x and y may not
always mean causation.
• Sometimes there is a lurking variable which is
highly correlated to both x and y and as a result
we obtain a high correlation between x and y.
51
Example: Lurking variable
• A study at the University of Pennsylvania Medical Center
(published in the May 13, 1999 issue of Nature) concluded:-
young children who sleep with the light on are much more likely
to develop eye problems in later life.
• However, a later study at The Ohio State University did not find
that infants sleeping with the light on caused the development of
eye problems.
• The second study (done at OSU) did find a strong link between
parental myopia and the development of child myopia, also
noting that myopic parents were more likely to leave a light on in
their children's bedroom. So parental myopia is a lurking variable
here.
Reference:
http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
52
Choose the best description of the
scatter plot
A.
B.
C.
D.
E.
Moderate, negative, linear association
Strong, curved, association
Moderate, positive, linear association
Strong, negative, non-linear association
Weak, positive, linear association
53
Match the following values of correlation coefficients for the data
shown in this scatter plots.
Fig. 1
Fig. 2
A. r = -0.67
B. r = -0.10
C. r = 0.71
D. r = 0.96
E. r = 1.00
Fig. 3
54