SAMBAND - umu.se

Download Report

Transcript SAMBAND - umu.se

Module 2
Association and Correlation
Two quantitative
variables
Two categorical
variables
Scatterplot
Crosstab
Correlation
Chi-square test
Regression (Least square)
I. Association between categorical variables
Definition:
Two categorical variables are independent if the
population conditional distributions on one of them are
identical at each category of the other.
Example:
Party identification
Ethinic group
Democrat
Independent
Republican
Total
White
440 (44%)
140 (14%)
420 (42%)
1000
Black
44 (44%)
14 (14%)
42(42%)
100
Hispanic
110 (44%)
35 (14%)
105 (42%)
250
Chi-square test of independence:
A study was done to investigate whether using
bicycle helmets is an effective way to protect people
from skull damage in bicycle accidents. 793 persons
participated in the study, with the following results:
Observed
Skull
damage
No skull damage
Total
No helmet
218 (33.7%) 428(66.3%)
646
Helmet
17(11.6%)
130 (88.4%)
147
Total
235
558
793
We test the following:
H0: There is no association between skull damages and
using helmets. (The proportion of skull damage is the
same no matter a person in an accident is using a
helmet or not. )
((The variables are statistically independent))
H1: There is an association between skull damages and
using helmets.
((the variables are statistically dependent))
Expected frequency: The count expected in a cell IF the variables
were independent
Obs (exp)
Skull
damage
No skull
damage
No helmet
218 (191.4)
428 (454.6)
Helmet
17 (43.6)
130 (103.4)
Total
235
(235*646) / 793=191.4
(235*147) / 793=43.6
(646*558) / 793=454.6
(147*558) / 793=103.4
558
Total
646
147
793
We compare the observed frequencies with
expected. If the frequencies differ, we will
reject the null hypothesis. Then we have
empirical evidence to show the dependency
between variables
Oi = an observed frequency;
Ei = an expected frequency
n = the number of cells in the table.
When H0 is true, observed and expected
frequencies tend to be close for each cell, and
2 is relatively small.
P-value: right-tail probability above the observed
2 value
Reject H0 at -level if P-value ≤ 
Helmet & skull damage: SPSS calculates p-value=
0.0001….. Conclusion?
Since the p-value = 0.0001 < 0.05, we can reject
the null hypothesis.
There is a significant difference between the
percentage of people with skull damage in two
groups of bikers.
II. Relationships between two quantitative varibles
A set of data of two continuous variables

Predict the value of one variable (Y) when we know the
other variable (x)
Y: Response, Outcome, Dependent
X: Predictor, independent, Explanatory
Yi=value of the response variable for the ith observation
Xi= value of the predictor variable for the ith observation
Three aspects of relatioship analyses:



To investigate whether there is an associaion
To study the strength of their association
Estimate a Regression equation that predicts the value of the
response variable using the value of the explanatory variable
Note: The analyses are collectively called a Regression analysis.
Objectives:
•
•
•
•
Description
Explanation
Prediction
Control
Example:
We want to check if there is any relationship
between the score on exams and study hours.
Person
A
B
C
D
E
F
G
H
I
J
Hours/
Day
4
5
6
7
5
3
5
7
8
8
Score
20
25
22
35
15
14
22
30
37
39
Step 1) Is there any association?
Scatter plots are used to describe the relationship
between two quantitative variables.
Response variable on the vertical axis
Explanatory variable  on the horizontal axis
45
40
result on exam
35
30
25
20
15
10
5
0
0
1
2
3
4
5
hours used for studies
6
7
8
9
Step2) Strength of the Association
Correlation Coefficient
• The coefficient of correlation r measures the strength of
the linear association between x and y.
• Shows how X and Y vary together; association does not
imply causation
• The correlation coefficient can take values between –1
and +1.
• The value of r does not depend on the variables’ units
• Be aware that r is a measure of linear relationships.
Even if r = 0 there can be a nonlinear relationship
between x and y.
Where do we have linear relationship?
0
Y
Y
Y
0
0
0
0
x
x
x
Y
Y
Y
x
x
x
Quadratic graph
Y=a+bX+cX2
y =b exp(ax)
y = b exp(-ax)
Question: Is there any linear relationship
between newborn babies’ height and
weight?
5000
weight in g
4500
4000
3500
3000
2500
2000
47
48
49
50
51
52
height in cm
53
54
55
56
We want to test:
H0: There is no linear relationship between x and y.
H1: There is a linear relationship between x and y.
• (x: baby’s height)
• (y: baby’s weight)
Example 1) Correlation between
weight and Height of newborn babies
Correlations
LENGTH Pearson Correlation
Sig. (2-tailed)
N
WEIGHT Pearson Correlation
Sig. (2-tailed)
N
LENGTH
1
,
35
,765**
,000
35
**. Correlation is significant at the 0.01 level
(2-tailed).
WEIGHT
,765**
,000
35
1
,
35
Example2) Is there a linear relationship between “mean
income level” and “population size “ in 50 American
states ?
8000
7000
Correlations
POP
,255
,074
50
1
,
50
INCOME
INCOME Pearson Correlation
Sig . (2-tailed)
N
POP
Pearson Correlation
Sig . (2-tailed)
N
INCOME
1
,
50
,255
,074
50
6000
5000
4000
3000
0
POP
10000
20000
30000
Example3)
Scatterplot of Advertising Expenditures (X) and Sales (Y)
Scatterplot for:
advertising expenditures (x)
and sales (y)
• Larger sales are associated
with larger advertising
expenditures (x)
• The points seem to be distributed close to the line with
positive slope.
• The points are not exactly at the line.
• The line represents the mean sales in relation with
advertising expenditures.
140
120
Sales
100
80
60
40
20
0
0
10
20
30
A d ve rtising
40
50
Step 3) Prediction
Regression equation using Least Square
method
Y=  +  x + 
Systematic part
Stochastic (random) part
Y : dependent or response variable. The variable we
want to predict.
x: independent variable or explanatory variable
: Error term, the random part of the model.
= intercept, the point where the line crosses the y-axis
= slope of the line.
Assumptions in regression
1.
2.
3.
4.
The observations are
independent from each
other (e.g., simple random
sample in a survey)
The mean of Y is related to
X by a linear equation
E(Y)=0+βx
The conditional standard
deviation of Y is identical at
each x-value
The conditional distribution
of Y at each value of x is
normal
Assumptions
Y
2
3 and 4
X
Residuals
 The difference between an observed value and
the predicted value of the response variable
 Prediction errors
Y
Y
Data
Residuals
X
X
Model control
Residuals
Residuals
0
0
x or y
x or y
Same variance around the line. Good.
Residuals
Variance around line depends on x. Bad.
Residuals
0
0
x or y
Time
A pattern. Linear trend in time,. Bad.
A pattern. None linear. Bad.
Step 4) Regression diagnosticsModel control
How well x can predict y?
Coefficient of determination (R2): the proportion of the total
variation explained by the model
( y  y) = ( y  yˆ )
 ( yˆ  y)
Total = Unexplained Explained
Deviation
Deviation
Deviation
(Error)
(Regression)
Y
.
Y
Unexplained Deviation
Y
Explained Deviation
Y
}
{
Total Deviation
 ( y  y)2 =  ( y  yˆ )2   ( yˆ  y)
SST = SSE
+ SSR
{
r 2 = SSR =1 SSE
SST
SST
X
X
2
The proportion of
the total variation
that can be
explained by the
regression model
Example ,R2
Model Summary
Model
1
R
,960a
R Square
,922
Adjusted
R Square
,915
Std. Error of
the Estimate
,9954
a. Predictors: (Constant), X
R2 =92,2%
Properties of R2
 Since -1≤R<1, R2 falls between 0 and 1.
 R2 measures the strength of linear
association. The closer R2 is to 1, the
stronger the linear association.
 In the absence of linear association, R2 =0
 R2 does not depend on the units of
measurements.
 Takes the same value when x predicts y as
when y predicts x.
Coefficient of determination, R2
Y
Y
Y
X
R2=0
X
R2=0.50
SST
SSE
SST
SSE SSR
X
R2=0.90
S
S
E
SST
SSR
Test of Independence




Population mean of y is identical at each x-value
Two quantitative variables are independent
The slope =0 for the linear regression function E(y)=+ x
Null hypothesis that the variables are statistically independent is
H0: =0
 t= β/se(β) ; the P-value is listed as “sig”
 If P<0.05, reject the hypothesis, meaning that x and y are
dependent
Example SPSS
Coefficients a
Model
1
(Constant)
X
Unstandardized
Coefficients
B
Std. Error
-3,057
,971
,187
,016
Standardi
zed
Coefficie
nts
Beta
,960
t
-3,148
11,381
Sig.
,009
,000
a. Dependent Variable: Y
Standard error
() and se(β)
P-value
Prediction
Point prediction
A point prediction of Y for a given value of x is given by
the regression line.
Interval Prediction
An interval prediction of Y for a given value of x.
Uncertainty in estimation of the regression line.
Variation around the regression line.
Confidence interval of the prediction
A confidence interval for prediction at a given value of x.
Uncertainty in estimation of the regression line.
Confidence interval for E[Y|x]
Y
Y
Upper limit for slope
Upper limit for intercept
Regression line
Lower limit for slope
Y
X
X
1) Uncertainty about slope
Regression line
Y
Lower limit for intercept
X
X
2)Uncertainty about intercept
Confidence interval for E[Y|x]
Y
Confidence interval
Regression line
Y
X
X
Confidence interval for E[Y|x]
Prediction interval for E(Y|x)
Y
Regression line
Y
Confidence interval
Regression line
Y
Prediction interval
X
3) Variation around the regression
line + line uncertainty (1 and 2).
X
X
Prediction interval for E(Y|x)