Causality and confounding variables

Download Report

Transcript Causality and confounding variables

Causality and confounding
variables
• Scientists aspire to measure cause and
effect
• Correlation does not imply causality.
Hume: contiguity + order (cause then
effect) + effect only when cause present
• Confounding variables (extraneous
factors) may intervene and effect both
the proposed cause and effect.
Correlation and Regression
• Steps for making statistical predictions
– Pearson product moment coefficient of correlation (r) – to
measure strength of any linear relationship between
variables – e.g. in bivariate correlation: age and salary level
– Lies in the range -1< r < +1
– -1 perfect negative linear correlation; +1 perfect positive
correlation; 0 no correlation
– Only strength of relationship not cause-effect
Steps for making statistical predictions
continued…
• Having established a correlation (strength)
– Use ‘coefficient of determination’ (r2) to assess what
proportion (%) of the relationship is explained by the
Pearson r correlation
– Evaluate the statistical significance (t-scores) – i.e. set the
risk level of accepting calculated coefficients against null
hypothesis
• The selection of scatter diagrams (next) illustrates
linear correlation principles
A selection of scatter diagrams and associated correlation coefficients
r=+1
r=-1
16
r = + 0.871
25
25
14
y values
y values
20
20
12
10
15
15
8
10
y v5alues
10
6
4
0
5
0
2
4
6
x values
8
10
0
2
r = - 0.497
30
25
20
15
10
y v alues
5
0
0
4
6
x values
8
8
10
4
6
x v alues
8
10
8
10
r =0
25
20
20
15
15
10
y v5alues
0
4
6
x v alues
2
r = + 0.0037
25
10
y v5alues
2
0
10
0
0
2
4
6
x v alues
8
10
0
2
4
6
x v alues
Now move on to prediction
• From assessing the strength and power
of a linear correlation between two
variables
• …move on to describing the nature of
the relationship to assist in predicting
The equation of a regression line has the form:
Y = a + bX
where Y is the dependent variable (the one we wish to predict /
explain) and X is the independent variable. The value “a” is known
as the intercept of the line and “b” measures the gradient of this line.
Worked Example
• LOS and age is correlated as r = 0.87207 from a
survey of 30 employees in a firm
• r (above) and r2 (0.760508) are strong – although this
still leaves residuals at 24% (i.e. due to extraneous
factors)
• Is this significant?
• Can we predict mean LOS at age 40?
• What is the 95% confidence interval for LOS derived
from one extra year age?
Plotting the data we can see…
30
20
SERVICE
10
0
Rsq = 0.7605
10
20
30
40
50
60
AGE
The equation of the line linking length of service (y) and age (x) is:
Y = -8.2194 + 0.45727x and SPSS reveals these coefficients for us
This equation can be used to predict LOS at a selected age.
Where do the figures come from to drop into the Y=a+bX equation?
An SPSS regression printout gives us the data needed to solve the problem:
Variables Entered/Remov ed
Model
Variables Entered
1
AGE
a All requested v ariables entered.
b Dependent Variable: LOS
Model Summary
Model
R
R Square
1
.872
.761
a Predictors: (Constant), AGE
b Dependent Variable: LOS
Variables Removed
.
Adjusted R Square
.752
Method
Enter
Std. Error of the Estimate
2.63
Coeff icients
Unstandardized
Coeff icients
Model
B
1
(Constant)
-8.219
AGE
.457
a Dependent Variable: SERVICE
Casewise Diagnostics
Case
Std. Residual
Number
2
3.385
a Dependent Variable: LOS
Standardized
Coeff icients
Std. Error
Beta
1.657
.048
.872
t
-4.961
9.429
Sig.
95%
Conf di ence
Interval f or B
Lower Bound
.000
.000
SERVICE
Predicted Value
Residual
24
15.10
8.90
-11.613
.358
Upper
Bound
-4.826
.557
Interpretation of the SPSS output
Variables Entered/Removed
This simply tells us that ‘age’ was the independent variable and ‘service’ the dependent
variable.
Model summary
The value of the correlation coefficient (r) was 0.872 and the value of r2 was 0.761.
Coefficients
The ‘unstandardized coefficients’ give us the values of a and b in the regression
equation. Thus the equation here is y = -8.219 + 0.457x
The final column ‘Sig.’ gives values less than 0.01 thus we can say that the coefficients
of the regression equation are significantly different from zero at the 1% (0.01) level (and
thus at 5% (0.05) level).
Casewise diagnostics
During the input dialogue, SPSS was asked to show any standardised residuals outside
the range -3 to + 3. The output shows that one reading, case number 2, had a large
standardised residual. This indicates that this point does not fit the general trend of the
straight line and can be regarded as an ‘outlier’ (i.e. an unusual reading).
The solution…
Y = a + bX
(where Y is LOS; X is age)
Y = -8.2194 + 0.45727x
Y = -8.2194 + 0.45727(40)
Y = -8.2194 + 18.29
Y = 10.07 years’ service predicted at age 40*
And … there is a 95 per cent probability that the mean additional LOS for each extra year
in age lies in the range: 0.358 to 0.557 (as supplied in the SPSS output).
* Have a glance back at the scattergram to check this visually
Basic Quants: A Summary
•
•
•
•
We have introduced the modelling concept
We have reflected on data types/displays
We have engaged with probability theory
We have touched on
– Significance testing of hypotheses using both parametric and
non-parametric statistics
– Prediction from what is known to make an informed estimate
of the variable of interest
» Work through the assignment with the booklet provided
alongside and this will guide solution of every aspect!