SOC 206 - Division of Social Sciences

Download Report

Transcript SOC 206 - Division of Social Sciences

Statistics, Causation
Simple Regression


Famous distinction by Hans Reichenbach
Discovery:


How do we come up with ideas
Justification:

How can we demonstrate that they are true
I.
Statistics is a language

Theoretical ideas can be represented
Verbally


Culture creates and reinforces power relations
Visually

Culture
Power
Mathematically


P=f(C, e)
Any language is a tool of both discovery and justification
Statistics is more of a tool of justification  hypothesis testing, prediction
it is limited as a tool of discovery  data mining, inductive statistics (factor, cluster analysis etc.)
limited by its inflexibility

Statistics allow us to process a huge amount of standardized and
comparable pieces of information

Qualitative (Clinical) vs. Quantitative (Statistical) Judgment


More than a hundred studies comparing the two (Grove et al. 2001, Dawes et al. 1989)
including: college admission, medical and psychiatric diagnostics, credit assessment,
criminal recidivism, job performance etc.
In the overwhelming majority of the cases statistical judgment was better




Even when experts judges had more information
Even when experts were informed of the statistical prediction
Even when the statistical model was “inappropriate” but the coefficients had the right sign and unit
size
Reasons:






Limited cognitive capacities
Common cognitive errors (e.g. overemphasis of recent experience, confirmation bias, ignoring base rates, human
prejudice etc.
Separation of the judgment and its outcome
Self-fulfilling prophecy
Selection bias
All apply to the qualitative vs. quantitative distinction in social science methodology


Models are simplified and explicit representations of
reality
Example (Lave and March 1993)

Friendships on campus
 Observation: students tend to have friends in adjacent quarters
 Question: what could produce (cause) this pattern?
 Hypothesis : students request to be close to their friends
 Implication: We should not find the same pattern for freshmen
 Assumption: freshmen rarely know their college mates
 Finding: same pattern for freshmen – Hypothesis is wrong
 New Hypothesis: students befriend others close by.
 What is the exact process?
 Can you generalize, and broaden the context?
 To all colleges in the US? Beyond colleges in the US? Beyond the US?
 What are the implications of your theory?

Causal models

Friendship ties are caused by something (e.g. physical
proximity)

Causation is an asymmetric relationship between two things: the
cause and the effect


Empirical Association




Statistics often assumes this
Non-Spuriousness (Excluding other Forms of
Causation)


Statistics is strong in revealing this
Appropriate Time Order


John Stuart Mill’s 3 Main Criteria of Causation (System of Logic Book III.
Chapter V-VIII)
Statistics uses multivariate models to establish this
Verbal representation of causality – narratives
Visual Cause (X)  Effect (Y) --- Proximity  Friendship
Mathematical Y=f(X, e)

Y=f(X)
2)
1/ln(34X+.5X
f= function -- e.g. Y=2X or Y=e

The simplest function: linear – the change in Y is constant as X changes










i
1
2
3
4
5
6
#of Chocolate Bars (X)
0
1
2
3
4
5
Cost Paid $ (Y)
0
2
4
6
8
10
Price of the chocolate bar= $2
Cost=f(count)  Y=2X or Yi=2Xi

i
1
2
3
4
5
6

Price of the chocolate bar= $1 entry fee + $2 per bar






#of Chocolate Bars (X)
0
1
2
3
4
5
Cost Paid $(Y)
1
3
5
7
9
11

Price=f(count)  Y=1+2X or Yi=1+2Xi
a=intercept
b=slope

Yi=a+bXi


a=1

Y1=a+bX1  1=1+2*0
Y2=a+bX2  3=1+2*1
…………..
Yn=a+bXn

Deterministic linear function



b=2
Case Summaries a
N
AGE
6
a. Limited to first 100 cases.
30
30
40
40
37
51
6
INCOME
19
23
26
30
27
31
6
INCOME
1
2
3
4
5
6
Total
NAME
Tom
Ben
Jane
Steve
Cathy
Diane
36
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Steve
Diane
Cathy
Jane
Ben
Tom
0
AGE
10
20
30
40
50
60
br
var(y)
sy
r
var(x)
sx
Case Summaries a
1
2
3
4
5
6
Total
NAME
Tom
Ben
Jane
Steve
Cathy
Diane
N
AGE
6
a. Limited to first 100 cases.
30
30
40
40
37
51
6
INCOME
19
23
26
30
27
31
6


i Yi income Xi age
1
19
30
2
23
30
3
26
40
4
30
40
5
27
37
6
31
51
Σ 156
228
Mean 26
38

b=152/306=0.4967








_
Yi-Y
19-26= -7
23-26= -3
26-26= 0
30-26= 4
27-26= 1
31-26= 5
0
_
Xi-X
30-38= -8
30-38= -8
40-38= 2
40-38= 2
37-38=-1
51-38=13
0




a=?
_ _
a=Y-bX= 26-0.4967*38


_
_
(Xi-X)(Yi-Y)
(-8)*(-7)= 56
(-8)*(-3)=24
0* 2= 0
4* 2= 8
1 *(-1)= -1
5 * 13= 65
152
_
(Yi-Y)2
(-7)*(-7)=49
(-3)*(-3)= 9
0* 0= 0
4 * 4 =16
1* 1= 1
5 * 5 =25
100
Incomei=a+0.4967*Agei+ei
Incomei=7.1254+0.4967*Agei+ei


_
(Xi-X)2
(-8)*(-)8= 64
(-8)*(-8)= 64
2* 2= 4
2* 2= 4
1* 1 = 1
13 * 13 =169
306
Yi=7.1254+0.4967*Xi+ ei
7.1254  value of Y when X=0 (income at age 0)
+0.4967  unit change in Y by one unit change in X (income change for each year increase in age)


How good is our model?
Our measure is the Residual Sum of Squares we also call is Sum of
Squared Error (SSE)

observed calculated residual/error

i








Yi
1 19
2 23
3 26
4 30
5 27
6 31
Pred(Yi)=a+bXi ei=Yi-Pred(Yi)
22.026
22.026
26.993
26.993
25.503
32.458
squared residual/error
ei2=ei*ei
-3.0261
9.1573
.9739
0.9485
-.9935
0.9870
3.0065
9.0390
1.4967
2.2401
-1.4575
2.1243
Σ
0
24.4962
Is the SSE of Σei2=24.4962 a lot or a little?

Compared to what?
INCOME

36
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Steve
Diane
Cathy
Jane
Ben
Tom
0
AGE
10
20
30
40
50
60
Bob 18
years old
and
making
$30K
added
Now Tom became
an outlier (like Bob)
Keeping
Bob but
dropping
Tom
In small samples individual cases (or a small
set of cases) can influence where the
regression line goes.



Intercept in the population: α
Slope in the population: β
Can we generalize?
Do we have a probability (random) sample?

If yes, we can proceed.
Are the coefficients significantly different from 0? Is α ≠0; β≠0?
Is R-square significantly different from 0? Is R









Both a (intercept in the sample) and b (slope in the sample) have a probability distribution and so does Rsquare.
Suppose we take many random samples of N=6 from this population. Each time we will get an intercept and a
slope.

http://lstat.kuleuven.be/java
We get a sampling distribution with the following characteristics:
1. It has a normal (bell) shape
2. Its expected value is the population or true value (E(a)= α; E(b)= β).
3.The standard deviation of the sampling distribution (standard error) for b
s.e.(b)  var(b)

for a
var(b) 
1
2
2


 x2i
 xi2
2
1
X 
var(a)   *  

s.e.(a)  var(a)
2
 N  xi 
σ2=Σεi2/N  Mean Squared Error (Mean Residual Sum of Squares) where εi is the distance between the observation i
and the TRUE regression line.
Because we don’t know the TRUE regression line, we can only estimate εi. Our best guess is ei. So our estimate of σ2,
s2= Σei2/N-2
2


s.e.(a)  24.5 /(6  2) *
s.e.(b) 

1 382

 2.475* 2.2103 5.470
6 306
24.5 /(6  2)
 .141
306
Testing if α ≠0

t=(a- α)/s.e.(a)
t=(7.124-0)/5.470=1.302 d.f=n-2=4

t=(b- β)/s.e.(b)
t=(.497-0)/.141=3.511
d.f=n-2=4


Testing if β ≠0



Income000
Coef.
Std. Err.
t
P>t
[95% Conf.
Interval]
Age
_cons
.496732
7.124183
.1414697
5.469958
3.51
1.30
0.025
0.263
.1039492
-8.062854
.8895148
22.31122





Source
To evaluate this we use the ANalysis Of VAriance (ANOVA) table
SS
df
MS
75.503268
24.496732
1
4
75.503268
6.12418301
100
5
20

Model
Residual

Total


Number of obs
F( 1, 4)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
6
12.33
0.0246
0.7550
0.6938
2.4747
We calculate the F-statistics
F reg d.f., res d.f. =(RegSS/Reg. d.f.)/(SSE/Res.d.f.)


Reg d.f.= K (# of independent variables)
Res d.f.=N-k-1

F=(75.503/1)/(24.497/(6-1-1))=12.329 df=1,4

In a simple regression F is the squared value of the t for the slope: 3.5112=12.327 (the discrepancy is due to rounding) 

The F distribution is a relative of the t distribution. Both are based on the normal distribution.

Verbal:


Despite the fact, that many see schools as the
ultimate vehicle of social mobility, schools reproduce
social inequalities by denying high quality public
education from the poor.
Visual
Family Income

School Quality
Statistical

School quality=f(Family income, e)
Academic Performance Index (API) in California Public
Schools in 2006 as a Function of the Percent of Students
Receiving Subsidized Meals
800
600
API13
.003
0
200
.001
400
.002
Density
.004
.005
1000

0
200
Variable
API13
400
600
API13
Obs
10242
800
Mean
784.2502
20
1000
40
60
MEALS
Std. Dev.
102.2748
Source
SS
df
MS
Model
Residual
23852172.8
83270168.8
1
10240
23852172.8
8131.85243
Total
107122342
10241
10460.1447
API13
Coef.
Std. Err.
MEALS
_cons
-1.730451
885.6367
.0319514
2.073267
t
-54.16
427.17
Min
311
Max
999
Number of obs
F( 1, 10240)
Prob > F
R-squared
Adj R-squared
Root MSE
= 10242
= 2933.18
= 0.0000
= 0.2227
= 0.2226
= 90.177
P>t
[95% Conf.
Interval]
0.000
0.000
-1.793082
881.5727
-1.66782
889.7008
80
100

Z Xi 
Xi  X
SX
Suppose we eliminate the natural metric of the
variables and turn them into Z-scores

Z score for X

Z score for Y
The slope will be different because now everything is
measured in standard deviations. It will tell you that “Y
will change that many standard deviations by one
standard deviation change in X.” It is called the
“standardized regression coefficient a.k.a. path coefficient,
a.k.a. beta weight or beta coefficient.
There is no intercept in a standardized regression
Z X i Z Yi   * Z X i Z X i  Z X i ei

We multiply both sides of the equation by Zxi
Z X 1 ZY1   * Z X 1 Z X1  Z X 1 e1

We do that for each case 1st, 2nd …….nth.
Z Yi 
Yi  Y
SY


Z Yi  a   * Z X i  ei
a  Z Yi   * Z X i
Z Yi  Z X i  0
a0
Z Yi   * Z X i  ei
Z X 2 ZY2   * Z X 2 Z X 2  Z X 2 e2
Z X n ZYn   * Z X n Z X n  Z X n en
Z
Xi
ZYi   *  Z X i Z X i   Z X i ei
Z
Xi
Z Yi
rXY
*
Z
n
  * rXX  rXe
Xi
n
Z Xi

Z
e
Xi i


Dividing by n we get the average cross-products
of Z-scores which are correlations.
This is the normal equation. On one side there is
a correlation. On the other side path coefficients
and correlations

The final normal equation

n
rXX  1
rXe  0
rXY   *
b
SY
S
 *  Y rXY
SX
SX
summing the equations

This is how you get the metric (unstandardized)
slope coefficient from the path coefficient
. correlate API13 MEALS, means
(obs=10242)
b= [102.2748/27.88903] *-.4718717=-1.730451
Variable |
Mean Std. Dev.
Min
Max
-------------+---------------------------------------------------API13 | 784.2502 102.2748
311
999
MEALS | 58.58963 27.88903
0
100
a= 784.2502 –(-1.730451)* 58.58963= 885.6367
| API13 MEALS
-------------+-----------------API13 | 1.0000
MEALS | -0.4719 1.0000
. regress API13 MEALS, beta
Source |
SS
df
MS
-------------+------------------------------ ----------------------------------------------------Model | 23852172.8
1
23852172.8
Residual | 83270168.8
10240
8131.85243
-------------+------------------------------ ---------------------------------------------------Total | 107122342
10241
10460.1447
Number of obs = 10242
F( 1, 10240) = 2933.18
Prob > F
= 0.0000
R-squared = 0.2227
Adj R-squared = 0.2226
Root MSE
= 90.177
--------------------------------------------------------------------------------------------------------------------------------------------------API13 |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+-----------------------------------------------------------------------------------------------------------------------------------MEALS | -1.730451 .
0319514
-54.16
0.000
-.4718717
_cons | 885.6367
2.073267
427.17
0.000
.
-----------------------------------------------------------------------------------------------------------------------------------------------
0
200
. estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of API13
-400
-200
chi2(1)
= 120.54
Prob > chi2 = 0.0000
. regress API13 MEALS, vce(hc3) beta
-600
Linear regression
0
20
40
60
MEALS
The Standard Error is corrected to make it robust
against the violation of the homoscedasticity
assumption.
80
100
Number of obs
F( 1, 10240)
Prob > F
R-squared
Root MSE
= 10242
= 3091.00
= 0.0000
= 0.2227
= 90.177
----------------------------------------------------------------------------------------------------------|
Robust HC3
API13 |
Coef.
Std. Err.
t
P>|t|
Beta
------------------+-------------------------------------------------------------------------------------MEALS | -1.730451
.031125 -55.60 0.000
-.4718717
_cons | 885.6367
2.152182 411.51 0.000
.
----------------------------------------------------------------------------------------------------------