#### Transcript Introduction to Statistics: Political Science (Week 1)

```Introduction to Statistics:
Political Science (Class 2)
Central Limit Theorem,
T-statistics, and using split sample
analysis and multivariate
regression to deal with confounds
Today…
• A review of what standard errors and Tstatistics tell us
• Multivariate regression
The goal of statistical analysis?
• We want to know: *true* “population”
mean or relationship
• What we have: sample of the units we are
interested in
• Thus we estimate the mean or
relationship
– What is an estimate?
Actually we estimate 2 things
• Estimate of mean or relationship
– We know how to get this (calculate the mean
or find the best fit line)
• Estimate of uncertainty
– Often (typically?): How confident can we be
that a mean or relationship is not zero
– We can’t measure our uncertainty directly
(we’re uncertain – duh!)
The Central Limit Theorem
• In repeated sampling (if we redrew over
and over and over and recalculated)…
– the average of the estimates will be centered
on the population (“true”) mean
– the distribution of estimates will be
approximately normal…
Like this
This width depends on:
1. Variance in population (more  wider)
2. Number of cases sampled (more  narrower)
Number of Samples
9
Coin toss
8
7
6
5
4
3
2
1
0
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Mean ideology of the American public
•
How would you rate yourself on the
following scale?
1.
2.
3.
4.
5.
6.
7.
Very Liberal
Liberal
Somewhat Liberal
Somewhat Conservative
Conservative
Very Conservative
•
If we were omniscient (or could ask
every single person) we would know that
the true average is 5.0
•
but we’re not/we can’t…Instead we call 100 people at
random… and then we do that again and again…
Estimating Mean Ideology
Sample
Mean
SE
LB
(Mean-2SEs)
UB
(Mean+2SEs)
1
4.8
0.167
4.466
5.134
2
5.1
0.176
4.748
5.452
3
5.3
0.19
4.92
5.68
4
4.9
0.18
4.54
5.26
5
4.7
0.168
4.364
5.036
6
5
0.176
4.648
5.352
7
5.1
0.148
4.804
5.396
8
5.2
0.2
4.8
5.6
9
4.7
0.168
4.364
5.036
10
4.9
0.124
4.652
5.148
In any given sample we would be about 95% confident that the true
population mean was somewhere within this range
estimates of the mean will be within about
+/- two standard errors of the population value
One
Standard
Error
5.0
Same idea with regression
coefficient
• If we were able to redraw new samples
over and over and re-estimate β…
• Typically (always for our purposes here)
we’re testing whether a coefficient = 0
So T can be thought of as:
how many SEs from 0 that
the coefficient is
Democracy Scores
Constant
Coef
0.259
23.21
SE Coef
0.023
0.253
T
11.34
91.82
P
0.000
0.000
0
T = -11.34
T = 11.34
If the true relationship was 0 (no relationship), getting an estimated coefficient with
a T-value with an absolute value greater than 11.34 by chance would be
extremely unlikely
So we can be confident rejecting the null hypotheses
(What’s the null? Why do we set things up this way?)
1 v. 2-tailed tests
1-tailed: You have strong prior
relationship (if relationship turns
out to be in the other direction
you can’t reject the null – even
w/a large t-statistic)
direction of relationship – more
conservative test
Causal relationships
• Identifying associations is nice, but usually
we want to identify causality
• Two primary threats
– Reverse causation (we’ll table this for now and talk about it in a few weeks)
– Confounding variables
Need to rule out alternative explanations
Bush was particularly unpopular at
the end of his presidency…
help Obama?
?
Measuring “reverse coattails” effect
• …I'll read the name of a person and I'd like you to rate that person
using something we call the feeling thermometer. Ratings between
50 degrees and 100 degrees mean that you feel favorable and warm
toward the person. Ratings between 0 degrees and 50 degrees
mean that you don't feel favorable toward the person and that you
don't care too much for that person. You would rate the person at
the 50 degree mark if you don't feel particularly warm or cold toward
the person.
• Bivariate regression
Υ = β0 + β1X + u SO…
Obama FT = β0 + β1(Bush FT) + u
Obama FT = 80.4 + (-0.43*Bush FT)
Obama Feeling Thermometer
80
60
40
Bush FT
Constant
Coef. SE
T
-.43 .018 -24.12
80.4 .852 94.37
R-squared
P-value
0.000
0.000
= 0.203
20
0
100
Bush Feeling Thermometer
What else might explain this
(strong!) relationship?
• Other factors that might affect
evaluations of both Obama and
Bush?
Party Identification?
Bush Feeling
Thermometer
Obama Feeling
Thermometer
Party Identification
Party Identification
• Generally speaking, do you usually think
of yourself as a Democrat, a Republican,
an Independent, or what?
-3 = Strong Republican
-2 = Weak Republican
-1 = Lean Republican
0 = Independent
1 = Lean Democrat
2 = Weak Democrat
3 = Strong Democrat
Party Identification  FTs
Predict Obama Feeling Thermometer
Coef. SE
T
Party Identification 8.71 .234 37.16
Constant
58.1 .507 114.71
P-value
0.000
0.000
Predict Bush Feeling Thermometer
Coef. SE
T
Party Identification -8.19 .259 -31.58
Constant
43.3 .560 77.38
P-value
0.000
0.000
Accounting for a confound by
splitting the sample…
• Among Democrats:
– Mean evaluation of Bush: 24.7
– Mean evaluation of Obama: 79.2
• Among Republicans:
– Mean evaluation of Bush: 65.9
– Mean evaluation of Obama: 35.5
• Let’s see what happens when we run separate
regressions for Democrats and Republicans…
Model with all respondents
Obama FT = 80.4 + (-0.43*Bush FT)
Obama Feeling Thermometer
80
Democrats
Obama FT =
83.6 + (-0.18*Bush FT)
60
Republicans
Obama FT =
50.4 + (-0.23*Bush FT)
40
20
0
100
Bush Feeling Thermometer
Party ID as Confound
Bush Feeling
Thermometer
(X)
Not this part
Obama Feeling
Thermometer
(Y)
Party
Identification
(Z)
We only want to give
Bush FT explanatory
“credit” for this part of
the relationship
Multivariate Regression
Υ = β0 + β1X + β2X + u
Obama FT = β0 + β1(Bush FT) +
β2(Party Identification) + u
(party identification -3=strong Republican; 3=strong Democrat)
Multivariate Regression
Coef.
Bush FT
-.165
Party Identification 7.354
Constant
65.28
St.Err
T
.019 -8.72
.278 26.44
.962 67.89
P
0.000
0.000
0.000
Language: relationship between X1 and Y
controlling for X2 (OR holding X2 constant)
(more precisely: “controlling for the linear relationship
between X2 and Y”)
Bivariate
Bush
regression:
FT
Bush
onlyFT
gets
gets
“credit”
“credit” for
for this
all
part
of of
thisthe
overlap
overlap
Bush Feeling
Thermometer
No variable gets
“credit” for this part,
(but it does affect
the R-squared)
Obama Feeling
Thermometer
Party Affiliation
Party Affiliation
only gets
“credit” for this
part of the overlap
Getting predicted values
Coef.
Bush FT
-.165
Party Identification 7.354
Constant
65.28
St.Err
T
.019 -8.72
.278 26.44
.962 67.89
P
0.000
0.000
0.000
Obama FT = β0 + β1(Bush FT) +
β2 (Party Identification) + u
Getting predicted values
Coef.
Bush FT
-.165
Party Identification 7.354
Constant
65.28
St.Err
T
.019 -8.72
.278 26.44
.962 67.89
P
0.000
0.000
0.000
Obama FT = 65.28 + (-.165)(Bush FT) +
7.354(Party Identification) + u
What does the coefficient on the constant mean?
Expected Value for a Strong Democrat who gave Bush a
feeling thermometer rating of 50?
Notes and Next Time
• No Class on Tuesday
• Remember to look at the homework
assignment in time to get TA office hour
help before it’s due next Thursday!
• Next time:
– R-squared
– Non-continuous explanatory variables
– Joint significance of variables (F-tests)
```