Transcript Sampling and Inference The Quality of Data and Measures March 23, 2006
Sampling and Inference
The Quality of Data and Measures March 23, 2006 1
Why we talk about sampling
• General citizen education • Understand data you’ll be using • Understand how to draw a sample, if you need to • Make statistical inferences 2
Cost/ benefit
Why do we sample?
Benefit (precision) Cost (hassle factor) N 3
How do we sample?
• Simple random sample – Variant: systematic sample with a random start • Stratified • Cluster 4
Stratification
• Divide sample into subsamples, based on
known
characteristics (race, sex, religiousity, continent, department) • Benefit: preserve or enhance variability 5
Cluster sampling
Block HH Unit Individual 6
Effects of samples
• Obvious: influences marginals • Less obvious – Allows effective use of time and effort – Effect on multivariate techniques • Sampling of independent variable: greater precision in regression estimates • Sampling on dependent variable: bias 7
y
Sampling on Independent Variable
y x x 8
y
Sampling on Dependent Variable
y x x 9
Sampling
Consequences for Statistical Inference 10
Statistical Inference: Learning About the Unknown From the Known • Reasoning forward: distributions of sample means, when the population mean, s.d., and
n
are known.
• Reasoning backward: learning about the population mean when only the sample, s.d., and
n
are known 11
Reasoning Forward
12
Exponential Distribution Example
.271441
Mean = 250,000 Median=125,000 s.d. = 283,474 Min = 0 Max = 1,000,000 0 0 500000 inc 1.0e+06 13
Consider 10 random samples, of
n
= 100 apiece
Sample 1 2 3 4 5 6 7 8 9 10 mean 253,396.9
198.789.6
271,074.2
238,928.7
280,657.3
241,369.8
249,036.7
226,422.7
210,593.4
212,137.3
.271441
0 0 250000 500000 inc 1.0e+06 14
Consider 10,000 samples of
n
= 100
N = 10,000 Mean = 249,993 s.d. = 28,559 Skewness = 0.060
Kurtosis = 2.92
.275972
0 0 250000 500000 (mean) inc 15 1.0e+06
.731
Consider 1,000 samples of various sizes
10 100 1000 .731
.731
0 0 250000 500000 (mean) inc Mean =250,105 s.d.= 90,891 Skew= 0.38
Kurt= 3.13
1.0e+06 0 0 250000 500000 (mean) inc Mean = 250,498 s.d.= 28,297 Skew= 0.02
Kurt= 2.90
1.0e+06 0 0 250000 500000 (mean) inc Mean = 249,938 s.d.= 9,376 Skew= -0.50
Kurt= 6.80
16 1.0e+06
Difference of means example
.280203
State 1
Mean = 250,000 0 0 250000 500000 inc .251984
1.0e+06
State 2
Mean = 300,000 0 0 250000 500000 inc2 1.0e+06 17
Take 1,000 samples of 10, of each state, and compare them
Sample 1 2 3 4 5 6 7 8 9 10 First 10 samples State 1 311,410 < 184,571 468,574 253,374 220,934 270,400 127,115 253,885 152,678 222,725 < > < > < < < < > State 2 365,224 243,062 438,336 557,909 189,674 284,309 210,970 333,208 314,882 152,312 18
1.1e+06
1,000 samples of 10
0 0 State 2 > State 1: 673 times (mean) inc 1.1e+06 19
1,000 samples of 100
1.1e+06 0 0 State 2 > State 1: 909 times (mean) inc 1.1e+06 20
1,000 samples of 1,000
1.1e+06 0 0 State 2 > State 1: 1,000 times (mean) inc 1.1e+06 21
.565
Another way of looking at it: The distribution of Inc
2
– Inc
1
n =
10
n =
100
n =
1,000 .565
.565
0 -400000 050000 diff Mean = 51,845 s.d. = 124,815 600000 0 -400000 050000 diff Mean = 49,704 s.d. = 38,774 600000 0 -400000 0 diff Mean = 49,816 s.d. = 13,932 600000 22
Play with some simulations
• http://www.ruf.rice.edu/~lane/stat_sim/sam pling_dist/index.html
• http://www.kuleuven.ac.be/ucs/java/index.h
tm 23
Reasoning Backward
When you know
n
, X , and
s
, but want t o say something about 24
Central Limit Theorem
As the sample size
n
increases, the sample taken from
practically any population
approaches a
normal
distribution, with mean deviation
n
and standard 25
Calculating Standard Errors
In general: std.
err.
s n
26
Most important standard errors
Mean Proportion Diff. of 2 means Diff. of 2 proportions Diff of 2 means (paired data) Regression (slope) coeff.
s n p
( 1
p
)
n s
1 2
n
1
s n
2 2 2
p
1 ( 1
p
1 )
n
1
p
2 ( 1
p
2 )
n
2
s d n s
.
e
.
r
.
n
1 1
s x
27
Using Standard Errors, we can construct “confidence intervals” •
Confidence interval (ci)
: an interval between two numbers, where there is a certain specified level of confidence that a population parameter lies • ci = sample parameter + • multiple * sample standard error 28
Constructing Confidence Intervals • Let’s say we draw a sample of tuitions from 15 private universities. Can we estimate what the average of all private university tuitions is?
• N = 15 • Average = 29,735 • S.d. = 2,196 • S.e. =
s
n
2 , 196 15 567 29
N = 15; avg. = 29,735; s.d. = 2,196; s.e. = s/√n = 567
The Picture
.398942
29,735-567=29,168 29,735+567=30,302 29,735-2*567= 28,601 29,735+2*567= 30,869 .000134
4 3 2 29,735 68% Mean 95% 99% 2 3 4 30
Confidence Intervals for Tuition Example • 68% confidence interval = 29,735+567 = [29,168 to 30,302] • 95% confidence interval = 29,735+2*567 = [28,601 to 30,869] • 99% confidence interval = 29,735+3*567 = [28,034 to 31,436] 31
What if someone (ahead of time) had said, “I think the average tuition of major research universities is $25k”?
• Note that $25,000 is well out of the 99% confidence interval, [28,034 to 31,436] • Q: How far away is the $25k estimate from the sample mean?
– A: Do it in
z
-scores: (29,735-25,000)/567 = 8.35
32
Constructing confidence intervals of proportions • Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll) Can we estimate the % of all American adults who approve?
• N = 1000 • p = .37
• s.e. =
p
( 1
p
)
n
.
37 ( 1 .
37 ) 1000 0 .
02 33
N = 1,000; p. = .37; s.e. = √p(1-p)/n = .02
The Picture
.398942
.37-.02=.35
.37+.02=.39
.37-2*.02=.33
.37+2*.02=.41
.000134
4 3 2 .37
68% Mean 95% 99% 2 3 4 34
Confidence Intervals for Bush approval example • 68% confidence interval = .37+.02 = [.35 to .39] • 95% confidence interval = .37+2*.02 = [.33 to .41] • 99% confidence interval = .37+3*.02 = [ .31 to .43] 35
What Gallup said about the confidence interval • Results are based on telephone interviews with 1,000 national adults, aged 18 and older, conducted March 13-16, 2006. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error is ±3 percentage points [because the actual standard error is 1.5%, not 2%]. 36
What if someone (ahead of time) had said, “I think Americans are equally divided in how they think about Bush.” • Note that 50% is well out of the 99% confidence interval, [31% to 43%] • Q: How far away is the 50% estimate from the sample proportion?
– A: Do it in
z
-scores: (.37-.5)/.02 = -6.5 [-8.7 if we divide by 0.15] 37
Constructing confidence intervals of differences of means • Let’s say we draw a sample of tuitions from 15 private and public universities. Can we estimate what the difference in average tuitions is between the two types of universities?
• N = 15 in both cases • Average = 29,735 (private); 5,498 (public); diff = 24,238 • s.d. = 2,196 (private); 1,894 (public) • s.e. =
s
1 2
n
1
s n
2 2 2 4,822,416 3,587,236 15 15 749 38
N = 15 twice; diff = 24,238; s.e. = 749
The Picture
.398942
24,238-749= 23,489 24,238+749=24,987 24,238-2*749= 22,740 24,238+2*749= 25,736 .000134
4 3 2 24,238 68% Mean 95% 99% 2 3 4 39
Confidence Intervals for difference of tuition means example • 68% confidence interval = 24,238+749 = [23,489 to 24,987] • 95% confidence interval = 24,238+2*749 = [22,740 to 25,736] • 99% confidence interval =24,238+3*749 = • [21,991 to 26,485] 40
What if someone (ahead of time) had said, “Private universities are no more expensive than public universities” • Note that $0 is well out of the 99% confidence interval, [$21,991 to $26,485] • Q: How far away is the $0 estimate from the sample proportion?
– A: Do it in
z
-scores: (24,238-0)/749 = 32.4
41
Constructing confidence intervals of difference of proportions • Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll). We focus on the 600 who are either independents or Democrats. Can we estimate whether independents and Democrats view Bus differently?
• N = 300 ind; 300 Dem.
• p = .29 (ind.); .10 (Dem.); diff = .19
• s.e. =
p
1 ( 1
p
1 )
n
1
p
2 ( 1
p
2 )
n
2 .
29 ( 1 .
29 ) 300 .
10 ( 1 .
10 ) 300 .
03 42
diff. p. = .19; s.e. = .03
The Picture
.398942
.19-.03=.16
.19+.03=.22
.19-2*.03=.13
.19+2*.03=.25
.000134
4 3 2 .19
68% Mean 95% 99% 2 3 4 43
Confidence Intervals for Bush Ind/Dem approval example • 68% confidence interval = .19+.03 = [.16 to .22] • 95% confidence interval = .19+2*.03 = [.13 to .25] • 99% confidence interval = .19+3*.03 = [ .10 to .28] 44
What if someone (ahead of time) had said, “I think Democrats and Independents are equally unsupportive of Bush”?
• Note that 0% is well out of the 99% confidence interval, [10% to 28%] • Q: How far away is the 0% estimate from the sample proportion?
– A: Do it in
z
-scores: (.19-0)/.03 = 6.33
45
Constructing confidence intervals of differences of means in a
paired sample
• Let’s say we draw a sample of tuitions from 15 private universities, in 2003 and again in 2004. Can we estimate what the difference in average tuitions is between the two years?
• N = 15 • Averages = 28,102 (2003); 29,735 (2004); diff = 1,632 • s.d. = 2,196 (private); 1,894 (public); 886 (diff) • s.e. =
s d n
886 15 229 46
N = 15; diff = 1,632; s.e. = 229
The Picture
.398942
1,632-229=1,403 1,632+229=1,861 1,632-2*229= 1,174 1632+2*229= 2,090 .000134
4 3 2 1,632 68% Mean 95% 99% 2 3 4 47
Confidence Intervals for second difference of tuition means example • 68% confidence interval = 1,632+ 229= [1,403 to 1,861] • 95% confidence interval = 1,632+ 2*229 = [1,174 to 2,090] • 99% confidence interval = 1,632+3*229 = [945 to 2,319] 48
What if someone (ahead of time) had said, “Private university tuitions did not grow from 2003 to 2004” • Note that $0 is well out of the 99% confidence interval, [$1,174 to $2,090] • Q: How far away is the $0 estimate from the sample proportion?
– A: Do it in
z
-scores: (1,632-0)/229 = 7.13
49
The Stata output
. gen difftuition=tuition2004-tuition2003 . ttest diff=0 in 1/15 One-sample t test ----------------------------------------------------------------------------- Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+------------------------------------------------------------------- difftu~n | 15 1631.6 228.6886 885.707 1141.112 2122.088
----------------------------------------------------------------------------- mean = mean(difftuition) t = 7.1346
Ho: mean = 0 degrees of freedom = 14 Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
50
Constructing confidence intervals of regression coefficients • Let’s look at the relationship between the mid term seat loss by the President’s party at midterm and the President’s Gallup poll rating 30 1946 1950 1982 1978 1974 1966 1994 1958 1990 1970 1954 2002 1998 1938 40 50 Gallup approval rating (Nov.) loss Fitted values 60 Fitted values 1942 70 Slope = 1.97
N = 14 s.e.r. = 13.8
s x = 8.14
s.e.
slope =
s
.
e
.
r
.
n
1 1
s x
13 .
8 13 1 8 .
14 0 .
47 51
N = 14; slope=1.97; s.e. = 0.45
The Picture
.398942
1.97-0.47=1.50
1.97+0.47=2.44
1.97-2*0.47=1.03
1.97+2*0.47=2.91
.000134
4 3 2 1.97
68% Mean 95% 99% 2 3 4 52
Confidence Intervals for regression example • 68% confidence interval = 1.97+ 0.47= [1.50 to 2.44] • 95% confidence interval = 1.97+ 2*0.47 = [1.03 to 2.91] • 99% confidence interval = 1.97+3*0.47 = [0.62 to 3.32] 53
What if someone (ahead of time) had said, “There is no relationship between the president’s popularity and how his party’s House members do at midterm”?
• Note that 0 is well out of the 99% confidence interval, [0.62 to 3.32] • Q: How far away is the 0 estimate from the sample proportion?
– A: Do it in
z
-scores: (1.97-0)/0.47 = 4.19
54
The Stata output
. reg loss gallup if year>1948 Source | SS df MS Number of obs = 14 -------------+----------------------------- F( 1, 12) = 17.53
Model | 3332.58872 1 3332.58872 Prob > F = 0.0013
Residual | 2280.83985 12 190.069988 R-squared = 0.5937
-------------+----------------------------- Adj R-squared = 0.5598
Total | 5613.42857 13 431.802198 Root MSE = 13.787
----------------------------------------------------------------------------- loss | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+--------------------------------------------------------------- gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208
_cons | -127.4281 25.54753 -4.99 0.000 -183.0914 -71.76486
------------------------------------------------------------------------------
55
z
vs.
t
56
If
n
is sufficiently large, we know the distribution of sample means/coeffs. will obey the normal curve .398942
.000134
4 3 2 68% Mean 95% 99% 2 3 4 57
If
n
is sufficiently large, we know the distribution of sample means/coeffs. will obey the normal curve .398942
.000134
4 3 2 68% Mean 95% 99% 2 3 4 58
If
n
is sufficiently large, we know the distribution of sample means/coeffs. will obey the normal curve .398942
.000134
4 3 2 68% Mean 95% 99% 2 3 4 59
If
n
is sufficiently large, we know the distribution of sample means/coeffs. will obey the normal curve .398942
.000134
4 3 2 68% Mean 95% 99% 2 3 4 60
Therefore….
• When the sample size is large (i.e., > 150), convert the difference into
z
units and consult a
z
table
Z
= (H 1 - H 0 ) / s.e.
61
Reading a
z
table 62
63
Therefore….
• When the sample size is small (i.e., <150), convert the difference into
t
units and consult a
t
table
t
= (H 1 - H 0 ) / s.e.
64
.003989
t
(when the sample is small)
z
(normal) distribution .000045
-4
t
-distribution -2 0 z 2 4 65
Reading a
t
table 66
A word about standard errors and collinearity • The problem: if
X
1 and
X
2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on
Y
67
How does having another
collinear
independent variable affect standard errors?
1 )
N
1
S Y
2 1 2
S X
1 1 1
R Y
2 2
R X
1 R 2 of the “auxiliary regression” of X 1 the other independent variables on all 68
Example: Effect of party, ideology, and religiosity on feelings toward Quincy Bush Bush Feelings Conserv.
Bush Feelings 1.0
Conserv.
.39
1.0
Repub.
.57
.46
Religious .16
.18
Repub.
1.0
.06
Relig.
1.0
69
Intercept Repub.
Conserv.
Relig.
N R 2 (1) 32.7
(0.85) 6.73
(0.244) -- -- 1575 .32
Regression table
(2) 32.9
(1.08) 5.86
(0.27) 2.11
(0.30) -- 1575 .35
(3) 32.6
(1.20) 6.64
(0.241) -- 7.92
(1.18) 1575 .35
(4) 29.3
(1.31) 5.88
(0.27) 1.87
(0.30) 5.78
(1.19) 1575 .36
70