National Population Health Survey (NPHS)
Download
Report
Transcript National Population Health Survey (NPHS)
Population Health Surveys
Bootstrap Hands-on Workshop
Yves Beland, CCHS senior methodologist
Larry MacNabb, CCHS dissemination manager
developed by
François Brisebois CCHS/NPHS senior methodologist
[email protected]
Purpose of the presentation
Justify the use, understand the theory, and
get familiar with the bootstrap technique
Demystify all illusions about using the
bootstrap technique for variance estimation
Outline
Context
NPHS \ CCHS Complex survey design
Variance estimation \ Bootstrap 101
Data support \ using the bootvar program
Why bootstrap?
CV lookup tables
Historical info about variance estimation for NPHS
Variance estimation with other software programs
Future for STC Health Surveys (re. bootstrap)
Context
A data user is interested in producing some
results
1- Compute an estimate (total, ratio, etc.)
2- Compute the precision of the estimate (variance,
coefficient of variation (CV), etc.)
Context
1- Compute an estimate
Is not a problem!
Use the provided survey weight with
NPHS/CCHS files
Context
1- Compute an estimate (cont’d)
Why use the survey weight?
NPHS Estimates for Diabetes - Canada
Unweighted
Weighted
# People
% People
620
4.1
865,910
3.5
Source: 1998 Master Health file
Conclusion: ALWAYS USE THE WEIGHTS
Context
2- Compute the precision of an estimate
Is a problem!!
NPHS Estimates for Diabetes - Canada
STANDARD DEVIATIONS
Unweighted
Weighted
Bootstrap weights
% People
Estimate
Std Dev.
4.1
0.162
3.5
0.151
3.5
0.177
Source: 1998 Master Health file
Context
2- Compute the precision of the estimate (cont’d)
Scaled weights:
Scaled weight = weight / mean(weight)
Used to overcome problems with the computation of
the variance for some statistics in SAS
Reference: paper from G.Roberts & al.
Context
2- Compute the precision of the estimate (cont’d)
Why such a difference?
Answer: The complex survey design is the main
cause (other factors to be discussed later)
Note: CCHS and NPHS have slightly different frames
but are both considered as complex survey designs
Complex survey design
1- Each province is divided into strata
Province A
Stratum #1
Stratum #2
Complex survey design
2- Selection of clusters within each stratum
Province A
Stratum #1
Stratum #2
Complex survey design
3- Selection of households within each cluster
Province A
Stratum #1
Stratum #2
Complex survey design
How does the sample design affect the
precision of estimates?
Stratification decreases variability (more precise)
Clustering increases variability (less precise)
Overall, the multistage design has the effect of
increasing variability (less precise than SRS)
Complex survey design
So why use a multistage cluster sample design
anyway?
Pros:
Efficient for interviewing (less travel, less costly)
Better coverage of the entire region of interest
Cons:
Problems for variance estimation
Bootstrap Method
Variance estimation with complex multistage
cluster sample design:
Exact formula for variance estimation is too
complex; use of an approximate approach required
NOTE: taking account for the design in variance
estimation is as crucial as using the sampling
weights for the estimation of a statistic
Bootstrap Method
Approximate methods for variance estimation:
Taylor linearization
Re-sampling methods:
Balanced Repeated Replication
Jackknife
Bootstrap
Bootstrap Method
Principle:
You want to estimate how precise is your estimation
of the number of smokers in Canada
You could draw 500 totally new samples, and
compare the 500 estimations you would get from
these samples. The variance of these 500
estimations would indicate the precision.
Problem: drawing 500 new samples is $$$
Solution: Use your sample as a population, and take
many smaller subsamples from it.
Bootstrap 101
How Bootstrap weights are created
(the secret is finally revealed!!!)
USING
Select
Repeat
Apply
Adjust
the
for
n-1
THE
the
the
survey
clusters
process
BOOTSTRAP
fact
weight
that
among
500we
times
(Wgt)
WGTS:
picked
n (example
within
(*BOOTSTRAP
(*BOOTSTRAP
Estimate
n-1
each
among
stratum
theREPLICATES*)
nnumber
WEIGHTS*)
(factor
(with
of
replacement)
nsmokers
/ n-1
= 1.33)
Starting
point:
Full
data
file
presented
for a =given
stratum)
ID Wgt Cluster Smoke B1 =B2# of times
. . . .the
. cluster
. . . . is
. selected
. . B500
A
10
1
X
10
13
1
0
30
40
3
B
10
1
X
10
13
1
0
30
40
3
C
10
1
10
13
1
0
30
40
3
D
10
2
10
13
1
10
13
1
0
2
E
10
2
10
13
1
10
13
1
0
i
F
10
2
10
13
1
10
13
1
0
G
10
3
X
0
0
0
H
10
3
0
0
0
I
10
4
10
13
1
20
27
2
0
J
10
4
X
10
13
1
20
27
2
0
40
39
27
. . . . . . . . . . . .
80
T = 40
Var = (B - B) / 499
Bootstrap 101
How Bootstrap replicates are built (cont’d)
The “real” recipe
1- Subsampling of clusters (SRS) within strata
2- Apply (initial design) weight
3- Adjust weight for selection of n-1 among n
4- Apply all standard adjustments (nonresponse, share, etc.)
5- Post-stratification to population counts
Bootstrap 101
How Bootstrap replicates are built (cont’d)
The bootstrap method intends to mimic the same approach
used for the sampling and weighting processes
Be careful: some software programs say they include the
bootstrap technique; what they really do is to skip steps #4 and
#5, and use directly the final weight in step #2
Bootstrap 101
STC Methodologists create the bootstrap weight files.
Can you create your own bootstrap wgt file? No
Why? Because to do so you need to know:
The design information, i.e. strata, clusters (to generate the
bootstrap subsamples)
The definition of all adjustment classes (including post-
stratification)
Bootstrap 101
The bootstrap wgt files are:
Available for all file (except PUMF - confidentiality)
Distributed with the data files in separate files
The bootstrap wgt files contain:
IDs (REALUKEY/SAMPLEID, PERSONID)
Final sampling weight (WTxx)
500 Bootstrap weights (BSW1--BSW500)
Bootstrap - Support
NPHS/CCHS provides data users with SAS & SPSS
macro programs to compute bootstrap variances
Macros simplifying computation of bootstrap variance
estimates for totals, ratio, differences of ratios, regressions
(linear and logistic), and basic generealized linear models
Come with documentation & examples
French and English
referred as “bootvar”
Example: Step by Step
Let’s get to work!
Goal: Interested in estimating the number of
diabetics (total)
NPHS 1998-99 Dummy file (see information sheet)
Diabetes (CCC6_1J), some totals and ratios
NPHS 1998-99 Dummy Health File
Total cases of diabetes
#
DIAB
% of population
DIAB / TOTAL
Example: Step by Step
STEP #1
STEP #2
Create your « analysis data file »
Compute your variances
with bootvar
Read NPHS\CCHS data file
Prepare dummy variables
necessary for your analysis
Keep only necessary variables
(include geography desired)
Run the analysis to get point
estimates only
(not necessary but recommended)
Location of INPUT files:
Your « analysis data file »
The bootstrap weights file
Geography desired
Number of bootstrap weights
to use
Specify the desired analysis
Totals, ratios, diff of ratios
Regression (linear & logit)
Generalized linear modeling
Example: Step by Step
Step #1: On your own
(but can use the examples provided as a starting point)
Step #2: Use the provided Bootvar program
STEP #1
Read input file
Create dummy variables
Keep only necessary variables
Run
thequalitative/categorical
analysis to get pointvariables,
estimates
For
we need to identify
which value(s) we are interested in. This is done through
the creation of a dummy variable
Dummy variable
= 1 for characteristic of interest
= 0 otherwise
STEP #1
Create dummy variable: example #1
During the past 12 months, how often did you drink
alcoholic beverages? (ALC8_2)
1=Less than once a month
2=Once a month
3=2 to 3 times a month
4=Once a week
5=2 to 3 times a week
6=4 to 6 times a week
7=Every day
Interested in categories 1 to 4 (once a week or less)
DRINK
= 1 if ALC8_2 is 1,2,3 or 4
= 0 otherwise
STEP #1
Create dummy variable: example #2
Diabetes (CCC8_1J)
1=Yes
2=No
6=Not applicable
7=Don’t know
9=Not stated
Sex (DHC8_SEX)
1=Male
2=Female
Interested in “males having diabetes”
mdiab
= 1 if CCC8_1J = 1 and SEX =1
= 0 otherwise
STEP #1
Create dummy variable: example #2
How to use the dummy variable to get an estimate
Total:
MDIAB
0
0
1
1
1
0
0
0
WT56
100
200
300
400
500
600
700
800
(product)
0
0
300
400
500
0
0
0
ESTIMATE =
1200
In SAS:
Proc freq;
tables mdiab;
weight wt56;
run;
STEP #1
Create dummy variable: example #2
How to use the dummy variable to get an estimate
Ratio:
MDIAB TOTAL
0
1
0
1
1
1
1
1
1
1
0
1
0
1
0
1
WT56
100
200
300
400
500
600
700
800
ESTIMATE =
(num)
0
0
300
400
500
0
0
0
(den)
100
200
300
400
500
600
700
800
1200
3600
1200 / 3600 = 33%
STEP #1
See example in SPSS
Diabetes (CCC6_1J), some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes (Nfld, Man & BC)
#
169,700
% of population
3.1
STEP #1
Now your turn! (exercise #1)
Add asthma (CCC8_1C) to the table
Use existing program (step1.sas) and add SPSS codes to create
a dummy variable for asthma; and then get the results
Diabetes & Asthma, some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes (Nfld, Man & BC)
Asthma (Nfld, Man & BC)
#
169,700
ASTHMA
446,800
% of population
3.1
ASTHMA
8.1/ TOTAL
Step #2: Bootvar Program
Created by methodologists in 1997
(first used with NPHS cycle 2 data)
Version 1.0
one single program (over 1,000 lines of codes)
divided into 4 sections
users have to adapt the program to their requests; changes in
3 sections
SAS: bootvar.sas / bootvarf.sas
SPSS: beta version available only on request (bvr_b.sps)
Step #2: Bootvar Program
Version 2.0
Justifications:
Compatible with SAS 8+
Centralize the codes where modifications have to be done
by the user
Can use with both NPHS and CCHS data files
Now consists of 2 programs
Contains the codes users need to modify for their requests
Contains the codes users do not have to modify (macros)
Step #2: Bootvar Program
Version 2.0
SAS version:
bootvare_v20.sas / bootvarf_v20.sas
macroe_v20.sas / macrof_v20.sas
SPSS version:
bootvare_v21.sps / bootvarf_v21.sps
macroe_v21.sps / macrof_v21.sps
STEP #2: Use of bootvar
Point estimates have already been obtained, let us now
estimate the sampling variability of those estimates
Go through the bootvar program (bootvare_v21.sps)
STEP #2: Use of bootvar
See example in SPSS
Diabetes & Asthma, some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes
Asthma
#
169,700
446,800
Nfld, Man & B.C. only
95% C.I.
% of pop. 95% C.I.
(133,400 ; 205,900)
3.1
(2.4 ; 3.8)
8.1
STEP #2
Now your turn! (exercise #2)
Compute confidence intervals for asthma
Use bootvare_v21.sps and adjust it to obtain desired results
(use the already set up step2.sps program for this exercise)
Diabetes & Asthma, some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes
Asthma
#
169,700
446,800
Nfld, Man & B.C. only
95% C.I.
% of pop. 95% C.I.
(133,400 ; 205,900)
3.1
(2.4 ; 3.8)
(381,700 ?; 511,900)
8.1
(6.9 ?; 9.3)
Bootstrap - More
Why 500 bootstrap weights?
Size of file (for dissemination)
Time of computation (for an average PC)
Accuracy
Use more bootstrap weights?
Faster PC
Accuracy for small domains and more complex analysis
methods
Bootstrap - More
Confidentiality revealed from the bootstrap weights
ID Wgt Cluster
A
10
1
?
B
10
1
?
C
10
1
?
D
10
2
?
E
10
2
?
F
10
2
?
G
10
3
?
H
10
3
?
I
10
4
?
J
10
4
?
B1
13
13
13
13
13
13
0
0
13
13
B2 . . . . . . . . . . . . B500
0
33
0
33
0
33
14
0
14
0
14
0
0
0
0
0
29
0
29
0
Bootstrap - More
Confidentiality revealed from the bootstrap weights
(cont’d)
How PUMF users estimate their exact variances?
Remote access
Provide dummy file
(same structure as master files but contain dummy data)
Test programs and send by e-mail
Research Data Centre
Regional Offices
Why Bootstrap?
Other techniques examined: Taylor, Jackknife
Taylor:
Need to define a linear equation for each statistic
examined
Jackknife:
Can not disseminate because of confidentiality
Number of replicates depends on the number of
strata (large number of strata in 1996 makes it
impossible to disseminate)
Why Bootstrap?
Bootstrap:
Handle more easily survey design with many strata
Sets of 500 bootstrap weights can be distributed to
data users
Recommended (over the jackknife) for estimating the
variance of nonsmooth functions like quantiles, LICO
Reference: “Bootstrap Variance Estimation for the
National Population Health Survey”, D.Yeo,
H.Mantel, and T.-P. Liu. 1999, ASA Conference.
Bootvar: exercise #3
Results for diabetes broken down by sex and
province
Diabetes, some totals and ratios
NPHS 1998-99 Dummy Health File
Nfld,Man & BC
Nfld
Males
Females
Manitoba
Males
Females
B.C.
Males
Females
#
95% C.I.
% of pop.
95% C.I.
169,700 (133,400 ; 205,900)
3.1
169,700
(2.4 ; 3.8)
24,900 (18,200 ; 31,500)
4.6
(3.4 ; 5.9)
DIAB
TOTAL
9,800
(4,600 ; 14,700)
(1.7 ; 5.6)
MDIAB
MDIAB 3.7
/ MTOTAL
15,100 (10,000 ; 20,100)FDIAB 5.6
(3.7 ; 7.4)
FDIAB
/ FTOTAL
32,300 (20,400 ; 44,200)
3.0
(1.9 ; 4.1)
15,800
(7,300 ; 24,400)
3.0
(1.3 ; 4.5)
16,500
(8,000 ; 25,000)
3.0
(1.5 ; 4.7)
112,500 (79,300 ; 145,600)
2.9
(2.0 ; 3.7)
68,700 (43,500 ; 93,900)
3.5
(2.2 ; 4.8)
43,700 (22,200 ; 65,300)
2.2
(1.1 ; 3.4)
Bootvar: Tricks
If you need to create a dummy variable for a
characteristic based on many variables:
Example: Males with diabetes
First, create dummy variables for each individual
variable (males, diabetes)
Then, create the dummy variable for the characteristic
by multiplying the individual dummy variables
Bootvar: Tricks
Example:
Males = 1,0
(MALES)
Diabetes = 1,0 (DIAB)
Males having diabetes (MDIAB) = MALES * DIAB
MALES
1
1
0
0
*
DIAB
0
1
0
1
= MDIAB
0
1
0
0
Bootvar: Tricks
Use the REGION parameter in bootvar to specify
a “stratification” variable (doesn’t have to be a
geographic variable!)
Example: REGION = sex
will produce results by sex
CV look-up tables
What is it?
Approximate sampling variability tables
Produced for Canada, each province, and by age groups
for Canada (also by Health Regions for cycle 2)
Useful only for categorical estimates
Totals & ratios only
CV look-up tables
Approximate Sampling Variability Tables for MANITOBA - Selected Members
NUMERATOR OF
PERCENTAGE
('000)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
30
35
40
45
50
ESTIMATED PERCENTAGE
0.1%
1.0%
2.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
103.6
103.2
102.6
********
72.9
72.6
********
59.6
59.3
********
51.6
51.3
********
46.1
45.9
********
42.1
41.9
********
39.0
38.8
********
36.5
36.3
********
34.4
34.2
********
32.6
32.5
****************
30.9
****************
29.6
****************
28.5
****************
27.4
****************
26.5
****************
25.7
****************
24.9
****************
24.2
****************
23.5
****************
22.9
****************
22.4
************************
************************
************************
************************
************************
************************
************************
************************
************************
101.1
71.5
58.3
50.5
45.2
41.3
38.2
35.7
33.7
32.0
30.5
29.2
28.0
27.0
26.1
25.3
24.5
23.8
23.2
22.6
22.1
21.5
21.1
20.6
20.2
18.4
17.1
16.0
15.1
14.3
98.4
69.5
56.8
49.2
44.0
40.2
37.2
34.8
32.8
31.1
29.7
28.4
27.3
26.3
25.4
24.6
23.9
23.2
22.6
22.0
21.5
21.0
20.5
20.1
19.7
18.0
16.6
15.6
14.7
13.9
95.6
67.6
55.2
47.8
42.7
39.0
36.1
33.8
31.9
30.2
28.8
27.6
26.5
25.5
24.7
23.9
23.2
22.5
21.9
21.4
20.9
20.4
19.9
19.5
19.1
17.5
16.2
15.1
14.2
13.5
92.7
65.6
53.5
46.4
41.5
37.9
35.0
32.8
30.9
29.3
28.0
26.8
25.7
24.8
23.9
23.2
22.5
21.9
21.3
20.7
20.2
19.8
19.3
18.9
18.5
16.9
15.7
14.7
13.8
13.1
89.8
63.5
51.8
44.9
40.2
36.7
33.9
31.7
29.9
28.4
27.1
25.9
24.9
24.0
23.2
22.4
21.8
21.2
20.6
20.1
19.6
19.1
18.7
18.3
18.0
16.4
15.2
14.2
13.4
12.7
86.7
61.3
50.1
43.4
38.8
35.4
32.8
30.7
28.9
27.4
26.2
25.0
24.1
23.2
22.4
21.7
21.0
20.4
19.9
19.4
18.9
18.5
18.1
17.7
17.3
15.8
14.7
13.7
12.9
12.3
83.6
59.1
48.3
41.8
37.4
34.1
31.6
29.6
27.9
26.4
25.2
24.1
23.2
22.3
21.6
20.9
20.3
19.7
19.2
18.7
18.2
17.8
17.4
17.1
16.7
15.3
14.1
13.2
12.5
11.8
80.3
56.8
46.4
40.2
35.9
32.8
30.4
28.4
26.8
25.4
24.2
23.2
22.3
21.5
20.7
20.1
19.5
18.9
18.4
18.0
17.5
17.1
16.7
16.4
16.1
14.7
13.6
12.7
12.0
11.4
………
Sampling Variability Guidelines
Type of estimate
CV
Guidelines
Acceptable
0.0-16.5
General unrestricted release
Marginal
16.6-33.3
General unrestricted release but with
warning cautioning users of the high
sampling variablitity.
Should be identified by letter M.
Unacceptable
> 33.3
No release.
Should be flagged with letter U.
CV look-up tables
Comparison between bootstrap CV and CV from
lookup table
For number of people having diabetes:
Manitoba total: T=32K Cvtable =18%, BTS = 18.7%
Manitoba Males : T=16K Cvtable=25.7%, BTS=27.6%
Manitoba Females: T=16.5K Cvtable=25.3%, BTS=26.4%
CV look-up tables
Comparison between bootstrap CV and CV from
lookup table
Other examples (from master - general file)
Number of people experiencing food insecurity:
Manitoba total: T=118K Cvtable =6.4%, BTS = 11.2%
Number of people in the lowest income quintile:
Manitoba total: T=40K Cvtable =11.9%, BTS = 19.8%
Bootvar: Regression models
Logistic regression model
log (Y) = intercept + b1*X1 + b2*X2
→Y has to be qualitative (categorical)
(for now assume it is dichotomous, i.e. 0,1)
→Xi can be quantitative or qualitative variables
Bootvar: Regression models
Logistic regression model
Example: Diabetes vs sex and age
→Categorical variables need to be dichotomized
(“dummied”; 1 variable for each category except 1)
→Sex: if sex=2 then FEMALE = 1; else FEMALE = 0;
→Age: create a variable for people over 60
(if age > 60 then OVER60=1; else OVER60=0)
→The model is:
DIAB = intercept + b1*FEMALE + b2*OVER60
Bootvar: Regression models
Logistic regression model
Example: Diabetes vs sex and age
DIAB = intercept + b1*FEMALE + b2*OVER60
In bootvar, use %logreg macro
%logreg(yvar,xvar);
%logreg(DIAB,FEMALE OVER60);
Bootvar: Regression models
Linear regression model
Y = intercept + b1*X1 + b2*X2
→Y is quantitive
→Xi can be qualitative (categorical) or quantitative
Bootvar: Regression models
Linear regression model
Example: BMI (body mass index) vs sex and age
→Categorical variables need to be dichotomized
(“dummied”; 1 variable for each category except 1)
→Sex: if sex=2 then FEMALE = 1; else FEMALE = 0;
→Age: use it as quantitative (single year of age)
→The model is:
BMI = intercept + b1*FEMALE + b2*AGE
Bootvar: Regression models
Linear regression model
Example: BMI vs sex and age
BMI = intercept + b1*FEMALE + b2*AGE
In bootvar, use %regress macro
%regress(yvar,xvar);
%regress(BMI,FEMALE AGE);
Bootvar: testing
For version 2.0/2.1:
Simply set 2 < B < 500
For version 1.0:
See documentation!
Historical info about variance
estimation for NPHS
Cycle 1: Use of Jackknife technique
Could not disseminate with public-use microdata
files; only custom requests
Cycle 2 & +: Use of bootstrap technique
Can not disseminate ….; custom requests or remote
access
All cycles: CV look-up tables
for large domains (provinces, age groups)
only good for totals, ratios, and differences of ...
Variance estimation with other
software programs
WesVar (SPSS)
SAS
SUDAAN
STATA
Future for Stats Can Health
Surveys (vs. bootstrap)
NPHS
Cycle 4 (2000-2001) data processing & weighting
Promote the use of longitudinal data
Bootstrap pgms: finalize version 2.0 (SAS & SPSS)
CCHS
Cycle 1.1 bootstrap weights
Bootstrap also used for variance estimation (same programs
as for NPHS)
Contacts
Population Health Surveys
Health Pgm Surveys Manager:
Lorna Bailie ([email protected])
NPHS Manager:
France Bilocq ([email protected])
CCHS Manager:
Marc Hamel ([email protected])
CCHS Dissemination manager:
Larry MacNabb ([email protected]
Senior Methodologists:
François Brisebois ([email protected])
Mylène Lavigne ([email protected])
Yves Béland ([email protected])
Data Access Services Manager:
Mario Bédard ([email protected])
Custom Services Requests:
Garry Macdonald ([email protected])