National Population Health Survey (NPHS)

Download Report

Transcript National Population Health Survey (NPHS)

Population Health Surveys
Bootstrap Hands-on Workshop
Yves Beland, CCHS senior methodologist
Larry MacNabb, CCHS dissemination manager
developed by
François Brisebois CCHS/NPHS senior methodologist
[email protected]
Purpose of the presentation

Justify the use, understand the theory, and
get familiar with the bootstrap technique

Demystify all illusions about using the
bootstrap technique for variance estimation
Outline









Context
NPHS \ CCHS Complex survey design
Variance estimation \ Bootstrap 101
Data support \ using the bootvar program
Why bootstrap?
CV lookup tables
Historical info about variance estimation for NPHS
Variance estimation with other software programs
Future for STC Health Surveys (re. bootstrap)
Context

A data user is interested in producing some
results
1- Compute an estimate (total, ratio, etc.)
2- Compute the precision of the estimate (variance,
coefficient of variation (CV), etc.)
Context
1- Compute an estimate
 Is not a problem!
 Use the provided survey weight with
NPHS/CCHS files
Context
1- Compute an estimate (cont’d)
 Why use the survey weight?
NPHS Estimates for Diabetes - Canada
Unweighted
Weighted
# People
% People
620
4.1
865,910
3.5
Source: 1998 Master Health file

Conclusion: ALWAYS USE THE WEIGHTS
Context
2- Compute the precision of an estimate
 Is a problem!!
NPHS Estimates for Diabetes - Canada
STANDARD DEVIATIONS
Unweighted
Weighted
Bootstrap weights
% People
Estimate
Std Dev.
4.1
0.162
3.5
0.151
3.5
0.177
Source: 1998 Master Health file
Context
2- Compute the precision of the estimate (cont’d)
 Scaled weights:
Scaled weight = weight / mean(weight)
Used to overcome problems with the computation of
the variance for some statistics in SAS
Reference: paper from G.Roberts & al.
Context
2- Compute the precision of the estimate (cont’d)
 Why such a difference?
Answer: The complex survey design is the main
cause (other factors to be discussed later)
Note: CCHS and NPHS have slightly different frames
but are both considered as complex survey designs
Complex survey design
1- Each province is divided into strata
Province A
Stratum #1
Stratum #2
Complex survey design
2- Selection of clusters within each stratum
Province A
Stratum #1
Stratum #2
Complex survey design
3- Selection of households within each cluster
Province A
Stratum #1
  
 
Stratum #2
 
 





Complex survey design

How does the sample design affect the
precision of estimates?
Stratification decreases variability (more precise)
Clustering increases variability (less precise)
Overall, the multistage design has the effect of
increasing variability (less precise than SRS)
Complex survey design

So why use a multistage cluster sample design
anyway?
Pros:
Efficient for interviewing (less travel, less costly)
Better coverage of the entire region of interest
Cons:
Problems for variance estimation
Bootstrap Method

Variance estimation with complex multistage
cluster sample design:
Exact formula for variance estimation is too
complex; use of an approximate approach required
NOTE: taking account for the design in variance
estimation is as crucial as using the sampling
weights for the estimation of a statistic
Bootstrap Method

Approximate methods for variance estimation:
Taylor linearization
Re-sampling methods:
Balanced Repeated Replication
Jackknife
Bootstrap
Bootstrap Method

Principle:
You want to estimate how precise is your estimation
of the number of smokers in Canada
You could draw 500 totally new samples, and
compare the 500 estimations you would get from
these samples. The variance of these 500
estimations would indicate the precision.
Problem: drawing 500 new samples is $$$
Solution: Use your sample as a population, and take
many smaller subsamples from it.
Bootstrap 101

How Bootstrap weights are created
(the secret is finally revealed!!!)
USING
Select
Repeat
Apply
Adjust
the
for
n-1
THE
the
the
survey
clusters
process
BOOTSTRAP
fact
weight
that
among
500we
times
(Wgt)
WGTS:
picked
n (example
within
(*BOOTSTRAP
(*BOOTSTRAP
Estimate
n-1
each
among
stratum
theREPLICATES*)
nnumber
WEIGHTS*)
(factor
(with
of
replacement)
nsmokers
/ n-1
= 1.33)
Starting
point:
Full
data
file
presented
for a =given
stratum)
ID Wgt Cluster Smoke B1 =B2# of times
. . . .the
. cluster
. . . . is
. selected
. . B500
A
10
1
X
10
13
1
0
30
40
3
B
10
1
X
10
13
1
0
30
40
3
C
10
1
10
13
1
0
30
40
3
D
10
2
10
13
1
10
13
1
0
2
E
10
2
10
13
1
10
13
1
0
i
F
10
2
10
13
1
10
13
1
0
G
10
3
X
0
0
0
H
10
3
0
0
0
I
10
4
10
13
1
20
27
2
0
J
10
4
X
10
13
1
20
27
2
0
40
39
27
. . . . . . . . . . . .
80
T = 40
Var =  (B - B) / 499
Bootstrap 101

How Bootstrap replicates are built (cont’d)
 The “real” recipe
1- Subsampling of clusters (SRS) within strata
2- Apply (initial design) weight
3- Adjust weight for selection of n-1 among n
4- Apply all standard adjustments (nonresponse, share, etc.)
5- Post-stratification to population counts
Bootstrap 101

How Bootstrap replicates are built (cont’d)
 The bootstrap method intends to mimic the same approach
used for the sampling and weighting processes
 Be careful: some software programs say they include the
bootstrap technique; what they really do is to skip steps #4 and
#5, and use directly the final weight in step #2
Bootstrap 101

STC Methodologists create the bootstrap weight files.

Can you create your own bootstrap wgt file? No
Why? Because to do so you need to know:
 The design information, i.e. strata, clusters (to generate the
bootstrap subsamples)
 The definition of all adjustment classes (including post-
stratification)
Bootstrap 101

The bootstrap wgt files are:
Available for all file (except PUMF - confidentiality)
Distributed with the data files in separate files

The bootstrap wgt files contain:
 IDs (REALUKEY/SAMPLEID, PERSONID)
 Final sampling weight (WTxx)
 500 Bootstrap weights (BSW1--BSW500)
Bootstrap - Support

NPHS/CCHS provides data users with SAS & SPSS
macro programs to compute bootstrap variances
 Macros simplifying computation of bootstrap variance
estimates for totals, ratio, differences of ratios, regressions
(linear and logistic), and basic generealized linear models
 Come with documentation & examples
 French and English
 referred as “bootvar”
Example: Step by Step

Let’s get to work!

Goal: Interested in estimating the number of
diabetics (total)
NPHS 1998-99 Dummy file (see information sheet)
Diabetes (CCC6_1J), some totals and ratios
NPHS 1998-99 Dummy Health File
Total cases of diabetes
#
DIAB
% of population
DIAB / TOTAL
Example: Step by Step
STEP #1
STEP #2
Create your « analysis data file »
Compute your variances
with bootvar

Read NPHS\CCHS data file

Prepare dummy variables
necessary for your analysis

Keep only necessary variables
(include geography desired)

Run the analysis to get point
estimates only
(not necessary but recommended)

Location of INPUT files:
 Your « analysis data file »
 The bootstrap weights file

Geography desired
Number of bootstrap weights
to use

Specify the desired analysis

 Totals, ratios, diff of ratios
 Regression (linear & logit)
 Generalized linear modeling
Example: Step by Step

Step #1: On your own
(but can use the examples provided as a starting point)

Step #2: Use the provided Bootvar program
STEP #1




Read input file
Create dummy variables
Keep only necessary variables
Run
thequalitative/categorical
analysis to get pointvariables,
estimates
For
we need to identify
which value(s) we are interested in. This is done through
the creation of a dummy variable
Dummy variable
= 1 for characteristic of interest
= 0 otherwise
STEP #1

Create dummy variable: example #1
During the past 12 months, how often did you drink
alcoholic beverages? (ALC8_2)
1=Less than once a month
2=Once a month
3=2 to 3 times a month
4=Once a week
5=2 to 3 times a week
6=4 to 6 times a week
7=Every day
Interested in categories 1 to 4 (once a week or less)
 DRINK
= 1 if ALC8_2 is 1,2,3 or 4
= 0 otherwise
STEP #1

Create dummy variable: example #2
Diabetes (CCC8_1J)
1=Yes
2=No
6=Not applicable
7=Don’t know
9=Not stated
Sex (DHC8_SEX)
1=Male
2=Female
Interested in “males having diabetes”
 mdiab
= 1 if CCC8_1J = 1 and SEX =1
= 0 otherwise
STEP #1

Create dummy variable: example #2
How to use the dummy variable to get an estimate
 Total:
MDIAB
0
0
1
1
1
0
0
0
WT56
100
200
300
400
500
600
700
800
(product)
0
0
300
400
500
0
0
0
ESTIMATE =
1200
In SAS:
Proc freq;
tables mdiab;
weight wt56;
run;
STEP #1

Create dummy variable: example #2
How to use the dummy variable to get an estimate
 Ratio:
MDIAB TOTAL
0
1
0
1
1
1
1
1
1
1
0
1
0
1
0
1
WT56
100
200
300
400
500
600
700
800
ESTIMATE =
(num)
0
0
300
400
500
0
0
0
(den)
100
200
300
400
500
600
700
800
1200
3600
1200 / 3600 = 33%
STEP #1

See example in SPSS
Diabetes (CCC6_1J), some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes (Nfld, Man & BC)
#
169,700
% of population
3.1
STEP #1

Now your turn! (exercise #1)
Add asthma (CCC8_1C) to the table
Use existing program (step1.sas) and add SPSS codes to create
a dummy variable for asthma; and then get the results
Diabetes & Asthma, some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes (Nfld, Man & BC)
Asthma (Nfld, Man & BC)
#
169,700
ASTHMA
446,800
% of population
3.1
ASTHMA
8.1/ TOTAL
Step #2: Bootvar Program

Created by methodologists in 1997
(first used with NPHS cycle 2 data)

Version 1.0
one single program (over 1,000 lines of codes)
divided into 4 sections
 users have to adapt the program to their requests; changes in
3 sections
SAS: bootvar.sas / bootvarf.sas
SPSS: beta version available only on request (bvr_b.sps)
Step #2: Bootvar Program

Version 2.0
Justifications:
 Compatible with SAS 8+
 Centralize the codes where modifications have to be done
by the user
 Can use with both NPHS and CCHS data files
Now consists of 2 programs
 Contains the codes users need to modify for their requests
 Contains the codes users do not have to modify (macros)
Step #2: Bootvar Program

Version 2.0
SAS version:
 bootvare_v20.sas / bootvarf_v20.sas
 macroe_v20.sas / macrof_v20.sas
SPSS version:
 bootvare_v21.sps / bootvarf_v21.sps
 macroe_v21.sps / macrof_v21.sps
STEP #2: Use of bootvar

Point estimates have already been obtained, let us now
estimate the sampling variability of those estimates
 Go through the bootvar program (bootvare_v21.sps)
STEP #2: Use of bootvar

See example in SPSS
Diabetes & Asthma, some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes
Asthma
#
169,700
446,800
Nfld, Man & B.C. only
95% C.I.
% of pop. 95% C.I.
(133,400 ; 205,900)
3.1
(2.4 ; 3.8)
8.1
STEP #2

Now your turn! (exercise #2)
Compute confidence intervals for asthma
Use bootvare_v21.sps and adjust it to obtain desired results
(use the already set up step2.sps program for this exercise)
Diabetes & Asthma, some totals and ratios
NPHS 1998-99 Dummy Health File
Diabetes
Asthma
#
169,700
446,800
Nfld, Man & B.C. only
95% C.I.
% of pop. 95% C.I.
(133,400 ; 205,900)
3.1
(2.4 ; 3.8)
(381,700 ?; 511,900)
8.1
(6.9 ?; 9.3)
Bootstrap - More

Why 500 bootstrap weights?
 Size of file (for dissemination)
 Time of computation (for an average PC)
 Accuracy

Use more bootstrap weights?
 Faster PC
 Accuracy for small domains and more complex analysis
methods
Bootstrap - More

Confidentiality revealed from the bootstrap weights
ID Wgt Cluster
A
10
1
?
B
10
1
?
C
10
1
?
D
10
2
?
E
10
2
?
F
10
2
?
G
10
3
?
H
10
3
?
I
10
4
?
J
10
4
?
B1
13
13
13
13
13
13
0
0
13
13
B2 . . . . . . . . . . . . B500
0
33
0
33
0
33
14
0
14
0
14
0
0
0
0
0
29
0
29
0
Bootstrap - More

Confidentiality revealed from the bootstrap weights
(cont’d)
How PUMF users estimate their exact variances?
 Remote access
 Provide dummy file
(same structure as master files but contain dummy data)
 Test programs and send by e-mail
 Research Data Centre
 Regional Offices
Why Bootstrap?

Other techniques examined: Taylor, Jackknife
Taylor:
Need to define a linear equation for each statistic
examined
Jackknife:
Can not disseminate because of confidentiality
Number of replicates depends on the number of
strata (large number of strata in 1996 makes it
impossible to disseminate)
Why Bootstrap?

Bootstrap:
Handle more easily survey design with many strata
Sets of 500 bootstrap weights can be distributed to
data users
Recommended (over the jackknife) for estimating the
variance of nonsmooth functions like quantiles, LICO
Reference: “Bootstrap Variance Estimation for the
National Population Health Survey”, D.Yeo,
H.Mantel, and T.-P. Liu. 1999, ASA Conference.
Bootvar: exercise #3

Results for diabetes broken down by sex and
province
Diabetes, some totals and ratios
NPHS 1998-99 Dummy Health File
Nfld,Man & BC
Nfld
Males
Females
Manitoba
Males
Females
B.C.
Males
Females
#
95% C.I.
% of pop.
95% C.I.
169,700 (133,400 ; 205,900)
3.1
169,700
(2.4 ; 3.8)
24,900 (18,200 ; 31,500)
4.6
(3.4 ; 5.9)
DIAB
TOTAL
9,800
(4,600 ; 14,700)
(1.7 ; 5.6)
MDIAB
MDIAB 3.7
/ MTOTAL
15,100 (10,000 ; 20,100)FDIAB 5.6
(3.7 ; 7.4)
FDIAB
/ FTOTAL
32,300 (20,400 ; 44,200)
3.0
(1.9 ; 4.1)
15,800
(7,300 ; 24,400)
3.0
(1.3 ; 4.5)
16,500
(8,000 ; 25,000)
3.0
(1.5 ; 4.7)
112,500 (79,300 ; 145,600)
2.9
(2.0 ; 3.7)
68,700 (43,500 ; 93,900)
3.5
(2.2 ; 4.8)
43,700 (22,200 ; 65,300)
2.2
(1.1 ; 3.4)
Bootvar: Tricks

If you need to create a dummy variable for a
characteristic based on many variables:
Example: Males with diabetes
First, create dummy variables for each individual
variable (males, diabetes)
Then, create the dummy variable for the characteristic
by multiplying the individual dummy variables
Bootvar: Tricks

Example:



Males = 1,0
(MALES)
Diabetes = 1,0 (DIAB)
Males having diabetes (MDIAB) = MALES * DIAB
MALES
1
1
0
0
*
DIAB
0
1
0
1
= MDIAB
0
1
0
0
Bootvar: Tricks

Use the REGION parameter in bootvar to specify
a “stratification” variable (doesn’t have to be a
geographic variable!)
Example: REGION = sex
 will produce results by sex
CV look-up tables

What is it?
Approximate sampling variability tables
Produced for Canada, each province, and by age groups
for Canada (also by Health Regions for cycle 2)

Useful only for categorical estimates
Totals & ratios only
CV look-up tables
Approximate Sampling Variability Tables for MANITOBA - Selected Members
NUMERATOR OF
PERCENTAGE
('000)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
30
35
40
45
50
ESTIMATED PERCENTAGE
0.1%
1.0%
2.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
103.6
103.2
102.6
********
72.9
72.6
********
59.6
59.3
********
51.6
51.3
********
46.1
45.9
********
42.1
41.9
********
39.0
38.8
********
36.5
36.3
********
34.4
34.2
********
32.6
32.5
****************
30.9
****************
29.6
****************
28.5
****************
27.4
****************
26.5
****************
25.7
****************
24.9
****************
24.2
****************
23.5
****************
22.9
****************
22.4
************************
************************
************************
************************
************************
************************
************************
************************
************************
101.1
71.5
58.3
50.5
45.2
41.3
38.2
35.7
33.7
32.0
30.5
29.2
28.0
27.0
26.1
25.3
24.5
23.8
23.2
22.6
22.1
21.5
21.1
20.6
20.2
18.4
17.1
16.0
15.1
14.3
98.4
69.5
56.8
49.2
44.0
40.2
37.2
34.8
32.8
31.1
29.7
28.4
27.3
26.3
25.4
24.6
23.9
23.2
22.6
22.0
21.5
21.0
20.5
20.1
19.7
18.0
16.6
15.6
14.7
13.9
95.6
67.6
55.2
47.8
42.7
39.0
36.1
33.8
31.9
30.2
28.8
27.6
26.5
25.5
24.7
23.9
23.2
22.5
21.9
21.4
20.9
20.4
19.9
19.5
19.1
17.5
16.2
15.1
14.2
13.5
92.7
65.6
53.5
46.4
41.5
37.9
35.0
32.8
30.9
29.3
28.0
26.8
25.7
24.8
23.9
23.2
22.5
21.9
21.3
20.7
20.2
19.8
19.3
18.9
18.5
16.9
15.7
14.7
13.8
13.1
89.8
63.5
51.8
44.9
40.2
36.7
33.9
31.7
29.9
28.4
27.1
25.9
24.9
24.0
23.2
22.4
21.8
21.2
20.6
20.1
19.6
19.1
18.7
18.3
18.0
16.4
15.2
14.2
13.4
12.7
86.7
61.3
50.1
43.4
38.8
35.4
32.8
30.7
28.9
27.4
26.2
25.0
24.1
23.2
22.4
21.7
21.0
20.4
19.9
19.4
18.9
18.5
18.1
17.7
17.3
15.8
14.7
13.7
12.9
12.3
83.6
59.1
48.3
41.8
37.4
34.1
31.6
29.6
27.9
26.4
25.2
24.1
23.2
22.3
21.6
20.9
20.3
19.7
19.2
18.7
18.2
17.8
17.4
17.1
16.7
15.3
14.1
13.2
12.5
11.8
80.3
56.8
46.4
40.2
35.9
32.8
30.4
28.4
26.8
25.4
24.2
23.2
22.3
21.5
20.7
20.1
19.5
18.9
18.4
18.0
17.5
17.1
16.7
16.4
16.1
14.7
13.6
12.7
12.0
11.4
………
Sampling Variability Guidelines
Type of estimate
CV
Guidelines
Acceptable
0.0-16.5
General unrestricted release
Marginal
16.6-33.3
General unrestricted release but with
warning cautioning users of the high
sampling variablitity.
Should be identified by letter M.
Unacceptable
> 33.3
No release.
Should be flagged with letter U.
CV look-up tables

Comparison between bootstrap CV and CV from
lookup table
For number of people having diabetes:
Manitoba total: T=32K  Cvtable =18%, BTS = 18.7%
Manitoba Males : T=16K  Cvtable=25.7%, BTS=27.6%
Manitoba Females: T=16.5K  Cvtable=25.3%, BTS=26.4%
CV look-up tables

Comparison between bootstrap CV and CV from
lookup table
Other examples (from master - general file)
 Number of people experiencing food insecurity:
Manitoba total: T=118K  Cvtable =6.4%, BTS = 11.2%
 Number of people in the lowest income quintile:
Manitoba total: T=40K  Cvtable =11.9%, BTS = 19.8%
Bootvar: Regression models
Logistic regression model
 log (Y) = intercept + b1*X1 + b2*X2
→Y has to be qualitative (categorical)
(for now assume it is dichotomous, i.e. 0,1)
→Xi can be quantitative or qualitative variables
Bootvar: Regression models
Logistic regression model
 Example: Diabetes vs sex and age
→Categorical variables need to be dichotomized
(“dummied”; 1 variable for each category except 1)
→Sex: if sex=2 then FEMALE = 1; else FEMALE = 0;
→Age: create a variable for people over 60
(if age > 60 then OVER60=1; else OVER60=0)
→The model is:
DIAB = intercept + b1*FEMALE + b2*OVER60
Bootvar: Regression models
Logistic regression model
 Example: Diabetes vs sex and age
DIAB = intercept + b1*FEMALE + b2*OVER60

In bootvar, use %logreg macro
%logreg(yvar,xvar);
%logreg(DIAB,FEMALE OVER60);
Bootvar: Regression models
Linear regression model
 Y = intercept + b1*X1 + b2*X2
→Y is quantitive
→Xi can be qualitative (categorical) or quantitative
Bootvar: Regression models
Linear regression model
 Example: BMI (body mass index) vs sex and age
→Categorical variables need to be dichotomized
(“dummied”; 1 variable for each category except 1)
→Sex: if sex=2 then FEMALE = 1; else FEMALE = 0;
→Age: use it as quantitative (single year of age)
→The model is:
BMI = intercept + b1*FEMALE + b2*AGE
Bootvar: Regression models
Linear regression model
 Example: BMI vs sex and age
BMI = intercept + b1*FEMALE + b2*AGE

In bootvar, use %regress macro
%regress(yvar,xvar);
%regress(BMI,FEMALE AGE);
Bootvar: testing

For version 2.0/2.1:
Simply set 2 < B < 500

For version 1.0:
See documentation!
Historical info about variance
estimation for NPHS

Cycle 1: Use of Jackknife technique
Could not disseminate with public-use microdata
files; only custom requests

Cycle 2 & +: Use of bootstrap technique
Can not disseminate ….; custom requests or remote
access

All cycles: CV look-up tables
for large domains (provinces, age groups)
only good for totals, ratios, and differences of ...
Variance estimation with other
software programs

WesVar (SPSS)

SAS

SUDAAN

STATA
Future for Stats Can Health
Surveys (vs. bootstrap)

NPHS
 Cycle 4 (2000-2001) data processing & weighting
 Promote the use of longitudinal data
 Bootstrap pgms: finalize version 2.0 (SAS & SPSS)

CCHS
 Cycle 1.1 bootstrap weights
 Bootstrap also used for variance estimation (same programs
as for NPHS)
Contacts
Population Health Surveys
Health Pgm Surveys Manager:
Lorna Bailie ([email protected])
NPHS Manager:
France Bilocq ([email protected])
CCHS Manager:
Marc Hamel ([email protected])
CCHS Dissemination manager:
Larry MacNabb ([email protected]
Senior Methodologists:
François Brisebois ([email protected])
Mylène Lavigne ([email protected])
Yves Béland ([email protected])
Data Access Services Manager:
Mario Bédard ([email protected])
Custom Services Requests:
Garry Macdonald ([email protected])