Transcript Overview of NLSCY Survey methodology Cycle 5
N L S C Y
Overview of the National Longitudinal Survey of Children and Youth
Goal of the Presentation
To answer the following questions: What is NLSCY?
What does a researcher need to know when using NLSCY data (analytical issues)?
What are the tools available for NLSCY users?
What is NLSCY?
Longitudinal survey conducted in partnership by HRDC and Statistics Canada.
Follows the development and well-being of children, from birth to adulthood.
Began in 1994, data collection takes place every 2 years, the 6 th cycle of data collection beginning September 2004.
The Analytical Issues
The data strategy / concepts issues: partial non-response inconsistencies Sampling Complex design Impact on estimation precision analysis Type of analysis longitudinal, cross-sectional repeated
NLSCY - Overview
Complex data structure the lives of children are complex dual child/household structure new content in each cycle some changes in old content Other constraints limit on quantity of information limited resources
NLSCY Content
Child (depending on age)
socio-demographic health perinatal information development (motor, social and physical) temperament academic performance education literacy extracurricular activities work experience socialization relationship with parents family history and legal custody of children child care behaviour self-esteem cigarettes, alcohol, drugs vocabulary assessment math test reading comprehension test sexual activity and loving relationship
Parents
socio-demographic education/literacy labour market income health social support parental involvement at school parents’ aspirations for child’s education
Family
demography of members relationships between members of household family functioning household neighbourhood
School
number of students discipline problems school atmosphere resources characteristics
Teachers
teaching practices demography qualifications
Principal
demography qualifications
Note: Minor changes are made in the content from one cycle to the next.
Unit of Analysis
The child Sources of information Person most knowledgeable about the child (PMK) Teacher School principal Child himself/herself cognitive measures self-administered
Unit of Analysis
Caution Other types of Analysis Weights are designed for the child Concepts like family are characteristics of the child Not a domain for estimation
Statements like . . .
number of families with characteristic …
Classic Trade-off of
In the NLSCY, you will find
Investment for the NLSCY
Focus on derived variables scales, cognitive measures transition measures Non-response adjustment total non-response Processing of financial data family income personal income Dissemination within reasonable time
Partial Non-response (item or component)
Respondent units are those which answered the key questions.
Not necessarily all the questions.
Some variables will include non-response, identified by: not stated don’t know refusal Note: These are different from Not applicable
How important is it ?
Maybe non-response is random.
Maybe it's negligible Maybe it can be explained away Maybe I can get away with it
What are your options
Report missing data as a value Ignore missing data (limit your analysis to reported data only) Correct for the missing data By re-weighting With imputation model non-response information
Get to know your non respondents
When you have significant non-response You need to assess non-response It becomes your first variable of interest It’s an analysis like any other analysis you will do Otherwise it casts doubt over every finding
NLSCY Data Collection Strategy
1994-95
0 1 2 13
1996-97
0 1 2 3 0 1 2 3 4 5
0 1
2 3 4 5
1998-99
6
Released summer 2003
17
E.Y.
8
2000-01 Release expected December 2004
19
Issues
CROSS-SECTIONAL ANALYSIS
Issues
CROSS-SECTIONAL ANALYSIS Limitations due to the age of the sample part of the sample was not selected for cross-sectional estimates inherent complexity in the sample design to meet divergent needs coverage problems • • no update of the sample to reflect changes in the population (e.g., immigration); only the sampling weights have been adjusted to reflect changes the older the cohort gets, the more difficult it is to adjust the sampling weights properly
Issues
CROSS-SECTIONAL ANALYSIS Limitations due to the nature of the survey Problems with sample erosion Conditioning bias Changes in the definition of age of the child Interpretation of the results Impact on the effectiveness of estimation methods Making inferences Greater potential with the supplementary samples that have been added
Issues
LONGITUDINAL ANALYSIS
One Survey but actually many datasets
A longitudinal file 1994-95 1996-97 1998-99 Intended for cohort analysis of 2 ages, eg, 0-1, 2-3, 4-5, 6-7, 8-9, 10-11
6
Cycle 4
0-1 2-3 4-5
17
6-7 2000-01 8-9 10-11
Issues
LONGITUDINAL ANALYSIS Limitations due to sample erosion sample shrinkage problems representation (coverage) problems Swiss cheese problems Conditioning bias Interpretation of results impact on effectiveness of estimation inferences New definition for the age of the child
Dissecting NLSCY Data
Cross-sectional Data Repeated Surveys In 1994-95
0 1 2 13
1996-97 1998-99 3 data cycles for children aged 0 to 11 The sample size is very different 2 data cycles
Dissecting NLSCY Data
Cross-sectional Data Repeated Surveys In 1994-95
0 1 2 13
1996-97 1998-99 Whereas these units are independent NOTE: The sample units are not independent of one another.
Issues
CROSS-SECTIONAL ANALYSIS (REPEATED) Same limitations as noted earlier The sample overlaps from one cycle to the next.
Independence or interdependence of samples There is sample interdependence when the sample is made up of the same respondents Involves a covariance factor Sample independence is possible only for certain domains (e.g., children aged 0-1)
The obsession with weight in the modelling world
The basic
The basic idea of sampling The reason behind complicating a good idea The implication when modelling data
How Sampling Works.
(systematic)
backgroud, the dress, the face, the hands, etc...
How Sampling Works.
1% 3% 5% 10% 2.5% Stratified
How does this affect modeling or analysis
The sample is no longer simply random We purposefully biaised the sample to gain efficiencies to meet other goals This bias is corrected when we apply the design weights.
Framework
If you were to analyse each stratum separately Each part can actually be treated as surveys each with a simpler design The sampling frame or design allows you to keep all these part together in a cohesive way for analysis.
Still there would be some difficulty associated with the correction for non-response and final calibration (post)
How to interpret sampling
If you looked only at the parts we sampled You wouldn’t get an accurate picture.
All the parts would be there but not in the right proportions.
The way we sample is reflected and corrected by how we weight the data in the end.
The design weights compensate for the known distortions. The final weights include estimated distortions.
What would you use to base the fundamental multivariate relationships in your model or analysis ?
Steps to calculate the weights – Basic overview
At the survey design stage, some factors are used to determine the sample size required Probability of selection calculated First series of adjustments for non response Post-stratification
Factors to determine the sample size
Characteristics to be estimated (small proportions) Required precision of the estimates (targetted CV) Variability of the data Expected non-response rate Size of the population
Original design weight
Once the sample is selected in each stratum, calculate the original weight: N h /n h , where « h » is the stratum Since the sample is selected from LFS, get original weight from LFS.
Adjustments for the number of available children.
Non-response adjustment
Adjustments must be made to take into account the total non-response Characteristics of respondents vs non respondents are analyzed: Province, income, level of education of parents, depression scale of PMK, urban/rural, etc.
Post-stratification
Adjustment factor calculated in order to post-stratify the sample to known population counts, by: Province, age, gender
Final weight
W f = W i Where X Adj 1 X Adj 2 W f : Final weight W i : initial weight Adj 1 : Non-response adjustment Adj 2 : Post stratification
Link between analysis and the sample design (weight)
Child’s Intelligence Ability School Materials Curriculum The proportion of kids in the sample being taught the PEI curriculum is much larger than what’s found in the population Province is a stratum Grade level Subject Province
Link between analysis and the sample design
There are very few things in a child’s life that is not related to where they live.
• In the city versus in a small village • In a small province versus a large one • what social/educational programs are offered • what social support and services are offered • regional cultural differences • to name a few…
Weights for cycle 4
Cross-sectional weights Longitudinal weights, including the converted respondents.
Longitudinal weights, children introduced in C1 and respondent to all cycles. NEW Not to mention the bootstrap weights, which are used for an entirely different purpose.
Cross-sectional Weights
Available for all cycles, up to Cycle 4.
When are they used?
Cycle 4 cross-sectional weights: to represent the population aged 0-17 in 2000-01.
… Cycle 1 weights: to represent the population aged 0-11 in 1994-95.
Cross-sectional Weights - Cycle 4 Warning
In Cycle 4, children with a cross sectional weight come from 4 different cohorts (introduced in 1994, 1996, 1998 and 2000).
By 2000, the 1994 cohort has been around for 6 years: cross-sectional representativity decreases over time because of sample erosion and population change (immigration).
Cross-sectional Weights - Cycle 5
For Cycle 5 (2002-2003), no children aged 6 and 7.
In addition, the 1994 cohort’s cross sectional representativity has declined even further (erosion and immigration).
As a result, cross-sectional weights will be calculated only for children aged 0-5.
Cross-sectional weights in a nutshell
Cross-sectional weights must be used when the analysis concerns a specific year, when you want a snapshot of the situation at a specific point in time.
Longitudinal Weights
Longitudinal weights represent the population of children at the time they were brought in to the survey.
Children introduced in Cycle 1: longitudinal weights represent the population of children aged 0-11 in 1994-95.
Longitudinal Weights (continued)
Children introduced in Cycle 2: longitudinal weights represent the population of children aged 0-1 in 1996-97.
Children introduced in Cycle 3: longitudinal weights represent the population of children aged 0-1 in 1998-99.
Children introduced in Cycle 4: longitudinal weights represent the population of children aged 0-1 in 2000-01.
When are longitudinal weights used?
When you want to track a cohort of children introduced in a particular cycle and see how they’ve developed over time.
Longitudinal Weights - Cycle 4
Something new in Cycle 4: 2 sets of longitudinal weights: Set 1: Weights for children who responded in their first cycle and in Cycle 4 (possible non-response in Cycle 2 or 3) Set 2: Weights for those introduced in cycle 1 who responded in every cycle.
NEW
.
Longitudinal Weights - Cycle 4
Difference between the 2 sets of longitudinal weights To avoid total non-response in Cycle 2 or 3, the set of weights for those who responded throughout can be used.
If you’re only interested in the changes between Cycle 1 and Cycle 4 directly, the longitudinal weights including converted respondents can be used.
Examples
Following are real examples taken from the NLSCY data
Weighting - Examples
Average weights in Cycle 4.
Prince Edward Island 7 5-year-old 1-year-olds
Weighting - Examples Average weights in Cycle 4 (continued) 15-year-old Ontario 712 15-year-olds
Example: Proportion of children aged 0-17, by province, Cycle 4, UNWEIGHTED
24% of Canada’s children live in the Maritime provinces … whereas in reality...
Province Sample size Nfld PEI NS NB Que Ont Man Sask Alberta BC Total 1,826 1,025 2,259 2,037 5,337 7,468 2,356 2,353 2,986 2,659 30,306 Percentage 6.0% 3.4% 7.5% 6.7% 17.6% 24.6% 7.8% 7.8% 9.9% 8.8%
Example: Proportion of children aged 0-17, by province, Cycle 4, WEIGHTED
Whereas in reality…7.3% of children live in the Maritime provinces.
Province Sample size Percentage Nfld PEI NS NB Que Ont Man Sask Alberta BC Total 116,080 33,311 208,160 165,078 1,590,325 2,747,236 289,265 265,221 763,858 892,908 7,071,442 1.6% 0.5% 2.9% 2.3% 22.5% 38.8% 4.1% 3.8% 10.8% 12.6%
The weights in brief
To be obsessed with weights is a good thing…where statistical analysis is concerned
Variance
Why is it necessary to compute the variance? How can the variance be computed with NLSCY data?
Why Compute the Variance?
NLSCY data come from a probabilistic survey: Variability associated with estimates produced with data from any probabilistic survey To make valid inferences about the population of interest, that variability must be measured
Main Difficulty: Complexity of the NLSCY’s Sampling Plan
Two different sample frames used to select the sample: Labour Force Survey (LFS), itself a survey with a complex sample design Birth Register Use of two frames for certain groups (five-year-olds, Cycle 3)
Complexity of the NLSCY’s Sampling Plan (continued)
Children’s selection probabilities very uneven Non-response adjustments that cross strata boundaries Empty clusters from the LFS
Effects of the Complexity of the NLSCY’s Sampling Plan
No exact analytical formula for computing the variance because of the complex sample design.
No commercial application can fully take the NLSCY’s complexity into account in computing the variance.
How to Compute the Variance for the NLSCY
3 solutions: 1) Approximate sampling variability tables provided in the user’s guide (in the form of coefficients of variation (CVs)).
Available for the first 4 cycles.
2) Approximate CV tables for a number of specific subject areas In a single Excel spreadsheet.
With a newly developed interface using Visual Basic and Excel.
How to Compute the Variance for the NLSCY (cont’d)
3 solutions: 3) Use bootstrap weights and SAS program supplied by StCan.
NLSCY_VES (specific to NLSCY) A new generalized version of Bootvar will soon be available, which will be useable for all survey data available in the RDC.
SUDAAN can also use the Bootstrap weights.
How to Compute the Variance for the NLSCY
Of these 3 solutions: The first two can be used for exploratory analysis. These 2 methods provide an approximation of the variance Only the third solution computes the variance “more exactly”
Final words on the variance
The variance must be computed if we are to make valid inferences The sample design must be taken into account if we want the variance calculation to be valid. Otherwise, we may draw incorrect conclusions A workshop on how to calculate the variance using the Bootstrap weights is available
A few words on cycle 6
Children from original cohort are now 10 to 21 years old.
Children introduced in cycles 2 and 3 are done with the survey.
Children introduced in cycle 4 are 4-5 years old.
Children introduced in cycle 5 are 2-3 years old.
In cycle 6, new children of 0-1 year old are introduced to the survey. An additional top-up of children aged 2-5 added in some provinces.
NLSCY Cycle 6 Sample
2004-05 0 1
2 3 4 5 2 3 4 5
6-9.
10 21
Summary
NLSCY offers a lot of possibilities for analysis: Many types of analysis possible Complex data structure Data are not perfect: Not fully edited Partial non-response Changes over time Proper weights must be used and variance must be calculated taking into account the design to make valid inferences.
Some resources available
Series of Powerpoint presentations about NLSCY Workshops on non-response analysis and variance calculation for NLSCY.
Codebooks for all cycles.
Users’ Guides for all cycles.
Checklists for proposal and before submitting a paper for review.
Some resources available (cont’d)
Statistics Canada’s Client Services: Client Services Special Surveys Division Telephone: (613) 951-3321 or 1-800-461-9050 Fax: (613) 951-4527 E-mail: [email protected]
My cordinates
Charles Tardif Room 2500, Main Building, Statistics Canada Tunney’s Pasture, Ottawa, Ontario K1A 0T6 [email protected]
Tel: (613) 951-4353