Transcript Document

Bayesian graphical models for inference from combinations of data

Nicky Best and Chris Jackson

With Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London [email protected]

http://www.bias-project.org.uk

Example: low birth weight and air pollution

    Does exposure to air pollution during pregnancy increase the risk of low birth weight? Example illustrates various biases.

Combine datasets with different strengths:  Survey data (Millennium Cohort Study)  Small, great individual detail.

 Administrative data (national births register)  Large, but little individual detail.

Single underlying model assumed to govern both datasets: elaborate as appropriate to handle biases

Low birth weight

   Important determinant of future health  health indicator.

population Established risk factors:  Tobacco smoking during pregnancy.

 Ethnicity (South Asian, issue for UK data)  Maternal age, weight, height, number of previous births. Role of environmental risk factors, such as air pollution, less clear.  Various studies around the world suggest a link.

 Exposure to urban air pollution correlated with socioeconomic factors  ethnicity, tobacco smoking  confounding

Data sources (1): Millennium Cohort Study

  About 15,000 births in the UK between Sep 2000 and August 2001 (we study only England and Wales, singleton births) Postcode made available to us under strict security   Match individuals with annual mean concentration of certain air pollutants (PM 10 , NO 2 , CO, SO 2 ) ( NETCEN ) Birth weight, and reasonably complete set of confounder data available Allows a reasonable analysis, but issues remain:   Low power to detect small effect  could be improved by incorporating other data. Selection bias…

Selection of Millennium Cohort

ENGLAND ALL UK WARDS SCOTLAND WALES NORTHERN IRELAND SELECTION PROBABILITY High child poverty 0.04

Low child poverty High ethnic minority 0.02

0.11

High child poverty Low child poverty High child poverty Low child poverty 0.07

0.04

0.18

0.06

0.16

High child poverty Low child poverty 0.08

Selection bias in the Millennium Cohort

   Survey disproportionately represents population.  If selection scheme (=child poverty / ethnicity) related to exposure (=pollution) and outcome (=low birth weight), then estimate of association biased.

Accounting for selection bias:  Adjust model for all variables affecting selection,

or

 Weight cases Cluster sampling by inverse probability of selection  within-ward correlations  for correct standard errors for inference on population, use a hierarchical (multilevel) model groups defined by wards. with

Data sources (2): National birth register

      Every birth in the population recorded.

Individual data with postcode (  pollution exposure) and birth weight available to us under strict security. Social class and employment status of parents also available for a 10% sample.

We study only this 10% sample: 50,000 births between Sep 2000 and Aug 2001. Larger dataset, no selection bias, …but

no confounder information

, especially ethnicity and smoking.

Data sources (3): Aggregate data

 Ethnic composition of the population  2001 census  for census output areas (~500 individuals)  Tobacco expenditure  consumer surveys (CACI, who produce ACORN consumer classification data )   for census output areas.

…linked by postcode to Millennium Cohort and national register data.

Birth weight and pollution (source: MCS)

Birth weight and ethnicity (source: MCS)

Birth weight and smoking (source: MCS)

Pollution and confounders (source: MCS)

Models for formally analysing combined data

Want estimate of the association between low birth weight and pollution, using all data, accounting for:  Selection bias in MCS  Adjust models for all predictors of selection  Or weight by inverse probability of selection  Missing confounders in register  Bayesian graphical model…

Graphical model representation

ETH i POLL i POLL j ETH j MODEL LBW i

baby

i

in register

LBW j

baby

j

in MCS

LBW i

: low birth weight

POLL i

: pollution exposure (plus other confounders observed in both datasets)

ETH i

: ethnicity and smoking. Only observed in the MCS. Same

MODEL

assumed to govern both datasets.

known unknown

Adding in the imputation model

AGG i MODEL (imputation) AGG j ETH i POLL i POLL j ETH j MODEL (LBW) LBW i LBW j

baby

i

in register baby

j

in MCS

AGG i

: aggregate ethnicity/smoking data for area of residence of baby

i MODEL

for

imputation

of

ETH i

in terms of aggregate data and other variables. Estimate it from observed

ETH j

in the MCS.

Bayesian model

    Estimate both:  Imputation model for missing ethnicity and smoking  Outcome model for the association between low birth weight and pollution.

All beliefs about unknown quantities expressed as probability distributions.  Prior beliefs (often ignorance) posterior distributions modified in light of data  Joint posterior distribution of all unknowns estimated by Markov Chain Monte Carlo (MCMC) simulation (WinBUGS software) Graphical representation of the model guides the MCMC simulation.

Variables in the final models: (1) regression model for low birth weight

  Probability baby

i

has birth weight under 2.5 kg modelled in terms of   Pollution (NO 2 and SO 2 ) Ethnicity (White / South Asian / Black / other)  Smoking during pregnancy (yes/no)  Social class of mother  Survey selection strata (for MCS data) Other variables not significant in multiple regression, or not confounded with pollution (mother’s weight, height, maternal age, number of previous births, hypertension during pregnancy,…)

  

Variables in the final models: (2) imputation model for missing data

Probability baby

i

is in one of eight categories:  ethnicity 1. White / 2. South Asian / 3. Black / 4. other  smoking during pregnancy 1. No / 2. Yes Modelled in terms of small-area variables for baby

i

:  Proportion of population of in each of three ethnic minority categories (South Asian / Black / other)  Tobacco expenditure  MCS survey selection strata …and some individual-level variables for baby

i.

 Pollution exposure  Low birth weight  Social class, employment status of mother.

Data Register, ignore confounding MCS Odds ratios (posterior mean, 95% CI) NO 2 * 1.20 (1.13,1.27) SO 2 * 1.03 (1.00,1.07) Smoking MCS, ignore selection Register + MCS 1.04 (0.89,1.21) 1.08 (0.94,1.23) 0.97 (0.91,1.03) 1.04 (0.96,1.12) 1.04 (0.96,1.12) 1.01 (0.97,1.05) 2.00 (1.71,2.34) 2.00 (1.71,2.34) 1.94 (1.80,2.10) Register, adjust for confounding 0.97 (0.91,1.04) 1.01 (0.97,1.07) 1.94 (1.76,2.12)

*One unit of pollution concentration = interquartile range of pollution concentration across England and Wales

South Asian 2.76 (2.14,3.56) 3.01 (2.42,3.74) 2.92 (2.61,3.26) 2.93 (2.57,3.33)

Conclusions so far

 No evidence for association of pollution exposure with low birth weight.  Combining the datasets can  increase statistical power of the survey data  alleviate bias due to confounding in the administrative data  Must allow for selection mechanism of survey when combining data

Work in progress

     Sensitivity to different choices for the imputation model  External data (e.g. small-area data) on confounders not always available More investigation of selection bias, and different ways of accounting for it Quantify relative influence of each dataset Other biases, expected to be smaller problem  Missing data in MCS  Exposure measurement error Distinguish between preterm birth and low full-term birth weight.

Combining aggregate and individual data

 Aggregate (ecological) data   Administrative data usually aggregated to preserve confidentiality Make inferences on individual-level risk factors and outcomes using aggregate data: “Ecological bias” caused by   within-area variability of risk factors confounding caused by limited number of variables.

  Needs appropriate models , and often individual data  survey/cohort data, case-control data.

Combining aggregate and individual data:   can reduce ecological bias and increase power distinguish contextual effects from individual.

Publications

Our papers, presentations and software available from http://www.bias-project.org.uk

   C. Jackson, N. Best, S. Richardson.

disease risk factors.

Hierarchical related regression for combining aggregate and survey data in studies of socio-economic

under revision, Journal of the Royal Statistical Society, Series A.

C. Jackson, N. Best, S. Richardson.

Improving ecological inference using individual-level data.

Statistics in Medicine (2006) 25(12):2136 2159.

C. Jackson, S. Richardson, N. Best.

Studying place effects on health by synthesising area-level and individual data.

Submitted.