Transcript Slide 1
Graphical models for combining multiple sources of information in observational studies Nicky Best Sylvia Richardson Chris Jackson Virgilio Gomez Sara Geneletti ESRC National Centre for Research Methods – BIAS node Outline • Overview of graphical modelling • Case study 1: Water disinfection byproducts and adverse birth outcomes – Modelling multiple sources of bias in observational studies • Bayesian computation and software • Case study 2: Socioeconomic factors and heart disease (Chris Jackson) – Combining individual and aggregate level data – Application to Census, Health Survey for England, HES Graphical modelling Mathematics Modelling Algorithms Inference 1. Mathematics Mathematics Modelling Algorithms Inference • Key idea: conditional independence • X and W are conditionally independent given Z if, knowing Z, discovering W tells you nothing more about X P(X | W, Z) = P(X | Z) Example: Mendelian inheritance • Y, Z = genotype of parents • W, X = genotypes of 2 children • If we know the genotypes of the parents, then the children’s genotypes are conditionally independent P(X | W, Y, Z) = P(X | Y, Z) Y Z W X Joint distributions and graphical models Graphical models can be used to: P(Y) • represent structure of a joint probability distribution….. • …..by encoding conditional independencies Y P(Z) Z P(X|Y, Z) P(W|Y, Z) W X P(W,X,Y,Z) = P(W|Y,Z) P(X|Y,Z) P(Y) P(Z) Factorization thm: Jt distribution P(V) = P(v | parents[v]) Where does the graph come from? • Genetics – pedigree (family tree) • Physical, biological, social systems – supposed causal effects (e.g. regression models) • Conditional independence provides basis for splitting large system into smaller components A D B Y Z W X C • Conditional independence provides basis for splitting large system into smaller components A D B Y Y W W C Z Y Z X 2. Modelling Mathematics Modelling Algorithms Inference Building complex models Key idea • understand complex system • through global model • built from small pieces – comprehensible – each with only a few variables – modular Example: Case study 1 • Epidemiological study of low birth weight and mothers’ exposure to water disinfection byproducts • Background – Chlorine added to tap water supply for disinfection – Reacts with natural organic matter in water to form unwanted byproducts (including trihalomethanes, THMs) – Some evidence of adverse health effects (cancer, birth defects) associated with exposure to high levels of THM – SAHSU are carrying out study in Great Britain using routine data, to investigate risk of low birth weight associated with exposure to different THM levels Data sources • National postcoded births register • Routinely monitored THM concentrations in tap water samples for each water supply zone within 14 different water company regions • Census data – area level socioeconomic factors • Millenium cohort study (MCS) – individual level outcomes and confounder data on sample of mothers • Literature relating to factors affecting personal exposure (uptake factors, water consumption, etc.) Model for combining data sources f [true] THMzt s2 [raw] THMztj [mother] THMik b[T] b[c] yik cik qi [mother] THMim yim cim Regression sub-model (MCS) f [true] THMzt s2 [raw] THMztj [mother] THMik Regression model for MCS data relating risk of low birth weight (yim) to mother’s THM exposure and other confounders (cim) b[T] b[c] yik cik qi [mother] THMim yim cim Regression sub-model (MCS) Logistic regression yim ~ Bernoulli(pim) [mother] Regression model for MCS data relating risk of low birth weight (yim) to mother’s THM exposure and other confounders (cim) logit pim = b[c] cim + b[T] THMim [mother] i indexes small area m indexes mother THMim b[T] b[c] cik = potential confounders, e.g. deprivation, smoking, ethnicity yim cim Regression sub-model (national data) f Regression model for national data relating risk of low birth weight (yik) to mother’s THM exposure and other confounders (cik) [true] THMzt s2 [raw] THMztj [mother] THMik b[T] b[c] yik cik qi [mother] THMim yim cim Regression sub-model (national data) Regression model for national data relating risk of low birth weight (yik) to mother’s THM exposure and other confounders (cik) Logistic regression yik ~ Bernoulli(pik) [mother] logit pik = b[c] cik + b[T] THMik [mother] THMik b[T] b[c] yik cik i indexes small area k indexes mother Missing confounders sub-model Missing data model to estimate confounders (cik) for mothers in national data, using information on within area distribution of confounders in MCS f [true] THMzt s2 [raw] THMztj [mother] THMik b[T] b[c] yik cik qi [mother] THMim yim cim Missing confounders sub-model Missing data model to estimate confounders (cik) for mothers in national data, using information on within area distribution of confounders in MCS cik cim ~ Bernoulli(qi) (MCS mothers) cik ~ Bernoulli(qi) (Predictions for mothers in national data) qi cim THM measurement error sub-model f Model to estimate true tap water THM concentration from raw data [true] THMzt s2 [raw] THMztj [mother] THMik b[T] b[c] yik cik qi [mother] THMim yim cim THM measurement error sub-model Model to estimate true tap water THM concentration from raw data [true] THMzt s2 [raw] THMztj ~ [raw] THMztj [true] Normal(THMzt, s 2) z = water zone; t = season; j = sample (Actual model used was a more complex mixture of Normal distributions) THM personal exposure sub-model Model to predict personal exposure using estimated tap water THM level and literature on distribution of factors affecting individual uptake of THM f [true] THMzt s2 [raw] THMztj [mother] THMik b[T] b[c] yik cik qi [mother] THMim yim cim THM personal exposure sub-model Model to predict personal exposure using estimated tap water THM level and literature on distribution of factors affecting individual uptake of THM [mother] THM = ∑k [true] THMzt [mother] THMik [mother] f THMim [true] THMzt x quantity (f1k) x uptake factor (f2k) where k indexes different water use activities, e.g. drinking, showering, bathing 3. Inference Mathematics Modelling Algorithms Inference Bayesian … or non Bayesian Bayesian Full Probability Modelling • Graphical approach to building complex models lends itself naturally to Bayesian inferential process • Graph defines joint probability distribution on all the ‘nodes’ in the model Recall: Joint distribution P(V) = P(v | parents[v]) • Condition on parts of graph that are observed (data) • Calculate posterior probabilities of remaining nodes using Bayes theorem • Automatically propagates all sources of uncertainty Data f Unknowns [true] THMzt s2 [raw] THMztj [mother] THMik b[T] b[c] yik cik qi [mother] THMim yim cim 4. Algorithms Mathematics Modelling Algorithms Inference • MCMC algorithms are able to exploit graphical structure for efficient inference • Bayesian graphical models implemented in WinBUGS