No Slide Title

Download Report

Transcript No Slide Title

Missing Data & Measurement Error

Welcome to Rachel Whitaker

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 1

Overview

• Missing data are inevitable • Some missing data are “inherent” • Prevention is better than statistical “cures” • Too much missing information invalidates a study • There are many methods for accommodating missing data – Their validity depends on the missing data mechanism and the analytic approach   • Issues can be subtle • A little data on the missingness process can be helpful Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 2

Common types of missing data

• Survey non-response • Missing dependent variables • Missing covariates • Dropouts • Censoring – administrative, due to competing events or due to loss to follow-up • Non-reporting or delayed reporting • Noncompliance • Measurement error Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 3

Implications of missing data

Missing data produces/induces

• Unbalanced data • Loss of information and reduced efficiency • Extent of information loss depends on – Amount of missingness – Missingness pattern – Association between the missing and observed data – Parameters of interest – Method of analysis

Care is needed to avoid biased inferences, inferences that target a reference population other than that intended

• e.g., those who stay in the study Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 4

Inherent missingness

Right-censoring

• We know only that the event has yet to occur – Issue: “No news is no news” versus “no news is good news”

Latent disease state

• Disease Free/Latent Disease/Clinical Disease – Screen and discover latent disease – Only known that transition DF  LD occurred before the screening time and that LD  CD has yet to occur Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 5

Missing Data Mechanisms

Little RJA, Rubin D. Statistical analysis with missing data.

Chichester, NY: John Wiley & Sons; 2002

Missing Completely at random (MCAR) • Pr(missing) is unrelated to process under study Missing at Random (MAR) • Pr(missing) depends only on

observed data

Not Missing at Random (NMAR) • Pr(missing) depends on both observed

and unobserved data These distinctions are important because validity of an analysis depends on the missing data mechanism

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 6

Notation (for a missing dependent variable in a longitudinal study)

i indexes participant (unit), i = 1,…,n j indexes measurement (sub unit), j = 1,…,J • Potential response vector Y i = (Y i1 , Y i2 , … , Y iJ ) • Response Indicators R i = (R i1 , R i2 , … , R iJ ) R ij = 1 if Y ij is observed and R ij = 0 if Y ij is missing • Given

R i

,

Y i

can be partitioned into two components:

Y i O observed responses Y i M missing responses

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 7

Patient 1 2 3 n

Schematic Representation of Response vector and Response indicators

Y 1 y 11 y 21 y 31 Response vector Y y y * 2 12 32 Y y y * 3 13 23 … … … … Y J y 1J y 2J y 3J y n1 * * … * R 1 1 Response indicators R 1 2 R 1 3 … … R J 1 1 1 0 1 1 0 … … 1 1 1 0 0 … 0

Eg: Y 2

= (Y 21 , Y 22 , Y 23 , … , Y 2J )

R 2

= (1, 0, 1, … , 1) Y 2 O = (Y 21 , Y 23 , … , Y 2J ) Term 3, 2008 Y 2 M = (Y 22 ) Bio753—Advanced Methods in Biostatistics, III 8

More general missing data

• A similar notation can be used for missing regressors (X ij ) and for missing components of an even more general data structure • Using “ Y ” to denote all of the potential data (regressors, dependent variable, etc.), the foregoing notation applies in general Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 9

Missing Data Mechanisms

• Some mechanisms are relatively benign and do not complicate or bias an analysis • Others are not benign and can induce bias

Example

• Goal is to predict weight from gender and height • Use information from Bio656 students • Possible reasons for missing data – Absence from class – Gender-associated, non-response – Weight-associated, non-response

How would each of the above reasons affect results ?

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 10

Missing Completely at Random (MCAR)

• Missingness is a chance mechanism that does not depend on observed or unobserved responses – R i is independent of both Y i O and Y i M

Pr(R i | Y i O , Y i M ) = Pr(R i )

• In the weight survey example, missingness due to absence from class is unlikely to be related to the relation between weight, height and gender • The dataset can be regarded as a random sample from the target population (the full class, Bio620 over the years, ....) • A complete-case analysis is appropriate, albeit with a drop in efficiency relative to obtaining more data Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 11

Missing Completely at Random (MCAR)

Scatterplot: Weight vs Height by Gender

MALE Observed Missing • The probability of having a missing value for variable Y is unrelated to the value of Y or to any other variables in the data set • A complete-case analysis is appropriate FEMALE Observed Missing H[G == 0 & R == 1] Term 3, 2008 50 Bio753—Advanced Methods in Biostatistics, III 60 70 80 90 100 12

Missing at random (MAR)

• Missingness depends on the observed responses, but does not depend on what would have been measured, but was not collected

Pr(R i |Y i O ,Y i M ) = Pr(R i |Y i O )

• The observed data are not a random sample from the full population – In the weight survey example, data are MAR if Pr(missing weight) depends on gender or height but not on weight • Even though not a random sample, the distribution of Y i M conditional on Y i O is the same as that in the reference population (the full class) • Therefore, Y i M can be validly predicted using Y i O – Of course, validity depends on having a correct model for the mean and dependency structure for the observed data • But, we don ’ t need to do these predictions to get a valid inferences Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 13

Missing at random (MAR)

• The probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis

Scatterplot: Weight vs Height by Gender

MALE Observed Missing • Analysis using the wrong model is not valid – e.g., uncorrelated regression, when correlation is needed FEMALE Observed Missing H[G == 0 & R == 1]

A complete case analysis gives a valid slope, when selection is on the predictors, BUT correlation will be biased.

Term 3, 2008 50 Bio753—Advanced Methods in Biostatistics, III 60 70 80 90 100 14

When the mechanism is MAR

• Complete-case methods and standard regression methods based on all the available data can produce biased estimates of mean response or trends • If the statistical model for the observed data is correct, likelihood based methods using only the observed data are valid • Requires that the joint distribution of the observed Y i s is correctly specified, – when the mean and covariance are correct – when using a correct GEE working model – when using correct random effects

Ignorability

• With a correct model for the observeds, under MAR the details of the missing data mechanism are not needed; the mechanism is

ignorable

– Ignorability is not an inherent property of the mechanism – It depends on the mechanism and on the analytic model Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 15

Not missing at random (NMAR)

• Missingness depends on the responses that could have been observed

Pr(R i |Y i O ,Y i M ) does depend on Y i M

• The observed data cannot be viewed as a random sample of the complete data • The distribution of Y i M conditional on Y i O is

not the same

as that in the reference population (the full class) • Y i M depends on Y i O and on Pr(R i |Y i O ,Y i M ) and on Pr(Y) • In the weight survey example, data are NMAR if missingness depends on weight Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 16

Missing Data Mechanisms: Not missing at random (NMAR)

• Also known as – Non-ignorable missing • The probability of missing data on Y is related to the value of Y even if we control for other variables in the analysis.

• A complete-case analysis is NOT valid • Any analysis that does not take dependence on Y into account is not valid • Inferences are highly model dependent

Scatterplot: Weight vs Height by Gender

MALE Observed Missing FEMALE Observed Missing H[G == 0 & R == 1] 50 60 70 80 90 Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 17 100

MAR for Y vs X [Y | X] NMAR for cor(X,Y)

Scatterplot: Weight vs Height with fitted line initial analysis analysis with missing

Term 3, 2008 50 60 70 80 90 100 50 60 70 80 Height (cm) Height (cm) Bio753—Advanced Methods in Biostatistics, III 90 100 18

When the mechanism is NMAR

• Almost all standard methods of analysis are invalid – Valid inferences require joint modeling of the response and the missing data mechanism Pr(R i |Y i O ,Y i M ) • Importantly, assumptions about Pr(R i |Y i O ,Y i M ) cannot be empirically verified using the data at hand • Sensitivity analyses can be conducted (Dan Scharfstein ’ s research focus) • Obtaining values from some missing Ys can inform on the missing data mechanism Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 19

Dropouts (if missing, missing thereafter)

Dropout Completely at Random

• Dropout at each occasion is independent of all past, current, and future outcomes – Is assumed for Kaplan-Meier estimator and Cox PHM

Dropout at Random

• Dropout depends on the previously observed outcomes up to, but not including, the current occasion – i.e., given the observed outcomes, dropout is independent of the current and future unobserved outcomes

Dropout Not at Random, “informative dropout”

• Dropout depends on current and future unobserved outcomes Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 20

Probability of a follow-up lung function measurement depends on smoking status and current lung function

Term 3, 2008

Is the mechanism MAR?

We don’t know!

in Biostatistics, III 21

LUNG FUNCTION DECLINE IN ADULTS

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 22

Longitudinal dropout example

• Repeated measurements Y it i indexes people, i=1,…,n t indexes time, t=1,…,5

Y it = μ it =

0 +

1

t + e it cor = cov(e is , e it

) =

|s-t| ;

  0 • 

0 = 5,

1 = 0.25,

= 1,

= 0.7

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 23

Longitudinal dropout example the dropout mechanism

• • Dropout indicator,

D i D i

= k if person

i

k th occasion • Assume that drops out between the (k-1) st and log    Pr( Pr(

D i D i

 

i i

 

i

1

i

1 ,..., ,...,

Y ik Y ik

• Dropout is MCAR if q 2 = q 3 • Dropout is MAR if q 3 = 0 • Dropout is NMAR if q 3 ≠ 0 ) )     = 0 1 2

Y ik

 1  q 3

Y ik

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 24

Population Regression Line vs. Observed Data Means

Y 6.5

6 5.5

5

MCAR (

1 q

1 = -0.5,

q

2 =

q

3 = 0 )

Y 6.5

6 5.5

MAR (

q

1 = -0.5,

q

2 =0.5,

q

3 = 0 )

2 3 Y 6.5

6 5.5

T 5 4 5 1 2

NMAR (

q

1 = -0.5,

q

2 =0,

q

3 = 0.5

)

Term 3, 2008 5 1 2 3 4 5 T 3 4 5 25 T

Analysis results

The true regression parameters are intercept = 5.0 and slope = 0.25,

= 0.7

Dropout Mechanism MCAR Parameter Intercept

ML (se)

Estimate MAR Slope Intercept 5.015

(0.031) 0.257

(0.016) 5.003

(0.041)

GEE/OLS (se)

Estimate 5.022

(0.032) 0.253

(0.018) 5.062

(0.043) NMAR Slope Intercept 0.261

(0.016) 5.058

(0.040) 0.182

(0.018) 5.071

(0.043) Slope 0.201

(0.016) 0.162

(0.018)

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 26

Y

Misspecified GEE

(when the truth is

random intercepts and slopes)

Complete Data (GEE) Partial Missing Data (GEE) Y Term 3, 2008 Time Bio753—Advanced Methods in Biostatistics, III Time 27

Correctly specified Random Effects

(when the truth is

random intercepts and slopes)

Complete Data (REM) Partial Missing Data (REM) Y Y Term 3, 2008 Time Bio753—Advanced Methods in Biostatistics, III Time 28

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III

The probability of dropping out depends on the observed history

29

One step at a time

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 30

There are 5 different “trajectories” with relative weights 2 2 1 1 2 The OLS analysis has regressors 0, 1, 2 and dependent variables 0,  ,  2  The Indep. Increments analysis has a constant regressor “1” and so is just estimating the mean. The dependent variable is either +  or  Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 31

If the missing data process is MAR and if we use the correct model for the observed data, the missing data mechanism is “ignorable”

• In the foregoing example, computing first differences (current value – previous value) and averaging them differences is an unbiased estimate (of 0) no matter how complicated the MAR missing data process • We don’t have to know the details of the dropout process (it can be very complicated), as long as the probabilities depend only on what has been observed and not on what would have been observed • Ignorability depends on using the correct model for the observed data (mean and dependency structure) • If the errors were independent (rather than the first differences), then standard OLS would be unbiased Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 32

Analytic Approaches

Complete Case Analysis

• Global complete case analysis • Individual model complete case analysis • Augment with missing data indicators – primarily for missing Xs • Weighting • Imputation – Single – Multiple • Likelihood-based (model-based) methods Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 33

Analytic Approaches

Global complete-case Analysis (use only data for people with fully complete data)

• Biased, unless the dropout is MCAR • Even if MCAR is true, can be immensely inefficient

Analyze Available Data (use data for people with complete data on the regressors in the current model)

• More efficient than complete-case methods, because uses maximal data • Biased unless the dropout is MCAR • Can produce floating datasets, producing “illogical” conclusions – R 2 relations are not monotone

Use Missing data indicators (e.g., create new covariates)

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 34

Weighting

• Stratify samples into J weighting classes – Zip codes – propensity score classes • Weight the observed data inversely according to the response rate of the stratum – Lower response rate  higher weight • Unbiased if observed data are a random sample in a weighting class (a special form of the MAR assumption) • Biased, if respondents differ from non-respondents in the class • Difficult to estimate the appropriate standard error because weights are estimated from the response rates Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 35

Simple example of weighting adjustment

• Estimate the average height of villagers in two villages • Surveys sent to 10% of the population in both villages

# villagers # survey sent # providing data Avg height Village A

1000 100 100 1.7m

Village B

1000 100 50 1.4m

• Direct, unweighted: 1.7*(2/3) + 1.4*(1/3) = 1.60m

• Weighted: 100*1.7* 0.005 + 50*1.4* 0.01

= 1.55m (= 1.7*.5 + 1.4*.5) Term 3, 2008 2 x Weight Bio753—Advanced Methods in Biostatistics, III 36

Single Imputation

Single Imputation

• Fill in missing values with imputed values • Once a filled-in dataset has been constructed, standard methods for complete data can be applied

Problem

• Fails to account for the uncertainty inherent in the imputation of the missing data • Don’t use it!

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 37

Multiple Imputation

Rubin 1987, Little & Rubin 2002

• Multiply impute “m” pseudo-complete data sets – Typically, a small number of imputations (e.g., 5 ≤

m

≤10) is generally sufficient • Combine the inferences from each of the m data sets • Acknowledges the uncertainty inherent in the imputation process • Equivalently, the uncertainty induced by the missing data mechanism •

Rubin DB. Multiple Imputation for Nonresponse in Surveys, Wiley, New York, 1987

Little RJA, Rubin D. Statistical analysis with missing data. Chichester, NY: John Wiley & Sons; 2002

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 38

Multiple Imputation

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 39

Multiple Imputation: Combining Inferences

• Combine m sets of parameter estimates to provide a single estimate of the parameter of interest • Combine uncertainties to obtain valid SEs • In the following, “ k ” indexes imputation β  1 m k m   1 ˆ (k)

This computation is correct for fully efficient estimators.

Var( β )  1 m k m   1 Var( ˆ (k) )  1 1 m m 1  1 k m   1  ˆ (k)  β  2 Term 3, 2008

Within-imputation variance Between-imputation

Bio753—Advanced Methods in Biostatistics, III

variance

40

Multiple Imputation: Combining Inferences

• Combine m sets of parameter estimates to provide a single estimate of the parameter of interest • Combine uncertainties to obtain valid SEs • In the following, “ k ” indexes imputation β  1 m k m   1 ˆ (k) Cov( β )  1 m k m   1 Cov( ˆ (k) )  1 1 m m 1  1 k m   1  ˆ (k)  β  ˆ (k)  β  '

Within-imputation covariance Between-imputation covariance

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 41

Producing the Imputed Values

Last value carried forward (LVCF)

• Single Imputation (never changes) • Assumes the responses following dropout remain constant at the last observed value prior to dropout • Unrealistic unless, say, due to recovery or cure • Underestimates SEs

Hot deck

• Randomly choose a fill-in from outcomes of “similar” units • Distorts distribution less than imputing the mean or LVCF • Underestimates SEs Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 42

Valid Imputation

Build a model relating observed outcomes

• Means and covariances and random effects, ...

• Goal is prediction, so be liberal in including predictors • Don’t use P-values; don’t use step-wise • Do use multiple R 2 , predictions sums of squares, cross-validation, ... Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 43

Producing Imputed Values

Sample values of Y i M from pr(Y

i M |Y i O , X i

)

• Can be straightforward or difficult • Monotone case: draw values of Y i M from pr(

Y i M |Y i O ,X i

) in a sequential manner • Valid when dropouts are MAR or MCAR

Propensity Score Method

• Imputed values are obtained from observations on people who are equally likely to drop out as those lost to follow up at a given occasion • Requires a model for the propensity (probability) of dropping out, e.g., log    Pr(D Pr(D i i  k | D i  k | D i   k, k, Y i1 ,  , Y ik Y i1 ,  , Y ik ) )     θ 1  θ 2 Y ik  1 Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 44

Producing Imputed Values

Recall that “Y” is all of the data, not just the dependent variable

Predictive Mean Matching (build a regression model!)

A series of regression models for Y ik , given Y i1 , … ,Y ik-1 , are fit using the observed data on those who have not dropped out by the k th occasion. For example,

E(Y ik ) =

1 +

2 Y i1 + … +

k Y i(k-1) V(Y ik ) =

 ˆ 2  1 * +  ˆ 1. Parameters  2 * Y i1 +  … * 2 and  2* estimated parameters (to account for the uncertainty in the estimated regression) 2. Missing values can then be predicted from +  k * Y ik-1 are then drawn from the distribution of the +  * e i , where e i is simulated from a standard normal distribution 3. Repeat 1 and 2 Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 45

Missing, presumed at random

Cost-analysis with incomplete data*

• Estimate the difference in cost between transurethral resection (TURP) and contact-laser vaporization of the prostate (Laser) • 100 patients were randomized to one of the two treatments – TURP: n = 53; Laser: n = 47 • 12 categories of medical resource usage were measured – e.g., GP visit, transfusion, outpatient consultation, etc.

* Briggs A et al. Health Economics. 2003; 12, 377-392

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 46

Missing data

TURP n = 53 Laser n = 47 Total n = 100 Patients with no missing resource counts Observed resource counts 34 (59%) 570 (90%) 21 (51%) 510 (90%) 55 (55%) 1080 (90%) Term 3, 2008 Complete-case analysis uses only half of the patients in the study even though 90% of resource usage data were available Bio753—Advanced Methods in Biostatistics, III 47

Comparison of inferences

Note that mean imputation understates uncertainty.

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 48

Multiple Imputation versus likelihood analysis when data are MAR

• Both multiple imputation or used of a valid statistical model for the observed data (likelihood analysis) are valid – The model-based analysis will be more efficient, but more complicated • Validity of each depends on correct modeling to produce/induce ignorability Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 49

What if you doubt the MAR assumption (you should always doubt it!)

You can never empirically rule out NMAR

• Methods for NMAR exist, but they require information and assumptions on pr(Missing | observed, unobserved) • Methods depend on unverifiable assumptions • Sensitivity analysis can assess the stability of findings under various scenarios – Set bounds on the form and strength of the dependence – Evaluate conclusions within these bounds Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 50

MEASUREMENT ERROR

If a covariate (X) is measured with error, what is the implication for regression of Y on X?

See also “Air” and “Cervix” in volume II of the BUGS examples Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 51

Measurement Error Another type of missing data

• Measurement error is a special case of missing data because we do not get to “observe the true value” of the response or covariates • Depending on the measurement error mechanism and on the analysis, inferences can be – inefficient (relative to no measurement error) – biased Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 52

• Differential attenuation across studies complicates “exporting” and synthesizing Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 53

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 54

The two “Pure Forms”

relating X t & X o Classical: X o = X t +

,



(0,

 

2 )

What you see is a random deviation from the truth • Measured & true blood pressure • Measured and true social attitudes

Berkson: X t = X o +

 The truth is a random deviation from what you see • Individual SES measured by ZIP-code SES • Personal air pollution measured by centrally monitored value • Actual temperature & thermostat setting Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 55

Hybrids are possible

X t and X o have a general joint distribution

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 56

Measurement error’s effect on a simple regression coefficient

Classical

• The regression coefficient on X o is attenuated towards 0 relative to the “true” regression coefficient on X t • Because, the spread of X o is greater than that for X t

Berkson

• No effect on the expected regression coefficient • Variance inflation Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 57

Term 3, 2008

Berkson

X t = X 0 +

,



(0,

 

2 ) true: observed: Y = int +

X t + resid = int +

(X 0 Y = int +

* X 0 +

) + resid + resid

 Var(X 0 ) = 

0 2 No attenuation

* =

 because E(X t | X 0 ) = X 0 Bio753—Advanced Methods in Biostatistics, III 58

Classical

X o = X t +

,



(0,

 

2 ) true: observed: Y = int +

X t + resid Y = int +

* X 0 = int +

* (X t + resid +

) + resid

 Var(X 0 ) =  t 2 +   2 (X 0 is stretched out)

Attenuation (attenuation factor

)

*

= =

 

t 2 /(

t 2 +

 

2 )

slope = cov(Y, X)/Var(X), but E(X t | X 0 ) =  X 0 Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 59

Y versus X

t

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 60

Y versus X

0

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 61

An illustration

• • •

Back to the basic example

W = Weight (lb) H = Height (cm) Analysis: simple linear regression W i =  0 +  1 H i + e i where e i ~ N(0,  2  Assume the true model to be: W i = 3 + 1.0H

i + e i where e i ~ N(0, 8 2 

Measurement error

1.

Error in W: observe W * 2.

Error in H : observe H * = W + e i * where e i ~ N(0, 4 2  = H +  i * where  i * ~ N(0, 10 2  Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 62

Scenario 1: Measurement Error in Response

No error With error Results:  1 = 1.16

SE(  1 )= 0.15

 1 = 1.08

SE(  1 ) = 0.18

• Standard regression estimate for  1 is unbiased, but

less efficient

• The larger is the measurement error, the greater the loss in efficiency Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 63

Scenario 2: measurement error in H

No error With error Results:  1 = 1.16

SE(  1 )= 0.15

 1 = 0.69

SE(  1 )= 0.21

• Standard regression estimate for  1 is biased (attenuated) • The larger is the measurement error, the greater the attenuation Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 64

Multivariate Measurement Error

X o = X t +

,



(0,

 

)

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 65

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 66

The Multiple Imputation Algorithm in SAS

The MIANALYZE

Procedure – Combines the

m

different sets of the parameter and variance estimates from the

m

imputations – Generates valid inferences about the parameters of interest

PROC MIANALYZE

BY

variables

; VAR

variables

; <

options

>; Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 67

Multiple Imputation Algorithm in SAS

• • •

PROC MI <options>; BY variables; FREQ variable; MULTINORMAL <options>; VAR variables; Available options in PROC MI include: NIMPU=number (default=5) Available options in MULTINORMAL statement: METHOD=REGRESSION METHOD=PROPENSITY<(NGROUPS=number)> METHOD=MCMC<(options)> The default is METHOD=MCMC

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 68

Term 3, 2008 Bio753—Advanced Methods in Biostatistics, III 69