VALEURS MANQUANTES - Personal Homepages

Download Report

Transcript VALEURS MANQUANTES - Personal Homepages

VALEURS MANQUANTES
 Quelle proportion?

Importante - Non importante?
 Aléatoires - Non aléatoires?
 Quel « pattern » suivent les valeurs
manquantes?
Valeurs manquantes

Une solution simple :
 écarter
les « sujets » avec des réponses
incomplètes : « analyse des cas
disponibles ou des cas complets »
non efficiente de l ’information
 cas complets peuvent être très différents
 utilisation
Généralisation?
Représentativité?
Valeurs manquantes
 Classification
Exemple : 2 variables Y = revenu, X = Age
 Complètement
aléaloires (Missing
Completely At Random = MCAR) : données
manquantes = échantillon représentatif de
l’ensemble complet de données
Probabilité que revenu soit récolté la même pour tous les
individus
MCAR
Valeurs manquantes
 Classification
Exemple : 2 variables Y = revenu, X = Age
 Aléatoires
(Missing At Random = MAR)
: probabilité qu’une donnée soit
manquante dépend des valeurs des
variables mesurées
Probabilité que revenu soit récolté dépend de l’âge
des répondants mais ne varie pas en fonction du
revenu des répondants au sein des groupes d’âge
MAR
Valeurs manquantes
 Classification
Exemple : 2 variables Y = revenu, X = Age

Valeurs manquantes non aléatoires
(Missing Not At Random = MNAR) :
l’occurrence des valeurs manquantes
d’une variable dépend de la valeur réelle
mais non observée de la variable.
Si probabilité que revenu varie aussi en en fonction du
revenu dans les groupes d’âge
MNAR
Valeurs manquantes
 Classification


MCAR et MAR = « missing ignorable »
MNAR = « missing non ignorable »
VALEURS MANQUANTES
 Méthodes d’analyse

Deux grands types d’approches
 Imputation

Basées sur la vraisemblance (Likelihood –
«Expectation-Maximization » algorithm).
Estimation de paramètres par maximum de
vraisemblance à partir des données incomplètes.
 Méthodes d’analyse

Différence principale entre les deux
approches
imputation complète les «missing »
 approche basée sur le Likelihood : pas
d’estimation explicite des « missing » mais
spécification d’un modèle et logiciels moins
facilement disponibles pour certaines analyses


Si grands échantillons, résultats semblables
avec les deux méthodes; si petits échantillons,
MI supérieur?
Valeurs manquantes

IMPUTATION:
 Imputation
simple
 Imputation
multiple
VALEURS MANQUANTES
 Imputation simple



Valeur basée sur la connaissance à priori
moyenne des observations disponibles
pour les autres sujets avec des
caractéristiques identiques
valeurs prédites par régression ou
régression stochastique (valeurs
manquantes remplacées par valeurs
prédites + résidus pour refléter l’incertitude
sur la valeur prédite)
Valeurs manquantes
 Imputation simple


Hot Deck : valeur imputée sélectionnée à
partir de la distribution estimée pour
chaque valeur manquante
Cold deck : remplacer une valeur
manquante par une valeur constante
provenant d’une source extérieure (ex :
étude antérieure)
 Imputation simple



Étude longitudinale : dernière valeur
observée (LOCF)
Substitution : remplacer des unités
sélectionnées par d’autres non
sélectionnées dans l’échantillon (stade
expérimental)
…………
VALEURS MANQUANTES
 Imputation simple : problèmes



Connaissance à priori : OK si nb. Missing
petit et chercheur expérimenté
L’analyse de la base de données
complétée comme si les mesures ajoutées
étaient des mesures réelles ne tient pas
compte de l’incertitude liée au processus
d’imputation
Les erreurs standards sont en général
sous-estimées
VALEURS MANQUANTES
 Imputation multiple (MI)

N’ajoute pas des valeurs

Analyse de plusieurs ensembles de
données « complets »

Simulations
nb. M d’imputations
répétées = 3, suffisant si 20% de missing
Valeurs manquantes
 Imputation multiple

Sauf si % « missing » très grand : peu de
bénéfice avec + de 10 imputations – 5
imputations = recommandé

Ajuste les statistiques pour tenir compte
de l’incertitude liée à l’imputation
Valeurs manquantes
 Remarque

Méthodes choisies pour traiter les missings
dans les essais cliniques ont un impact sur
les calculs de taille d’échantillons
VALEURS MANQUANTES
 Quelques situations




Analyses avec des modèles classiques
Essais cliniques
Etudes longitudinales
………..
Valeurs manquantes
 Exemple 1
Developing a prognostic model in the presence
of missing data: an ovarian cancer case study
Taane G. Clark*, Douglas G. Altman
Journal of Clinical Epidemiology 56 (2003) 28–37


Valeurs manquantes pour 8 des 10 facteurs
prognostiques potentiels : 2-43%
Temps de survie connus
Valeurs manquantes
 Exemple 1 - étapes de la procédure
1. Investigating the missing data

a. Quantifying the multivariate patterns of the
missing data.

b. Plotting the proportion of missing data for
each potential prognostic factor against
diagnosis year to show time trends in
measurement practice.
Valeurs manquantes
 Exemple 1 - étapes de la procédure
1. Investigating the missing data

c. Exploring the relationship between
missing data of potential prognostic factors
with other prognostic variables, survival
information [i.e., (log) survival time and the
censoring indicator], and auxiliary variables.
Valeurs manquantes
 Exemple 1 - étapes de la procédure
2. Specifying an imputation model.
3. Using the model to generate (via a
random sampling procedure) M sets of
imputed values for the missing data
points, thus creating M completed
datasets.
Valeurs manquantes
 Exemple 1 - étapes de la procédure
4. For each completed dataset, carrying
out a Cox regression, obtaining estimate
of interest and its estimated variance
5. Combining the results from the different
datasets to obtain a prognostic model.
Valeurs manquantes
 Exemple 1 - étapes de la procédure
6. Constructing a final “completed data”
model (Model 2) by removing the
covariate with the highest P-value and
repeating steps 4 and 5 until all
remaining covariates were significant at
a 5% level (backward elimination).
Valeurs manquantes
 Exemple 1

Etape 1 : missing data = MAR

Etapes 2 et 3 = simulation bayésienne

Etape 3 : nombre d ’imputations
répétées=10
Valeurs manquantes
 Exemple 1 - Etape 1 – Pattern «missing»
Prognostic variable
N (%)
Grade
Unknown
139 (11.7)
Ascites
Presence
707 (59.5)
Absence
417 (35.1)
Unknown
65 (5.5)
Alkaline phosphatase
793 (66.7)
Valeurs manquantes
 Exemple 1-Etape 1-Pattern
«missing»

The number of patients contributing to a
complete case analysis using all the
prognostic factors would be 358 (245
deaths).

Plots of the proportion of missing data by
diagnosis year show that the proportions
for ascites, alkaline phosphatase, albumin,
grade, and residual disease were constant.
Valeurs manquantes
 Exemple 1 - Etape 1-Pattern
«missing»

The proportion of missing CA125 data decreased
linearly in time from 85 to 21% between 1984 and
1999.

The proportion of missing performance status had
an increasing trend in time with a minimum of 18%
in 1986 and a maximum of 71% in 1995.
Valeurs manquantes
 Exemple 1 - Etape 1 - Evidence of MAR
data
An analysis of the survival distributions of non-missing
and missing strata within each of the factors (log) CA125,
grade, FIGO stage, and performance status showed
no visual or statistical evidence of significant differences.
Valeurs manquantes
 Exemple 1 - Etape 1 - Evidence of MAR
data

Difference between the survival
distributions of patients with and without
missing data for ascites (P .002), albumin
(P .003), alkaline phosphatase (P .020)
and residual disease (P .020)
Valeurs manquantes
 Exemple 1 - Etape 1 - Evidence of MAR
data

Those patients missing albumin and
alkaline phosphatase results had a better
prognosis, suggesting that eliminating the
patients with missing values would lead to
an underestimate of the true survival of the
cohort. The opposite effect was seen for
ascites and residual disease.
Valeurs manquantes
 Exemple 1 - Etape 1 - Evidence of MAR
data

The univariate logistic models indicated
that histology and clinical trial participation
were associated with the missingness of all
but one prognostic variable.
Valeurs manquantes
 Exemple 1 - Etape 2 à 5 - Imputation

We completed 10 data sets by imputing
2,045 values in each. As a consequence,
6,265 additional real data values were
incorporated into each dataset.
Valeurs manquantes
 Exemple 1 - étape 2 – Imputation model


For binary variables (e.g., the presence or
absence of ascites) we used a logistic
model
For categorical variables with three or
more ordered levels (e.g., performance
status) we applied a polytomous (2 levels)
logistic model
Valeurs manquantes
 Exemple 1 - Etape 2 - Imputation model

For continuous variables (e.g., log CA125)
we used normal linear regression
truncated where appropriate to the credible
range of values.
Valeurs manquantes
 Exemple 1 - Etapes 2 à 5 - Imputation

The prevalences (%) of categorical
prognostic factors in the original data
(ignoring missing data) were consistent
with those from the 10 imputations.
Valeurs manquantes
 Exemple 1 - Etapes 2 à 5 - imputation
Original
Completed (a)
Prognostic Factor #
%
Median Range Overall %
Grade
I
131
12.5
149
144–153 12.5
II
278
26.5
315
310–321 26.5
III
641
61.0
724
716–732 60.9
Unknown
139
0
—
—
—
Ascites
Presence
707
62.9
750
747–752 63.0
Absence
417
37.1
440
437–442 37.0
Unknown
65
0
—
—
—
(a) 10 datasets with original data augmented by imputed missing
values.
Valeurs manquantes
 Exemple 1 - Etapes 2 à 5 - Imputation

The median and range of albumin, log
CA125, and alkaline phosphatase in the
original data were consistent with the
median of the median of the 10 imputation
distributions and the extreme values of
these distributions, respectively.
Valeurs manquantes
 Exemple 1 - Etape 2 à 5 - imputation
Prognostic
Factor
Log CA125
Albumin
Log Alk. Phos.
Original
Median Range
Completed
Median Range
(5.34) (1.79–10.04)
(39.0) (20.0–50.0)
(4.54) (3.26–7.50)
5.16
39.0
4.54
1.79–10.04
20.0–50.0
3.26–7.50
Valeurs manquantes
 Exemple 1 - Etapes 2 à 5 - Imputation

The narrow ranges of imputation values for
each potential prognostic variable
coincides with the visual impression that
the distributions for each of the potential
prognostic factors in the 10 imputed
datasets were similar.
Valeurs manquantes
 Exemple 1 - Etape 6 - Fitting the Cox
models.



Model 1 : as four factors, each with missing values,
were found not to be prognostic, the analysable
dataset was 518 (380 deaths).
Model 2 : pooled analysis using 10 complete datasets
with imputed missing values.
Grade and ascites were statistically significant in
Model 2, but not in Model 1.
Valeurs manquantes
 Exemple 1 - Etape 6 - Fitting the Cox
models.

A complete case analysis based on Model 2
would include only 449 patients (319
deaths).

The confidence limits are narrower in the
augmented data, especially for those with
less missing observations in the original
dataset.
 Exemple 1 - Etape 6 - Fitting the Cox
models.


The models applied to completed data (i.e., the 10
datasets with imputed missing values)
 had better calibration (i.e., greater ability to
produce unbiased estimates of outcome)
 superior discrimination (i.e., improved ability to
provide accurate predictions for individual
patients)
There was little difference between the
discrimination measures of Model 1 and Model 2
when applied to the completed data.
Exemple 1 - Conclusion
Most data are multivariate in nature,
so a small proportion of missing data
for several variables can lead to a
severely depleted complete case
analysis.
 MI seems appropriate in this setting if
the original dataset is not too small.

Valeurs manquantes
 Exemple 1 - conclusion


Using imputed data we are incorporating patients
that are removed merely because one or more of their
prognostic factors are missing and, as a result,
increasing power and adding precision to an analysis.
our approach may be viewed as a sensitivity analysis,
and ultimately we need to use judgement about the
plausibility of assumptions in a particular situation to
assess which is the primary analysis.
Valeurs manquantes
 Exemple 2 : une étude longitudinale
Attrition in longitudinal studies: How to deal
with missing data
Jos Twisk*, Wieke de Vente
Journal of Clinical Epidemiology 55 (2002)
329–337
Valeurs manquantes
 Exemple 2 - Conclusion

When MANOVA for repeated
measurements is used to analyze a
longitudinal dataset with missing data,
imputation methods to replace these
missing data are highly recommendable
(because MANOVA as implemented in the
software used (SPSS), uses listwise
deletion of cases with a missing value).
Valeurs manquantes
 Exemple 2 - Conclusion


When GEE is used to analyze a
longitudinal dataset with missing data, not
imputing at all may be better than any of
the imputation methods applied.
If one chooses to impute missing values,
longitudinal methods are generally
preferred above cross-sectional methods.
Valeurs manquantes
 Exemple 2 - Conclusion


Using the more refined multiple imputation
method to impute missing values did not
lead to different point estimates than the
single imputation techniques.
The estimated standard errors were higher
than the ones obtained from the complete
dataset, which seems to be theoretically
justified, because they reflect uncertainty in
estimation caused by missing values.
Valeurs manquantes
 Exemple 2 - Limitations






Specific observational longitudinal dataset
Four missing data scenarios
Limited number of imputation techniques
Missingness dependent on the outcome
variable
Two statistical methods
Less advanced multiple imputation
estimation pro-cedures)
Valeurs manquantes
 Exemple 3 – Un essai clinique

Extrait de « Multiple Imputation : a primer».
JL Schafer
Statistical Methods in Medical Research,
1999; 8 (1) 3-15
VALEURS MANQUANTES
 Softwares





Routines pour STATA
http://www.stat.harvard.edu/~barnard/
S-PLUS
SAS
NORM (free sur INTERNET (Schafer,
1999)
SOLAS™ for Missing Data Analysis and
Multiple Imputation
http://www.statsol.ie/solas/solas.htm
Valeurs manquantes
 Et SPSS?
 Module MVA


Pattern des missings
Méthodes de substitution :
 Régression
 EM
Univariate Statistics
poidbebe
hemog 1
agem
perbg
baude
parite
gead
gretum
N
2081
1302
1995
1939
1916
2083
1252
2080
Mean
2.9353
6.8703
25.06
24.486
17.062
2.66
Std. Deviation
.67960
1.04820
7.115
2.5238
1.6391
2.950
Missing
Percent
Count
1.7
35
38.5
814
5.7
121
8.4
177
9.5
200
1.6
33
40.8
864
1.7
36
a. Number of cases outside the range (Q1 - 1.5*IQR, Q3 + 1.5*IQR).
a
No. of Extremes
High
Low
29
46
22
109
13
0
27
41
39
5
48
0
poidbebe
hemog1
agem
perbg
baude
parite
gead
baude
perbg
agem
hemog1
Separate Variance t Tests a
t
df
# Present
# Mi ssing
Mean(Present)
Mean(M issing )
t
df
# Present
# Mi ssing
Mean(Present)
Mean(M issing )
t
df
# Present
# Mi ssing
Mean(Present)
Mean(M issing )
t
df
# Present
# Mi ssing
Mean(Present)
Mean(M issing )
t
df
# Present
# Mi ssing
Mean(Present)
-1.8
1598.7
1302
779
2.9145
2.9699
-2.5
120.8
1968
113
2.9246
3.1212
-1.9
173.8
1922
159
2.9252
3.0569
-2.0
206.5
1900
181
2.9247
3.0457
-2.1
1717.9
1252
829
.
.
1302
0
6.8703
.
.
.
1301
1
6.8694
8.0000
-2.2
4.1
1297
5
6.8682
7.4000
-1.5
1.0
1300
2
6.8697
7.2500
-.5
86.5
1226
76
-3.9
1301.1
1301
694
24.60
25.93
.
.
1995
0
25.06
.
-1.8
139.0
1870
125
24.98
26.22
-2.6
159.1
1854
141
24.94
26.67
-4.3
1449.5
1250
745
2.9
1240.3
1297
642
24.605
24.246
2.3
73.0
1870
69
24.511
23.794
.
.
1939
0
24.486
.
.2
55.9
1883
56
24.489
24.373
2.1
1387.7
1245
694
-19.3
949.7
1300
616
16.568
18.106
-11.9
68.3
1854
62
17.001
18.903
-6.3
32.4
1883
33
17.012
19.909
.
.
1916
0
17.062
.
-18.5
1073.0
1248
668
-4.2
1600.5
1302
781
2.45
3.01
-2.7
123.3
1971
112
2.62
3.40
-1.9
180.7
1926
157
2.63
3.11
-2.8
212.4
1904
179
2.61
3.25
-4.4
1748.4
1251
832
2.9094
6.8667
24.52
24.578
16.560
2.43
Mean(M issing )
2.9744
6.9276
25.96
24.320
17.999
3.01
For each q uantitative vari able, pai rs of g roups are formed by indicator vari ables
(present, mi ssing ).
a. Indi cator variables wi th less than 5% mi ssing are not displayed.
gead
perbg
Mi ssing
Present
baude
Mi ssing
Present
Mi ssing
1302
61.5
38.5
1995
94.3
5.7
1939
91.6
8.4
1916
90.5
9.4
.0
821
97.0
3.0
844
99.8
.2
840
99.3
.7
842
99.5
.4
.1
405
99.8
.2
406
100.0
.0
405
99.8
.2
406
100.0
.0
.0
Indi cator vari ables with less than 5% missing are not displ ayed.
SysMis
1
agem
Mi ssing
Present
Count
Percent
% SysM is
Count
Percent
% SysM is
Count
Percent
% SysM is
Count
Percent
% SysM is
% 60.0
0
Present
Total
hemog1
Mi ssing
76
8.8
91.2
745
86.2
13.8
694
80.3
19.7
668
77.3
22.7
.0
agem
Mi ssing
Present
perbg
Mi ssing
Present
baude
Mi ssing
Present
Mi ssing
gead
Present
Mi ssing
Count
Percent
% SysM is
Count
Percent
% SysM is
Count
Percent
% SysM is
Count
Percent
% SysM is
% 60.0
Count
Percent
% SysM is
1302
61.5
38.5
1995
94.3
5.7
1939
91.6
8.4
1916
90.5
9.4
.0
1252
59.2
40.8
629
58.1
41.9
1001
92.5
7.5
974
90.0
10.0
954
88.2
11.7
.1
612
56.6
43.4
673
67.4
32.6
977
97.9
2.1
949
95.1
4.9
950
95.2
4.8
.0
639
64.0
36.0
Indi cator variabl es with less than 5% missi ng are not displ ayed.
Mi ssing
SysMis
1.00
Present
.00
hemog 1
Total
gretum
0
.0
100.0
17
47.2
52.8
16
44.4
55.6
12
33.3
66.7
.0
1
2.8
97.2
baude
hemog1
5.72
9.17
9.59
32.84
35.30
8.36
4.21
30.58
33.13
9.45
29.21
31.76
38.47
4.82
gead
perbg
agem
perbg
baude
hemog1
gead
agem
Percent M ismatch of Indicator Variables.a,b
40.83
The diagonal elements are the percentages missi ng,
and the off-di agonal elements are the mismatch
percentages of i ndicator variables.
a. Variables are sorted on missi ng patterns.
b. Indi cator vari ables with less than 5% missing
val ues are not di splayed.
X
X
X
X
X
X
X
X
X
X
gead
hemog1
X
X
X
X
X
a
baude
perbg
agem
poidbebe
Number of Cases
1220
22
46
483
75
38
70
22
31
gretum
parite
Mi ssing Patterns
X
X
X
X
X
X
X
Patterns wi th less than 1% cases (21 or fewer) are not displayed.
a. Variables are sorted on missing patterns.
b. Number of complete cases if variables missi ng in that
pattern (marked with X) are not used.
Complete if ...
b
Tabulated Patterns
1220
1242
1848
1800
1295
1839
1937
1826
2031
poidbebe
hemog1
agem
perbg
baude
parite
EM Meansa
2.9351
6.8889
25.12
24.489
17.077
2.65
a. Little's MCAR test: Chi-Square = 622.509, DF =
57, Sig. = .000
Descriptive Statistics
poidbebe
hemog1
agem
perbg
baude
Valid N (listwise)
N
2081
1302
1995
1939
1916
1295
Minimum
.97
1.50
7
10.0
11.7
Maximum
9.04
9.50
54
40.1
30.0
Mean
2.9353
6.8703
25.06
24.486
17.062
Std. Deviation
.67960
1.04820
7.115
2.5238
1.6391
baude
parite
1
-.039
.014
.067
-.035
perbg
1
.050
.042
.118
.192
.063
agem
hemog1
poidbebe
hemog1
agem
perbg
baude
parite
poidbebe
EM Correlationsa
1
.114
.212
.806
1
.299
.099
1
.185
1
a. Little' s MCAR test: Chi-Square = 622.509, DF
= 57, Sig. = .000
Correlations
agem
agem
perbg
baude
poidbebe
hemog1
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
N
1
.
1995
.118**
.000
1870
.195**
.000
1854
.031
.165
1968
-.046
.095
1301
perbg
.118**
.000
1870
1
.
1939
.287**
.000
1883
.111**
.000
1922
.017
.552
1297
**. Correlation is significant at the 0.01 level (2-tailed).
baude
poidbebe
hemog1
.195**
.031
-.046
.000
.165
.095
1854
1968
1301
.287**
.111**
.017
.000
.000
.552
1883
1922
1297
1
.204**
.045
.
.000
.106
1916
1900
1300
.204**
1
.076**
.000
.
.006
1900
2081
1302
.045
.076**
1
.106
.006
.
1300
1302
1302
Valeurs manquantes
 EM


Deux étapes : E = valeurs attendues des
données manquantes; M = estimation des
paramètres (corrélations) comme si les
valeurs manquantes avaient été
complétées
Avec SPSS MVA, on peut simuler une
imputation multiple