Transferring variables between different data-sets

Download Report

Transcript Transferring variables between different data-sets

Transfer of variables
between different data-sets
Or: Taking ‘previous research’ seriously
Bojan Todosijevíć
University of Twente
7th International Conference on Social Science Methodology
RC33 - Logic and Methodology in Sociology
Napoli, September 1-5, 2008
1
The Problem:

Data scattered in different data sets – surveys, census data, etc.
Typical solutions:
 Collecting more data


Not always feasible, e.g., concerning past events
Data aggregation – geographical, cohort


Aggregate-level analysis
Imputation of conditional means for individual-level analysis


Often used in ‘representation studies’, during the 70s & 80s, for
example
Methodological problems

Ecological Inference (King, 1997)
2
The task

Present a model for transferring data between datasets - based on the imputation of individual scores.

Present a substantive research illustration

Discuss the approach from the perspective of “taking the
previous research seriously”
3
The MI approach:




A question not asked in one survey could be seen as a
special case of the missing data problem (Gelman et al.,
1998)
‘Statistical matching”, or ‘data fusion’ (Rassler 2003 ).
Adopt Bayesian multiple imputation (MI) (Rubin, 1987)
approach.
When data are missing because a question was not asked
the MAR assumption applies
P(R|Ycomplete) = P(R|Yobserved)
4
Advantages of the individually imputed scores:

Wider range of applications (e.g., variables of interest may
be unrelated to geographic or cohort units)

Aggregation method tends to neglect variability within
aggregation units

Imputation of individual scores allows the use of the
standard analytic methods
5
Illustration 1: Comparing imputed and true scores
1.
Two data-sets selected - SOCON 2000 and NKO 2002 contain a number of equivalent variables
2.
Target variable: Left-Right self-placement – from
SOCON to NKO
3.
Test and comparisons of the ‘true’ and imputed L-R
scores
6
Assessing the feasibility of the approach
Data file A
(SOCON)
Data file B
(NKO)
Y
+
+
+
+
+
+
+
+
Y
0
0
0
0
0
0
0
(Y true)
(+)
(+)
(+)
(+)
(+)
(+)
(+)
X1
+
+
+
+
+
+
+
+
X2
+
+
+
+
+
+
+
+
X3
+
+
+
+
+
+
+
+
X1
+
+
+
+
+
+
+
X2
+
+
+
+
+
+
+
X3
+
+
+
+
+
+
+
Z
+
+
+
+
+
+
+
7
Imputation procedure and software
ICE – MICE application for Stata (Royston, 2005)
UVIS – Univariate imputation sampling

Ice imputes missing values by using switching regression, an
iterative multivariable regression technique (Van Buuren &
Oudshoorn 1999; Stata module written by Patrick Royston, 2005)).

uvis imputes missing values in the single variable based on multiple
regression on a list of predictors. uvis is called repeatedly by ice in
a regression switching mode to perform multivariate imputation.
8
Common NKO and SOCON variables
Name
Variable
urb2
sex
age
class
zincome
educatio
church_a
party
recollection);
Employed
pm
pol_int
d_proud
Urbanization
Sex
Age
Class - self-description
Household income (standardized)
Education level
Religious service attendance
Party choice (hypothetical, vote intention, vote
L-R
L-R1
L-R2
L-R2
Employment status
Post-materialism index
Political interest
Proud being Dutch
Left-right self-placement, SOCON 2000
Left-Right self-placement, NKO 1st wave
Left-Right self-placement, NKO 2nd wave
Left-Right self-placement, NKO 3rd wave
9
The Imputation equation – DV: SOCON L-R
Source |
SS
df
MS
-------------+-----------------------------Model | 1717.64517
12 143.137097
Residual | 2579.00563
995 2.59196546
-------------+-----------------------------Total | 4296.65079 1007 4.26678331
Number of obs =
F( 12,
995) =
Prob > F
=
R-sq.
=
1008
55.22
0.0000
0.3998
Adj R-squared = 0.3925
Root MSE
=
1.61
R = .63
-----------------------------------------------------------------------------SOCON l_r |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------urb2 |
.0570569
.0379983
1.50
0.134
.0387282
sex |
-.156749
.1075442
-1.46
0.145
-.0379609
age | -.0013591
.0043088
-0.32
0.752
-.0087341
class |
.3938487
.0834685
4.72
0.000
.1461826
zincome |
.0254967
.0627695
0.41
0.685
.0123456
educatio | -.1585475
.0315719
-5.02
0.000
-.1664914
church_a | -.2379957
.0522299
-4.56
0.000
-.1199417
party |
.3608413
.0198037
18.22
0.000
.48643
employed |
.030681
.1241836
0.25
0.805
.0068578
pm | -.2921879
.1009341
-2.89
0.004
-.0801642
pol_int |
.173143
.0716642
2.42
0.016
.0685405
d_proud |
-.195067
.0623608
-3.13
0.002
-.0822013
_cons |
4.659597
.5426508
8.59
0.000
.
-------------+----------------------------------------------------------------
MI approach illustration
Correlation between the original NKO L-R variables
L_r1
l_r2
l_r2
.760
1
l_r3
.711
.779
Correlation between the imputed and original NKO L-R variables
l_r_uvis-1
l_r_uvis-2
l_r_uvis-3
l_r_uvis-4
l_r_uvis-5
l_r1
.332
.392
.353
.375
.377
l_r2
.403
.418
.394
.445
.419
l_r3
.402
.408
.395
.449
.416
11
Correlations with attitudinal variables NOT
included in the imputation model
1st wave
l_r_uvis
L-R uvis
corrected
Reliab.=.51
l_r1
l_r2
l_r3
v0280 Sympathy: CDA
0.17 *
0.24
0.33 *
0.38 *
0.37 *
v0281 Sympathy: PvdA
-0.24 *
-0.33
-0.36 *
-0.40 *
-0.41 *
v0282 Sympathy: VVD
0.21 *
0.30
0.39 *
0.38 *
0.39 *
v0283 Sympathy: D66
-0.23 *
-0.32
-0.26 *
-0.32 *
-0.30 *
v0284 Sympathy: GroenLinks
-0.31 *
-0.44
-0.47 *
-0.49 *
-0.49 *
v0286 Sympathy: Lijst Pim Fortuyn
0.21 *
0.30
0.40 *
0.41 *
0.40 *
v0287 Sympathy: SGP
0.11 *
0.15
0.21 *
0.24 *
0.25 *
v0288 Sympathy: ChristenUnie
0.10 *
0.14
0.16 *
0.19 *
0.18 *
-0.24 *
-0.34
-0.41 *
-0.42 *
-0.41 *
v0289 Sympathy: SP
12
More attitudes…
L-R uvis
corrected
l_r_uvis Reliab.=.51
l_r3
l_r2
l_r1
v0965 Confessional attitude score 2002
0.10 *
0.14 *
0.13 *
0.15 *
0.18 *
x0125 Income differences
-0.23 *
-0.32 *
-0.34 *
-0.38 *
-0.38 *
x0133 Asylum seekers
0.29 *
0.41 *
0.42 *
0.45 *
0.48 *
x0141 European unification
0.13 *
0.18 *
0.14 *
0.17 *
0.19 *
x0149 Ethnic minorities
0.33 *
0.46 *
0.40 *
0.44 *
0.44 *
x0158 Punishment of crimes
0.21 *
0.29 *
0.26 *
0.29 *
0.29 *
x0159 Death penalty for certain crimes
-0.21 *
-0.30 *
-0.36 *
-0.36 *
-0.35 *
*
-0.22 *
x0441 Religion is a good guide in politics -0.17 *
-0.24 * .
*.
13
Summary of the comparison between the imputed and original
L-R variables
Total variables
Identical conclusion (direction and significance)
63
56
%
1.0
88.9
Identical conclusions
Identical significant-significant conclusion:
Identical insig. - insig. conclusion
51
5
91.1
9.9
7
11.1
6
85.7
1
14.3
Different conclusions
Insignificant (imputed variable) - significant (original
variable)
Significant (imputed variable) - insignificant (original
variable)
14
Summary of the conclusions that differ between the
original and imputed variables
L-R uvis
L_r_uvis corrected
l_r1
l_r2
l_r3
I/ Political knowledge 1
-.0145
-.020
-.0982* -.1206* -.1143*
I/ Political knowledge 2
-.0129
-.018
-.0799* -.1004* -.0899*
III/1293 Views MP's are good reflection of views voters
.0439*
.061
.0358
.0483
.0245
III/1295 Parties necessary for functioning of democracy
.0114
.016
.0676*
.0297
.0437*
III/1303 Which aspect should politicians emphasize?
-.0299
-.042
-.0765* -.0854* -.0942*
II/349 So many people vote, my vote does not matter
-.0357
-.050
-.0890* -.0649*
-.0482
III/1291 Satisfaction with democracy in the Netherlands
.0099
.014
.0644*
.0595*
.0560*
The highest ‘missed’ correlation: with Political knowledge 1 – average
for the three ‘real’ L-R variables: r=-.11.
15
Summary

Coefficients associated with the imputed variables are lower in
magnitude.

Correction for attenuation helps.

In a number of cases even quite low correlations were correctly
predicted.

Using the imputed variable one is in danger of making Type II error,
much less Type I error.
16
Problem with MI

MI introduces conditional independence for variables not
included in the imputation model
17
Conditional independence problem
-------------------------------------------------------True L-R |
Coef.
Std. Err.
t
P>|t|
-------------+-----------------------------------------urb2 |
.0632457
.0293626
2.15
0.031
sex | -.2560798
.0768747
-3.33
0.001
...
CDA |
.0139822
.0022131
6.32
0.000
VVD |
.0191216
.0022146
8.63
0.000
_cons |
3.03059
.4192847
7.23
0.000
--------------------------------------------------------
Multiple imputation parameter estimates (5 imputations)
------------------------------------------------------l_r_Imputed |
Coef.
Std. Err.
t
P>|t|
-------------+----------------------------------------urb2 |
.0602722
.0362267
1.66
0.096
sex | -.2097795
.1849854
-1.13
0.257
...
CDA |
.0031271
.0040161
0.78
0.436
VVD | -.0003598
.0027466
-0.13
0.896
_cons |
4.402879
.9052197
4.86
0.000
-------------------------------------------------------
18
Dealing with the conditional independence
problem

Expand the imputation model

Introduce prior information, or a ‘third’ data set (Rassler
2003)

Simulate the relationships
19
"Third" data set approach
Data file A
Y
+
+
+
+
+
+
+
"Third" data set
Y
+
+
+
Data file B
Y
0
0
0
0
0
0
X1
+
+
+
+
+
+
+
X2
+
+
+
+
+
+
+
X3
+
+
+
+
+
+
+
Z
Z
+
+
+
X1
+
+
+
+
+
+
X2
+
+
+
+
+
+
X3
+
+
+
+
+
+
Z
+
+
+
+
+
+
20
Simulated ‘third’ data file
N=100
|
L-R
CDA
VVD
-------------+--------------------------L-R |
1.0000
|
CDA |
0.3750
1.0000
|
VVD |
0.4426
0.2635
1.0000
21
Results
In the complete model, the conditional relationships are preserved,
although the target variables did not exist in the data that served to
construct the imputation model.
Multiple imputation parameter estimates (5 imputations)
------------------------------------------------------L-R Im. Corr. |
Coef.
Std. Err.
t
P>|t|
-------------+----------------------------------------urb2 |
.0660053
.0454674
1.45
0.147
sex | -.1209418
.1076277
-1.12
0.261
...
CDA |
.019413
.0028881
6.72
0.000
VVD |
.0170953
.0039076
4.37
0.000
_cons |
3.139328
.4902676
6.40
0.000
-------------------------------------------------------
22
Conclusions (thus far)

The imputed variable strongly correlates with the ‘true’ individual
responses (r is around .40, without correction for attenuation).

By using imputed variable one is in danger of wrongly supporting the
null-hypothesis, and underestimating the strength of the relationships.

Using additional information about conditional relationships may be a
worthwhile effort (e.g., through simulated ‘third’ data file)

Applicability



pilot-studies
multiple surveys where particular questions are omitted from some studies
problems dealing with past events
23
Substantive research problem, Part 1
Theory: According to SIT, ethnocentrism depends on the
strength of in-group identification
Note: The puzzle of Pim Fortuyn phenomenon in 2002
According to TAP, personality (authoritarianism) explains
individual differences in ethnocentrism.
Integrated model: Authoritarianism is stronger predictor under
condition of weaker group identification.
 Problem: No suitable data
24
Available data: NKO 2002 & SOCON 2000
Both data-sets are supposed to be representative for the
Dutch population
Variables
 In-group identification:



Proud to be Dutch
Authoritarianism scale
Ethnocentrism

Both SOCON & NKO
SOCON -> NKO
NKO
Ethnic minorities - position of respondent
25
Results
Multiple imputation parameter estimates (3 imputations)
Ethnocentrism
High identification
Low Identification
(Proud to be Dutch )
Authoritarianism
b
0.28
0.47
s.e.
0.07
0.10
(imputed from SOCON to NKO)
1833 observations.
26
Substantive research problem, Part 2
Theory:
Authoritarianism should predict party preference for
ethnocentric parties depending on the degree of in-group
identification (less in case of strong identification)
Party preference
Authoritarianism
(for ethnocentric parties)
In-group
identification
27
Results: NKO 2002, 1st wave
Regression coefficients:
Authoritarianism predicting Party sympathy under different group identification conditions
GL
SP
PvdA
D66
High identification
-5.30**
-4.97***
-1.59
-4.01***
Low identification
-7.77*
-7.75***
-4.15*
-3.27
Multiple imputation parameter estimates (3 imputations)
VVD
CDA
LPF
CU
SGP
High identification
0.46
2.63
2.47
0.13
1.93
Low identification
3.84*
3.10
7.12**
1.72
2.64
Multiple imputation parameter estimates (3 imputations)
28
Results: NKO 2002, 2nd wave
Regression coefficients:
Authoritarianism predicting Party sympathy under different group identification conditions
GL
SP
PvdA
D66
High identification
-4.83***
-4.19***
-1.68*
-3.76***
Low identification
-8.77***
-6.48**
-4.37**
-4.36***
Multiple imputation parameter estimates (3 imputations)
VVD
CDA
LPF
CU
SGP
High identification
0.89
3.24***
2.96*
1.16
2.53**
Low identification
4.83*
4.85*
7.37***
0.73
2.74
Multiple imputation parameter estimates (3 imputations)
29
Conclusion of the illustration
The integrated model is supported: group
identification modifies the influence of Auth onto
Ethnocentrism
 The obtained coefficients are estimated minimum
associations



Based on the previous research, namely the association
between authoritarianism and predictors common for
the two data-sets
The conclusion rests on the assumption that the
relationships present in the SOCON 2000 study are also
valid in NKO 2002
30
MI approach from the perspective of using the
previous research

MI approach is statistically superior to alternative methods
for data transfer or statistical matching

MI approach has advantages as a method for
“taking the previous research into account”


Traditional method: “It has been found…”
Advanced approach: Quantitative meta-analysis

Uses results of previous analyses
 May differ from current interests
 Includes information loss
31
Hence,

“Previous research” should refer not only to
published analyses, but also to already collected
data

Relationships in the already collected data (even if
present only implicitly) are our best guesses about the
relationships in comparable data sets
32
Sources of problems

Comparability of data sources

Various details abut data collection methods
Sampling frames
 Time frame
 Conditional independence



There are possible compensations
The problem may be attenuated

If we have theoretical reasons not to include all variables
included in the imputation model (as in the previous example)
33
“With our without missing data, the goal of a statistical
procedure should be to make valid and efficient inferences
about a population of interest – not to estimate, predict, or
recover missing observations not to obtain the same results
that we would have seen with complete data.”
Schafer & Graham 2002, p. 149.
34