Powerpoint slides are available for this talk

Download Report

Transcript Powerpoint slides are available for this talk

Lisa M. Lix, PhD P. Stat.
School of Public Health
Joint Seminar: Statistics and Collaborative Graduate
Program in Biostatistics
January 5, 2012

Co-Authors: Tolu Sajobi, Bola Dansu

Funding:
◦ Canadian Institutes of Health Research
◦ Centennial Chair Program, University of
Saskatchewan

Background

Description of Relative Importance Measures

Numeric Example

Monte Carlo Study: Design and Results

Discussion and Conclusions

m ≥ 2 correlated variables for N study participants with
n1 participants in group 1 and n2 participants in group 2
(n1 + n2 = N)

In many studies, the variables are assumed to follow a
normal distribution, N(μjk, σjk2), for k = 1 ,…, m and
j = 1, 2

We will focus on the case where there are no missing
observations

Do different measures of relative importance result in
the same rankings of a set of correlated variables for
distinguishing between two independent groups?

What factors affect the variable ranking performance of
relative importance measures?

For exploratory analysis and model development

Organizational research:

Genetics research:

Quality of life research:
◦ the relative contribution of various applicant
characteristics in hire–not hire decisions made by
managers
◦ Relative contribution of individual genes to
distinguishing between patients with and without
chronic health conditions
◦ Relative importance of quality of life domains for
distinguishing between patients who do and do not
receive healthcare treatments

Back et al. (2008). Journal of
Biopharmaceutical Statistics
◦ Rankings of variable importance were used to identify a
set of genes to classify life-threatening diseases according
to prognosis or type
◦ Variable importance was assessed using a variety of
techniques, including non-parametric recursive
partitioning techniques

Statistical significance (e.g., t-test)

Practical significance (e.g., effect size)

Descriptive discriminant analysis (DDA): linear combination
of variables that maximizes separation of the groups
Stepwise multivariate analysis of variance (MANOVA): F-toremove statistic measures the decrease in the inter-group
Mahalanobis distance caused by removing each of the variables
in sequence
Logistic regression analysis (LRA): Contribution of each
variable to the total predicted variance in the dichotomous
outcome



Dominance analysis: Budescu, 1993
◦ General dominance analysis determines relative importance
based on the average ΔR2 observed by adding a predictor to all
possible subsets of the remaining predictors

Relative weights analysis: Johnson, 2000
◦ creates a new set of variables that are orthogonal
representations of the original set of variables



Denote Xij as the m x 1 vector of observations for the
ith study participant in the jth group (i = 1,…, nj; j = 1,
2)
X j is the m x 1 vector of means for the jth group
Vector of discriminant function coefficients is
estimated by
a  S 1 ( X1  X2 )

where
(n1 1)S1 (n2 1)S 2
S
N 2

and S1 and S2 are the variance-covariance matrices for
groups 1 and 2, respectively

The kth standardized discriminant function coefficient
is
*
ak  ak sk


where ak and sk are the kth estimated discriminant
function coefficient and standard deviation,
respectively
By placing a constraint on the discriminant function
coefficients such that aTSa = 1, where T is the transpose
operator, the coefficients will range in value from -1 to
+1

The parallel discriminant ratio coefficient for the kth
variable is
qk  ak* fk


where fk is the kth structure coefficient, the correlation
between the kth variable and the discriminant function
Coefficients can take on positive and negative values

The total discriminant ratio coefficient for the kth
variable is

where STkk is the (k,k)th element of ST, ST = T/ (N – 1),
T = H + E, and H and E are the hypothesis and error
sum of squares and cross-product matrices, respectively
Coefficients have a lower bound of zero but no upper
bound


For the kth variable, the F-to-remove statistic is


F( k )  k2 ( D2  D(2k ) ) /(k3  D(2k ) )

where k2= N – m, k3 = N2/(n1n2), and
D2  (X1  X2 )T S1 (X1  X2 )
2
is the squared Mahalanobis distance, and D(k )
is the value of D2 when the kth variable is omitted
 Statistics take on positive values




The model is
 pl 
  A l β
ln
 1  pl 
where Al is the vector of (m + 1) observations for the lth
study participant (l = 1 ,…, N) where the first element is
equal to one
pl = Pr(yl = 1| Al) is the probability the lth study
participant is a member of group 1 conditional on the
explanatory variables
β is the (m + 1) vector of coefficients to be estimated,
with the first element equal to the model intercept, β0

The estimated coefficient for the kth variable can be
defined as
ˆβ  rlogit( pˆ ) k  R
k
1 R
2
2
(  k ) k |(  k )
2
k |(  k )



R
,
where rlogit( pˆ ) k is the correlation between the kth
variable and the logit of the predicted probabilities
and R(2k ) is the R2 value for a LRA model in which the
kth variable is excluded
2
R
and k|(k ) is the R2 value for a model in which the kth
variable is regressed on the remaining (m – 1) variables

Standardized logistic regression coefficients have also
been used to assess relative importance. The kth
standardized coefficient is
βˆk*  βˆk sk R / slogit ( pˆ ) ,


where ˆk is the estimated coefficient and slogit(pˆ ) is the
standard deviation of the logit of the predicted
probabilities
Coefficients can take on positive and negative values

Pratt’s (1987) index for relative importance was
originally proposed for multiple regression and then
extended to LRA. The index value for the kth variable
is
βˆk* ρˆ k
dk  2 ,
R

where ρˆk is the estimated correlation between the kth
explanatory variable and the logit of the predicted
probabilities
Coefficients can take on positive and negative values




Data are from the Manitoba Inflammatory Bowel Disease
(IBD) Cohort Study
Started in 2002 and initially enrolled 388 patients who had
recently diagnosed with Crohn’s disease or ulcerative colitis
Health-related quality of life (HRQOL) data collected at
regular intervals throughout the study
◦ SF-36: 8 domains
◦ IBD Questionnaire: 4 domains

A central theme of the study is the effect of disease activity
on quality of life, stress, well-being, and coping with illness
Active Disease
Inactive Disease
(n1 = 244)
(n2 = 105)
Bowel Symptoms
4.92 (1.03)
6.08 (0.76)
Emotional Health
4.81 (1.05)
5.85 (0.89)
Social Function
4.09 (1.18)
5.19 (1.05)
Systemic Symptoms
5.62 (1.35)
6.65 (0.64)
Bodily Pain
60.78 (24.15)
77.45 (26.11)
Role Physical
63.48 (29.07)
83.65 (24.08)
General Health
43.40 (19.52)
59.18 (17.01)
Mental Health
60.33 (14.11)
66.62 (12.47)
Physical Functioning
77.49 (21.73)
91.11 (14.41)
Role Emotional
76.06 (23.98)
85.82 (20.11)
Social Functioning
63.74 (27.20)
78.85 (27.10)
Vitality
46.13 (16.39)
57.84 (14.49)
IBDQ
SF-36
t-statistic
SLRC
LPI
ALPI
SDFC
PDRC
FTR
IBDQ
Bowel Symptoms
10.430*
0.463
0.471
0.376
0.587
0.542
5.034
Emotional Health
8.840*
0.309
0.28
0.223
0.428
0.347
4.033
Social Function
7.500*
0.183
0.165
0.132
0.044
-0.031
5.072
Systemic Symptoms
7.980*
0.145
-0.117
-
0.083
-0.062
14.334
Bodily Pain
5.690*
0.103
0.066
0.053
0.103
0.057
0.504
Role Physical
6.220*
0.015
-0.010
0.000
0.037
-0.022
6.099
General Health
Mental Health
6.930*
0.135
0.095
0.076
0.226
0.149
12.334
3.790*
0.143
-0.059
-
0.1910
-0.072
0.952
Physical Functioning
5.890*
0.169
0.113
0.090
0.185
0.106
8.329
Role Emotional
3.640*
0.171
-0.066
-
0.120
-0.043
0.508
Social Functioning
4.770*
0.026
0.015
0.012
0.027
0.013
0.011
Vitality
6.080*
0.074
0.049
0.039
0.029
0.017
6.911
Domain
SF-36
Note: * denotes a test statistic that is statistically significant at α = .05/12 = .004
SLRC
ALPI
SDFC
PDRC
FTR
IBDQ
Bowel Symptoms
1
1
1
1
7
Emotional Health
2
2
2
2
8
Social Function
3
3
9
9
6
Systemic Symptoms
SF-36
6
-
8
-
1
Bodily Pain
9
6
7
5
11
12
9
10
9
5
General Health
8
5
3
3
2
Mental Health
7
-
4
-
9
Physical Functioning
5
4
5
4
3
Role Emotional
4
-
6
-
10
Social Functioning
11
8
12
7
12
Vitality
10
7
11
6
4
Domain
Role Physical






SDFC: standardized discriminant function coefficient
PDRC: parallel discriminant ratio coefficients
TDRC: total discriminant ratio coefficients
FTR: F-to-remove statistic
SLRC: standardized logistic regression coefficient
LPI: Logistic Pratt’s index







Number of variables (m = 4, 6, 8)
Total sample size (N = 60, 80, 140, 200)
Equality/inequality of group sizes
Magnitude and pattern of correlation among the
variables
Group covariance homogeneity/heterogeneity
Group means
Shape of the population distribution

Let ρ denote the average correlation between the
variables
◦ ρ = 0, 0.3, 0.6

Pattern of correlation
◦ Compound symmetric
◦ Unstructured
◦ Modified simplex
Mean Pattern
I
II
III
IV
Note: μ2 is the null vector
μ1
(2.5, 2, 1.5, 1)
(1.5, 1, 0.5, 2)
(1.0, 0.75, 0.5, 0.25)
(0.75, 0.5, 0.25, 1.0)
D2
13.5
7.5
1.9
1.9
Mean
Pattern
μ1
D2
I
(4.5, 4, 3.5, 3, 2.5, 2, 1.5, 1)
(2.5, 2, 1.5, 1, 0.5, 3, 3.5, 4)
(2, 1.75, 1.5, 1.25, 1, 0.75, 0.5, 0.25)
(1.25, 1, 0.75, 0.5, 0.25, 1.5, 1.75, 2)
71.0
47.0
12.8
12.8
II
III
IV
Note: μ2 is the null vector

Normal
◦ γ1 = 0; γ2 = 0

Skewed
◦ γ1 =1.8; γ2 =5.9

Heavy-Tailed
◦ γ1 = 0 and γ2 = 33



All-variable correct ranking percentage: percent
of simulations in which the sample rank was the
same as the corresponding population rank for
the variable
Average per-variable correct ranking percentage:
the percent of simulations in which a variable in
the sample had the same rank as the variable in
the population, averaged across all variables
Kendall’s concordance statistic (not reported in
this presentation)
Mean
Pattern
SDFC
PDRC
TDRC
FTR
SLRC
LPI
I
49.1
59.8
59.0
38.0
41.7
61.1
II
43.7
63.1
56.2
32.1
38.0
64.3
III
34.8
47.0
37.8
26.4
33.2
47.4
IV
37.0
54.3
41.1
28.3
34.8
54.7
Average
41.2
56.0
48.5
31.2
36.9
56.9
Mean
Pattern
SDFC
PDRC
TDRC
FTR
SLRC
LPI
I
17.5
28.3
27.1
9.1
13.6
29.4
II
12.2
32.1
23.6
5.7
9.8
33.6
III
7.7
12.7
9.4
2.1
7.3
12.8
IV
8.1
21.1
11.0
3.8
7.6
21.4
Average
11.4
23.5
17.8
5.2
9.6
24.3
Corr.
Scenario
SDFC
PDRC
TDRC
FTR
SLRC
LPI
1
60.3
63.3
63.2
40.2
55.0
66.3
2
45.9
63.2
51.0
32.6
42.4
63.6
3
32.2
65.9
42.5
25.8
25.7
65.4
4
39.7
52.1
45.1
29.7
36.5
53.1
5
25.8
34.2
38.6
27.0
24.3
33.9
6
43.0
57.6
50.5
31.8
37.8
58.5
Average
41.2
56.0
48.5
31.2
36.9
56.9
Scenario 1: ρ = 0, where ρ is the average correlation; Scenario 2: compound symmetric matrix
with ρ = 0.3; Scenario 3: compound symmetric matrix with ρ = 0.6; Scenario 4: unstructured
matrix with ρ = 0.3; Scenario 5: unstructured matrix with ρ = 0.6; Scenario 6: modified simplex
matrix with correlations of 0.3 and 0.6 on alternating diagonals.

The LPI and PDRC measures tended to result in the
highest percentages of correct rankings and values of
the concordance statistic

The FTR measure tended to result in the lowest
percentages of correct rankings and concordance
followed by the SLRC measure

The LPI and PDRC measures were relatively insensitive to
many of the correlation structures

However, they resulted in a substantial drop in correct
ranking percentages when the data exhibited an
unstructured correlation pattern with a high average
correlation (ρ = 0.6)

Differences in correct ranking percentages across the
correlation structures were smaller for the TDRC and SLRC
measures than for other measures and were smallest for the
FTR measure

Violations of the assumption of covariance
homogeneity had a very small effect on the correct
ranking rates

The correct ranking percentages for all measures were
consistently lower for heavy-tailed than for skewed
distributions

The choice of measures of relative importance depends
on the perspective the researcher wants to take on the
data
◦ contribution of a variable to the discriminant function score
◦ contribution of a variable to the grouping variable effect
◦ contribution of a variable to explaining variation in a
regression model

Inference for relative importance measures and ranks

Comparisons with recent developments in relative
importance measures that are more computationally
intensive (e.g., relative weights)

Extensions to more than two groups