Assessing the Construct Validity of Assessment Centers (AC)
Download
Report
Transcript Assessing the Construct Validity of Assessment Centers (AC)
CONSTRUCT VALIDITY
OF ACCESMENT
CENTRES:
LEST WE FORGET
THAT ASSESSMENT
CENTRES ARE STILL
PSYCHOLOGICAL
ASSESSMENTS
31st Annual ACSG Conference
•
March 2011
What is known about the
construct validity currently:
Over last 50 years – popular in the assessment of personal differences for
managerial development purposes
Multi-occupation, multi-company investigation with high face validity
AC post-exercise dimension ratings (PEDRs) is more pervasive than cross
situational stability in candidate ratings
Bowler, M. C., & Woehr, D. J. (2006). A meta-analytic evaluation of the impact
of dimension and exercise factors on assessment center ratings. Journal of
Applied Psychology, 91, 1114–1124.
Lance, Lambert, Gewin, Lievens, & Conway, (2004) found in a meta-analysis
that exercise effect explain almost three times more variance than dimension
ratings
Problematic for construct validity: PEDRs is a function of exercise design and
not person competencies.
What is known about the
construct validity currently:
Recently there has been two schools of thought to assess the
construct validity of AC’s:
Confirmatory Factor Analysis (CFA) [MTMM]
Generalization Theory
FOUR basic models within the CFA tradition:
Correlated Dimension Correlated Exercises Model (CDCE): MTMM
One-dimension-correlated exercise model (1DCE)
an uncorrelated dimensions, correlated exercises, plus g model
(UDCE + g)
Correlated dimension-correlated uniqueness (CDCU) model
Lance, Woehr & Meade (2007). A Monte Carlo Investigation of
Assessment Center Construct Validity Models. Organizational
Research Methods, 10(3), 430-448
Advantages of CFA approach
Partition out error variance; ALSO
Partition out Exercise effects
Thus PEDR’s are a function of both exercise and dimension
effects
However, technically CTCE model difficult to model
(Empirical under-identification)
Prerequisite is construct validity before partitioning out
exercise effects
Thus critical first step was to assess construct validity of
dimensions with actual DAC data
An Example: Achievement
motivation and Financial
Perspective
An Example: Achievement
motivation
AM:
DIMENSION: ACHIEVEMENT MOTIVATION
EXCERCISES
TRAITS
ANALYSIS
PROBLEM (AP)
INNOVATION
IN_AP
ENERGY
EN_AP
PROCESS SKILLS
PS_AP
SIMULATED IN
BASKET (SIB)
PS_SIB
Correlation Matrix
AM: Option 1
CDCE
model would be preferable:
WHY?
Differentiate
sources of variance:
SCENARIO 2:
IN_AP
ANALYSIS
IN_AP
ACHIEVMENT
MOTIVEATION
PROBLEM
IN_AP
IN_SIB
SIMULATED
IN BASKET
-Empirical under-identification
-We have 13 parameters to measure in the model, yet only 10 pieces of information in the
covariance matrix
-Thus we have to much model parameters to gauge with too little information (-3df)
-Similar to equation: X + Y = 6
-Unlimited possible combinations to solve the equation
x1
δ1
lx21
lx13
x3
ACHIEVEMENT
MOTIVATION
lx11
lx23
δ2
x2
δ3
x3
lx23
lx31
δ3
ANALYSIS
PROBLEM
φ21
lx43
lx42
x4
x1
x2
SIMULATED
IN BASKET
AM: Technical Problems
Simulated in Basket only measures one dimension (trait): Process Skills
Whereas Innovation, Energy and Process skills are gauged with analysis
problem excercise
For basic CFA we need at least three indicators for each dimension.
However, if we have a single dimension and single exercise effect we need a
minimum of five indicators
This have DAC design implications if we want to gauge the measurement
effect in addition to the dimension effects
Literature review by Lievens and Conway (2001) suggest that median number
of three exercises and five dimensions
AM: Option 2
• Still not enough degrees of freedom, need at least 5
indicators (10 possible sources of information yet must
measure 12 parameters, thus -2 df
• SOLUTION: Include more exercises per dimension
IN_AP
IN_AP
ACHIEVMENT
MOTIVEATION
IN_AP
IN_AP
GLOBAL
METHOD
EFFECT
Financial Perspective (FP)
DIMENSION: FINANCIAL PERSPECTIVE
TRAITS
EXCERCISES
ANALYSIS
PROBLEM (AP)
GROUP
DISCUSSION
(GD)
ONE:ONE
(ONE)
SIMULATED
IN BASKET
(SIB)
BROKER
MARKET (BM)
BM_AP
BM_GD
BM_ONE
BM_SIB
CROSS UP
SELLING
(CUS)
CUS_AP
CUS_GD
CUS_ONE
CUS_SIB
PROFIT
(PROF)
PROF_AP
PS_SIB
PS_ONE
PS_SIB
Correlation Matrix
LARGE CORRELATIONS
BETWEEN
EXCERCERCISES
FP:CTCE
BM_AP
BM_GD
BROKER
MARKET
CROSS UP
SELLING
BM_ONE
ANALYSIS
PROBLEM
BM_SIB
CUS_AP
GROUP
DISCUSSION
CUS_GD
CUS_ONE
CUS_SIB
PROFIT
ONE:ONE
PROF_AP
PROF_GD
PROF_ONE
PROF_SIB
SIMULATED IN
BASKET
FP: CDCE
Model
did not converge although
enough df (78-44=34df)
Singularity problems: Chiefly because of
multi-colinearity
Go back to only dimension level without
exercise effects
Thus only Broker Market, Cross up Selling
and Profit individually
CFA: Broker Market: FIT
CHI-SQUARE =
2.600 BASED ON
2 DEGREES OF FREEDOM
PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS
0.27247
THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS
FIT INDICES
----------BENTLER-BONETT
NORMED FIT INDEX =
0.942
BENTLER-BONETT NON-NORMED FIT INDEX =
0.954
COMPARATIVE FIT INDEX (CFI)
=
0.985
BOLLEN'S
(IFI) FIT INDEX
=
0.986
MCDONALD'S
(MFI) FIT INDEX
=
0.997
JORESKOG-SORBOM'S GFI FIT INDEX
=
0.988
JORESKOG-SORBOM'S AGFI FIT INDEX
=
0.938
ROOT MEAN-SQUARE RESIDUAL (RMR)
=
0.013
STANDARDIZED RMR
=
0.035
ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA)
90% CONFIDENCE INTERVAL OF RMSEA (
0.000,
RELIABILITY COEFFICIENTS
-----------------------CRONBACH'S ALPHA
RELIABILITY COEFFICIENT RHO
=
=
0.607
0.613
=
2.428.
0.056
0.217)
CFA: BM: Parameter estimates
• Thus BM showed good fit and parameter estimates
• Broker Market Simulated in Basket was the best
predictor of Broker Market
• All factor loadings were statistically significant
(p<0.05)
BM_AP
0.91
E1*
BM_GD
0.85
E2*
BM_ONE
0.77
E3*
BM_SIB
0.80
E4*
0.41*
BROKER MARKET 1.0
0.53*
0.64*
0.61*
Figure X: EQS 6 broker market trait only Chi Sq.=2.60 P=0.27 CFI=0.98 RMSEA=0.06
CFA: CUS: FIT
• Problems with fit: BBNFI; IFI and reliability .
• ERROR MESSAGE IN EQS DUE TO SINGULARITY OF
COVARIANCE MATRIX
CHI-SQUARE =
1.456 BASED ON
2 DEGREES OF FREEDOM
PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS
0.48280
THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS
FIT INDICES
----------BENTLER-BONETT
NORMED FIT INDEX =
0.972
BENTLER-BONETT NON-NORMED FIT INDEX =
1.036
COMPARATIVE FIT INDEX (CFI)
=
1.000
BOLLEN'S
(IFI) FIT INDEX
=
1.011
MCDONALD'S
(MFI) FIT INDEX
=
1.003
JORESKOG-SORBOM'S GFI FIT INDEX
=
0.993
JORESKOG-SORBOM'S AGFI FIT INDEX
=
0.964
ROOT MEAN-SQUARE RESIDUAL (RMR)
=
0.010
STANDARDIZED RMR
=
0.026
ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA)
90% CONFIDENCE INTERVAL OF RMSEA (
0.000,
RELIABILITY COEFFICIENTS
-----------------------CRONBACH'S ALPHA
=
0.630
RELIABILITY COEFFICIENT RHO
=
0.635
MAXIMAL WEIGHTED INTERNAL CONSISTENCY RELIABILITY
1.391.
=
0.000
0.183)
=
0.684
CFA: CUS: Parameter estimates
• Indicators did not do that well this time.
• Best predictor was Group Discussion
0.27
CUS_AP
0.36
CUS_GD
0.28
CUS_ONE
0.24
CUS_SIB
0.10
0.36
CROSS UP SELLING--1.00
0.39
0.29
Figure X: EQS 6 cross up selling trait only Chi Sq.=1.46 P=0.48 CFI=1.00 RMSEA=0.00
CFA: PROFIT: FIT
CHI-SQUARE =
0.634 BASED ON
2 DEGREES OF FREEDOM
PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS
0.72820
THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS
FIT INDICES
----------BENTLER-BONETT
NORMED FIT INDEX =
0.988
BENTLER-BONETT NON-NORMED FIT INDEX =
1.090
COMPARATIVE FIT INDEX (CFI)
=
1.000
BOLLEN'S
(IFI) FIT INDEX
=
1.027
MCDONALD'S
(MFI) FIT INDEX
=
1.007
JORESKOG-SORBOM'S
GFI
FIT INDEX
=
0.997
JORESKOG-SORBOM'S AGFI
FIT INDEX
=
0.984
ROOT MEAN-SQUARE RESIDUAL (RMR)
=
0.007
STANDARDIZED RMR
=
0.018
ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA)
90% CONFIDENCE INTERVAL OF RMSEA
(
0.000,
RELIABILITY COEFFICIENTS
-----------------------CRONBACH'S ALPHA
=
0.633
RELIABILITY COEFFICIENT RHO
=
0.642
MAXIMAL WEIGHTED INTERNAL CONSISTENCY RELIABILITY
MAXIMAL RELIABILITY CAN
0.621.
=
0.000
0.143)
=
0.688
CFA: PROFIT: Parameter estimates
• Group discussion is once again the best predictor
PROF_AP
0.37
PROF_GD
0.25
PROF_ONE
0.26
PROF_SIB
0.11
0.26
0.42
PROFIT --1.00
0.34
0.31
Figure X: EQS 6 profit trait only Chi Sq.=0.63 P=0.73 CFI=1.00 RMSEA=0.00
CFA: Three dimensions no Exercise effects
• Model did not work, neither did single
universal dimension work
BROKER MARKET
BM_AP
E1
BM_GD
E2
BM_ONE
E3
BM_SIB
E4
CUS_AP
E5
CUS_GD
E6
CUS_ONE
E7
CUS_SIB
E8
PROF_AP
E9
PROF_GD
E10
PROF_ONE
E11
CROSS UP SELLING
PROFIT
E12
Conclusion
The Broker Market sub-dimension worked
individually but not the Cross up Selling or Profit
sub-dimensions
For this reason we can not expect the combined
CFA model to work which incorporates all three
dimensions
Have to work out problems on sub-scale level first
before moving on to global level
Because construct validity is lacking at the
subscale level it does not make sense to look at
the exercise effects
Must sort out construct validity on sub-scale level
first
G-theory
Generalizability theory (G-theory) extends the framework of classical test
theory in order to take into account the multiple sources of variability
that can have an effect on test scores (Lynch & McNamara, 1999)
DAC the following sources of variance is often considered:
Person
Exercise
Dimension
Person*Dimension interaction (Cross situational specifity)
Person* Exercise interaction (Low construct validity)
Dimension* Exercise (Observability of particular dimension)
G-study is then designed to estimate the relative effects of these facets
on test performance data.
Overall index of reliability (similar to Cronbach coefficient alpha) are
expressed as phi I(Φ) coefficient and is also referred to generally as “an
index of dependability’”
Meaning of different sources
of variance in DAC
Dimension effect: variance in ratings attributed to certain
dimensions, i.e. certain dimensions receiving higher/lower ratings
compared to others
Person effect: general performance factor of persons
Exercise effect: certain exercises overall receive higher/lower
ratings in comparisons with others
Person*Dimension effect: amount of variance attributed to person’s
performance on dimension across exercises:- this is indicative of
cross-situational specifity
Person*Exercise effect: amount of variance attributed to person
receiving high/low rating on certain exercises regardless of
dimension being measured
Dimension*Exercise effect: amount of variance attributed to
specific dimension being measured in a specific exercise:- referred
to as obervability of a particular dimension
Construct Validity
G-study construct validity: person, dimension &
person*dimension variance must collectively >
exercise, and person*exercise effect
Consider a practical DAC example with G-Theory
N=372
Nine dimensions with mostly two exercise:
Simulated In-Basket
Role Play
A Practical example
Dimension
Exercises
SIB
Role Play
Interview
Change Orientation
✓
✓
Communication
✓
Customer Service Orientation
✓
Interpersonal Interaction
✓
Planning & Organizing
✓
Problem Analysis & Decision-making
✓
Self-Management
✓
Team Management
✓
✓
✓
✓
✓
✓
✓
A Practical example: Variance
Components for entire DAC
A Practical example:
Important note
In SPSS:
For the ANOVA and MINQUE methods, negative variance
component estimates may occur. Some possible reasons for their
occurrence are:
(a) the specified model is not the correct model, or
(b) the true value of the variance equals zero
In light of the foregoing example:
Variance attributed to exercise effects (.108) > variance attributed
to person effects (.322)
This finding seems to be in-line with Lance et al’s (2004)
contention that method effects are three time more than
trait effects
In the current example 2,9 more variance was explained by
exercise effects compared to dimension effects.
A Practical example: Variance
Components for selected
dimensions
However, could it be that the G-study on the entire DAC ironed out some of the
robust dimension effects on the sub-dimension level?
I.e. are we throwing out the good with the bad?
To investigate the relative contribution of each dimension to the overall Gcoefficient – one could conduct forward G-analysis on the individual dimension
level
However, when we calculate the Φ coefficient on subscale level, there will be no
variance component for dimension, dimension*exercise, dimension*person, or
dimension*person*exercise effect
The biggest problem with the approach is that it will not be able to compare
person*dimension variance with person*exercise variance since no person*exercise
variance component is generated
However it is still possible to compare person variance with person*exercise variance
A Practical example: Variance
of communication
A Practical example: Variance
of Team Management
Final Verdict: G-study and DAC
Investigate
dimensions individually to
assess contribution of different sources
Poorly designed dimensions may inflate
observed variance attributed to exercise,
exercise by dimension, and exercise by
person effects
The way G-studies is conducted have
design implications for DAC: All vs some
approach to design
IRT ANALYSIS
Previously we noted:
Recently there has been two schools of thought to
assess the construct validity of AC’s:
Confirmatory Factor Analysis (CFA) [MTMM]
Generalization Theory
Fairly new area: IRT modeling with interval data
Consider Achievement Motivation discussed earlier
IRT Approach
Logistical model dictate that a respondents response to an
item should depend on two parameters only:
Difficulty of endorsing the items (item location parameter)
Standing of respondent on the latent trait (person location
parameter)
The expectation is that persons with a higher standing on the
latent trait should have a higher probability of endorsing a
particular item compared to a person with a lower standing on
the same trait
This is a key requirement for DAC since the central aim is to
discriminate between person who is low and high on the trait
(dimension).
Deviations from these indications might suggest that the DAC
exercises are not operating as expected
Rating scale
The current DAC was rated on a 5-point response
scale with non-integer values (i.e. decimal values)
Common wisdom: more response categories = more
reliable measure that resemble interval data.
However it remains to be seen if people actually
make distinction between response categories.
It is expected that thresholds between 5 response
categories will be sequentially ordered along the
latent traits
We can examine the Graphed category response
function to see if each of the 4 thresholds becomes
the modal category at some point on the latent trait
continuum
Empirical response categories
for INN-AP
Empirical response categories
for EN_AP
Empirical response categories
for PS_AB
Empirical response categories
for PS_SIB
Empirical response categories
ITEM DIFFICULTY MEASURE OF -1.13 ADDED TO MEASURES
------------------------------------------------------------------|CATEGORY
OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY|
|LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE|
|-------------------+------------+------------++---------+--------|
| 1
1
1
1| -4.95 -4.41|
.29
.10|| NONE
|( -9.04)| 1
| 2
2
27 54| -.60 -.43|
.55
.62||
-6.82 | -3.31 | 2
| 3
3
15 30| 1.89 1.67|
.72
.42||
2.46 |
2.28 | 3
| 4
4
7 14| 3.55 3.31|
.68
.60||
4.36 |( 4.43)| 4
| 5
1
1|
|
|| NONE
|
| 5
------------------------------------------------------------------OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.
Response Scales
What we see here is that although there is
supposed to be 5 response categories – raters
effectively make use of three response categories
when rating PEDR’s
Furthermore, person reliability is not very good.
This indicates estimates the confidence we have
that people will be allocated to the same ranking
order when exposed to the Achievement
Motivation DAC again
This is similar to the person*dimension effect in Gstudies
Fit Statistics
SUMMARY OF 97 MEASURED (NON-EXTREME) PERSON
------------------------------------------------------------------------------|
TOTAL
MODEL
INFIT
OUTFIT
|
|
SCORE
COUNT
MEASURE
ERROR
MNSQ
ZSTD
MNSQ
ZSTD |
|-----------------------------------------------------------------------------|
| MEAN
34.6
8.0
.66
.64
1.01
-.2
1.05
-.1 |
| S.D.
6.0
.0
2.32
.16
.94
1.3
1.29
1.3 |
| MAX.
48.0
8.0
5.09
1.29
5.40
3.4
8.32
4.2 |
| MIN.
19.0
8.0
-8.77
.45
.10
-2.5
.08
-2.7 |
|-----------------------------------------------------------------------------|
| REAL RMSE
.76 TRUE SD
2.19 SEPARATION 2.88 PERSON RELIABILITY .89 |
|MODEL RMSE
.66 TRUE SD
2.22 SEPARATION 3.39 PERSON RELIABILITY .92 |
| S.E. OF PERSON MEAN = .24
|
-------------------------------------------------------------------------------
Fit Statistics: PERSON AND ITEM
PARAMETERS
ITEM STATISTICS:
MISFIT ORDER
-------------------------------------------------------------------------------------------------|ENTRY
TOTAL TOTAL
MODEL|
INFIT | OUTFIT |PT-MEASURE |EXACT MATCH|
|
|NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM
G |
|------------------------------------+----------+----------+-----------+-----------+-------------|
|
8
590
98
-.24
.17|1.48
2.7|2.45
4.3|A .61
.74| 53.6 64.1| TRANS_SIB 0 |
|
6
632
98
1.63
.14|1.24
1.4|1.14
.6|C .80
.81| 62.9 62.0| TRANS_EN 0 |
|
7
545
98
-1.53
.14| .95
-.3| .95
-.1|D .80
.78| 57.7 56.6| TRANS_PS 0 |
|
5
545
98
-1.53
.14| .92
-.5| .83
-.6|d .81
.78| 58.8 56.6| TRANS_IN 0 ||
|------------------------------------+----------+----------+-----------+-----------+-------------|
| MEAN
426.4
98.0
.00
.18|1.00
-.1|1.05
-.1|
| 65.7 64.4|
|
| S.D.
154.5
.0
1.52
.03| .29
1.9| .58
2.0|
| 8.2
5.4|
|
--------------------------------------------------------------------------------------------------
THUS from this Table we can see from the high ZSTD infit statistics
that PS_SIB underestimates expected item scores
Expected Item Characteristic
Curves: PS-SIB
Expected Item Characteristic
Curves: EN_AP
Expected Item Characteristic
Curves: EN_AP
Expected Item Characteristic
Curves: PS_SIB
Validation problems of DAC’s
If the SEM approach is to preferred:
Empirical Considerations
At least 5 exercises per dimension for an unidimensional construct and single exercise effect
If the 1DCE approach is used with multiple subdimensions than at least 3 exercises per subdimension is needed
Multiple raters for each dimension
Sample size > 150
Minimum of 5-point rating scale
Validation problems of DAC’s
Substantive considerations:
Theoretical underpinnings of DAC dimensions
Are we really measuring more than fluid
intelligence (g) in DAC’s?
Have we considered discriminant and
convergent validity outside the MTMM doctrine:
Cross-validation with paper & pencil measures?
Rater calibration: Higher inter-rater agreement
at the expense of restriction of range and
construct validity
PEDR’s lies at the heart of the
problem: What are we rating?
Competency
potential
Observable
Behaviour
Competence
?
?
PEDR’s
?
PEDR’s lies at the heart of the
problem: What are we rating?
If we are proponing to measure competency potential would it not be better to use paper & pencil measures with
more control (standardisation) and objectivity?
When designing exercises to measure AC dimensions –
what is the constitutive meaning of the proposed
dimensions? “Creative thinking & Entrepreneuric Energy”
Why not cross-validate AC constructs with “known
constructs”?
For example: Empowering Leadership (DAC) –
Transformational leadership (Bass & Avolio, 1995).
Rating calibration: Guidelines vs Rules!
More variance in PEDR’s when raters are given more
discretion (i.e. guidelines not rules)
PEDR’s lies at the heart of the
problem: What are we rating?
Exercises: Uni-dimensionality is paramount
Avoid conglomeration of constructs when
designing exercises
Be adamant about micro measurement through
thoroughly designed scoring reports
Attach scoring scale to each elicited behaviour
Can raters list all observable behaviors without
guidance?
Finally: Is DAC a new science? OR
Can we apply some known psychometric truths to
DAC or are “behaviour to complex to measure”
Legislative Pitfalls !!
LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL
ASSESSMENTS
EEA implications:
The usage of psychometric test in South Africa are monitored and
guided by the Employment Equity Act (Republic of South Africa, 1998)
prohibiting the use of psychological tests unless it can be shown that the
tests are valid and not biased against any employee or group (i.e.,
without measurement bias)
According to the paragraph In paragraph 8 of the Employment Equity
Act (Republic of South Africa, 1998, p. 16) this position is reiterated and
qualified by stating:
Psychological testing and other similar assessments of an employee are
prohibited unless the test or assessment being used:
a) has been scientifically shown to be valid and reliable;
b) can be applied fairly to all employees;
c) is not biased against any employee or group.
Legislative Pitfalls !!
According to the main propositions of the EEA the users of psychometric tests are behooved to
provide evidence that suggest that selection processes adheres to the act.
THUS, whenever allegations of discrimination is advanced the burden of proof shifts to the employer
to demonstrate the job-relatedness of the selection procedure and that the inferences derived from
the predictor scores are fair.
This interpretation is reinforced in Chapter II of the EEA under the heading “burden of proof”,
paragraph 11:
Whenever unfair discrimination is alleged in terms of the Act, the employer against whom the allegation is
made must establish that it is fair
Is it possible to immunize oneself from EEA legislation by claiming to use DAC for developmental vs.
Selective purposes?
Ultimately, developmental DAC can still be discriminating unfairly, especially in promotional practices
In an effort to avoid legislation: Make sure to get psychometric “INTEL” on DAC
LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS