Unit 3: Objectives

Download Report

Transcript Unit 3: Objectives

Characteristics of Successful
Assessment Measures
• Reliable
• Valid
• Efficient
- Time
- Money
- Resources
• Don’t result in complaints
What Do We Mean by
Reliability?
•
The extent to which a score
from a test is consistent and free
from errors of measurement
Methods of Determining Reliability
• Test-retest (temporal stability)
• Alternate forms (form stability)
• Internal reliability (item stability)
• Interrater Agreement
Test-Retest Reliability
• Measures Temporal Stability
Stable measures
Measures expected to vary
• Administration
Same participants
Same test
Two testing periods
Test-Retest Reliability
Scoring
• To obtain the reliability of an
instrument, the scores at time
one are correlated with the
scores at time two
• The higher the correlation the
more reliable the test
Test-Retest Reliability
Problems
• Sources of measurement errors:
- Characteristic or attribute being measured
may change over time.
- Reactivity
- Carry over effects
• Practical problems:
- Time consuming
- Expensive
- Inappropriate for some types of test
Standard Error of Measurement
• Provides a range of estimated accuracy
1 SE = 68% confident
1.98 SE = 95% confident
• The higher the reliability of a test, the lower the
standard error of measurement
• Formula
sd
1  reliabilit y
Example
Mean = 70, SD = 10
Reliability
68% Confidence 95% Confidence
.95
2.24
4.39
.90
3.16
6.19
.80
4.47
8.85
.70
5.47
10.73
.50
7.07
13.86
Practice Exercise
Reliability
SD
.92
4.55
.79
2.21
.85
7.88
.54
7.88
.72
4.55
68% Confidence
95% Confidence
Exercise Answers
Reliability
SD
68% Confidence
95% Confidence
.92
4.55
1.29
2.55
.79
2.21
1.01
1.98
.85
7.88
3.05
6.04
.54
7.88
5.34
10.57
.72
4.55
2.41
4.77
Serial Killer IQ Exercise
Mean = 100, SD = 15, Reliability=.90
IQ of 70 for death penalty
Killer
IQ
Ed Kemper
145
Henry Lee Lucas
89
Hubert Geralds
73
Louis Craine
69
Clarence Victor
65
95% Confidence
Serial Killer IQ - Answers
Mean = 100, SD = 15, Reliability=.90
IQ of 70 for death penalty
Killer
IQ
95% Confidence
Ed Kemper
145
135.6 – 154.4
Henry Lee Lucas
89
79.6 – 98.4
Hubert Geralds
73
63.6 – 82.4
Louis Craine
69
59.6 – 78.4
Clarence Victor
65
55.6 – 74.4
Alternate Forms Reliability
• Establishes form stability
• Used when there are two or more forms of the same test
- Different questions
- Same questions, but different order
- Different administration or response method (e.g., computer, oral)
• Why have alternate forms?
- Prevent cheating
- Prevent carry over from people who take a test more than once
• GRE or SAT
• Promotion exams
• Employment tests
Alternate Forms Reliability
Administration
• Two forms of the same test are developed, and to
the highest degree possible, are equivalent in
terms of content, response process, and statistical
characteristics
• One form is administered to examinees, and at
some later date, the same examinees take the
second form
Alternate Forms Reliability
Counterbalancing
Subjects
First
Administration
Second
Administration
1-50
Form A
Form B
51-100
Form B
Form A
Alternate Forms Reliability
Scoring
• Scores from the first form of test are
correlated with scores from the
second form
• If the scores are highly correlated,
the test has form stability
Difference Between Parallel and Equivalent
Client
Malone
Crane
Form A
6
9
Form B
8
11
Peterson
Clavin
Boyd
11
12
12
13
14
14
Chambers
Tortelli
Howe
17
18
19
19
20
21
Pantusso
Sternin
Average Score
21
24
14.9
23
26
16.9
Alternate Forms Reliability
Disadvantages
• Difficult to develop
• Content sampling errors
• Time sampling errors
What the Research Shows
• Computer vs. Paper-Pencil
- Few test score differences
- Cognitive ability scores are lower on the computer
for speed tests but not power tests
• Item order
- Few differences
• Video vs. Paper-Pencil
-Little difference in scores
-Video reduces adverse impact
Internal Reliability
• Defines measurement error strictly in terms
of consistency or inconsistency in the
content of the test
• With this form of reliability the test is
administered only once and measures item
stability
Determining Internal Reliability
Split-Half Method
• Test items are divided into two equal parts
• Scores for the two parts are correlated to get
a measure of internal reliability
• Need to adjust for smaller number of items
• Spearman-Brown prophecy formula:
(2 x split half reliability) ÷ (1 + split-half reliability)
Spearman-Brown Formula
(2 x split-half correlation)
(1 + split-half correlation)
If we have a split-half correlation of .60,
the corrected reliability would be:
(2 * .60) ÷ (1 + .60) = 1.2 ÷ 1.6 = .75
Spearman-Brown Formula
Estimating the Reliability of a
Longer Test
L * Re liability
1  ( L  1) * Re liability
L = the number of time longer the new test will be
Example
Suppose you have a test with 20 items and it has a
reliability of .50. You wonder if using a 60-item test would
result in acceptable reliability.
3 * . 50
1  ( 3  1)(. 50 )
=
1 .5
1  ( 2 )(. 50 )
1 .5
=
11
Estimated New Reliability = .75
1 .5
=
2
Practice
Split-Half # of #
Reliability items Proposed
items
.80
.40
.80
30
45
.40
10
50
Corrected Future
Reliability Reliability
Practice Answers
Split-Half # of
#
Corrected
Future
Reliability items Proposed Reliability Reliability
items
.80
.89
.40
.57
.80
30
45
.86
.40
10
50
.77
Common Methods to Determine
Internal Reliability
• Cronbach’s Coefficient Alpha
- Used with ratio or interval data.
• Kuder-Richardson Formula
- Used for test with dichotomous items
- yes-no
- true-false
- right-wrong
Interrater Reliability
• Used when human judgment of performance is
involved in the selection process
• Refers to the degree of agreement between 2 or
more raters
• 3 common methods used to determine interrater
reliability
- Percent agreement
- Correlation
- Cohen’s Kappa
Interrater Reliability Methods
Percent Agreement
• Determined by dividing the total number of
agreements by the total number of observations
• Problems
- Exact match?
- Very high or very low frequency behaviors can
inflate agreement
Interrater Reliability Methods
Correlation
• Ratings of two judges are correlated
• Pearson for interval or ratio data and Spearman
for ordinal data (ranks)
• Problems
- Shows pattern similarity but not similarity of actual
ratings
Interrater Reliability Methods
Cohen’s Kappa
• Allows one to determine not only the level of
agreement, but the level that would be
determined by chance
• A Kappa of .70 or higher is considered
acceptable agreement
Forensic Examiner A
Insane
Insane
Malingering
16
0
16
1
3
4
17
3
20
Forensic
Examiner B
Malingering
Example
K 
K 
K 
Po  Pc
1  Pc
. 95  . 71
1  . 71
. 24
. 29
K  . 83
Po 
16  3
Pc 
(17 x16 )  ( 3 x 4 )
20
Po 
19
20 x 20
Pc 
( 272 )  (12 )
20
Po  . 95
400
Pc 
284
400
Pc  . 71
Demonstration
Television Show
Elimidate
Law and Order
Friends
CSI
X-Files
Walker Texas Ranger
Just Shoot Me
Will and Grace
Darma and Greg
JAG
Like (1) or Dislike (0)
Increasing Rater Reliability
• Have clear guidelines regarding
various levels of performance
• Train raters
• Practice rating and provide
feedback
Scorer Reliability
• Allard, Butler, Faust, & Shea (1995)
- 53% of hand scored personality tests contained at least one
error
- 19% contained enough errors to alter a clinical diagnosis
The degree to which inferences from scores on
tests or assessments are justified by the
evidence
Validity refers to the degree to which evidence and theory support
the interpretations of test scores entailed by proposed uses of tests.
... The process of validation involves accumulating evidence to
provide a sound scientific basis for the proposed score
interpretations. It is the interpretations of test scores required by
proposed uses that are evaluated, not the test itself. When test
scores are used or interpreted in more than one way, each intended
interpretation must be validated. Sources of validity evidence
include but not limited to: evidence based on test content, evidence
based on response processes, evidence based on internal structure,
evidence based on relations to other variables, evidence based on
consequences of testing.
Standards for Educational and Psychological Testing (1999)
Common Methods of
Determining Validity
•
•
•
•
•
Content Validity
Criterion Validity
Construct Validity
Known Group Validity
Face Validity
Content Validity
• The extent to which test items sample the
content that they are supposed to measure
• In industry the appropriate content of a test
of test battery is determined by a job analysis
• Considerations
- The content that is actually in the test
- The content that is not in the test
- The knowledge and skill needed to answer the
question
Test of Logic
•
•
•
•
Stag is to deer as ___ is to human
Butch is to Sundance as ___ is to Sinatra
Porche is to cars as Gucci is to ____
Puck is to hockey as ___ is to soccer
What is the content of this exam?
Messick (1995)
Sources of Invalidity
• Construct underrepresentation
• Construct-irrelevant variance
- Construct-irrelevant difficulty
- Construct-irrelevant easiness
Domain Content
Test Content
Criterion Validity
•
•
Criterion validity refers to the extent to which a
test score is related to some measure of job
performance called a criterion
Established using one of the following research
designs:
- Concurrent Validity
- Predictive Validity
- Validity Generalization
Concurrent Validity
• Uses current employees
• Range restriction can be a problem
Predictive Validity
• Correlates test scores with future
behavior
• Reduces the problem of range
restriction
• May not be practical
Validity Generalization
• Validity Generalization is the extent to which a test
found valid for a job in one location is valid for the
same job in a different location
• The key to establishing validity generalization is
meta-analysis and job analysis
Construct Validity
• The extent to which a test actually measures
the construct that it purports to measure
• Is concerned with inferences about test
scores
• Determined by correlating scores on a test
with scores from other test
Construct Valid Measures
• Correlate highly with measures of similar
constructs
• Don’t correlated highly with irrelevant or
competing constructs
• Correlate with other measures of the
construct that are measured in different ways
• Don’t correlate highly with competing or
irrelevant constructs that are measured in
similar ways
Messick (1995)
Facets of Validity
Evidential Basis
Consequential
Basis
Test
Interpretation
Test Use
Construct
Validity
Relevance &
Utility
Value
Implications
Social
Consequence
Face Validity
•
•
•
•
•
The extent to which a test appears to be job related
Reduces the chance of legal challenge
Increases test taker motivation
Increases acceptance of test results
Increasing face validity
- Item content
- Professional look of the test
- Explaining the nature of the test
Known-Group Validity
• Compares test scores of groups “known”
to be different
• If no differences, test may not be valid
• If differences, conclusions are hard to
make