Transcript CHAPTER 5

CHAPTER 5
Test Scores as
Composites
This Chapter is about the Quality of
Items in a Test.
1
Test Scores as Composites
 What
is the Composite Test Score?
 A composite test score is a total test
score created by summing two or
more subtest scores i.e., WAIS IV Full
Scale IQ consisted of 1-Verbal
Comprehension Index, 2-Perceptual
Reasoning Index, 3-Working Memory
Index, and 4-Processing Speed Index.
Qualifying Examinations and EPPP Exams
are also composite test scores.
2
Item Scoring Schemes
[skeems]Systems
We have 2 different scoring system
 1.
Dichotomous Scores
Dichotomous Scores are restricted to 0 and 1
such as scores on True and False, and
multiple-choice question
 2. Non-dichotomous Scores
Non dichotomous Scores are not restricted to
0 and 1
Can have range of possible points such as in
essays. 1,2, 3, 4, 5……..
3
Dichotomous Scheme Examples

1. The space between nerve cell endings is called
the
a. Dendrite
b. Axon ;
c. Synapse
d. Neutron
(In this item, responses a, b, and d are scored 0;
response c is scored 1.)
 2. Teachers in public school systems should
have the right to strike.
a. Agree
b. Disagree
(In this item, a response of Agree is scored 1;
Disagree is scored 0) .
4
Or, you can use True or False.
Practical Implication for Test
Construction
Variance and Covariance measure
the quality of items in a test.
Reliability and validity measure the
quality of the entire test.
 σ²=SS/N  used by one set of data
Variance is the degree of variability
of scores from mean.
5
Practical Implication for Test
Construction
Correlation is based on a statistic
called Covariance (Cov xy or S xy)
COVxy=SP/N-1  used for 2 sets
of data
Covariance is a number that reflects
the degree to which 2 variables
vary together.
 r=sp/√ssx.ssy
6
Variance

X
1
2
4
5
σ² = ss/N
s² = ss/n-1 or ss/df
Pop
Sample
SS=Σx²-(Σx)²/N
SS=Σ( x-μ)²
Sum of Squared Deviation from Mean
7
Covariance
 Covariance
is a number that
reflects the degree to which 2
variables vary together.
 Original Data
X Y
1 3
2 6
4 4
5 7
8
Covariance
COVxy=SP/N-1
2 ways to calculate the SP
SP= Σxy-(Σx.Σy/N)
SP= Σ(x-μx)(y-μy)
SP requires 2 sets of data
SS requires only one set of
data
9
Descriptive Statistics for
Dichotomous Data
10
Descriptive Statistics for
Dichotomous Data
Item Variance & Covariance
11
Descriptive Statistics for
Dichotomous Data
 P=Item
Difficulties:
P= (#of examinees who answered an item correctly / total # of examinees
or
P=f/N
See handout
The higher the P value The easier
the item
12
Relationship between Item Difficulty
P and σ²
Variance
σ² (quality)


0 difficult
0.5
1
easy
P= Item Difficulty

13
Non-dichotomous Scores Examples
 1.
Write a grammatically correct German
sentence using the first person singular form
of the verb verstehen. (A maximum of 3
points may be awarded and partial credit
may be given.)
 2.
An intellectually disabled person is a
nonproductive member of society.
5. Strongly agree 4. Agree, 3. No opinion
2. Disagree 1. Strongly disagree
(Scores can range from 1 to 5 points. with
high scores indicating a positive attitude
toward intellectually disabled citizens.)
14
Descriptive Statistics for
Non-dichotomous Variables
15
Descriptive Statistics for
Non-dichotomous Variables
16
Variance of a Composite “σ²C”
σ²=SS/N
σ²a=SSa/Na
σ²b=SSb/Nb
σ²C= σ²a+σ²b
Ex.
From WAIS III-- FSIQ=VIQ+PIQ
If More than 2 subtests, σ²C=σ²a+σ²b+σ²c…
Calculate the variance for each
subtest and add them up.
17
Variance of a Composite “σ²C”
 What
is the Composite Test Score?
 Ex.
WAIS IV Full Scale IQ which
consist of a-Verbal Comprehension
Index, b-Perceptual Reasoning Index,
c-Working Memory Index, and
 d-Processing Speed Index.
 More than 2 subtests
 σ²C=σ²a+σ²b+σ²c+σ²d
18
*Suggestions to Increase the
Total Score Variance of a Test
 1-Increase
the number of items in a test
 2-Item difficulties p (medium range)
 3-Items with similar content have higher
correlations & higher covariance
 4-Item scores & total scores variances
alone are not indices (in-də-ˌcēz) of test
quality (reliability and validity).
19
*1-Increase the Number of Items
in a Test (how to calculate the test variance)
 Variance
for a test of 25 items is higher
than a variance for a test of 20 items.
 σ²=N(σ²x)+N(N-1)(COVx)=
 Ex. If the COVx=items covariance = (0.10)
σ²x=items variance  (0.20)
N= #of items in a test -- first try N=20
σ²=test variance  For 20 items 42 ,
then try N=25
and σ²=test variance for 25 items 65
20
2-Item Difficulties
Item
difficulties
should be almost
equal for all of the
items and difficulty
levels should be in
the medium range.
21
3-Items with Similar Content have Higher
Correlations & Higher Covariance
22
4- Item Scores & Total Scores Variances Alone
are not Indices (in-də-ˌcēz) of Test Quality
Variance and Covariance are
important and necessary however,
they are not sufficient to
determine the test quality.
To determine a higher level of test
quality we use Reliability and
Validity.
23
UNIT II
RELIABILITY
CHAP 6: RELIABILITY AND THE CLASSICAL
TRUE SCORE MODEL
CHAP 7: PROCEDURES FOR ESTIMATING
RELIABILITY
CHAP 8: INTRODUCTION TO
GENERALIZABILITY THEORY
CHAP 9: RELIABILITY COEFFICIENTS FOR
CRITERION-REFERENCED TESTS
24
CHAPTER 6
Reliability and the Classical True Score Model
 Reliability
(p)=Reliability is a measure of
consistency/dependability, or when a test
measures same thing more than once and
results in same outcome.
 Reliability refers to the consistency of
examinees performance over repeated
administrations of the same test or parallel
forms of the test (Linda Crocker Text).
25
THE
MODERN
MODELS
26
*TYPES OF RELIABILITY
TYPE OF
RELIABILITY
WHT IT IS
HOW DO YOU DO IT
WHAT THE RELIABILITY
COEFFICIENT LOOKS
LIKE
Test-Retest
A measure of stability
Administer the same
test/measure at two different
times to the same group of
participants
r test1.test2
Ex. IQ test
A measure of
equivalence
Administer two different forms of
the same test to the same group
of participants
r testA.testB
Ex. Stats Test
2 Admin
Parallel/alternate
Interitem/Equivalent
Forms
2 Admin
Test-Retest with
Alternate Forms
A measure of stability
and equivalence
2 Admin
Inter-Rater
27
1 Admin
r testA.testB
A measure of
agreement
Have two raters rate behaviors
and then determine the amount
of agreement between them
Percentage of agreement
A measure of how
consistently each
item measures the
same underlying
Correlate performance on each
item with overall performance
across participants
Cronbach’s Alpha Method
Kuder-Richardson Method
Split Half Method
Hoyts Method
1 Admin
Internal
Consistency
On Monday, you administer form A to 1st
half of the group and form B to the second
half.
On Friday, you administer form B to 1st half
of the group and form A to the 2nd half
Test-Retest
 Class
IQ Scores
 Students
X
 John
125
 Jo
110
 Mary
130
 Kathy
122
 David
115
Y
120
112
128
120
120
1st time on Mon
2nd time on Fri
28
Parallel/alternate Forms
 Scores
on 2 forms of stats tests
 Students
Form A
Form B
 John
95
92
 Jo
84
82
 Mary
90
88
 Kathy
76
80
 David
81
78
29
Test-Retest with Alternate Forms
On Monday, you administer form A to 1st
half of the group and form B to the second
half. On Friday, you administer form B to 1st half of the group and form A to the 2nd half
Students Form A to 1st group (Mon) Students Form B to 2nd group (Mon)
David
Mary
Jo
John
Kathy

85
94
78
81
67
Mark
Jane
George
Mona
Maria
82
95
80
80
70
Next slide
30
Test-Retest with Alternate Forms
 On
Friday, you administer form B to 1st
half of the group and form A to the second
Students Form B to 1st group (Fri) Students Form A to 2nd group (FRi)
David
Mary
Jo
John
Kathy
85
94
78
81
67
Mark
Jane
George
Mona
Maria
82
95
80
80
70
31
HOW RELIABILITY IS
MEASURED
 Reliability


is Measured by Using a
Correlation Coefficient
r test1•test2
 Reliability


or r x.y
Coefficients:
Indicates how scores on one test change,
relative to scores on a second test
Can range from 0.0 to ±1
• ±1.00 = perfect reliability
• 0.00 = no reliability
32
THE
CLASSICAL
MODEL
33
A CONCEPTUAL DEFINITION OF
RELIABILITY
CLASSICAL MODEL
Observed Score = True Score ± Error Score
X=T±E
Method Error
Trait Error
34
Classical Test Theory






The Observed Score, X=T+E
X is the score you actually record or observe on
a test.
The True Score, T=X-E or, the difference between
the Observed score and Error score is the True
score
T score is the reflection of the examinee true
knowledge
The Error Score, E =X-T or, the difference
between the Observed score and True score is
the Error score.
E are factors that cause the True Score and
observed score to differ.
35
A CONCEPTUAL DEFINITION OF
RELIABILITY
Observed Score = True Score ± Error Score
Method Error
Trait Error
Observed Score
X=T±E
Score that actually observed
Consists of two components
• True Score
• Error Score
 (X)


36
A CONCEPTUAL DEFINITION OF
RELIABILITY
Observed Score = True Score ± Error Score
Method Error
Trait Error
 True
Score T=X-E
 Perfect reflection of true value
for individual
 Theoretical score
37
A CONCEPTUAL DEFINITION
OF RELIABILITY
Observed Score = True Score ± Error Score






Method Error
Trait Error
Method error is due to characteristics of the test or
testing situation
Trait error is due to individual characteristics
True Score
Conceptually, Reliability = True Score
+ Error Score
True Score
Observed Score
Reliability of the observed score becomes higher
if error is reduced!!
38
A CONCEPTUAL DEFINITION
OF RELIABILITY OR
Observed Score = True Score ± Error Score
 Error

Score
Method Error
Trait Error
E=X-T
Is the Difference between
Observed and True score ±


X=T±E
95=90+5 or 85=90-5 The difference
between T and X is 5 points or E=±5
39
The Classical True Score Model
X=T±E
 X=
Represents the observed test
score
 T= Represents the individual's
True knowledge of score
 E= Represents the random error
component
40
Classical Test Theory
 What Makes up the Error Score?
E=X-T
Error Score consist of;
1-Method Error and 2-Trait Error
1-Method Error
Method Error is the difference between True
& Observed Scores resulting from the test
or testing situation.
2-Trait Error
Trait Error is the difference between True &
Observed Scores resulting from the
characteristics of examinees.
See next slide
41
What Makes up the Error Score?
42
Expected Value of True Score
 Definition
of the True Score
 The True score is defined as
the expected value of the
examinees’ test scores (mean
of observed scores) over many
repeated testing with the same
test.
43





Error Score
Definition of the Error Score
Error scores for an
examinee over many
repeated testing should be
Zero.
eEj=Tj-Tj=0
eEj=Expected value of
Error
Tj=Examinee’ True Score
Ex. next
44
Error Score

X-E=T or, the difference between the
Observed score and Error score is the
True score (scores are from the same
examinee)
98-8= 90
88+2=90
80+10=90
X±E=T
100-10=90
95-5=90
81+9=90
88+2=90
90-0=90
-8+2+10-10-5+9+2-0=0
45
*INCREASING THE RELIABILITY OF
A TEST Meaning Decreasing Error
7 Steps
 1. Increase Sample Size (n)
 2. Eliminate Unclear Questions
 3. Standardize Testing Conditions
 4. Moderate the Degree of Difficulty of
the tests (P)
 5. Minimize the Effects of External Events
 6. Standardize Instructions (Directions)
 7. Maintain Consistent Scoring Procedures
(use rubric)
46
*Increasing Reliability of your
Items in a Test
47
*Increasing Reliability Cont..
48
How Reliability (p) is
Measured for an Item/score
 P=True
Score/True Score + Error
Score or p=T/T+E
0=== p === ±1
Note: In this formula you always add your Error(the
difference between T and X) to the True Score in
the denominator (±) , Whether is positive or
negative.
p=T/T + (the difference between T and X
which is E)
p=T/T+E
49
Which Item has the Highest Reliability?
Maximum points for this question is 10

+2= 8……….. 8/10=0.80

-3=6…………. 6/9=0.666

+7=1……….…1/8=0.125

-1=9…………..9/10=0.90

+4=6………....6/10=0.60

-4=6……….....6/10=0.60

+1=7………....7/8=0.875

0=10…………10/10=1.0

-5=4…………..4/9=0.444

+6=3…………..3/9=0.333
 >MORE ERROR <LESS RELIABLE
p=T/T+E
50
How Classical Reliability (p) is
Measured for a Test
 X=T+E
p=T/X…for an essay item/score
 Examinees

 1.
X1=t1+e1
Ex. 10 = 7+3
 2. X2=t2+e2
Ex. 8 = 5 + 3
 3. X3=t3+e3
Ex. 6 = 4 + 2
Then calculate theσ²X=4 & σ²T=2.33
51
How Classical Reliability (p) is
Measured for a Test
Reliability
Coefficient
for All Items
px1x2=σ²T/σ²X
Px1x2 for previous ex=2.33/4.00= 0.58
Pk=σ²T/σ²X
52
How Reliability Coefficient (p) is
Measured
for a Test
T
X
T±E=X

3+2= 5
 4+3=7
 8+6=13
 9+5=14
 2+1=3
 1+1=2
 8+1=9
 7+3=10
P=
σ²T/ σ²x
9.643/19.554= 0.493
053
Reliability Coefficient (p)
for parallel test forms
Reliability
Coefficient (p)
=The correlation between
scores on parallel test
forms.

Next slide
54
X±E=T
Scores on Parallel Test Forms

X Test A

98-2= 96

88+2=90

80+11=91

100-8=92

95-3=92

81+12=93

88+1=89

90-3=87
 r=sp/√ssx.ssy
Y Test B
95-6=89
80+6=86
87-4=83
75+12=87
90-5=85
82-2=80
86-3=83
85+6=91
r=0.882
55
*Reliability Coefficient and
Reliability Index
Reliability Coefficient- px1x2=σ²T/σ²X
 Reliability Index pxt=σT/σX
 Therefore-p =(pxt)²
 Or pxt =
 Just like the relationship between σ² and σ
x1x2
The higher the item-reliability index,
The higher the internal consistency of the
test.
56
*Reliability Coefficient and
Reliability Index
 Reliability
Coefficient
 PX1X2= σ²T/σ²X
Reliability Coefficient is the correlation coefficient that
expresses the degree of reliability of a test.
 Reliability
PXT=
Index
σT/σX
Reliability index is the correlation coefficient
that expresses the degree of relationship
between True (T) and Observed (X) scores of a
test. It is the √ of Reliability Coefficient.
57
Reliability of a Composite
C=a+b…..+k
 Two Ways to Determine/predict the
Reliability of the Composite Test
Scores
 *1-Spearman
Brown Prophecy Formula
 Allows us to estimate the reliability of a
composite of parallel tests when the
reliability of one of these tests is
known.
Ex. Next
*2 -CRONBACH’S Alpha (α)
or Coefficient (α)
58
*Next week Split
Half Reliability Method which is the same as
Spearman Brown Prophecy Formula when K=2

59
*1. Spearman Brown Prophecy Formula

60
*1. Spearman Brown Prophecy Formula

61
If N or K=2 then, we can call it Split half
Reliability Method which is used for
Measuring the Internal Consistency
Reliability (see next chapter)

The effect of changing test
length can also be estimated by
using Spearman Brown
Prophecy Formula. Just like increasing the
variance of a test by increasing the # of items in a test
(Chapter 5)
62
*The Spearman-Brown Prophcy Formula is
used for: a,b,c
 a.
Correcting for one half of the test by
estimating the reliability of the whole
test.
 b. Determining how many additional
items are needed to increase reliability
up to a certain level.
 c. Determining how many items can be
eliminated without reducing reliability
below a predetermined level
63
Reliability of a Composite
C=a+b…..+k
*2-CRONBACH’S Alpha (α) or
Coefficient (α) is a preferred statistic
Allows us to estimate the
reliability of a composite
when we know the
composite score variance
and/or the covariance
among all its components.
Next slide
64
Reliability of a Composite
C=a+b…..+k
*2-CRONBACH’S Alpha (α) or Coefficient (α)
K
α=Pccʹ=
(1K 1
 ²i
)
 ²C

K= # of tests=3
 σ²i= Variance of each test σ²ta, σ²tb, σ²tc
σ²ta =2, σ²tb =3, σ²tc=4
 σ²C= Composite score variance=12
65
The Standard Error of Measurement
σ
E
or
σ
M

Standard Error of Measurement is the Mean of
the Standard Deviations (σ) of all errors (E)
made by several examinee.
E=T-X
Examinees Test 1
Test 2 Test 3 Test 4
1.
E=95-90=5 -----4---- -----3------ ----4-2.
E=85-86=1 -----1----- ----3------2-3.
E=90-95=5 -----3----- ----1------ ----3--4.
E=95-93=2 -----2---- -----4---- ------1-σ1
σ2
σ3
σ4
66
*The Standard Error of Measurement
σ
 1. Find the σs of these errors (E) for all of
E
the examinees tests.
The mean/average for these σs is called
the Standard Error of Measurement
 2.
σE = σx 1 pxx'
Pxxʹ= r =reliability coefficient  or use Px1x2 for
parallel tests.
σx=Standard Deviation for a Set of Observed
Scores(X).
67
*The Standard Error of Measurement
σ
E
is a tool used to estimate or infer how far an
observed score (X) deviates from a true score (T).
σE = σx 1 pxx'

Pxxʹ=r=reliability coefficient=use Px1x2 for parallel tests=.91
σx=Standard Deviation for a Set of Observed Scores=10 -----
σ =3
E
next slide
68
The Standard Error of
Measurement σ
E
 This
means the average difference
between the True scores (T) and
Observed scores (X) is 3 points for all
examinees which is called the Standard
Error of Measurement.
3
69
70
71