Item Analysis: Classical and Beyond

Download Report

Transcript Item Analysis: Classical and Beyond

Item Analysis:
Classical and Beyond
SCROLLA Symposium
Measurement Theory and Item Analysis
Heriot Watt University
12th February 2003.
Why is item analysis relevant?
Item analysis provides a way of measuring
the quality of questions - seeing how
appropriate they were for the candidates
and how well they measured their ability.
It also provides a way of re-using items
over and over again in different tests with
prior knowledge of how they are going to
perform.
What kinds of item analysis
are there?
Item Analysis
Classical
Latent Trait Models
Item Response theory
IRT1
IRT2
Rasch
IRT3 IRT4
Classical Analysis
Classical analysis is the easiest and most widely used
form of analysis. The statistics can be computed by
generic statistical packages (or at a push by hand)
and need no specialist software.
Analysis is performed on the test as a whole rather
than on the item and although item statistics can be
generated, they apply only to that group of students
on that collection of items
Classical Analysis
Assumptions
Classical test analysis assumes that any
test score is comprised of a “true” value,
plus randomised error. Crucially it
assumes that this error is normally
distributed; uncorrelated with true score
and the mean of the error is zero.
xobs = xtrue + G(0, err)
Classical Analysis Statistics
• Difficulty
(item level statistic)
• Discrimination
(item level statistic)
• Reliability
(test level statistic)
Classical Analysis
Difficulty
The difficulty of a (1 mark) question in
classical analysis is simply the proportion
of people who answered the question
incorrectly. For multiple mark questions,
it is the average mark expressed as a
proportion.
Given on a scale of 0-1, the higher the
proportion the greater the difficulty
Classical Analysis
Discrimination
The discrimination of an item is the
(Pearson) correlation between the
average item mark and the average total
test mark.
Being a correlation it can vary from –1 to
+1 with higher values indicating
(desirable) high discrimination.
Classical Analysis
Reliability
Reliability is a measure of how well the test “holds
together”. For practical reasons, internal
consistency estimates are the easiest to obtain
which indicate the extent to which each item
correlates with every other item.
This is measured on a scale of 0-1. The greater
the number the higher the reliability.
Classical Analysis
vs
Latent Trait Models
• Classical analysis has the test (not the item as its
basis. Although the statistics generated are often
generalised to similar students taking a similar test;
they only really apply to those students taking that
test
• Latent trait models aim to look beyond that at the
underlying traits which are producing the test
performance. They are measured at item level and
provide sample-free measurement
Latent Trait Models
• Latent trait models have been around since the
1940s, but were not widely used until the 1960s.
Although theoretically possible, it is practically
unfeasible to use these without specialist software.
• They aim to measure the underlying ability (or trait)
which is producing the test performance rather than
measuring performance per se.
• This leads to them being sample-free. As the
statistics are not dependant on the test situation
which generated them, they can be used more
flexibly
Rasch
vs
Item Response Theory
Mathematically, Rasch is identical to the most basic
IRT model (IRT1), however there are some
important differences which makes it a more viable
proposition for practical testing
• In Rasch the model is superior. Data which does
not fit the model is discarded.
• Rasch does not permit abilities to be estimated for
extreme items and persons.
• Rasch eschews the use of bayesian priors to assist
parameter setting.
IRT - the generalised model
Where
ag = gradient of the ICC at the point 
(item discrimination)
bg = the ability level at which ag is maximised
(item difficulty)
cg = probability of low candidates correctly
answering question g
IRT - Item Characteristic Curves
•An ICC is a plot of
the candidates ability
over the probability of
them correctly
answering the
c - intercept
question. The higher
the ability the higher
the chance that they
- abilitycorrectly.
at max (a)
willbrespond
a - gradient
IRT - About the Parameters
Difficulty
• Although there is no “correct” difficulty for
any one item, it is clearly desirable that
the difficulty of the test is centred around
the average ability of the candidates.
• The higher the “b” parameter the more
difficult the question - note that this is
inversely proportionate to the probability
of the question being answered correctly.
IRT - About the Parameters
Discrimination
• In IRT (unlike Rasch) maximal
discrimination is sought. Thus the higher
the “a” parameter the more desirable the
question.
• Note however that differences in the
discrimination of questions can lead to
differences in the difficulties of questions
across the ability range.
IRT - About the Parameters
Guessing
• A high “c” parameter suggests that
candidates with very little ability may
choose the correct answer.
• This is rarely a valid parameter outwith
multiple choice testing…and the value
should not vary excessively from the
reciprocal of the number of choices.
IRT - Parameter Estimation
• Before being used (in an item bank or for
measurement) items must first be calibrated. That
is their parameters must be estimated.
• There are two main procedures - Joint Maximal
Likelihood and Marginal Maximal Likelihood. JML
is most common for IRT1 and 2, while MML is used
more frequently for IRT3.
• Bayesian estimation and estimated bounds may be
imposed on the data to avoid one parameter
degrading, or high discrimation items being over
valued.
Resources - Classical Analysis
Software
• Standard statistical packages (Excel; SPSS; SAS)
• ITEMAN (available from www.assess.com)
Reading
Matlock-Hetzel (1997) Basic Concepts in Item and Test
Analysis available at www.ericae.net/ft/tamu/Espy.htm
Resources - IRT
Software
• BILOG (available at www.assess.com)
• Xcalibre available at www.assess.com
Reading
• Lord (1980) Applications of Item Response Theory
to Practical Testing Problems
• Baker, Frank (2001). The Basics of Item Response
Theory - available at http://ericae.net/irt/baker/