Reliability & validity - University of Warwick

Download Report

Transcript Reliability & validity - University of Warwick

Reliability & Validity
Overview for this lecture
• Ethical considerations in testing
• Reliability of tests
– Split-half reliability
• Validity of tests
• Reliability and validity in designed research
– Internal and external validity
2
What does this resemble?
3
Rorschach test
• You look at several images like this, and say
what they resemble
• At the end of the test, the tester says …
– you need therapy
– or you can't work for this company
What assurance would you expect about the test?
4
Or imagine some asks your child to draw a
human figure
The tester says this shows “signs” that your child is a victim
of sexual abuse.
What questions would you ask?
5
What questions would you ask?
• Is it valid for the purpose to which you plan to
put it?
• Can it be faked?
• How were the norms constructed?
• Can we see the data on which the norm is
based?
• Are there tester effects?
• Is scoring reliable?
• Is it culture fair – are there separate norms for
my culture?
6
Ethics – developmental role for a test
Sometimes said: “a good test will let you give the
subject a debrief that they can use to help…”
- personal decisions
- career
- choice of therapy
- personal development targets
eg learning styles & study practices
But how reliable / specific is the test, really?
7
Psychological Testing
• Occurs widely …
– in personnel selection
– in clinical settings
– in education
• Test construction is an industry
– There are many standard tests available
What constitutes a good test?
8
Working assumption - a test is:
a set of items
questions, pictures, …
to which an individual responds
rating, comment, yes/no ….
The responses to these items are added up
(combined in some way) to create an overall
score that assesses one psychological construct
Also called a ‘scale’
9
Eg. The Warwick Sweetness Scale
1, 2, 3, 4, 5
How much do you like sugar in coffee?
How much do you like toffee?
How much do you like ice-cream?
How much do you like pudding?
How much do you like chocolate cake?
How much do you like honey?
10
Specificity & sensitivity
Critical for diagnostic tests (dyslexic; autistic;
diabetic)
Sensitivity: the test picks out people who really do
have the condition
Specificity: the test excludes people who do not
have the condition
11
Reliability
consistency
•
•
•
•
•
Test-retest reliability
Parallel forms reliability
Split-half reliability
Intraclass correlation (ICC, Cronbach’s alpha)
Inter-rater reliability (kappa, ICC)
12
Split-half reliability
1.
2.
3.
4.
5.
6.
sugar in coffee?
toffee?
ice-cream?
pudding?
chocolate cake?
honey?
Total Warwick Sweetness score
odd
3
4
2
3
5
4
3
21
10
even
4
2
3
5
4
11
13
Split-half reliability
• Split test in two halves – do you get similar
scores on the halves?
- Separate sub-totals for odd and even items (for
each subject)
- correlate these subtotals (rhalf)
• Adjust the reliability estimate with the SpearmanBrown correction
rtest = (2 * rhalf) / (1+ rhalf)
14
Reliability v. accuracy
Can be reliable but not
accurate
m1
m2
m3
1
11
21
2
12
22
3
13
23
4
14
24
5
15
25
15
16
Validity
Interpretation; link to reality
The relationship between test scores and the conclusions
we draw from them.
"The degree to which evidence and theory support the
interpretation of test scores entailed by proposed use of
tests." (AERA/APA/NCME, 1999)
IQ tests – “intelligence”
Personality tests – “personality”
17
Validity
Fast cars
move quickly – the speed test
are powerful – the bhp test
are red – the colour test
18
Validity
• "Validation is inquiry into the soundness of the
interpretations proposed for scores from a test"
Cronbach (1990, p. 145)
•
•
•
•
Face validity
Content validity
Construct validity
Criterion validity
19
Face validity
• Does a test, on the face of it, seem to be a good
measure of the construct
E.g., how fast can a particular car go?
– time it over a fixed distance
 Direct measurement of speed has good
face validity
20
Face validity
The bishop / colonel question
21
Content validity
Does the test systematically cover all parts of the
construct?
Eg the examination for a module
Topics taught
Soup
Fish
Beetroot
Custard
Rice
Topics examined
Soup
Beetroot
Custard
22
Content validity
Spider phobia
Aspects of the construct
Aspects assessed
Strength of fear reaction
Persistence of reaction
Invariability of reaction
Recognition that reaction is unreasonable
Avoidance of spiders
…
23
Construct validity
Measuring things that are in our theory of a
domain.
e.g. engine power propels car
• A construct is a mechanism that is believed to
account for some aspect of behaviour
– working memory
– trait introversion/extroversion
• E.g., children's spelling ability in native language
is correlated with learning of second language
24
Construct validity
The construct is sometimes called a latent variable
You can’t directly observe the construct
You can only measure its surface manifestations
Extroversion
Personality questionnaire
Behavioural observation
Construct
(Latent variable)
Measurement
(Manifest variable)
25
Construct validity
Measuring construct validity
• Convergent validity
– Agrees with other measures of the same thing
• Divergent validity
– Does not agree with measures of different things
(Campbell & Fiske, 1959)
‘Warwick spider phobia questionnaire’
positive correlation with SPQ
no correlation with BDI
26
Criterion validity
• A test has high criterion validity if it correlates
highly with some external benchmark
– e.g. spelling test predict learning 2nd language
– e.g. "Bishop/colonel" test might predict good cleaners
Concurrent validity
Predictive validity
27
Criterion / predictive validity
• Graphology for job selection
– Candidate writes something: Validity = .18
– But untrained graphologists, too…
Candidate copies something:
Validity = none
Schmidt & Hunter (1998) in Psychological Bulletin, 124, 262-274
28
Reliability and validity
Reliability limits validity
- without reliability, there is no validity
- Measures of validity cannot exceed measures of
reliability
validity ≤ reliability
29
Replicability
Can the result be repeated?
Drachnik (1994)
43 children abused; 14 included tongues
194 not abused – only 2 …
d = 1.44
30
Replicability
Does it replicate?
1. Chase (1987)
34 abused, 26 not abused
d = 0.09
2. Grobstein (1996)
81 abused, 82 not abused
d = 0.08
31
Reliability in designed research
Use reliable measurement instruments
Standardized questionnaires
Accurate and reliable clocks
Repeat measurements
Many participants
Many trials
Eliminate (control) sources of ‘noise’ – irrelevant factors that randomly affect
the outcome variable
Temperature
Time of day
32
Reliability in designed research
Eliminate (control) sources of ‘noise’ – irrelevant factors that randomly affect
the outcome variable
Temperature
Time of day
Tip:
Reduce irrelevant individual differences
e.g.
test only female participants
test only a narrow age band
Why? – reduces error variance, makes test more powerful
Cost? – ability to generalise to other groups or situations is reduced
33
Validity in designed research
Internal validity
Are there flaws in the design or method?
Can the study generate data that allows
suitable conclusions to be drawn?
External validity
How well do the results carry over from
sample to populations? How well do they
generalise?
34
Lecture Overview
• Ethical considerations in testing
– Results can be used to make important decisions, is the test
good enough to justify these?
• Reliability
– Test-retest; internal consistency (Split-half)
– Accuracy; specificity & sensitivity
• Validity
– Face, content, construct, criterion
– Divergent & convergent
• Replicability
• Reliability and validity in designed research
– Internal and external validity
35
http://wilderdom.com/personality/L32EssentialsGoodPsychologicalTest.html
•
•
•
•
•
•
Standardization
Standardization: Standardized tests are:
administered under uniform conditions. i.e. no matter where, when, by whom or to
whom it is given, the test is administered in a similar way.
scored objectively, i.e. the procedures for scoring the test are specified in detail so
that ant number of trained scorers will arrive at the same score for the same set of
responses. So for example, questions that need subjective evaluation (e.g. essay
questions) are generally not included in standardized tests.
designed to measure relative performance. i.e. they are not designed to measure
ABSOLUTE ability on a task. In order to measure relative performance, standardized
tests are interpreted with reference to a comparable group of people, the
standardization, or normative sample. e.g. Highest possible grade in a test is 100.
Child scores 60 on a standardized achievement test. You may feel that the child has
not demonstrated mastery of the material covered in the test (absolute ability) BUT if
the average of the standardization sample was 55 the child has done quite well
(RELATIVE performance).
The normative sample should (for hopefully obvious reasons!) be representative of
the target population - however this is not always the case, thus norms and the
structure of the test would need to interpreted with appropriate caution.
36