Reliability & Validity
Download
Report
Transcript Reliability & Validity
Reliability & Validity
What is Reliability?
Reliability: Consistency and dependability.
If a measurement device or procedure consistently assigns
the same score to individuals or objects with equal values,
the device is considered reliable.
Researchers
must
establish
the
reliability
of
their
measurement devices in order to be certain that they are
obtaining a systematic and consistent record of the variation
in X and Y.
Types of Reliability
Several types:
Test-retest reliability and alternate reliability
Inter-item reliability and internal consistency
Split-half reliability
Inter-rater reliability
Scorer reliability
Test-retest Reliability
Measure the scores twice with the same instrument.
Reliable measures should produce very similar scores.
Examples:
IQ tests typically show high test-retest reliability. The
reliability of a bathroom scale can be tested by recording
your weight 2-3 times within a minute or two.
Alternate Forms Reliability
Test-retest procedures may not be useful when participants may
be able to recall their previous responses and simply repeat them
upon retesting. In cases where administering the exact same test
will not necessarily be a good test of reliability, we may use
alternate forms reliability. As the name implies, two or more
versions of the test are constructed that are equivalent in content
and level of difficulty. Professors use this technique to create
makeup or replacement exams because students may already
know the questions from the earlier exam.
Inter-item reliability
• Inter-item reliability: The degree to which different items
measuring the same variable attain consistent results.
• Scores on different items designed to measure the same
construct should be highly correlated. It also goes by the
name internal consistency.
• Example: Math tests often ask you to solve several
examples of the same type of problem. Your scores on
these questions will normally represent your ability to
solve this type of problem, and the test would have high
inter-item reliability.
Inter-rater reliability
• When observers must use their own judgment to interpret
the events they are interpreting (including live or videotaped
behaviors and written answers to open-ended interview
questions), scorer reliability must be measured.
• Have different observers take measurements of the same
responses; the agreement between their measurements is
called inter-rater reliability.
• Their results can be compared statistically and represent the
scorer’s reliability.
• A measure is valid if it measures what it is supposed to
measure, and does so cleanly – without accidentally
including other factors.
• Most experiments are designed to measure
hypothetical constructs such as intelligence, learning,
or love. The experimenter must create an operational
definition of the dependent variable because one
cannot measure these hypothetical constructs directly.
• A valid measure is one that measures this hypothetical
construct accurately (such as intelligence) without
being influenced by other factors (such as motivation).
Types of Validity
Validity: (actually studying the variables that we
wish to study)
• Construct validity
• Face validity
• Content validity
• Criterion validity -- 2 types:
– Predictive validity
– Concurrent validity
Construct Validity
• Do my dependent variables actually measure
the hypothetical construct that I want to test?
• Does my IQ test really measure IQ, and
nothing else?
• Do my procedures actually measure learning,
(without being influenced by motivation)?
• Does my personality test really measure
personality traits without including fatigue?
Face Validity
• The consensus (usually by experts in the field) that a measure
represents a particular concept. It is the least stringent type of validity.
Because most psychological variables require indirect measures (like the
intelligence example before), the validity of a measured definition may
not be self-evident.
• Does rate of eating really reflect hunger? In rats, does the rate of lever
pressing actually measure learning?
• Does talking measure extroversion?
• Does GPA or SAT score
• really reflect intelligence?
Comparing face validity with
construct validity
• Face validity: The consensus that a measure represents
a particular concept – the face value of the measure.
(Would a 130-pound 5’3” college student be a good
football or basketball player?)
• Construct validity: The accuracy with which a measure
represents the particular concept, without influence of
additional factors. Construct validity implies that other
operational definitions of the same construct will yield
correlated results.
Content Validity
• Does the content of our measure fairly reflect the
content of the thing we are measuring?
• Example: Do the questions on an exam accurately
reflect what you have learned in the course, or were
• the exam questions sampled from only a sub-section of
the material?
• A test to measure your knowledge of mathematics
should not be limited to addition problems, nor should
it include questions about French literature.
• It should cover the entire range appropriate math
problems you are trying to measure.
Criterion Validity
• A powerful indicator of the validity of a
measure is its ability to accurately predict
performance on other, independent outcome
measures (referred to as criterion measures).
• The extent to which your SAT score predicts
your college GPA is an indication of the SAT’s
criterion validity.
• There are two approaches to criterion validity:
Concurrent validity and Predictive validity.
Concurrent vs. Predictive Validity
• In concurrent validity, the SAT test scores and
criterion measures (high school GPA) are
obtained at roughly the same time (concurrent).
• If the SAT shows high concurrent validity, it will be
highly correlated with GPA obtained at the same
time the SAT is taken.
• Predictive validity, however, would be high if your
SAT score accurately predicted your college GPA,
which is obtained long after taking the SAT.