Transcript Slide 1

POPULATIONS AND SAMPLING
1
Let’s Look at our Example
Research Question
How do UF COP pharmacy students who
only watch videostreamed lectures differ
from those who attend class lectures
(and also have access to videostreamed
lectures) in terms of learning outcomes?
Population
Who Do You Want These Study Results
to Generalize To??
2
Population
 The group you wish to generalize to is
your population.
 There are two types:
– Theoretical population
In our example, this would be all pharmacy
students in the US
– Accessible population
In our example, this would be all COP
pharmacy students
3
Sampling
 Target population or the Sampling
frame: All in the accessible population
that you can draw your sample from.
 Sample: The group of people you select
to be in the study.
A subgroup of the target population
 This is not necessarily the group that is
actually in your study.
4
Sampling
How you select your sample:
Sampling
Strategies
Probability
Sampling
Simple random
sampling
Stratified
sampling
Nonprobability
sampling
Multistage
cluster
sampling
Convenience
Sampling
Snowball
Sampling
5
Sample Size
Select as large a sample as possible from
your population.
There is less potential error that the
sample is different from the population
when you use a large sample.
Sampling error: The difference between
the sample estimate and the true
population value (example: exam score).
6
Sample Size
 Sample size formulas/tables can be used.
Factors that are considered include:
 Confidence in the statistical test
 Sampling error
 See Appendix B in Creswell (pg 630)
 Sampling error formula – used to determine
sample size for a survey
 Power analysis formula – used to determine
group size in an experimental study.
7
Back to Our Example
 How do UF COP pharmacy students who only
watch videostreamed lectures differ from
those who attend class lectures (and also
have access to videostreamed lectures) in
terms of learning outcomes?
 What is our theoretical population?
 What is our accessible population?
 What sampling strategy should we
use?
8
Important Concept
Random sampling vs random assignment
 We have talked about random sampling in
this session.
 Random sampling is not the same as
random assignment.
 Random sampling is used to select individuals
from the population who will be in the sample.
 Random assignment is used in an
experimental design to assign individuals to
groups.
9
VALIDITY
Lou Ann Cooper, PhD
Director of Program Evaluation and Medical
Education Research
University of Florida
College of Medicine
INTRODUCTION
 Both research and evaluation
include:
– how the study is conducted
 Instruments – how data is collected
 Analysis of the data to make inferences
about the effect of a treatment or
intervention.
 Design
 Each of these components can be
affected by bias.
11
INTRODUCTION
 Two types of error in research
 Random
error due to random variation in
participants’ responses at
measurement. Inferential statistics, i.e.
the p-value and 95% confidence interval,
measure random error and allow us to
draw conclusions based on research
data.
 Systematic error or bias.
12
BIAS: DEFINITION
 Deviations of results (or inferences)
from the truth, or processes leading to
such deviation. Any trend in the
selection of subjects, data collection,
analysis, interpretation, publication or
review of data that can lead to
conclusions that are systematically
different from the truth.
 Systematic deviation from the truth
that distorts the results of research.
13
BIAS

Bias is a form of systematic error that can affect
scientific investigations and distort the
measurement process.

Bias is primarily a function of study design and
execution, not of results, and should be
addressed early in the study planning stages.

Not all bias can be controlled or eliminated;
attempting to do so may limit usefulness and
generalizability.

Awareness of the presence of bias will allow more
meaningful scrutiny of the results and
conclusions.

A biased study loses validity and is a common
reason for invalid research.
14
POTENTIAL BIASES IN
RESEARCH AND EVALUATION
 Study Design
 Issues
related to Internal validity
 Issues related to External validity
 Instrument Design
 Issues
related to Construct validity
 Data Analysis
 Issues
related to Statistical Conclusion
validity
15
VALIDITY
Validity is discussed and applied based on
two complimentary conceptualizations in
education and psychology:
 Test validity: the degree to which a test
measures what it was designed to measure.
 Experimental validity: the degree to which
a study supports the intended conclusion
drawn from the results.
16
FOUR TYPES OF VALIDITY QUESTIONS
External
Can we generalize to
Construct
Can we generalize to
other persons, places,
times?
the constructs?
Internal
Is the relationship causal?
Conclusion
Is there a relationship
17 ?
between cause and effect
CONCLUSION VALIDITY
 Conclusion validity is the degree to
which conclusions we reach about
relationships are reasonable, credible
or believable.
 Relevant for both quantitative and
qualitative research studies.
 Is there a relationship in your data or
not?
18
STATISTICAL CONCLUSION
VALIDITY
 Basing conclusions on proper use of
statistics
 Reliability of measures
 Reliability of implementation
 Type I Errors and Statistical
Significance
 Type II Errors and Statistical Power
 Fallacies of Aggregation
19
STATISTICAL CONCLUSION
VALIDITY
 Interaction and non-linearity
 Random irrelevancies in the
experimental setting
 Random heterogeneity of
respondents
20
VIOLATED ASSUMPTIONS OF
STATISTICAL TESTS
 The particular assumptions of a
statistical test must be met if the
results of the analysis are to be
meaningfully interpreted.
 Levels of measurement.
 Example: Analysis of Variance
(ANOVA)
21
LEVELS OF MEASUREMENT
 A hierarchy is implied
in the ides of level of
measurement.
 At lower levels,
assumptions tend to be
less restrictive and
data analyses tend to
be less sensitive.
 In general, it is
desirable to have a
higher level of
measurement (interval
or ratio) rather than a
lower one (nominal or
ordinal).
22
STATISTICAL ANALYSIS AND
LEVEL OF MEASUREMENT
ANALYSIS OF VARIANCE ASSUMPTIONS
 Independence of cases.
 Normality. In each of the groups, the data
are continuous and normally distributed.
 Equal variances or homoscedasticity.
The
variance of data in groups should be the
same.
 The Kruskal-Wallis test is a nonparametric
alternative which does not rely on an
assumption of normality.
23
RELIABILITY
 Measures (tests and scales) of low
reliability may not register true
changes.
 Reliability of treatment
implementation – when
treatments/procedures are not
administered in a standard fashion,
error variance is increased and the
chance of obtaining true differences
will decrease.
24
TRUE POPULATION STATUS
STATISTICAL DECISION
Reject H0
H0 is Type I Error
α
TRUE
Retain H0
Correct
Decision
1-α
Correct
H0 is
Type II Error
Decision
β
FALSE
1 - β (Power)
25
TYPE I ERRORS AND
STATISTICAL SIGNIFICANCE
 A Type I error is made when a
researcher concludes that there is a
relationship and there really isn’t (False
positive)
 If the researcher rejects H0 because
p ≤ .05, ask:
 If
data are from a random sample, is
significance level appropriate?
 Are significance tests applied to a priori
hypotheses?
 Fishing and the error rate problem
26
TYPE II ERRORS AND
STATISTICAL POWER
 A Type II error is made when a
researcher concludes that there is not a
relationship and there really is (False
negative)
 If the researcher fails to reject H0
because p > .05, ask:
 Has
the researcher used statistical
procedures of adequate power?
 Does failure to reject H0 merely reflect a
small sample size?
27
FACTORS THAT INFLUENCE POWER
AND STATISTICAL INFERENCE
 Alpha level
 Effect size
 Directional vs. Non-directional test
 Sample size
 Unreliable measures
 Violating the assumptions of a
statistical test
28
RANDOM IRRELEVANCIES
 Features of the experimental setting
other than the treatment affect scores
on the dependent variable
 Controlled by choosing settings free
from extraneous sources of variation
 Measure anticipated sources of
variance to include in the statistical
analysis
29
RANDOM HETEROGENEITY OF
RESPONDENTS
 Participants can differ on factors that
are correlated with the major
dependent variables
 Certain respondents will be more
affected by the treatment than others
 Minimized by
 Blocking
variables and covariates
 Within subjects designs
30
STRATEGIES TO REDUCE
ERROR TERMS
 Subjects as own control
 Homogeneous samples
 Pretest measures on the same scales used



for measuring the effect
Matching on variables correlated with the
post-test
Effects of other variables correlated with
the post-test used as covariates
Increase the reliability of the dependent
variable measures
31
STRATEGIES TO REDUCE
ERROR TERMS
 Estimates of the desired magnitude
of a treatment effect should be
elicited before research begins
 Absolute magnitude of the treatment
effect should be presented so
readers can infer whether a
statistically reliable effect is
practically significant.
32
INTERNAL VALIDITY
 Internal validity has to do with defending
against sources of bias arising in a
research design.
 To what degree is the study designed such
that we can infer that the educational
intervention caused the measured effect.
 An internally valid study will minimize the
influence of extraneous variables.
Example: Did participation in a series of
Webinars on TB in children change the
practice of physicans?
33
HISTORY
INTERACTIONS
WITH
SELECTION
MORTALITY
SELECTION




THREATS
TO
INTERNAL
VALIDITY

STATISTICAL
REGRESSION
MATURATION


TESTING

INSTRUMENTATION
INTERNAL VALIDITY: THREATS IN
SINGLE GROUP REPEATED MEASURES
DESIGNS
 History
 Maturation
 Testing
 Instrumentation
 Mortality
 Regression
35
THREATS TO INTERNAL VALIDITY
HISTORY
 The observed effects may be due to or be



confounded with nontreatment events
occurring between the pretest and the
post-test
History is a threat to conclusions drawn
from longitudinal studies
Greater time period between
measurements = more risk of a history
effect
History is not a threat in cross sectional
designs conducted at one point in time
36
THREATS TO INTERNAL VALIDITY
MATURATION
 Invalid inferences may be made when
the maturation of participants
between measurements has an effect
and this maturation is not the
research interest.
 Internal (physical or psychological)
changes in participants unrelated to
the independent variable – older,
wiser, stronger, more experienced.
37
THREATS TO INTERNAL VALIDITY
TESTING
 Reactivity as a result of testing
 The effects of taking a test on the
outcomes of a second test
 Practice
 Learning
 Improved scores on the second
administration of a test can be
expected even in the absence of
intervention due to familiarity
38
THREATS TO INTERNAL VALIDITY
INSTRUMENTATION
 Changes in instruments, observers or
scorers which may produce changes
in outcomes
 Observers/raters, through
experience, become more adept at
their task
 Ceiling and floor effects
 Longitudinal studies
39
THREATS TO INTERNAL VALIDITY
STATISTICAL REGRESSION
 Test-retest scores tend to drift
systematically to the mean rather
than remain stable or become more
extreme
 Regression effects may obscure
treatment effects or developmental
changes
 Most problematic when participants
are selected because they are
extreme on the classification variable
of interest
40
THREATS TO INTERNAL VALIDITY
MORTALITY
 Differences in drop-out rates/attrition
across conditions of the experiment
 Makes “before” and “after” samples
not comparable
 This selection artifact may become
operative in spite of random
assignment
 Major threat in longitudinal studies
41
INTERNAL VALIDITY: MULTIPLE
GROUP THREATS
 Selection
 Interactions with Selection
 Selection-History
 Selection-Maturation
 Selection-Testing
 Selection-Instrumentation
 Selection-Mortality
 Selection-Regression
42
THREATS IN DESIGNS WITH GROUPS:
SOCIAL INTERACTION THREATS
 Compensatory equalization of
treatments
 Compensatory rivalry
 Resentful
demoralization
 Treatment imitation or diffusion
 Unintended treatments
43
EXTERNAL VALIDITY
 The extent to which the results of a
study can be generalized
validity – generalizations
related to other groups of people
 Ecological validity – generalizations
related to other settings, times,
contexts, etc.
 Population
44
THREATS TO EXTERNAL
VALIDITY
 Pre-test treatment interaction
 Multiple treatment interference
 Interaction of selection and treatment
 Interaction of setting and treatment
 Interaction of history and treatment
 Experimenter effects
45
THREATS TO EXTERNAL
VALIDITY
 Reactive arrangements
 Artificial
environment
 Hawthorne effect
◊Halo effect
◊John Henry effect
 Placebo
effect
 Participant-researcher interaction
 Novelty effect
46
SELECTING A
RESEARCH DESIGN
Lou Ann Cooper, PhD
Director of Program Evaluation and Medical
Education Research
University of Florida
College of Medicine
What If….
We gave 150 pharmacy students (all are
distance campus) access to streaming video
and then measured their performance on a
written exam (measures achievement of
learning outcomes)…..
48
PRE-EXPERIMENTAL DESIGNS
One Group Posttest Design
X
O
X = Implementation of the treatment
O = Measurement of the participants in
the experimental group
Also referred to as ‘One Shot Case Study’
49
What If….
We gave 150 pharmacy students (all are
distance campus) access to streaming video
and then measured their performance on a
written exam (measures achievement of
learning outcomes)…..
For Discussion:
What are the treats to validity?
50
SOURCES OF INVALIDITY
Internal
History
Maturation
Testing
Instrumentation
Regression
Mortality
Selection
Selection
Interactions
–
–
External
Interaction of
Testing and X
Interaction of
Selection and X
–
–
–
Reactive
Arrangements
Multiple X
Interference
51
What If….
We gave 150 pharmacy students (all are
distance campus) access to streaming video
and then measured their performance on a
written exam (measures achievement of
learning outcomes)…..
52
PRE-EXPERIMENTAL DESIGNS
Comparison Group Posttest Design
X
O
------O



Static Group Comparison
Ex post facto research
No pretest observations
53
What If….
We gave 150 pharmacy students (all are
distance campus) access to streaming video
and then measured their performance on a
written exam (measures achievement of
learning outcomes)…..
For Discussion:
What if we compare test scores for these
students with last year’s scores (assume
last year had no streaming video)?
54
SOURCES OF INVALIDITY
Internal
History
Maturation
Testing
Instrumentation
Regression
Mortality
Selection
Selection
Interactions
+
?
+
+
+
–
–
–
External
Interaction of
Testing and X
Interaction of
Selection and X
–
Reactive
Arrangements
Multiple X
Interference
55
What If….
We gave 150 pharmacy students (all are
distance campus) access to streaming
video and measured their performance on
a written exam both before and after the
intervention (measures achievement of
learning outcomes)…..
56
PRE-EXPERIMENTAL DESIGNS
One Group Pretest/Posttest Design
O


X
O
Not a true experiment
Because participants serve as
their own control, results may
be less biased
57
What If….
We gave 150 pharmacy students (all are
distance campus) access to streaming video
and measured their performance on a
written exam both before and after the
intervention (measures achievement of
learning outcomes)…..
For Discussion:
What are the threats to validity
(What are the plausible hypotheses
that could explain any difference)??
58
SOURCES OF INVALIDITY
Internal
History
Maturation
Testing
Instrumentation
Regression
Mortality
Selection
Selection
Interactions
–
–
–
–
?
+
+
–
External
Interaction of
Testing and X
–
Interaction of
Selection and X
–
Reactive
Arrangements
?
Multiple X
Interference
59
What If….
We could randomize all 300 pharmacy students
to the following groups:
Group 1: access only streaming video
Group 2:attend lectures
For each group, administer both a pre-test and
a post-test
60
TRUE EXPERIMENTAL DESIGNS
Pretest/Posttest Design with Control Group
and Random Assignment
R
O
X
O
-------------------R
O
O
 Measurement of pre-existing differences
 Controls most threats to internal validity
61
What If….
We could randomize all 300 pharmacy
students to the following groups:
Group 1: access only streaming video
Group 2:attend lectures
For each group, administer both a pre-test
and a post-test
For Discussion:
What are the threats to validity (What are
the plausible hypotheses that could explain
any difference)??
62
SOURCES OF INVALIDITY
Internal
History
Maturation
Testing
Instrumentation
Regression
Mortality
Selection
Selection
Interactions
+
+
+
+
+
+
+
+
External
Interaction of
Testing and X
–
Interaction of
Selection and X
?
Reactive
Arrangements
?
Multiple X
Interference
63
What If….
We could randomize all 300 pharmacy students
to the following groups:
Group 1: access only streaming video and post-test
Group 2: attend lectures and post-test
64
TRUE EXPERIMENTAL DESIGNS
Posttest Only Control Group
R
X
O
-------------R
O
65
What If….
We could randomize all 300 pharmacy students
to the following groups:
Group 1: access only streaming video and post-test
Group 2: attend lectures and post-test
For Discussion:
What have we lost by not using a pre-test? (as
compared to the experimental randomized pre-test
and post-test design)
66
SOURCES OF INVALIDITY
Internal
History
Maturation
Testing
Instrumentation
Regression
Mortality
Selection
Selection
Interactions
+
+
+
+
+
+
+
+
External
Interaction of
Testing and X
+
Interaction of
Selection and X
?
Reactive
Arrangements
?
Multiple X
Interference
67
What If….
We could randomize all 300 pharmacy students
to the following groups:
Group 1: pre-test, access only streaming video, and
post-test
Group 2: pre-test, attend lectures, and post-test
Group 3: access only streaming video and post-test
only
Group 4: attend lectures and post-test only
68
TRUE EXPERIMENTAL DESIGNS
Solomon Four Group Comparison
R
O
R
O
R
R
X
O
O
X
O
O
69
What If….
We could randomize all 300 pharmacy students to
the following groups:
Group 1: pre-test, access only streaming video, and post-test
Group 2: pre-test, attend lectures, and post-test
Group 3: access only streaming video and post-test only
Group 4: attend lectures and post-test only
For Discussion:
What have we gained by having 4 groups
(esp group 3 and 4)?
70
What If….
It is NOT feasible to use randomization. What if we
were to have the following groups:
Group 1 (all distant campuses): access only streaming video
Group 2 (GNV campus):attend lectures
For each group, administer both a pre-test and a
post-test
71
QUASI-EXPERIMENTAL DESIGNS
Nonequivalent Control Group
O
O
X
O
O
 Pre-existing differences can be measured
 Controls some threats to validity
72
What If….
It is NOT feasible to use randomization. What if we
were to have the following groups:
Group 1 (all distant campuses): access only streaming video
Group 2 (GNV campus):attend lectures
For each group, administer both a pre-test and a
post-test
For Discussion:
What have we “lost” by not randomizing?
73
SOURCES OF INVALIDITY
Internal
History
Maturation
Testing
Instrumentation
Regression
Mortality
Selection
Selection
Interactions
–
–
External
Interaction of
Testing and X
Interaction of
Selection and X
–
–
–
Reactive
Arrangements
Multiple X
Interference
74
QUASI-EXPERIMENTAL DESIGNS
Time Series
O
O
O
O
X
O
O
O
O
75
QUASI-EXPERIMENTAL DESIGNS
Counterbalanced Design
O
O
O
X1
X3
X2
O
O
O
X2
X1
X3
O
O
O
X3
X2
X1
76
MEASURMENT VALIDITY:
SOURCES OF EVIDENCE
CLASSIC VIEW OF TEST
VALIDITY
Traditional triarchic view of validity
 Content
 Criterion
 Concurrent
 Predictive
 Construct
 Tests were described as “valid” or
“invalid”
 Reliability was considered a separate test
trait
78
MODERN VIEW OF VALIDITY
 Scientific evidence needed to support
test score interpretation
 Standards
for Educational &
Psychological Testing (1999)
 Cronbach, Messick, Kane
 Some theory, key concepts,
examples
 Reliability as part of validity
79
VALIDITY: DEFINITIONS
“A proposition deserves some degree of trust only
when it has survived serious attempts to falsify it.”
(Cronbach, 1980)
According to the Standards, validity refers to “the
appropriateness, meaningfulness, and usefulness of
the specific inferences made from test scores.”
“Validity is an integrative summary.” (Messick, 1995)
“Validation is the process of building an argument
supporting interpretation of test scores.” (Kane,
1992)
80
WHAT IS A CONSTRUCT?
 Constructs are psychological attributes,

hypothetical concepts
A defensible construct has
A
theoretical basis
 Clear operational definitions involving
measurable indicators
 Demonstrated relationships to other constructs
or observable phenomena
 A construct should be differentiated
from
related theoretical constructs as well as
from methodological irrelevancies
81
THREATS TO CONSTRUCT
VALIDITY (Cook & Campbell)
 Inadequate preoperational explication
of constructs
 Mono-operation bias
 Mono-method bias
 Interaction of different treatments
 Interaction of testing and treatment
 Restricted generalizability across
constructs
82
THREATS TO CONSTRUCT
VALIDITY (Cook & Campbell)
 Confounding constructs
 Confounding levels of constructs
 Hypothesis guessing within
experimental conditions
 Evaluation apprehension
 Researcher expectancies
83
SOURCES OF VALIDITY EVIDENCE
1. Test Content
Task Representation
Construct Domain
2. Response Process – Item Psychometrics
3. Internal Structure – Test Psychometrics
4. Relationships with Other Variables – Correlations
Test-Criterion Relationships
Convergent and Divergent Data
5. Consequences of Testing – Social context
Standards for Educational and Psychological
Testing, 1999
84
ASPECTS OF VALIDITY: CONTENT
 Content validity refers to how well
elements of the test or scale relate to the
content domain.
 Content
relevance.
 Content representativeness.
 Content coverage.
 Systematic analysis of what the test is
intended to measure.
 Technical
quality.
 Construct irrelevant variance
85
SOURCES OF VALIDITY EVIDENCE:
TEST CONTENT
Detailed understanding of the content sampled by the
instrument and its relationship to content domain


Content-related evidence is often established during the planning
stages of an assessment or scale.
Content-related validity studies
 Exact sampling plan, table of specifications, blueprint
 Representativeness of items/prompts →Domain
Appropriate content for instructional objectives
◊ Cognitive level of items
◊ Match to instructional objectives
Review by panel of experts.



Content expertise of item/prompt writers
Expertise of content reviewers

Quality of items/prompts, sensitivity review

86
ASPECTS OF VALIDITY:
RESPONSE PROCESSES
 Emphasis is on the role of theory.
 Tasks sample domain processes as
well as content.
 Accuracy in combining scores from
different item formats or subscales.
 Quality control – scanning,
assignment of grades, score reports.
87
SOURCES OF VALIDITY EVIDENCE:
RESPONSE PROCESSES
Fit of student responses to hypothesized
construct?


Basic quality control information – accuracy of
item responses, recording, data handling, scoring
Statistical evidence that items/tasks measure the
intended construct



Achievement items measure intended content and not
other content
Ability items predict targeted achievement outcome
Ability items fail to predict a non-related ability or
achievement outcome
88
SOURCES OF EVIDENCE: RESPONSE
PROCESSES
 Debrief examinees regarding solution
processes.
 “Think-aloud” during pilot testing.
 Subscore/subscale analyses- i.e.,
correlation patterns among part
scores.
 Accurate and understandable
interpretations of scores for
examinees.
89
SOURCES OF VALIDITY EVIDENCE:
INTERNAL STRUCTURE
Statistical evidence of the hypothesized
relationship between test item scores
and the construct
Reliability



Test scale reliability
Rater reliability
Generalizability
 Item analysis data



Item difficulty and discrimination
MCQ option function analysis
Inter-item correlations
 Scale factor structure
 Dimensionality studies
 Differential item functioning (DIF) studies
90
ASPECTS OF VALIDITY:
EXTERNAL
Can the test results be evaluated by
objective criteria?
 Correlations with other relevant
variables
Test-criterion correlations
 Concurrent or predictive
 MTMM matrix
 Convergent correlations
 Divergent (discriminant) correlations
91
SOURCES OF VALIDITY EVIDENCE:
RELATIONSHIPS TO OTHER VARIABLES
Statistical evidence of the hypothesized
relationship between test scores and
the construct
 Criterion-related validity studies
 Correlations
between test scores/subscores
and other measures
 Convergent-Divergent studies
 MTMM
92
RELATIONSHIPS WITH OTHER
VARIABLES
Predictive validity: Variation of
concurrent validity where the
criterion is in the future.
Classic example is to determine
whether students who score high on
an admissions test such as the MCAT
earn higher preclinical GPAs?
93
RELATIONSHIPS WITH OTHER
VARIABLES
Convergent validity: Assessed by the
correlation among items which make
up the scale (internal consistency),
by the correlation of a the given scale
with measures of the same construct
using instruments proposed by other
researchers, and by the correlation of
relationships involving the given
scale across samples or across
methods.
94
RELATIONSHIPS WITH OTHER
VARIABLES
Criterion (concurrent) validity:
correlation between scale or
instrument measurement items and
known accepted standard measures
or criteria.
Do the proposed measures for a given
concept exhibit generally the same
direction and magnitude of
correlation with other variables as
measures of that concept already
accepted in this area of research?
95
RELATIONSHIPS WITH OTHER
VARIABLES
Divergent (discriminant) validity: The
indicators of different constructs
should not be highly correlated as to
lead us to conclude that they
measure the same thing. This would
happen is there is definitional overlap
between two constructs
96
MULTI-TRAIT MULTI-METHOD
MTMM MATRIX
 Mono-method and/or mono-method
biases – use of a single data
gathering method or a single
indicator for a concept may result in
bias
 Multi-trait/Multi-method validation
uses multiple indicators per concept
and gathers data for each indicator
by multiple methods or multiple
sources.
97
MULTI-TRAIT MULTI-METHOD
MTMM MATRIX
Validity of index of learning styles scores: multitrait−multimethod
comparison with three cognitive learning style instruments. Cook
DA; Smith AJ. Medical Education, 2006; 40: 900-907
ILS = Index of Learning Styles
LSTI = Learning Style Type Indicator
ILS
ILS
LSTI
ActRefl
SensInt
VisVerb
SeqGlob
ExtInt
SensInt
ThinkFeel
JudPer
LSTI
ActRefl SensInt VisVerb SeqGlob ExtInt SensInt ThinkFeel JudPer
Active-reflective
0.75
Sensing-intuitive
-0.15
0.81
Visualverbal
0.03
0.18
0.60
Sequential-global
-0.32
-0.48
-0.12
0.81
Extrovert-introvert
0.60
-0.11
0.05
-0.43
0.54
Sensing-intuition
-0.22
0.69
-0.02
0.54
-0.18
0.69
Thinking-feeling
0.02
-0.09
0.09
0.10
0.01
0.02
0.19
Judgingperceiving
-0.27
0.46
-0.18
0.39
-0.33
0.51
-0.01
0.50
Reliability diagonal (montrait-monomethod)
Heterotrait-monomethod
Validity diagonal
Heterotrait-heteromethod
98
MULTI-TRAIT MULTI-METHOD
MTMM MATRIX
Validity of index of learning styles scores: multitrait−multimethod
comparison with three cognitive learning style instruments. Cook
DA; Smith AJ. Medical Education, 2006; 40: 900-907
ILS = Index of Learning Styles
LSTI = Learning Style Type Indicator
ILS
ILS
LSTI
ActRefl
SensInt
VisVerb
SeqGlob
ExtInt
SensInt
ThinkFeel
JudPer
LSTI
ActRefl SensInt VisVerb SeqGlob ExtInt SensInt ThinkFeel JudPer
0.75
-0.15
0.81
0.03
0.18
0.60
-0.32
-0.48
-0.12
0.81
0.60
-0.11
0.05
-0.43
0.54
-0.22
0.69
-0.02
0.54
-0.18
0.69
0.02
-0.09
0.09
0.10
0.01
0.02
0.19
-0.27
0.46
-0.18
0.39
-0.33
0.51
-0.01
0.50
Reliability diagonal (montrait-monomethod)
Heterotrait-monomethod
Validity diagonal
Heterotrait-heteromethod
99
RELATIONSHIP BETWEEN
RELIABILITY AND VALIDITY
 Neither is a property of a test or scale.
 Reliability is important validity evidence.
 Without reliability, there can be no validity.

Reliability is necessary, but not sufficient
for validity.
Purpose of an instrument dictates what
type of reliability is important and the
sources of validity evidence necessary to
support the desired inferences.
100
SOURCES OF VALIDITY EVIDENCE:
CONSEQUENCES
Evidence of the effects of tests on students,
instruction, schools, society
 Consequential validity
 Social
consequences of assessment
 Effects of passing-failing tests
 Economic
costs of failure
 Costs to society of false positive/false negative
decisions
 Effects of tests on instruction/learning
 Intended
vs. unintended
101
RELIABILITY AND
INSTRUMENTATION
Lou Ann Cooper, PhD
Director of Program Evaluation and Medical
Education Research
University of Florida
College of Medicine
TYPES OF RELIABILITY
Different types of assessments require
different kinds of reliability
 Written MCQ/Likert-scale items
 Scale
reliability
 Internal consistency
 Written Constructed Response and
Essays
 Inter-rater
agreement
 Generalizability theory
103
TYPES OF RELIABILITY
 Oral Exams
 Rater
reliability
 Generalizability Theory
 Observational Assessments
 Rater
reliability
 Inter-rater agreement
 Generalizability Theory
 Performance Exams (OSCEs)
 Rater
reliability
 Generalizability Theory
104
ROUGH GUIDELINES FOR
RELIABILITY
 The higher the better!
 Depends on purpose of test
 Very
high-stakes: > 0.90
(Licensure exams)
 Moderate stakes: at least ~ 0.75
(Classroom test, Medical school OSCE)
 Low stakes: > 0.60 (Quiz, test for
feedback only)
105
INCREASING RELIABILITY
 Written tests
 Use
objectively scored formats
 At least 35-40 MCQs
 MCQs that differentiate between high and low
scorers
 Performance exams
 At
least 7-12 cases
 Well trained standardized patients and/or other
raters
 Monitoring and quality control
 Observational Exams
 Many
independent raters (7-11)
 Standard checklists/rating scales
 Timely ratings
106
SCALE DEVELOPMENT
1. Identify the primary purpose for which
scores will be used.
 Validity is the most important
consideration.
 Validity is not a property of an
instrument.
 Inferences to be made determine the
type of items you will write.
2. Specify the important aspects of the
construct to be measured.
107
SCALE DEVELOPMENT
3. Initial pool of items.
4. Expert review (content
validity)
5. Preliminary item ‘tryout’
6. Statistical properties of
the items
 Item analysis
 Reliability estimate
 Dimensionality
108
ITEM ANALYSIS







Item ‘difficulty’ – item variance,
frequencies
Inter-item covariances/correlations
Item discrimination – an item that
discriminates well correlates with the
total score.
Cronbach’s coefficient alpha
Factor Analysis – Multidemensional
Scaling
IRT
Structural aspect of validity.
109
NEED TO EVALUATE SCALE
 Jarvis & Petty (1996)
 Hypothesis: Individuals differ in the
extent to which they engage in
evaluative responding.
 Subjects were undergraduate
psychology students.
 Comprehensive reliability and
validity studies.
 Concluded the scale was
‘unidimensional’.
110
REFERENCES
Cook, T.D. & Campbell, D. T. (1979). Quasi-Experimentation:
Design and Analysis Issues for Field Settings.
Downing, S. M. Threats to the validity of locally developed
multiple-choice tests in medical education: construct-irrelevant
variance and construct underrepresentation. Adv in Health Sci
Educ 2002; 7:235-241.
Downing, S. M. Validity: On the meaningful interpretation of
assessment data. Med Educ 2003; 37:830-837.
Messick, S. (1989) Validity. In Educational Measurement 3rd Ed. R.
L. Linn, Ed.
Downing, S. M. Reliability: On the reproducibility of assessment
data. Med Educ, 2004; 38:1006-1012.
http://www.socialresearchmethods.net
111