The Art of Interpreting Test Results

Download Report

Transcript The Art of Interpreting Test Results

So Much Data, So Little Time:
What Parents and Teachers Need to Know
About Interpreting Test Results
Lee Ann R. Sharman, M.S.
ORBIDA Lecture Series
April 13, 2010
You’re Not Paying Attention!

By the end of our session, you will be
understand these key terms/concepts:
◦ Different strokes for different folks: all tests are
not equal!
◦ Basic statistics you MUST know
 Reliability and validity
 The Bell Curve – a beautiful thing
◦ Error, and why it matters
◦ Common mistakes that lead to poor decisions
 Entitlement





decisions (eligibility)
Skills assessment (diagnostic)
Screening and Progress Monitoring (RtI)
Instructional planning, accommodations and
modifications
Curriculum evaluation – is it working?
Increased focus on data-based decision making to
measure outcomes
Working smarter: ask the right questions!
Understand the differences between types of tests and
what they were designed to measure:
◦
◦
◦
◦
Curriculum-based measures (Dibels, Aimsweb)
Teacher-made criterion referenced tests
Published criterion referenced tests
Norm-referenced tests of Achievement
 OAKS
 Woodcock-Johnson III
◦ Norm-references tests of Cognitive Ability
The test you choose depends on what questions you want to
answer.






School records (file reviews)
Interviews
Medical and Developmental histories
Error analyses
Use of portfolios
Observations
The Snapshot: Point in Time Performance
Measuring Improvement (Change) and
Growth
…You can use a hammer to push in a screw, but a
screwdriver will be easier and more efficient
What OAKS is:
•OAKS is a “Point in Time” measurement,
intended to be used more as a Summative
Assessment. It’s a SNAPSHOT.
•Gives information on group achievement
towards state standards to stakeholders; “Are
enough students in our district meeting
benchmarks?

What OAKS is NOT:
 OAKS is not intended to give information (see
OARs) that will inform instruction or
interventions
 A tool designed for Progress Monitoring
 A measure of aptitude or ability
 A comprehensive measure of identified
content

Response to Intervention – RtI
◦ All models involve tiers of interventions, progress
monitoring, and cut scores to determine who is a
“responder” (or not).
◦ Dibels is a commonly used tool for progress
monitoring


A few different models, but in this case we refer to
measurement of the cognitive abilities underlying
areas of unexpected low academic achievement
Specific cognitive abilities (processing measures,
e.g. Rapid Automatic Naming, Phonemic
Awareness, Long-Term Retrieval) predict reading,
writing, and math acquisition ability
1. WHAT KIND OF TEST IS THIS?
2.
3.
4.
5.
e.g. Norm or Criterion-referenced
What is it used for - the purpose?
Is it valid for the stated purpose (what it measures or
doesn’t measure)?
Is the person administering the test a qualified
administrator?
Are the results valid (test conditions optimal, etc.)








Parental permission…true informed consent
Screening for sensory impairments or
physical problems
File review of school records
Parent/caregiver interview
Documented interventions and quality
instruction
Intellectual and academic assessment
Behavioral assessment or observation
Summary and recommendations
I. Identifying data – the Who, What, When
II. Background Information
A.
B.
C.
D.
E.
Student history
Reason for Referral
Classroom Observation
Parent Information
Instruction received/strategies implemented
Test Results
Test Interpretation
Summary and Conclusions
III.
IV.
V.
A.
B.
C.
Summary
Recommendations for instruction
Recommendations for further assessment, if
needed

Must haves:
◦ Skilled examiner
◦ Optimal test conditions
◦ Cultural bias – be aware
◦ Validity/reliability
◦ Appropriate measures for goal
Kids are more than the scores – the “rule
outs”:
◦ Home/Environmental issues
◦ Sensory acuity problems
◦ Previous educational history
◦ Language factors
 Second language and/or language disorders
◦ Social/Emotional/Behavioral issues


The Matthew Effect – poor reading skills
depress IQ scores (Stanovich)…”The rich get
richer”
The Flynn Effect – IQ is increasing in the
population over time; tests are renormed to
reflect this phenomenon

The devil is in those
details…learn the
basic principles of
Statistics
Simply stated:
Statistics are used to measure things and describe
relationships between things, using numbers
1.
2.
3.
4.
Standard Scores (SS) and Scaled Scores (ss)
Percentile Ranks (% rank)
Age and Grade Equivalents (AE/GE)
Relative Proficiency Index (RPI)
OR, The Normal Frequency Distribution



Mean and standard deviation of the test used
reported
Standard scores, percentile ranks, and
standard errors of measures, with
explanations of each
Both composite or broad scores and subtest
scores, with an explanation of each

Information about developmental ceilings,
functional levels, skill sequences, and
instructional needs upon which
assessment/curriculum linkages can be used
to write the IEP goals

These are raw scores which have been
transformed to have a given mean (average)
and standard deviation (set range or unit of
scores). The student’s test score is compared
to that average. A standard score expresses
how far a student’s score lies above or below
the average of the total distribution of scores.
Composite or
Cluster scores
Standard or scaled
scores
Raw scores


Similar to SS, but in a different form. Allows
us to determine a student’s position (relative
ranking) compared to the standardized
sample
Percentile rank is NOT the same as a percent
score! PR refers to a percentage of persons;
PC refers to a percentage of test items
correct.

Valuable statistic, found only on WJ-III
◦ Written as a percentage, or number out of 90,
indicating percent of proficiency on similar tasks
that students in the comparison group would have
90% success. Correlated with Independent,
Instructional, and Frustration levels (see sample)




Making faulty comparisons: Compare only data sets
measuring the same content, with good content/construct
validity, that are NORMED ON THE SAME POPULATION
Using and AE/GE as a measure of the child’s proficiency/skill
mastery of grade level material
Error exists! Don’t forget about the confidence intervals
◦ SEM creates uncertainty around reporting 1 number
Confusing Percentile RANKS with Percentages:
◦ PR = relative ranking out of 100
◦ Percentage = percentage correct


Age equivalents are developed by figuring out
what the average test score is (the mean) for
a group of children of a certain age taking the
test; not the same as skills
Grade equivalents are developed by figuring
out what the average test score is (the mean)
for a student in each grade.






Commonly used
Misleading
Misunderstood
Difficult to explain
May have little relevance
Avoid in favor of Standard Scores/%Ranks

“When assessed with teacher made tests, Sally
locates information within the text with 60%
accuracy.”
VS.

“Sally’s performance on the OLSAT falls at the
60th %ile rank.”

Are the student’s skills better developed in
one part of a domain than another?
For example:
“While Susan’s Broad Math score was within
the low average range, she performed at a
average level on a subtest that assesses
problem solving, but scored well below
average on a subtest that assesses basic
math calculation.

Test scores don’t support the teacher report
of a weakness?
◦ First, look at differences in task demands of the
testing situation, and in the classroom, when
hypothesizing a reason for the difference.
◦ Look at student’s Proficiency score (RPI) vs.
Standard Score (SS)

“Although weaknesses in mathematics were
noted as a concern by Billy’s teacher, Billy
scored in the average range on assessments
of math skills. These tests required Billy to
perform calculations and to solve word
problems that were read aloud to him. It
was noted he often paused for 10 seconds
or more before starting paper and pencil
tasks in mathematics.”

“Billy’s teacher stated that he does well in
spelling. However, he scored well below
average on a subtest of spelling skills. Billy
appeared to be bored while taking the
spelling test, so a lack of vigilance in his
effort may have depressed his score. Also,
the school spelling tests use words he has
been practicing for a week.

The lower score may indicate that he is
maintaining the correct spelling of words in
long-term memory, and is not able to
correctly encode new words he has not had
time to study.
1. WHAT KIND OF TEST IS THIS?
2.
3.
4.
5.
e.g. Norm or Criterion-referenced
What is it used for - the purpose?
Is it valid for the stated purpose (what it measures or
doesn’t measure)?
Is the person administering the test a qualified
administrator?
Are the results valid (test conditions optimal, etc.)



The good news: We are moving away from
the old “Test and Place” mentality.
The challenge: School teams are using
more comprehensive data sets, which
require more knowledge to interpret
More good news: The best decisions are
made using multiple sources of good
information

“The true utility of assessment is the extent
to which it enables us to find the match
between the student and an intervention that
is effective in getting him or her on track to
reach a meaningful and important goal. The
true validity of any assessment took should
be evaluated by the impact it has on student
outcomes.”
(Cummings/McKenna 2007)

“…one of the problems of writing about
intelligence is how to remind readers often
enough how little an IQ score tells you about
whether or not the human being next to you
is someone whom you will admire or cherish.”
(Herrnstein and Murray)