Document

Transcript Document

Update on MCAS: Is it
Working? Is it Fair?
Ronald K. Hambleton
University of Massachusetts at
Amherst
EPRA Seminar, November 5, 2005. (revised)
Purposes
 Address
some of the misconceptions
that exist about the MCAS.
 In addressing the misconceptions,
provide some insights about MCAS,
and answer questions about its level
of success and its fairness.
 Share some of my own concerns
about MCAS and next steps.
General Goals of State Testing
Programs Like MCAS
 Provide
students, their parents, and
teachers with feedback on student
educational progress in relation to state
curricula.
 Compile data for monitoring progress or
change in student performance over time.
 Provide data for educational accountability.
(e.g., NCLB legislation)
Characteristics of MCAS Assessments
 English
Language Arts and Mathematics
assessments at several grades.
 Science, social studies/history
assessments, second language
proficiency are coming in Massachusetts.
 Multiple-choice (60%)and performance
tasks (40%).
 Assessments include a core of items for
student reporting, and other items (for
field-testing, curriculum evaluation, and
linking test forms over time)
 Performance standards set by educators.
MCAS is not just about testing!
It is about:
-- substantially increased funding for
education
--curriculum reform
--integrating of curricula, instruction,
and assessment
--improving administrator and teacher
training, and educational facilities
--addressing special needs of students
Let’s consider next, six common
criticisms of MCAS.
1.
State tests encourage teachers to
“teach to the test” and this
narrows the curriculum taught.
 This
is potentially a valid concern—
Problematic with NRTs—10 to 20%
coverage of curricula, MC items only.
Same skills and items are assessed
each year. Here, teaching narrowly
to the skills and content of the test
improves test scores but not learning
of the broader curricula. But MCAS
assessments are not NRTs!!
1.
State tests encourage teachers to
“teach to the test” and this
narrows the curriculum taught.
 MCAS Assessments
are
representative of the curricula and
new items are used each year. What
does “teaching to the test” mean
when the tests are a sample of the
curricula? Teach the curricula!
 Consider the next set of displays—
85% or more of MCAS curricula are
assessed in every three year cycle.
% of LS Assessed
Comparison of Percent of Learning
Standards Assessed in Math at Grades 4, 6, 8,
and 10 from 2001 to 2004. (about 40/grade)
100
80
60
40
20
0
Grade 4
Grade 6
Grade 8
Grade 10
2001 2002 2003 2004
Year
% of LS Assessed
Percent of learning standards assessed in
Mathematics at grade 4, 6, 8 and 10 in the
time periods, 2001 to 2003, and 2002 to 2004.
100
80
60
Grade 4
Grade 6
Grade 8
Grade 10
40
20
0
2001-2003
2002-2004
Time Period
11
In sum, no justification for narrowing the
curricula: The assessments are
representative of the curricula, and over
three year periods, over 85% of learning
standards are being assessed. (Results at
all grades and subjects are equally good.)
Teaching to the test/assessment means
teaching the curricula!
 Other states, too (e.g., Minnesota) have
found that when tests and curricula are
aligned, teachers are considerably more
supportive of educational assessments.
Teachers need to see these alignment
results in Massachusetts!

2.
Important decisions about students
should not turn on one test.
 AERA, APA,
and NCME Test Standards
highlight importance of measurement
errors, and the undesirability of a single
test driving an important decision.
 Reality: State tests (e.g., grade 10 ELA
and Math) are not the only requirements
for students—English, mathematics,
science, history credits at HS level are
required, as well as regular attendance.
2. Important decisions about students
should not turn on one test.
 Students
have five chances to pass
grade 10 ELA and math assessments
during their high school years.
 Appeals process (for students close to
the passing score, and who are
attending school regularly, and taking
the assessments).
 Math assessment is available in
Spanish too (to reduce bias).
2. Important decisions about students
should not turn on one test.
DOE is expecting schools to be doing
their own assessments too (using
multiple methods, such as projects,
portfolios, work samples, classroom
tests, etc.)—I think one might question
high school grading practices at grades
11 and 12 if a grade 10 test was a major
block for graduation.
 In sum, the criticism does not have merit.

3. State assessments are full of
flawed and/or biased test items.
Item writing is not a perfect science
and mistakes will be made.
 Massachusetts releases all operational
items on the web-site shortly after their
use. Find the flawed items if you can?
(I can’t and I have seriously looked.)
This is a remarkable situation—few
states release items—excellent for
instructional purposes and critics. If
critics think items are flawed, report
them.

3. State assessments are full of
flawed and/or biased test items.
 Process
of preparing items in
Massachusetts is state-of-the art:
qualified and culturally diverse item
writers; content and bias reviews by
committees, department, and
contractors; field testing; study of
statistical evidence for bias; and
care in item selection (optimal
statistically and content valid).
3. State assessments are full of
flawed and/or biased test items.
 UMass
has looked at over 1000
items over years, grades, and tests,
and found little statistical evidence
for gender and racial bias.
 I just don’t see the merit of this
criticism, and I have studied these
tests to find flaws and biases and
can’t (but for a few items in
science).
4. Student testing takes up too
much time and money.
 Quality
tests are expensive, and
require student time. (Reliability of
scores needs to be high.) (six hours
in some grades—grades 4 and 10)
 In Massachusetts, for example, all
students at grades 3, 4, 6, 7, 8, and
10 are tested for some subjects.
4. Student testing takes up too
much time and money. (Cont.)
4 to 6 hours/per student or about 0.5% of
instructional time per year. (one day of
180!)
 $7.0 billion on education, $25 million for
assessment/year, $20.00/student. 0.3% of
education budget, or 1 of every 300 dollars
spent on MCAS assessments! Seems
obvious that the amount of time and cost of
assessments is not out of line with value.

Changes in timing or scheduling of the
test—reaction to criticisms in 1998.
 Administer
test in short periods.
 Administer at a time of day that takes
into account the student’s medical
needs or learning style.
 Time of testing varies by grades, but
takes less than one day (total) of 180
days of school year, not all grades
assessed, and diagnostic results from
students and groups can be used to
improve instructional practices!
One Example: Using Item Analysis Results at the
School Level (reproduced with permission of
MDOE)
Students Performing
At the Proficient Level
Your school
5. Passing scores are set too high.
Too often judgment of pass rates is
made based on failure rates.
 Look at the process used by the states,
look for validity evidence.
 Who is setting the passing scores and
what method are they using?
 What is the evidence to support
performance standards that are set too
high in Massachusetts? It doesn’t exist,
in my judgment.

5. Passing scores are set too high.
 Typically,
passing scores are set by
educators, school administrators; and
sometimes parents and local persons
are included too. In Mass, teachers
dominated. (52% of panelists)
 Critics need to study the procedures
used in setting passing scores and
validity evidence.
 As an expert on this topic, I can tell
you that the state used exemplary
procedures.
5. Passing scores are set too high.
 Test
scores are placed on a new
reporting scale with scores from 200
to 280. 220 is passing.
 In 2005, more than 75% of grade 10
students passed both ELA and math
assessments (first time), and pass
rates over 80% for each assessment
(first time takers).
 I don’t see merit in the criticism.
6.
There is little or no evidence that
MCAS is producing results.
Internal evidence (sample):
--At the grade 10 level, pass rates have
been steadily increasing.
--Research evidence by Learning
Innovations (2000): 90% of schools
indicated changes in curricula;
changes influenced by test results;
instruction influenced by MCAS results
in over 70% of teachers.

6.
There is little or no evidence that
MCAS is producing results.
External evidence:
--State received very positive reviews
about the MCAS curricula from
Achieve, Inc. (a national review
group)--among the best in country.
--NAEP scores are up since 1992 for
white, Black, and Hispanic students.
--SAT scores are up, and more students
taking the SAT.

NAEP 2005
Massachusetts and National Results:
Percentages at NAEP Achievement Levels
…
Mathematics Grade 4
Reading Grade 4
Mathematics Grade 4:
Percentage at NAEP Achievement Levels
Source: Massachusetts Snapshot Report 2005; US DOE, IES, NCES
29
Reading Grade 4:
Percentage at NAEP Achievement Levels
Source: Massachusetts Snapshot Report 2005; US DOE, IES, NCES
30
1994-2004 Massachusetts Mean SAT Scores
Combined Verbal & Math
1050
1041
1040
1030
1026
1020
1010 1003
1000
MA
Nation
1002
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
990
Personal Concerns
 Drop
out rates have increased, especially
for inner-city students. But how much?
Why? What can be done if true?
 Retention rates at the ninth grade are up.
How much? Why? What can be done?
 Consequential validity studies are
needed. Intended and unintended
outcomes—both positive and negative
need to be identified, and addressed.
Personal Concerns
 Funding
of schools. Is it sufficient? Are
we spending the money on the right
items and in the appropriate amounts—
teachers, special programs, school
facilities, etc.? (Assessment results
provide clues, at least, to problem areas.)
Conclusions
I
am encouraged by educational reform
in Massachusetts—many positive signs:
funding, curricula, assessments, concern
for students who need special assistance,
etc.
 Internal and external validity evidence is
very encouraging.
 Importance problems remain—notably
achievement gap, and funding issues.
Conclusions
I
am troubled by the misconceptions that are
so widely held about the MCAS. They
interfere with effective implementation.
 Would like to see everyone get behind
educational reform, and make it work for more
students. Continue with the strengths and
address problems.
--Compile substantial validity evidence, then
make the necessary changes, with the goal to
make education in Massachusetts meet the
needs of all students.
Follow-up reading:
 R.
P. Phelps. (Ed.). (2005).
Defending standardized testing.
Mahwah, NJ: Lawrence Erlbaum
Publishers.
Please contact me at
[email protected] for a
copy of the slides, or to
forward your questions and
reactions.
Some extra slides. Not used in the
presentation because of limited time.
38
Annual Grade 9 –12 Dropout Rates: 1997 to 2003
5.0%
4.5%
4.0%
3.5%
3.0%
3.4%
3.4% 3.6%
3.5% 3.5%
3.1%
2.5%
3.3%
2.0%
1.5%
1.0%
0.5%
0.0%
1997
1998
1999
Massachusetts Department of Education
2000
2001
2002
2003
39
State approach to minimizing dropouts:
 Provide
clear understanding to students
about what is needed.
 Improve students’ classroom curricula and
instruction.
 Offer after-school and summer programs.
 Find new role for community colleges to
meet student needs.
 Do research to identify reasons for dropouts and then react if possible.
7. Testing accommodations are not
provided to students with
disabilities.
 Federal
legislation is very clear on
the need for states to provide test
accommodations to students who
need them. (ADA, IDEA legislation)
 Validity of scores is threatened.
 State provides a large set of
accommodations.
Long List of Available
Accommodations
 About
20 accommodations
organized into four main
categories—(a) changes in
timing, (b) changes in setting, (c)
changes in administration, and
(d) changes in responding.
b. Changes in test setting
 Administer
to a small group or
private room
 Administer individually
 Administer in a carrel
 Administer with the student
wearing noise buffers
 Administer with the
administrator facing student
c. Changes in test administration
 Using
magnifying equipment or
enlargement devices
 Clarifies instruction
 Using large-print or Braille
editions
 Using tracking items
 Using amplification equipment
 Translating into American Sign
Language
d. Changes in how the student
responds to test questions
 Answers
dictated
 Answers recorded
8. State tests must be flawed
because failure rates are high
and better students do go onto
jobs and colleges.
 Actually
failure rates at the grade 10
level are not high. (80% pass both
tests on first chance)
 NAEP results are not that far out-ofline with state results in New England.
[in fact, results are close]
 Too many colleges must offer basic
reading and math courses.
8. State tests must be flawed
because failure rates are high
and better students go onto jobs
and colleges.
 Internationally,
we are about middle
of the pack. In one of the recent
studies, we were right there with
Latvia and New Zealand, and
trailing Korea, Singapore, and
many other industrial countries.
9.
Test items are biased against
minorities.
 Another
excellent validity concern,
but the evidence is not supportive of
the charge in Massachusetts.
 We have analyzed available 1998,
2000, 2001 grade 4, 8, 10, ELA,
math, science, history; Male-Female;
Black-White; Hispanic-Black.
Conditional P-Value Plot of
Uniform DIF (SDIF=0.135, UDIF=0.136)
1
0.9
0.7
0.6
Reference
Focal
0.5
0.4
0.3
0.2
0.1
Score Points (Common Test)
85
79
73
67
61
55
49
43
37
31
25
19
13
7
0
1
Proportion Correct
0.8
Conditional P-Value Plot of
Non-Uniform DIF (SDIF=0.060, UDIF=0.029)
1
0.9
0.7
0.6
Reference
0.5
Focal
0.4
0.3
0.2
0.1
Score Points (Common Test)
53
49
45
41
37
33
29
25
21
17
13
9
5
0
1
Proportion Correct
0.8
Grade 10 Mathematics DIF Item Plot
p-value
Grade 10 Mathematics MC Item 32
White-Black DIF: 0.099
50
59
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Whites
Blacks
0
10
20
30
40
Common Test Score
50
59
DIF Indices for a Mathematics Test
(M-F)
0.15
Easier Items for Males
0.1
DIF Index
0.05
0
-0.05
-0.1
Easier Items for Females
-0.15
1
3
5
7
9
11
13
15
17
19
21
Item
23
25
27
29
31
33
35
37
39
41
Mathematics Test Items (M-F)
Organized by Content
0.15
Favoring
Males
0.1
DIF Index
0.05
0
-0.05
-0.1
Favoring
Females
Number Sense
and Operations
Patterns,
Relations, and
Algebra
Geometry
-0.15
Strands
Measurement
Statistics, Data
Analysis, and
Probability
Even prior to statistical review,
judgmental reviews take place:
 Assessment
development committee
 Bias review committee
 Department staff
 External content area experts
 Item writers themselves
The real discrimination is more likely
in the educational system not the
assessments.
 Discrimination
does exist in an
educational system that has
historically moved students forward
despite poor educational
performance, awarded high school
diplomas, and sent them on to
minimum wage jobs. If high
schools won’t stop practice, then
the state needs to intervene.
10. Gains in achievement are most
likely due to improvements in testing
practices only.
 Many
possible reasons for gain
including—students are learning to
take tests (not all bad), aligning of
instruction to tests (not bad if tests
measure curriculum frameworks),
cheating (little evidence so far), and
holding back students/drop-outs.
Consider retention argument to
explain achievement growth in
Massachusetts: retention rate
increased 25%! (see Walt Haney)
 Reality:
It went from 4% at ninth
grade to 5%. Increase: 25%.
 With 60,000 students, 600
students affected, pass and fail
rates would be affected by about
1% only! This is not the
explanation for growth.
10. Gains in achievement are most
likely due to improvements in
testing practices only. (Cont.)
 Also,
it is possible that teachers and
students are working harder,
teachers are focused on the
curricula, and are teaching more
effectively.
10. Gains in achievement are most
likely due to improvements in
testing practices only. (Cont.)
 Research
evidence by Learning
Innovations (2000): 90% of schools
indicated changes in curricula;
changes influenced by test results;
instruction influenced by MCAS
results in over 70% of teachers.
11. Not everything that should be
tested is included.
 Definitely
true—but over time, what
can be tested should be tested. And,
schools have the responsibility to
pick up the rest! They can use work
samples, portfolios, more
performance tasks, etc. There are
course grades too.
12. Special education students
should not be included.
 Federal
laws (ADA, IDEA) require
that every possible effort to be made
to include all students in the
assessments. The policy is know as
“full inclusion.” President Bush’s “no
child left behind” is another example
of federal initiatives.
Conclusions
 From
a technical perspective,
many state tests, especially
recent ones, are quite sound.
 Technical methods for test
development, equating,
assessing reliability and validity,
setting standards, etc. are very
much in place and suitable.
Conclusions (Cont.)
 Major
shortcoming of many
state testing programs: (1) too
little emphasis on diagnostic
testing, and (2) too little
emphasis on efforts to evaluate
program impact.
 Impact on students such as
drop-outs, retentions, attitudes,
going to college, etc. needed
Conclusions (Cont.)
 Impact
on teachers such as
improvements in their
qualifications, changes in
instructional practices, attitudes
about teaching, etc. needs study
 Impact on school administrators
such as the need for new forms of
leadership, and new demands on
time, needs study.
Conclusions (Cont.)
 Testing
methods and curricula
frameworks are in place (and can
be modified as appropriate).
 My hope would be that educators
try and make the current program
of educational reform work—
evaluate as we go, and revise
accordingly. A data-based
approach make sense for effective
educational reform.

Document

Transcript Document

Directory