Issues in Large-scale Assessment

Transcript Issues in Large-scale Assessment

Margaret Wu
Victoria University
1
International large-scale assessments
 Main problem: interpretations of the results
 Focus on country rankings
 An example:
 In August 2012, Julia Gillard, then Prime Minister of Australia,
declared that Australia would strive to be ranked in the ‘top
five’ in international education assessments by 2025.
 So strong is this ambition that it has been inscribed into the
Australian Education Act of 2013 as its very first objective,
which reads: ‘Australia to be placed, by 2025, in the top 5
highest performing countries based on the performance of
school students in reading, mathematics and science’
(Australian Education Act, 2013, p. 3)
2
Does high ranking mean good education system?
 A Vietnamese researcher queried why Vietnam did well in
PISA despite poor education system in Vietnam
 Gorur & Wu, 2014, (Former OECD official– interview
transcript):
What’s the good of [the rankings]? what is the benefit to the US
to be told that it is number seven or number 10? It’s useless,
meaningless, except for a media beat up and political huffing
and puffing. It’s very important for the US to know, having
defined certain goals like improving participation rates for
impoverished students from suburbs in large cities – whether in
fact that is happening, and if it is, why it is happening and if not,
why not. And it is irrelevant whether Chile or Russia or France is
doing better or worse – that doesn’t help one bit – in fact it
probably hinders. Makes people feel uncertain, unsure, nervous,
and they rush over there and find out why they are doing better.
3
And rushed there, they did…
 (Australian) Grattan Institute’s Catching Up: Learning
from the Best School Systems in East Asia (Jensen et
al., 2012)
 … researchers from Grattan Institute visited the four
education systems [Hong Kong, Shanghai, Korea and
Singapore] studied in this report. They met educators,
government officials, school principals, teachers and
researchers. They collected extensive documentation at
central, District and school levels. Grattan Institute has
used this field research and the lessons taken from the
Roundtable to write this report (p. 6)
4
Suggested factors for high ranking (performance)
 One observation made by the Grattan Institute…
 “Shanghai, for example, has larger class sizes to give
teachers more time for school-based research to
improve learning and teaching.” (p.2)
 (Observation also made by OECD PISA, 2010)
 New Zealand government proposed to increase class
size to free up money to fund initiatives to raise the
quality of teaching (NZ Treasury briefing paper,
March, 2012)
5
Discussion points
 These “policies” are often said to be “evidence-based”,
where large-scale assessments are frequently quoted as
the sources of evidence.
 Why should we be concerned with these policies?
 Consider
 Validity - issues
 Reliability - issues
6
Validity issues
 Linking factors to performance
 Korea and China perform well, and have large class sizes.
 Can we conclude large class size leads to good
performance?
 Making inferences:
 No. of storks positively correlated with no. of babies born
 Crime rate positively correlated with ice cream sale
 People who take care of their teeth have better general health
 Mediating variables at play
7
Linking PISA to Policies
 PISA tells us about student performance, and




background of students/schools/countries
Linking background to performance is done by people,
not proven by statistics.
Any interpretation is an inference.
PISA cannot substantiate the validity of the inferences.
Need other in-depth studies.
8
A common misunderstanding about
statistical analysis
 regression equation Y = a + bX
 X is termed explanatory variable
 Y is termed dependent variable
 Does X explain Y?
 Try X = a + bY
 Exactly the same results
 Regression does not test for causal inference.
Regression only reflects correlation.
9
Regress Reading on GDP scores
Model
Coefficientsa
Standardiz
ed
Unstandardized
Coefficient
Coefficients
s
B
Std. Error
Beta
t
1
(Constant)
479.828
10.310
46.542
GDP
.427
.301
.243
1.416
a. Dependent Variable: Reading
Sig.
.000
.167
Coefficientsa
Model
Standardiz
ed
Unstandardized
Coefficient
Coefficients
s
B
Std. Error
Beta
1
(Constant)
-36.464
48.202
Reading
.138
.098
.243
a. Dependent Variable: GDP
t
-.756
1.416
Sig.
.455
.167
10
Small class size and/or
Large class size and high Number of countries
low teachers’ salaries
teachers’ salaries
performed higher than
OECD average in reading /
Reliability
Issues
Total number of countries
Low cumulative
expenditure on
education
 How strong
is the
relationship
between two
variables?
 P value = 0.11
n.s. at 95%
level
High cumulative
expenditure on
education
3 out of 31
3 out of 12
countries performed
countries performed
higher than OECD
higher than OECD
average in reading.
average in reading
8 out of 20
6/43
2 out of 2
countries performed
countries performed
higher than OECD
higher than OECD
average in reading
average in reading
10/22
Number of countries
performed higher than
OECD average in reading
11/51
5/14
16/65
/ Total number of
countries
11
Top five in what?
 Interview transcript of a senior OECD official (Gorur & Wu):
 OECD Official: Well, Australia is doing pretty well!
 RG: It’s doing well, right? But you know what we want to do now?
Our Prime Minister says we want to be in the top five in PISA!
 OECD Official: Top five in what?
 RG: In PISA.
 OECD Official: Yes, but for which students? The average student
in Canada, in Korea, Finland, Shanghai, China – that’s one thing.
If you then look at high performing students or how low
performing students do, then we may get a completely different
picture. And that’s where policy efforts are most interesting for
me.
12
Australian 2009 PISA Reading
results, by state
State
ACT
WA
QLD
NSW
VIC
SA
TAS
NT
Australia
Mean score
531
522
519
516
513
506
483
481
515
In top 5
already
Below
OECD
average
Confidence
interval
520–543
510–534
505–532
505–527
504–523
497–516
472–495
469–492
510–519
13
Ranking by item content
Country
Hong Kong-China
Item
M408Q01TR
0.60
Country
New Zealand
Item
M420Q01TR
0.66
Finland
0.56
Australia
0.64
Australia
0.56
Canada
0.64
Chinese Taipei
0.55
Ireland
0.62
United Kingdom
0.55
Shanghai-China
0.62
New Zealand
0.55
United Kingdom
0.60
Macao-China
0.53
United States
0.59
Iceland
0.52
Chinese Taipei
0.58
Ireland
0.51
Singapore
0.57
Singapore
0.50
Denmark
0.57
14
Differential Item Functioning (DIF)
 Australia performed extremely well on Items M408Q01TR
and M420Q01TR, ranking third and second respectively
internationally. For ItemM408Q01TR, Shanghai-China
ranked 20th, despite the fact that Shanghai took the top
spot internationally in mathematics literacy, with a mean
score much higher than the second place country,
Singapore. For Item M420Q01TR, Australia outperformed
all top ranking countries.
 In contrast, for Item M462Q01DR, Australia ranked 43
internationally, with an average score of only 0.1 out of a
maximum of two, while Shanghai had an average score of
1.5 out of a maximum of two.
15
Implications of DIF
 Average score (and ranking) hides DIF.
 Existence of DIF threatens comparisons across
countries, as the achievement results depend on which
items are in the test.
16
An Example - Japan
 PISA reading
 2000: 522
 2003: 498
 a 24 point drop, about 6 months of growth!
 Triggered huge reactions in Japan
 Blame on reform started two years before
 New reforms and policies
17
How PISA trends are established
 Select some items from 2000 as “anchoring items”
 Place in 2003 test
 So 2003 results can be placed on the 2000 scale
18
Item Bias
 Items don’t work in the same way in all countries.
 One item may be relatively more difficult for one
country than for other countries.
 Differential Item Functioning (DIF)
19
Differential Item Functioning
 Hypothetical example:
Item
Country A (%
correct)
Country B (%Correct)
Biased against B
/Favours A
76
1
65
2
74
83
3
42
51
4
79
85
5
73
64
6
72
91
7
46
54
Biased against A
/Favours B
20
Japan vs International Item Parameters
Comparison of International Item Parameter and National
parameters for Japan
3
2.5
Difficulty for Japan (logit scale)
2
1.5
1
0.5
0
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
-0.5
-1
-1.5
-2
International difficulty (logit scale)
21
Anchoring items in Reading 2003
 Many anchoring items were biased against Japan
 Japan’s mean score would increase by 10 score points
if one particular reading unit was removed from the set
of eight anchoring units. (Monseur & Berezner, 2007).
22
Fluctuation of Country Results
 Owing to items selected for a test for reasons such as
 Cultural differences
 Language differences
 Curriculum differences
23
2000
–
2009
trends
 It has often been claimed that Australia is slipping in
Reading.
24
-13 points
25
What PISA tells us
 Big picture
 Australia is doing pretty well
 Australia and New Zealand lead the English speaking
countries
 (Confucius culture) Asian countries lead in academic
performance
 Finland does very well in non Asian countries
 May suggest something for further investigation
26
Limitations of large-scale assessments
 Not able to collect data on all factors related to





education.
For example, private spending on education has not
been captured
Students’ lives outside schools.
Look beyond international ranks
Focus on within country comparisons
Don’t jump to conclusions on policy implications
27

Issues in Large-scale Assessment

Transcript Issues in Large-scale Assessment

Directory