Transcript Document

A fuzzy clustering approach to improve the
accuracy of Italian students’data
An experimental procedure to correct the
impact of the outliers on assessment test
scores
Claudio Quintano, Rosalia Castellano, Sergio Longobardi
UNIVERSITY OF NAPLES “PARTHENOPE”
[email protected]
[email protected]
[email protected]
OUTLINE
This work considers data
assessments collected by the
on
students’
performance
Italian National Evaluation Institute of the Ministry of
Education (INVALSI)
THE INVALSI SURVEY
3 AREAS
reading, mathematics and science
5 SCHOOL LEVELS
–2th and 4th year of primary school
–1th year of lower secondary
–1th and 3th year of upper secondary
•
OUTLIER UNITS, at class level, which brings to biased
distributions of the average scores by class
•
The AIM is to MITIGATE THE PRESENCE of outliers and
correcting the overestimation of children ability
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
DISTRIBUTIONS OF MEAN SCORES
AT CLASS LEVEL (MATHEMATICS ASSESSMENT)
MATHEMATICS
CLASS MEAN SCORE - S.Y 2004/05
Histogram
Histogram
Histogram
Histogram
Histogram
2.000
500
500
1.250
1.250
IIII CLASS
I CLASS
UPPER
CLASS
IVIIUPPER
CLASS
LOWER
SECONDARY
PRIMARY
SCHOOL
SECONDARY
SECONDARY
PRIMARY
SCHOOL
SCHOOL
SCHOOL
SCHOOL
Frequency
Frequency
Frequency
Frequency
Frequency
400
400
1.500
1.000
1.000
300
300
750
750
1.000
200
500
200
500
500
250
100
100
250
0 00
0
0
0
0,00
0,00
0,00
0
20
20,00
20,00
20,00
20
40
60
60,00
40,00
40,00
60,00
40,00 avergita
60,00
40
60
VAR00002
VAR00005
VAR00005
avergita
80
80,00
80,00
80,00
80
100
100,00
100,00
100,00
100
Mean =74,71
=59,57
MeanDev.
Std.
=14,133
Mean
=52,21
Mean
=71,65
Mean
=51,24
=10,382
Dev.
Std.
N
=30.097
Std.
Dev.
=15,229
Std.
Dev.
=16,15
Std.
Dev.
=27.437
N =14,451
N =9.280
N =8.454
N =29.559
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
CLASS MEAN SCORE
2.500
3.000
1.250
2.000
Frequency
750
1.500
1.000
500
1.000
500
250
Mean =87,81
Std. Dev. =8,14
N =30.031
0
0
0
0
20
40
60
80
100
Reading s.y.Histogram
2004/05
20
40
60
80
Mean =74,71
Std. Dev. =14,133
N =30.097
0
0
100
Mathematics s.y.
2004/05
Histogram
2.000
2.000
1.500
1.500
20
40
60
80
100
Science s.y. 2004/05
avergita
avergita
avergita
Histogram
5.000
1.000
Frequency
4.000
Frequency
Frequency
Frequency
1.000
2.000
Frequency
II CLASS - PRIMARY SCHOOL
Histogram
Histogram
Histogram
1.000
3.000
2.000
500
500
1.000
0
0
20
40
60
avergita
80
Reading s.y. 2005/06
100
Mean =77,72
Std. Dev. =13,029
N =29.802
0
0
20
40
60
avergita
80
100
Mathematics s.y. 2005/06
Mean =81,5
Std. Dev. =11,439
N =29.816
0
0
20
40
60
avergita
80
Science s.y. 2005/06
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
100
STEP I
Deletion of micro
considered as
units
–students-
“PSEUDO NON RESPONDENTS”
Students who haven’t given the
minimum number of answers to
compute a performance score
The presence of these units varies
from 9% to 16%
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
COMPUTATIONSUMMARY
OF CLASS LEVEL INDICATOR
For each student class the
Class mean score
following indexes are computed:
At first step the micro units
Standard deviation
considered
as
“pseudo-non
Class mean score : of mean score
respondents”
have been
Class
response
rate
Standard
deviation
Index
ofnon
answers’
homogeneity
NUMBER BOTH OF ITEM NON
dropped
fromscore
dataset then
SCORE
OF I STUDENT
J
of mean
REPSONSES
AND OF OF
INVALID
CLASS
GINI MEASURE
COMPUTED
the following indexes,
at OF HETEROGENEITY
RESPONSES
FOR
THE I STUDENT
OF
FOR EACH S TEST QUESTION
ADMINISTERED
TO
non response rate
THE Class
J CLASS
class level,
N are computed:
EACH STUDENT OF J CLASS
N jN
Q jj
2
pijEpM
pijj ij



sj
Epj Jj jis1ii111
MC
NUMBER OF ADMINISTERED
Njjj Q
NN
Q
Index
of answers’
ITEMS TO
J CLASS
TH
TH
TH
TH
TH
 
TH
TH
homogeneity
NUMBER OF RESPONDENT
STUDENTS OF JTH CLASS
NUMBER OF RESPONDENT
A fuzzy clustering approach to improve the accuracy of Italian students’data
STUDENTS OF JTH CLASS
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
PRINCIPAL COMPONENT ANALYSIS (PCA)
By the PCA we are able to describe the
answer behaviour of each student class
through two variables
SECOND
Component
Class non response rate
INDEX OF CLASS
COLLABORATION
TO SURVEY
FIRST
Component
OUTLIERS
IDENTIFICATION
AXIS
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
PRINCIPAL COMPONENT ANALYSIS (PCA)
It is possible to detect, graphically, the outlier
classes of students
Projection on the
first two factorial
axes plane of
second class
primary students
OUTLIER
CLASSES
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
THE FUZZY K-MEANS APPROACH
On the basis of the two
factorial dimensions the
students’classes are classified
in 8 clusters by a FUZZY KMEANS algorithm
Computation of fuzzy
partition matrix where for
each students’ class (rows of
the matrix) the degree of
belonging to each cluster
(columns of the matrix) is
computed
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
DETECTION OF OUTLIERS
High negative scores on
“outliers identification
axis” (x-axis) that
indicates a high class
average scores and
minimum within
variability respect to
scores and test answers
OUTLIER
CLUSTER
Projection of
centroids computed
by fuzzy k-means
Factorial scores close to zero
respect to the “index of class
collaboration to survey”
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
DETECTION OF OUTLIERS
Indicating with “a” the outlier cluster, the
degree of belonging to this cluster is:
µja
This measure is considered as the “outlier
probability” of jth class
Otherwise it can be interpreted as the
“outlier level” of each class
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
CORRECTION PROCEDURE
On the basis of the outlier cluster degree,
a weighting factor is developed:
Weighting factor
Outlier probability
Wj =1 - µja
Wj varies from 0 to 1
The students’ class with high probability to belong
to outlier cluster will have a low weight while the
class very far from this cluster will have a weight
close to 1
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
EFFECTS OF THE CORRECTION PROCEDURE
ORIGINAL DISTRIBUTION
ADJUSTED DISTRIBUTION
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
THE INSPIRATION PRINCIPLE
OUTLIER
Go over the
dichotomous logic
NOT
OUTLIER
FUZZY
APPROACH
Compute an “OUTLIER LEVEL”
measure for each unit to
calibrate
the correction
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION
AND THE PRESENCE OF OUTLIER CLASSES
Box plot of
outlier level
µja
Degree to
belonging to
the outlier
cluster
(cluster n.2)
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION
AND THE PRESENCE OF OUTLIER CLASSES
CLASS AVERAGE
SCORE
DISTRIBUTIONS ONLY
FOR THE NORTHERN
AND CENTRAL
REGIONS
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
REGIONAL SCORES
86
84
II elem MAT 0405
82
80
78
76
74
72
70
68
NOT WEIGHTED
AVERAGE
66
64
62
60
58
56
54
52
Si
ci
lia
Sa
rd
eg
na
Pu
gl
ia
Ba
si
lic
at
a
C
al
ab
ria
La
zi
o
Ab
ru
zz
o
M
ol
is
e
C
am
pa
ni
a
Va
lle
Pi
em
on
te
D
'A
os
ta
Lo
Tr
m
en
ba
tin
r
d
o
ia
Al
to
Ad
ig
e
Fr
iu
Ve
li
Ve
ne
to
ne
zi
a
G
iu
lia
Li
Em
gu
ilia
ria
R
om
ag
na
To
sc
an
a
U
m
br
ia
M
ar
ch
e
50
Media
86
84
II elem MAT 0405
82
80
78
76
74
72
70
WEIGHTED
AVERAGE
68
66
64
62
60
58
56
54
52
Media ponderata
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
Si
ci
lia
Sa
rd
eg
na
Pu
gl
ia
Ba
si
lic
at
a
C
al
ab
ria
La
zi
o
Ab
ru
zz
o
M
ol
is
e
C
am
pa
ni
a
M
ar
ch
e
U
m
br
ia
Pi
em
on
Va
te
lle
D
'A
os
ta
Lo
Tr
m
en
ba
tin
r
d
o
ia
Al
to
Ad
ig
e
Fr
iu
Ve
li
Ve
ne
to
ne
zi
a
G
iu
lia
Li
Em
gu
ilia
ria
R
om
ag
na
To
sc
an
a
50
Index of answers’ homogeneity
Q
Index of answers’ homogeneity
Ej 
E
s 1
sj
Q
The mean of the Q Gini indexes (Esj) computed for each sth test
Question administered to each student of jth class:
 nt
Where Esj is a Gini measure of heterogeneity: E sj  1   
t 1  N j
n
h
t
Nj




2
denotes the ratio of students of jth class that has given the tth answer to sth question
The Gini measure is equal to zero when all students of jth class have given
the same answer to the sth question. It reaches the maximum value: h-1/h (h
is the number of alternative answers to question sth) when there is perfect
heterogeneity of answers to sth question in the jth class
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”
EFFECTS OF THE CORRECTION PROCEDURE
Original distribution
Adjusted distribution
MEAN
74,71
71,67
MODE
100,00
68,75
I QUARTILE
64,42
63,12
MEDIAN
73,61
71,09
III QUARTILE
85,94
80,69
KURTOSIS
SKEWNESS
A fuzzy clustering approach to improve the accuracy of Italian students’data
An experimental procedure to correct the impact of the outliers on assessment test scores
C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”