LES MALADIES RARES

Download Report

Transcript LES MALADIES RARES

Correspondence analysis for data
mining
Annie Morin, IRISA France
[email protected]
Jean-Hugues Chauchat, Université Lyon2, ERIC
[email protected]
ZAGREB APRIL 2006
Correspondence analysis
• Statistical visualization method for displaying
the associations between the levels of a twocontingency table and the distances between the
categories of each variable => exploratory
method
• Usually, Chi-square test for independence in a
contingency table
ZAGREB APRIL 2006
1- EXAMPLE
• Data set crossing words (4 words) and 4
documents
ZAGREB APRIL 2006
Example : Frequency table
heart
forest
surgery
animal
Total
D1
11
20
4
2
37
D2
3
9
2
2
16
D3
1
5
2
3
11
D4
3
14
3
16
36
18
48
11
23
100
Total
ZAGREB APRIL 2006
Questions
• Explore the structure of categorical variables
included in the table
• Find some correspondences between the rows
and columns
• Well-known situation : independence between
the variables defining the lines and the columns
ZAGREB APRIL 2006
How to check the independence or the
relationship?
•
•
•
•
•
Comparison of row profiles
Comparison of column profiles
Chi-square statistics
Other indicators
CA
ZAGREB APRIL 2006
Independence situation
heart
forest
surgery
animal
Total
D1
6,7
17,8
4,1
8,5
37
D2
2,9
7,7
1,8
3,7
16
D3
2,0
5,3
1,2
2,5
11
D4
6,5
17,3
4,0
8,3
36
Total
18
48
11
23
100
ZAGREB APRIL 2006
Row-profiles
heart
forest
surgery
animal
Total
D1
29,7
54,1
10,8
5,4
100
D2
18,8
56,3
12,5
12,5
100
D3
9,1
45,5
18,2
27,3
100
D4
8,3
38,9
8,3
44,4
100
18,00
48,00
11,00
23,00
100
Total
ZAGREB APRIL 2006
Row-profiles
ZAGREB APRIL 2006
Column profile
heart
forest
surgery
animal
Total
D1
61,1
41,7
36,4
8,7
37,00
D2
16,7
18,8
18,2
8,7
16,00
D3
5,6
10,4
18,2
13,0
11,00
D4
16,7
29,2
27,3
69,6
36,00
Total
100
100
100
100
100
ZAGREB APRIL 2006
Column-profiles
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
2- Method
•
•
•
•
Table with r rows and c columns
nij = frequency in the cell (i,j)
ni.=nij, n.j= nij, n=  nij
Find a lower-dimensional space, in which to
position the row points in a manner that retains
all, or almost all, of the information about the
differences between the rows (ie columns)
ZAGREB APRIL 2006
• the row and column totals of the matrix of relative
frequencies are called the row mass and column mass,
respectively.
• The term inertia is used by analogy with the
definition in applied mathematics of "moment of
inertia," which stands for the integral of mass times the
squared distance to the centroid Inertia is defined as
the total Pearson Chi-square for the two-way divided
by the total sum
ZAGREB APRIL 2006
– If the rows and columns in a table are completely independent of
each other, the entries in the table can be reproduced from the
row and column totals alone, or row and column profiles
– Any deviations from the expected values (expected under the
hypothesis of complete independence of the row and column
variables) will contribute to the overall Chi-square. Thus,
another way of looking at CA is to consider it a method for
decomposing the overall Chi-square statistic (or Inertia=Chisquare/n) by identifying a small number of dimensions in which
the deviations from the expected values can be represented. This
is similar to the goal of FA or PCA where the total variance is
decomposed, so as to arrive at a lower-dimensional
representation of the variables that allows one to reconstruct most
of the variance/covariance matrix of variables.
ZAGREB APRIL 2006
• The dimensions are "extracted" so as to maximize
the distances between the row or column points, and
successive dimensions (which are independent of or
orthogonal to each other) will "explain" less and less
of the overall Chi-square value (and, thus, inertia )
• the maximum number of eigenvalues that can be
extracted from a two- way table is equal to the
minimum of the number of columns minus 1, and
the number of rows minus 1
ZAGREB APRIL 2006
• Plot the coordinates in a two-dimensional
scatterplot. Remember that the purpose of
correspondence analysis is to reproduce the
distances between the row and/or column points
in a two-way table in a lower-dimensional
display; note that, as in Factor analysis the
actual rotational orientation of the axes is
arbitrarily chosen so that successive dimensions
"explain" less and less of the overall Chi-square
value (or inertia)
ZAGREB APRIL 2006
• It is customary to summarize the row and
column coordinates in a single plot. However, it
is important to remember that in such plots, one
can only interpret the distances between row
points, and the distances between column
points, but not the distances between row points
and column points.
ZAGREB APRIL 2006
Quality of a displayed solution
• The quality of a point is defined as the ratio of
the squared distance of the point from the origin
in the chosen number of dimensions, over the
squared distance from the origin in the space
defined by the maximum number of dimensions
and is called the squared cosine
ZAGREB APRIL 2006
• The relative inertia represents the proportion of
the total inertia accounted for by the respective
point, and it is independent of the number of
dimensions chosen by the user. Note that a
particular solution may represent a point very
well (high quality) but the same point may not
contribute much to the overall inertia
ZAGREB APRIL 2006
• It should be noted at this point that correspondence
analysis is an exploratory technique. Actually, the
method was developed based on a philosophical
orientation that emphasizes the development of models
that fit the data, rather than the rejection of hypotheses
based on the lack of fit (Benzecri's "second principle"
states that "The model must fit the data, not vice
versa;" see Greenacre, 1984, p. 10). Therefore, there
are no statistical significance tests that are customarily
applied to the results of a correspondence analysis; the
primary purpose of the technique is to produce a
simplified (low- dimensional) representation of the
information in a large frequency table (or tables with
similar measures of correspondence).
ZAGREB APRIL 2006
• Contribution à l’inertie αème axe
– Crα(i) = fi.ψ2αi/λα
– ∑ contributions des individus sur un axe =1
• Qualité de la représentation :
ZAGREB APRIL 2006
Supplementary points
• An important aid in the interpretation of the
results from a correspondence analysis is to
include supplementary row or column points,
that were not used to perform the original
analyses.
ZAGREB APRIL 2006
Notations
• There are two clouds of points,
– the first one N(I) is the set of rows whose
coordinates are the components of the row profiles
and the mass is the marginal frequency of the row
– The second one N(J) is the set of columns whose
coordinates are the components of the column
profiles and the mass the marginal frequency of the
column
ZAGREB APRIL 2006
Distances
Between two columns
Between two rows
ZAGREB APRIL 2006
Principle of distributional equivalence
• If two row profiles (say) are identical the the
corresponding two rows of the original matrix
may be replaced by their summation (as a single
row) without affecting the geometry of the
column profiles.
ZAGREB APRIL 2006
CA
• Duality between the row and the columns
• Use of the row profiles and of the column
profiles
• Use of chi-square distance (distributional
equivalence)
• Factorial analysis method (eigen values of a adhoc matrix) and reduction of dimensionality
ZAGREB APRIL 2006
• Diagonalization of a « covariance matrix » to
find the eigenvalues and corresponding
eigenvectors
• λ1≥λ2≥…….. ≥ λp
• Inertia of the cloud is ∑λi =2 / n
• Distance to the independence model
ZAGREB APRIL 2006
Simultaneous representation
• Of the rows and of the columns profiles on the same
factorial plane
• Validity of representation :
– Inertia : contributions that describe the proportion of
variance explained provided by each element (row or
column profile) in building an axis
– Quality of representation of each element by the axes
ZAGREB APRIL 2006
Special clouds « shapes »
• Guttman effect : horseshoe shape
• Two sub-clouds
• 3 subclouds
ZAGREB APRIL 2006
Guttman effect
ZAGREB APRIL 2006
Two sub-clouds
ZAGREB APRIL 2006
3 sub-clouds
ZAGREB APRIL 2006
Similar techniques
•
•
•
•
Optimal scaling
Reciprocal averaging
Quantification method
Homogeneity analysis
ZAGREB APRIL 2006
Example
• Initial example
• Second example
ZAGREB APRIL 2006
Eigenvalues
Trace de la matrice:
Numéro
0.20149
Valeur propre
Pourcentage
Pourcentage cumulé
1
0,19
93,58
93,58
2
0,01
5,71
99,29
3
0,00
0,71
100,00
ZAGREB APRIL 2006
Column coordinates
Libellé de la variable
Poids relatif
Distance à
l'origine
Axe 1
Axe 2
Axe 3
heart
18,00
0,28817
-0,51
0,15
-0,04
forest
48,00
0,02389
-0,15
-0,02
0,04
surgery
11,00
0,07113
-0,08
-0,25
-0,06
animal
23,00
0,56662
0,75
0,06
-0,01
ZAGREB APRIL 2006
Column Contributions
Libellé de la variable
Poids relatif
Distance à
l'origine
Axe 1
Axe 2
Axe 3
heart
18,00
0,28817
25,30
33,24
23,45
forest
48,00
0,02389
5,59
2,56
43,85
surgery
11,00
0,07113
0,38
58,05
30,57
animal
23,00
0,56662
68,73
6,15
2,13
ZAGREB APRIL 2006
Columns quality (Squared Cosine)
Libellé de la variable
Poids relatif
Distance à
l'origine
Axe 1
Axe 2
Axe 3
heart
18,00
0,28817
0,92
0,07
0,01
forest
48,00
0,02389
0,92
0,03
0,05
surgery
11,00
0,07113
0,09
0,85
0,06
animal
23,00
0,56662
0,99
0,01
0,00
ZAGREB APRIL 2006
Row results
ZAGREB APRIL 2006
ZAGREB APRIL 2006
Second example
ZAGREB APRIL 2006
Tableau sur alcool et caractéristiques
Tableau des effectifs observés :
Pastis
Aime le goût
49
Avec des amis
83
Pour se détendre
61
Qui revient cher
60
Rafraîchissante, désaltérante
78
Peu élégante, peu distinguée
26
Produit sympathique
64
Bien avant les repas
88
Bien dans la journée
24
Bien dans la soirée
7
Toute l'année
83
Appréciée des jeunes
45
Volontiers avec invités
88
Vieillotte, dépassée
12
Aussi bien hommes que femmes 50
Très proche
38
Par habitude
36
Fait snob, m'as-tu vu ?
3
On peut mélanger
43
La nuit/Bar/Disco
12
Total
950
whisky
50
83
61
88
22
11
64
79
21
61
87
77
92
4
62
41
30
35
87
91
1146
martini
suze
42
76
51
42
18
13
56
85
12
12
85
36
87
13
69
27
24
9
29
27
813
vodka
18
60
32
41
19
17
34
64
10
11
79
16
60
38
43
11
16
8
32
16
625
ZAGREB APRIL 2006
25
69
38
75
17
13
45
45
13
53
83
65
70
5
49
16
19
28
82
84
894
gin
malibu
23
68
39
70
19
11
42
46
12
50
82
69
67
6
51
18
19
25
80
81
878
25
69
39
61
14
13
46
37
13
48
80
76
67
8
61
17
17
21
43
72
827
biere
59
74
72
19
80
29
68
41
85
54
90
89
81
7
60
49
40
4
40
67
1108
Total
291
582
393
456
267
133
419
485
190
296
669
473
612
93
445
217
201
133
436
450
7241
Tableau des effectifs théoriques :
Pastis
38.178
Aime le goût
76.357
Avec des amis
51.561
Pour se détendre
59.826
Qui revient cher
Rafraîchissante, désaltérante 35.030
Peu élégante, peu distinguée 17.449
54.972
Produit sympathique
63.631
Bien avant les repas
24.927
Bien dans la journée
38.834
Bien dans la soirée
87.771
Toute l'année
62.056
Appréciée des jeunes
80.293
Volontiers avec invités
12.201
Vieillotte, dépassée
58.383
Aussi bien hommes que femmes
28.470
Très proche
26.371
Par habitude
17.449
Fait snob, m'as-tu vu ?
57.202
On peut mélanger
59.039
La nuit/Bar/Disco
950
Total
whisky
46.055
92.110
62.198
72.169
42.257
21.049
66.313
76.759
30.070
46.847
105.880
74.860
96.858
14.719
70.428
34.344
31.811
21.049
69.004
71.219
1146
martini
32.673
65.345
44.125
51.198
29.978
14.933
47.044
54.454
21.333
33.234
75.114
53.107
68.714
10.442
49.963
24.364
22.568
14.933
48.953
50.525
813
suze
25.117
50.235
33.921
39.359
23.046
11.480
36.166
41.862
16.400
25.549
57.744
40.827
52.824
8.027
38.410
18.730
17.349
11.480
37.633
38.841
625
vodka
35.928
71.856
48.521
56.299
32.965
16.421
51.731
59.880
23.458
36.545
82.597
58.398
75.560
11.482
54.941
26.792
24.816
16.421
53.830
55.559
894
ZAGREB APRIL 2006
gin
35.285
70.570
47.653
55.292
32.375
16.127
50.805
58.808
23.038
35.891
81.119
57.353
74.207
11.277
53.958
26.312
24.372
16.127
52.867
54.564
878
malibu
33.235
66.471
44.885
52.080
30.494
15.190
47.854
55.392
21.700
33.806
76.407
54.022
69.897
10.622
50.824
24.784
22.956
15.190
49.796
51.395
827
biere
44.528
89.056
60.136
69.776
40.856
20.351
64.114
74.214
29.073
45.293
102.369
72.377
93.647
14.231
68.093
33.205
30.757
20.351
66.716
68.858
1108
Total
291
582
393
456
267
133
419
485
190
296
669
473
612
93
445
217
201
133
436
450
7241
Profils-lignes
Tableau des pourcentages par rapport aux lignes :
Pastis
16.84
Aime le goût
14.26
Avec des amis
15.52
Pour se détendre
13.16
Qui revient cher
Rafraîchissante, désaltérante 29.21
Peu élégante, peu distinguée 19.55
15.27
Produit sympathique
18.14
Bien avant les repas
12.63
Bien dans la journée
2.36
Bien dans la soirée
12.41
Toute l'année
9.51
Appréciée des jeunes
14.38
Volontiers avec invités
12.90
Vieillotte, dépassée
11.24
Aussi bien hommes que femmes
17.51
Très proche
17.91
Par habitude
2.26
Fait snob, m'as-tu vu ?
9.86
On peut mélanger
2.67
La nuit/Bar/Disco
13.12
Total
whisky
17.18
14.26
15.52
19.30
8.24
8.27
15.27
16.29
11.05
20.61
13.00
16.28
15.03
4.30
13.93
18.89
14.93
26.32
19.95
20.22
15.83
martini
14.43
13.06
12.98
9.21
6.74
9.77
13.37
17.53
6.32
4.05
12.71
7.61
14.22
13.98
15.51
12.44
11.94
6.77
6.65
6.00
11.23
suze
6.19
10.31
8.14
8.99
7.12
12.78
8.11
13.20
5.26
3.72
11.81
3.38
9.80
40.86
9.66
5.07
7.96
6.02
7.34
3.56
8.63
vodka
8.59
11.86
9.67
16.45
6.37
9.77
10.74
9.28
6.84
17.91
12.41
13.74
11.44
5.38
11.01
7.37
9.45
21.05
18.81
18.67
12.35
ZAGREB APRIL 2006
gin
7.90
11.68
9.92
15.35
7.12
8.27
10.02
9.48
6.32
16.89
12.26
14.59
10.95
6.45
11.46
8.29
9.45
18.80
18.35
18.00
12.13
malibu
8.59
11.86
9.92
13.38
5.24
9.77
10.98
7.63
6.84
16.22
11.96
16.07
10.95
8.60
13.71
7.83
8.46
15.79
9.86
16.00
11.42
biere
20.27
12.71
18.32
4.17
29.96
21.80
16.23
8.45
44.74
18.24
13.45
18.82
13.24
7.53
13.48
22.58
19.90
3.01
9.17
14.89
15.30
Total
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Profils-colonnes
Tableau des pourcentages par rapport aux colonnes :
Pastis
Aime le goût
5.16
Avec des amis
8.74
Pour se détendre
6.42
Qui revient cher
6.32
Rafraîchissante, désaltérante 8.21
Peu élégante, peu distinguée
2.74
Produit sympathique
6.74
Bien avant les repas
9.26
Bien dans la journée
2.53
Bien dans la soirée
0.74
Toute l'année
8.74
Appréciée des jeunes
4.74
Volontiers avec invités
9.26
Vieillotte, dépassée
1.26
Aussi bien hommes que femmes5.26
Très proche
4.00
Par habitude
3.79
Fait snob, m'as-tu vu ?
0.32
On peut mélanger
4.53
La nuit/Bar/Disco
1.26
Total
100
whisky
4.36
7.24
5.32
7.68
1.92
0.96
5.58
6.89
1.83
5.32
7.59
6.72
8.03
0.35
5.41
3.58
2.62
3.05
7.59
7.94
100
martini
5.17
9.35
6.27
5.17
2.21
1.60
6.89
10.46
1.48
1.48
10.46
4.43
10.70
1.60
8.49
3.32
2.95
1.11
3.57
3.32
100
suze
2.88
9.60
5.12
6.56
3.04
2.72
5.44
10.24
1.60
1.76
12.64
2.56
9.60
6.08
6.88
1.76
2.56
1.28
5.12
2.56
100
vodka
2.80
7.72
4.25
8.39
1.90
1.45
5.03
5.03
1.45
5.93
9.28
7.27
7.83
0.56
5.48
1.79
2.13
3.13
9.17
9.40
100
ZAGREB APRIL 2006
gin
2.62
7.74
4.44
7.97
2.16
1.25
4.78
5.24
1.37
5.69
9.34
7.86
7.63
0.68
5.81
2.05
2.16
2.85
9.11
9.23
100
malibu
3.02
8.34
4.72
7.38
1.69
1.57
5.56
4.47
1.57
5.80
9.67
9.19
8.10
0.97
7.38
2.06
2.06
2.54
5.20
8.71
100
biere
5.32
6.68
6.50
1.71
7.22
2.62
6.14
3.70
7.67
4.87
8.12
8.03
7.31
0.63
5.42
4.42
3.61
0.36
3.61
6.05
100
Total
4.02
8.04
5.43
6.30
3.69
1.84
5.79
6.70
2.62
4.09
9.24
6.53
8.45
1.28
6.15
3.00
2.78
1.84
6.02
6.21
100
VODKA
GIN
MALIBU
WHISKY
MARTINI
SUZE
BIERE
ZAGREB APRIL 2006
PASTIS
Rafraîchis
Vieillotte
Bien dans
Peu élégan
Très proch
Par habitu
Aime le go
Bien avant
Pour se dé
Produit sy
Volontiers
Avec des a
Toute l'an
Aussi bien
Appréciée
Qui revien
On peut mé
Bien dans
La nuit/Ba
Fait snob.
Indices d’attraction-répulsion
Pastis
whisky
martini
suze
vodka
gin
malibu
biere
Aime le goût
1.28344728 1.08565277 1.28547698 0.71663505 0.69583785
0.6518368 0.75221165 1.32500589
Avec des amis
1.08700127
0.9010918
1.1630506 1.19439175 0.96025623 0.96358484 1.03805208
0.8309359
Pour se détendre
1.18307486 0.98073396 1.15580782 0.94335674 0.78316284 0.81841973 0.86889059 1.19728829
Qui revient cher
1.00290859 1.21935948 0.82033728 1.04168772 1.33216325 1.26601027 1.17127273 0.27229994
Rafraîchissante, désaltérante
2.22668244
0.5206254 0.60043949 0.82444345 0.51570185 0.58687603 0.45910266 1.95811193
Peu élégante, peu distinguée1.49003562 0.52258263 0.87056201 1.48086617 0.79168559 0.68209533 0.85582457 1.42496811
Produit sympathique
1.16423565 0.96511681 1.19037009 0.94012029 0.86988035 0.82668356 0.96125109 1.06060502
Bien avant les repas
1.38297992 1.02919883 1.56093633 1.52882144 0.75150488 0.78220417 0.66796395 0.55246008
Bien dans la journée
0.96279224 0.69836043 0.56251699 0.60976842 0.55417991
0.5208728
0.5990772 2.92364146
Bien dans la soirée
0.18025249 1.30212313
0.3610751 0.43054595 1.45025772
1.3931001 1.41985032 1.19223217
Toute l'année
0.94564236 0.82168823 1.13162051 1.36810523 1.00487723 1.01086176 1.04702465 0.87917469
Appréciée des jeunes
0.72514744 1.02859288
0.6778746
0.3919019 1.11304634
1.203073 1.40684253 1.22966738
Volontiers avec invités
1.09598899 0.94983974 1.26612281 1.13584314 0.92641941 0.90287455 0.95855364 0.86495275
Vieillotte, dépassée
0.98349745
0.2717634 1.24499729 4.73390108 0.43545982 0.53207436 0.75318225 0.49189667
Aussi bien hommes que femmes
0.85641632 0.88033022 1.38101082 1.11950742 0.89186085 0.94517929 1.20022553 0.88115037
Très proche
1.33474654 1.19381781
1.1081844 0.58728848 0.59720203 0.68409561 0.68593383
1.47569
Par habitude
1.36515318 0.94305957 1.06346496 0.92223682 0.76562935 0.77958159 0.74053553 1.30053703
Fait snob, m'as-tu vu ?
0.17192719 1.66276293 0.60269678
0.6968782 1.70516896 1.55021666 1.38248584 0.19654732
On peut mélanger
0.75172139 1.26080143 0.59240608 0.85031927 1.52331035 1.51323901 0.86352518
0.5995595
La nuit/Bar/Disco
0.20325614 1.27774093 0.53439114 0.41193244 1.51191648 1.48448747 1.40091898 0.97301845
ZAGREB APRIL 2006
Boîtes à moustaches des dij
ZAGREB APRIL 2006
Statistique du
•
 
2
2
χ (Chi-square)
( kij ( kikj / k )) 2
kikj / k
i, j
À comparer à une valeur tabulée dans la table du Chi-deux à
(n-1)(p-1) degrés de liberté
Tableau des statistiques testant l'indépendance lignes/colonnes :
Khi²
Valeur
974.064
ZAGREB APRIL 2006
ddl
133
p-value
< 0.0001
Tableau des contributions au Khi² :
Pastis
3.067
Aime le goût
0.578
Avec des amis
1.728
Pour se détendre
0.001
Qui revient cher
Rafraîchissante, désaltérante 52.711
Peu élégante, peu distinguée 4.190
1.483
Produit sympathique
9.333
Bien avant les repas
0.035
Bien dans la journée
26.096
Bien dans la soirée
0.259
Toute l'année
4.688
Appréciée des jeunes
0.740
Volontiers avec invités
0.003
Vieillotte, dépassée
1.204
Aussi bien hommes que femmes
3.190
Très proche
3.516
Par habitude
11.965
Fait snob, m'as-tu vu ?
3.526
On peut mélanger
37.478
La nuit/Bar/Disco
165.791
Total
whisky
0.338
0.901
0.023
3.473
9.711
4.798
0.081
0.065
2.736
4.276
3.366
0.061
0.244
7.806
1.009
1.290
0.103
9.246
4.693
5.494
59.714
martini
2.663
1.737
1.071
1.653
4.786
0.250
1.705
17.134
4.083
13.567
1.301
5.511
4.866
0.627
7.253
0.285
0.091
2.357
8.133
10.953
90.026
suze
2.017
1.898
0.109
0.068
0.710
2.654
0.130
11.707
2.497
8.285
7.824
15.097
0.975
111.915
0.549
3.190
0.105
1.055
0.843
13.432
185.062
vodka
3.324
0.114
2.281
6.212
7.732
0.713
0.876
3.698
4.662
7.409
0.002
0.746
0.409
3.659
0.642
4.347
1.363
8.165
14.742
14.560
85.655
ZAGREB APRIL 2006
gin
4.277
0.094
1.571
3.913
5.525
1.630
1.526
2.790
5.289
5.546
0.010
2.365
0.700
2.469
0.162
2.626
1.184
4.882
13.926
12.808
73.292
malibu
2.041
0.096
0.772
1.528
8.922
0.316
0.072
6.107
3.488
5.959
0.169
8.942
0.120
0.647
2.038
2.445
1.545
2.222
0.927
8.261
56.616
biere
4.703
2.545
2.341
36.950
37.505
3.675
0.235
14.864
107.583
1.674
1.494
3.818
1.708
3.674
0.962
7.514
2.778
13.138
10.698
0.050
257.909
Total
22.430
7.963
9.896
53.796
127.601
18.226
6.107
65.698
130.373
72.812
14.426
41.228
9.762
130.801
13.818
24.887
10.686
53.030
57.488
103.036
974.064
ZAGREB APRIL 2006
ZAGREB APRIL 2006
Applications
• In medicine
• In text mining
ZAGREB APRIL 2006
Applications in medicine
• Pharmacology
• Therapeutic trials (to avoid double blind
procedures) : CA allows the physician to follow
the evolution of the illness or/and of the therapy
• Textual analysis : reports, business intelligence,
bibliometry
ZAGREB APRIL 2006
Application on mucoviscidosis
• Mucoviscidosis : rare disease
– No specific keywords
– No specific magazines
• Goal : To define a minimum common
vocabulary for the researchers working on
mucoviscidosis (clinicians, geneticists, etc..)
ZAGREB APRIL 2006
HYPOTHESIS :
THE TYPICAL WORDS FOR A GIVEN TOPIC ARE
INDEPENDENT OF THE TECHNIQUES
SURGEON WORDS
GENETICS WORDS
TOPIC WORDS
ZAGREB APRIL 2006
Processing
• First step of the study : to create a “kernel” base
which contains the references of scientific
documents used by people working on the
disease => 612 publications
ZAGREB APRIL 2006
• 30 axes with a positive side and a negative one
• Each side of each axis is characterized by the
words with a high relative contribution to the
inertia (greatest than a threshold).
ZAGREB APRIL 2006
DATA
• Two-table crossing the 612 documents
(summaries) and 850 words
• CA on this two-way table
ZAGREB APRIL 2006
Dimension of a word
• The words of a topic are one-dimensional
• The words of a filed are multidimensional
• The dimension of a word is the number of axis
on which this word has a high relative
contribution to inertia
• If we want to find the minimum common
vocabulary, the dimension of a word must be
high
ZAGREB APRIL 2006
MUCOVISCIDOSIS BASE
EXON
ALLELES
CBAVD
MUTATIONS NOVEL
DEFERENS FAMILIES
IDENTIFICATION
CONGENITAL ALLELE CODING
SCREENING
POPULATION ELECTROPHORESIS
MUTATION
PCR
DETECTION DELTAF
DIAGNOSIS
DNA
GENE
VENTRICULAR
LEFT
HYPERTENSION TRANSPLANTATIONS
FAILURE DOUBLE
HEART LIVER FOLLOW CASES
COMPLICATIONS CHILDREN
LUNG PULMONARY
REJECTION MEAN TREATMENT
+
ANALYSIS DELTA REGULATOR
CF
CFTR
CONDUCTANCE
EXPRESSION
PROTEIN
HUMAN
CELLS
ACTIVITY CELL MEMBRANE
ALPHA
TRANSPORT APICAL
ELASTASE INDUCED ATP CHANNEL
MU SECRETION
CHANNELS
INHIBITOR
CA
BILE
ZAGREB APRIL 2006
81 words have a dimension greatest
than 10
ACID
AERUGINOSA
ANTIGENS
AUREUS
CELL
CHANNELS
CONCENTRATIONS
DRUG
EPITHELIUM
FLOW
LEFT
MUCIN
NASAL
PEPTIDE
PRENATAL
PROTEINASE
RECEPTOR
SCREENING
STRAINS
TRANSPORT
WATER
ADENOSINE
ALPHA
ANTITRYPSIN
BRONCHIAL
CELLS
CHILDREN
CYSTIC
ELASTASE
EXPRESSION
FLUID
LIVER
MUCINS
NEONATAL
PERFORMANCE
PROPERTIES
PSEUDOMONAS
RECEPTORS
SECRETION
THERAPY
TRYPSIN
ADENOVIRUS
ALVEOLAR
ASPERGILLOSIS
CAMP
CFTR
CHROMOSOME
DIAGNOSIS
ELASTIN
FETAL
HLA
LUNG
MUCUS
NEUTROPHILS
PLASMA
PROTEASE
RAT
REJECTION
SECRETIONS
TRANSFER
VENTRICULAR
ZAGREB APRIL 2006
ADHESION
AMILORIDE
ATP
CASES
CHANNEL
CIRRHOSIS
DOUBLE
EMPHYSEMA
FIBROSIS
INHIBITOR
MARKERS
MUTATIONS
PATCHES
PNEUMONIA
PROTEIN
RATS
RIGHT
SPUTUM
TRANSPLANTATIO
VIVO
Is a high dimension a sufficient
condition to characterize the disease?
To check it, we use other thematic
databases and in each of them, we
count the number of documents with
at least two words among the
previous 81 words.
ZAGREB APRIL 2006
5 thematic databases
•
•
•
•
•
BREAST CANCER …………………………..9871 doc
POLYAMINES……………………………...12726 doc
LEUCOCYTE INFILTRATED TUMOR ……586 doc
ACUTE LYPMPHOBLAST LEUKEMIA …2063 doc
MUCOVISCIDOSCIS………………………...612 doc
ZAGREB APRIL 2006
RETRIEVAL STATISTICS WITH
THE 81 WORDS
SUJET DES BASES
MUCOVISCIDOSE
LEUCEMIE AIGUË LYMPHOBLASTIQUE
POLYAMINES
CANCER DU SEIN
TUMOR INFILTRATING LEUCOCYTE
TOTAL
TAUX DE RECUPERATION EFFECTIF BASES
612
612
(100%)
1990
2063
(96%)
11912
12726
(94%)
8728
9871
(88%)
546
586
(93%)
23788
(92%)
ZAGREB APRIL 2006
25858
CA of the 5 databases and 81 words
HLA
antigens cases
screening diagnosis
therapy chromosome
BASE LAL
flow transplantation
BASE CANCER SEIN adhesion
BASE TIL
receptor expression
children
right
left lung
mutations
aspergillosis
cell
neutrophils
mucins
vivo
epithelium
drug protein
rejection
pneumonia
plasma secretion
alpha peptide
alveolar
BASE POLYAMINE
acid transfer
protease
inhibitor ATP ventricular
adenovirus
adenosine CAMP
prenatal
proteinase
stains transport channel
cirrhosis
antitrypsin
neonatal
aureus
BASE MUCO
bronchial
pseudomonas
secretions
amiloride patches
aeruginosa
elastase
sputum fibrosis
nasal cystic
elastin
mucus
emphysema
CFTR
ZAGREB APRIL 2006
20 left words
adenovirus
Aeruginosa
Amiloride
Antitrypsin
Aureus
Bronchial
Cftr
Cirrhosis
Cystic
Elastase
Elatin
Emphysema
Fibosis
Mucus
Nasal
Patches
proteinase
Pseudomon.
secretions
sputum
ZAGREB APRIL 2006
Retrieval statistics with these 20
words
SUBJECT
Retrieval rate
Db size
Mucoviscidosis
550 (89.9%)
612
Leukemia
38 (1.8%)
2063
Polyamines
341 (2.7%)
12726
Breast cancer
202 (2.1%)
9878
Tumor Infilt. Leu 9 (1.5%)
ZAGREB APRIL 2006
586
Conclusion of this application
• CA is a very powerful methof to display teh
association among variables
• It can be used with large datasets (one of the
dimension must be « tractable »)
ZAGREB APRIL 2006
Text mining
ZAGREB APRIL 2006
Goals
•
•
•
•
•
•
Information retrieval
Bibliometrics
Responses to open questions
Mailing and sorting
Ontology (specification of a concept)
Content analysis …
ZAGREB APRIL 2006
• Problem that motivated the study rise to an a priori
formalization of a statistical model
• Data : observations, databases… and preprocessing
• Processing : uncover the main structural traits
• Interpretation : the beginning of a hard work
ZAGREB APRIL 2006
Data and preprocessing
• Choosing the units is not simple :
– What’s about segmentation, lemmatization, disambiguition
– Coding : words and their frequencies in units (sentences,
abstracts, mails, documents etc…)
– Contingency table (lexical table ) crossing units and words (no
hapax, no tool word, filtering on the frequencies)
ZAGREB APRIL 2006
Methods for text mining
• Multivariate descriptive methods : LSA latent
semantic analysis, CA correspondence analysis
and clustering methods
• Kohonen maps :websom
ZAGREB APRIL 2006
Aim
• Find the associations between words
• Find overlapping groups of words and of
documents
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
Underlying ideas
• Words are polysemic→ the context (word
neighbourhood) determines the meaning
• Documents are generally polythematic
ZAGREB APRIL 2006
• Metakey : group of words whose contributions
on one axis is greatest to a threshold (2,3 or 6
times the average contribution)
• Dimension of a word : number of metakeys to
which it belongs
• Minimal common vocabulary : words with a
high dimension
ZAGREB APRIL 2006
Study
• To have a point of view of the scientific
production of INRIA through their internal
reports
• Inria : french national institute for research in
computer science
and control, a research institute at the heart of
the information society
ZAGREB APRIL 2006
• A INRIA research project-team is a team of a limited
size with relatively focused scientific objectives and
themes. A project head is in charge of leading and
coordinating the team.
The research centers also host other teams that are not
recognized as research projects-team. These teams are
very often joint teams with partner institutions.
ZAGREB APRIL 2006
Methods for text mining
• Multivariate descriptive methods : LSA latent
semantic analysis, CA correspondence analysis
and clustering methods
• Kohonen maps :websom
• N-grams
ZAGREB APRIL 2006
Data and preprocessing
• Choosing the units is not simple :
– What’s about segmentation, lemmatization, disambiguition
– Coding : words and their frequencies in units (sentences,
abstracts, mails, documents etc…)
– Contingency table (lexical table ) crossing units and words (no
hapax, no tool word, filtering on the frequencies)
ZAGREB APRIL 2006
Algorithm
• Build the contingency table
• Perform a CA
• Select the metakeys
–
–
–
–
Either build a new table documentXmetakeys and perform a CA
Either select the words of the metakeys and perform a CA
Either cluster the metakeys
Either use a secator and perform a CA on each metakey x documents ofthe
neighbourhood
• Start again with the « residuals » of the previous analysis
ZAGREB APRIL 2006
Research theme
• theme 1 : networks and system
• theme 2 : software engineering and symbolic
computing
• theme 3 : human-computer interaction, images
processing, data management, knowledge syste
• theme 4 : simulation and optimization of
complex systems
ZAGREB APRIL 2006
Theme 1 (#44 teams)
• Architectures and Systems
• Networks and Telecommunications
• Distributed and Real-Time Programming
ZAGREB APRIL 2006
Theme 2 (#26 teams)
• Semantics and Programming
• Algorithms and Computational Algebra
ZAGREB APRIL 2006
Theme 3 (#48 teams)
• Databases, Knowledge Bases, Cognitive
Systems
• Vision, Image Analysis and Synthesis
ZAGREB APRIL 2006
Theme 4 (#32 teams)
• Control, Robotics, Signal
• Modelling and Scientific Computing
ZAGREB APRIL 2006
• 3315 english abstracts of internal reports of
INRIA from 1989 to july 2003 and 892 words
• Results : theme 1 and theme 4 overrepresented
ZAGREB APRIL 2006
Interpretation : Vizualizing tools
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
Themes distribution
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
ZAGREB APRIL 2006
Plan1-2(centers)
ZAGREB APRIL 2006
Plan5-6
ZAGREB APRIL 2006
ZAGREB APRIL 2006
Correspondance Factorial Analysis
and
Optimal Scaling (Dual Scaling)
ZAGREB APRIL 2006
16 people: Eye and Hair colors
Cross Tabulation
fair Hair black Hair red Hair
X \ Y
fair Hair black Hair red Hair
blue eyes
3
0
1
4
brown eyes
1
1
2
4
black eyes
0
4
0
4
green eyes
0
1
3
4
blue eyes
brown eyes
black eyes
green eyes
4
6
6
16
ZAGREB APRIL 2006
16 people: Eye and Hair colors
Data-Base
Individu alpha
X
Y
1
brown eyes
red Hair
2
green eyes
red Hair
3
green eyes
black Hair
4
black eyes
black Hair
5
black eyes
black Hair
6
blue eyes
fair Hair
7
brown eyes
red Hair
8
brown eyes
fair Hair
9
brown eyes
black Hair
10
black eyes
black Hair
11
green eyes
red Hair
12
green eyes
red Hair
13
blue eyes
fair Hair
14
black eyes
black Hair
15
blue eyes
red Hair
16
blue eyes
fair Hair
ZAGREB APRIL 2006
Looking for X and Y values that maximize R² (X,Y)
Optimal Scaling
Individu
X
Y
codage X
codage Y
1
brown eyes
red Hair
0,24
0,33
2
green eyes
red Hair
-0,06
0,33
3
green eyes
black Hair
-0,06
-1,20
4
black eyes
black Hair
-1,50
-1,20
5
black eyes
black Hair
-1,50
-1,20
6
blue eyes
fair Hair
1,25
1,24
7
brown eyes
red Hair
0,24
0,33
8
brown eyes
fair Hair
0,24
1,31
9
brown eyes
black Hair
0,24
-1,20
10
black eyes
black Hair
-1,50
-1,20
11
green eyes
red Hair
-0,06
0,33
12
green eyes
red Hair
-0,06
0,33
13
blue eyes
fair Hair
1,32
1,31
14
black eyes
black Hair
-1,50
-1,20
15
blue eyes
red Hair
1,32
0,33
16
blue eyes
fair Hair
1,32
1,31
0,00
0,00
1
1
ro=
0,80
Moyenne =
Sigma =
ro²=
Looking for 2 sets of numbers X and Y
Xi = value for modality « i »
Brown Eyes => 0,24 …..
Yj = value for modality « j »
Red Hairs => 0,33 ….
So that: R2 is Maximized
X average = 0
Y average = 0
X variance = 1
Y variance = 1
0,64
ZAGREB
APRIL 2006
Optimal Scaling (SPSS) and Correspondence Analysis
CODAGE_Y
2000
Fair
Hair
1000
Red
Hair
0
Black-1000
Hair
-2000
-2000
-1000
COGAGE_X
Black
Eyes
0
Green Eyes
1000
Blue Eyes 2000
Brown
Eyes
ZAGREB APRIL 2006
Optimal Scaling (SPSS)
Optimal Ranking (AMADO)
2000
black eyes
green eyes brown eyes
blue eyes
1000
fair Hair
0
red Hair
CODAGE_Y
-1000
-2000
-2000
-1000
0
1000
2000
COGAGE_X
ZAGREB APRIL 2006
black Hair
Optimal Scaling
gives the first factor of Correspondence Analysis
Fair
Hair
Fair Hair
Red
Hair
Red Hair
Green Eyes
Blue Eyes
Brown Eyes
Black Eyes
Black
Hair
Green
Blue
BlackEyes Brown Eyes
Eyes
Eyes
Black Hair
Black Hair
Red Hair
Green Eyes
Black Eyes
ZAGREB APRIL 2006
Fair Hair
Blue Eyes
Brown Eyes
Second Optimal Scaling
gives the Second factor of Correspondence Analysis
and so on …
Let X1 and Y1 the first Optimal Scalling solution.
Now we are looking for a second Optimal Coding solution, non correlated to the first one:
X2i = value for modality « i »
Y2j = value for modality « j »
So that: R2 is Maximized
Looking for 2 sets of numbers X2 and Y
X2 average = 0
Y2 average = 0
X2 variance = 1
Y2 variance = 1
AND
Corr (X1 ; X2) = 0
Corr (Y1ZAGREB
; Y2) = APRIL
0
2006
ZAGREB APRIL 2006
« O/1 table »,
Optimal Scaling,
Correspondance Factorial Analysis
and Blocks Model
ZAGREB APRIL 2006
Block Model
W7
W13
W3
W11
W9
W1
W4
W10
W12
W6
W8
W14
W5
W2
0/1 table
a6 a17 a12 a2 a18 a22 a24 a19 a13 a8 a11 a3 a23 a9 a20 a1 a15 a25 a5 a16 a14 a7 a21 a4 a10
1
0
0 0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
1
1 0
1
1
1
1
1
0
1
0
1
1
1
0
1
1
0
1
1
0
1
0
1
0
0
0 1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
0
0
1
1 0
1
1
1
1
1
0
1
0
1
1
1
0
1
1
0
1
1
0
1
0
1
1
0
0 0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0 1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0 1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
0
1
1
1 0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
0
1
1
1
1
0
1
0
1
1 0
1
1
1
1
1
0
1
0
1
1
1
0
1
1
0
1
1
0
1
0
1
0
0
0 1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
0
1
0
0 0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
1
1 0
1
1
1
1
1
0
1
0
1
0
1
0
1
1
0
1
1
0
1
0
0
0
0
0 1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0 1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
0
ZAGREB APRIL 2006
Block Model
Amado Graph
a6a17a12a2a18a22a24a19a13a8a11a3a23a9a20a1a15a25a5a16a14a7a21a4a10
W7
W13
W3
W11
W9
W1
W4
W10
W12
W6
W8
W14
W5
W2
ZAGREB APRIL 2006
Block Model = Amado Graph
after ranking rows and columns
along the first Correspondance Analysis Axis
a17a12a18a22a24a19a13a11a23a20a15a25a16a14a21a9a10a6 a8 a7 a2 a3 a1 a5 a4
W14
W13
W11
W12
W10
W7
W9
W8
W3
W1
W4
W6
ZAGREB APRIL 2006
W5
W2
Correspondance Analysis,
Optimal Scaling (=Dual Scaling)
and Blocks research
ZAGREB APRIL 2006
Two Blocks Optimal Scaling
Eyes
Hair
blue
Fair
blue
Fair
blue
Fair
blue
Fair
Green
Red
Green
Red
Green
Black
Chestnut
Black
Chestnut
Red
ZAGREB APRIL 2006
Two Blocks Optimal Scaling
Eyes
Hair
blue
Fair
blue
Fair
blue
Fair
blue
Fair
Green
Red
Green
Red
Green
Black
Chestnut
Black
Chestnut
Red
ZAGREB APRIL 2006
Two Blocks Optimal Scaling: R2=1
Eyes
Hair
blue
Fair
blue
Fair
blue
Fair
blue
Fair
Green
Red
Green
Red
Green
Black
Chestnut
Black
Chestnut
Red
Red or
Black
Hair
Fair
Blue
Eyes
ZAGREB APRIL 2006
Green or
Chestnut
B blocks 
(B-1) non-correlated coding with R2=1
(B-1) eignen values=1
ZAGREB APRIL 2006
Thanks you
ZAGREB APRIL 2006
Looking for X and Y values that maximize R² (X,Y)
Individu
X
Y
codage X
codage Y
1
brown eyes
red Hair
0,24
0,33
2
green eyes
red Hair
-0,06
0,33
3
green eyes
black Hair
-0,06
-1,20
4
black eyes
black Hair
-1,50
-1,20
5
black eyes
black Hair
-1,50
-1,20
6
blue eyes
fair Hair
1,25
1,24
7
brown eyes
red Hair
0,24
0,33
8
brown eyes
fair Hair
0,24
1,31
9
brown eyes
black Hair
0,24
-1,20
10
black eyes
black Hair
-1,50
-1,20
11
green eyes
red Hair
-0,06
0,33
12
green eyes
red Hair
-0,06
0,33
13
blue eyes
fair Hair
1,32
1,31
14
black eyes
black Hair
-1,50
-1,20
15
blue eyes
red Hair
1,32
0,33
16
blue eyes
fair Hair
1,32
CODAGE_Y
Optimal Scaling
0,00
0,00
1
1
ro=
0,80
Moyenne =
Sigma =
ro²=
1,31
SPSS graph
2000
1000
0
-1000
-2000
-2000
COGAGE_X
0,64
ZAGREB
APRIL 2006
-1000
0
1000
2000