Subgroup - LIACS Data Mining Group
Download
Report
Transcript Subgroup - LIACS Data Mining Group
Subgroup Discovery
Finding Local Patterns in Data
Exploratory Data Analysis
Scan the data without much prior focus
Find unusual parts of the data
Analyse attribute dependencies
interpret this as ‘rule’:
if X=x and Y=y then Z is unusual
Complex data: nominal, numeric, relational?
the Subgroup
Exploratory Data Analysis
Classification: model the dependency of the
target on the remaining attributes.
problem: sometimes classifier is a black-box, or uses
only some of the available dependencies.
for example: in decision trees, some attributes may not
appear because of overshadowing.
Exploratory Data Analysis: understanding the
effects of all attributes on the target.
Interactions between Attributes
Single-attribute effects are not enough
XOR problem is extreme example: 2 attributes
with no info gain form a good subgroup
Apart from
A=a, B=b, C=c, …
consider also
A=aB=b, A=aC=c, …, B=bC=c, …
A=aB=bC=c, …
…
Subgroup Discovery Task
“Find all subgroups within the inductive constraints
that show a significant deviation in the distribution
of the target attributes”
Inductive constraints:
Minimum support
(Maximum support)
Minimum quality (Information gain, X2, WRAcc)
Maximum complexity
…
Subgroup Discovery: the
Binary Target Case
Confusion Matrix
A confusion matrix (or contingency table) describes
the frequency of the four combinations of subgroup
and target:
within subgroup, positive
within subgroup, negative
outside subgroup, positive
outside subgroup, negative
target
subgroup
T
F
T
.42
.13
F
.12
.33
.54
.55
1.0
Confusion Matrix
High numbers along the TT-FF diagonal means a
positive correlation between subgroup and target
High numbers along the TF-FT diagonal means a
negative correlation between subgroup and target
Target distribution on DB is fixed
Only two degrees of freedom
target
subgroup
T
F
T
.42
.13
.55
F
.12
.33
.45
.54
.46
1.0
Quality Measures
A quality measure for subgroups summarizes the interestingness of
its confusion matrix into a single number
WRAcc, weighted relative accuracy
Balance between coverage and unexpectedness
WRAcc(S,T) = p(ST) – p(S)p(T)
between −.25 and .25, 0 means uninteresting
target
subgroup
T
F
T
.42
.13
F
.12
.33
.54
.55
1.0
WRAcc(S,T) = p(ST)−p(S)p(T)
= .42 − .297 = .123
Quality Measures
WRAcc: Weighted Relative Accuracy
Information gain
X2
Correlation Coefficient
Laplace
Jaccard
Specificity
…
Subgroup Discovery as Search
true
A=a1
A=a1B=b1
A=a1B=b1C=c1
A=a2
A=a1B=b2
B=b1
…
B=b2
C=c1
A=a2B=b1
…
…
T
F
T
.42
.13
F
.12
.33
.54
minimum support
level reached
…
.55
1.0
Refinements are (anti-)monotonic
entire database
Refinements are (anti-)
monotonic in their support…
target concept
S3 refinement of S2
S2 refinement of S1
subgroup S1
…but not in interestingness.
This may go up or down.
Subgroup Discovery and
ROC space
ROC Space
ROC = Receiver Operating Characteristics
Each subgroup forms a
point in ROC space, in
terms of its False Positive
Rate, and True Positive
Rate.
TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup)
FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup)
ROC Space Properties
entire database
‘ROC heaven’
perfect subgroup
‘ROC hell’
random subgroup
perfect
negative subgroup
empty
subgroup
minimum support
threshold
source: Flach & Fürnkranz
Measures in ROC Space
0
positive
negative
WRAcc
Information Gain
isometric
Other Measures
Precision
Gini index
Correlation coefficient
Foil gain
Refinements in ROC Space
Refinements of S will
reduce the FPR and TPR, so
will appear to the left and
below S.
.
.
.
..
Blue polygon represents
possible refinements of S.
With a convex measure, f
is bounded by measure of
corners.
If corners are not above
minimum quality or current
best (top k?), prune search
space below S.
Multi-class problems
Generalising to problems with more than 2 classes
is fairly staightforward:
target
T
F
X2
C2
C3
.27
.06
.22
.55
.03
.19
.23
.45
.3
.25
.45
1.0
combine values to quality
measure
Information gain
source: Nijssen & Kok
subgroup
C1
Subgroup Discovery for
Numeric targets
Numeric Subgroup Discovery
Target is numeric: find subgroups with
significantly higher or lower average value
Trade-off between size of subgroup and average
target value
h = 3600
h = 3100
h = 2200
Types of SD for Numeric Targets
Regression subgroup discovery
numeric target has order and scale
Ordinal subgroup discovery
numeric target has order
Ranked subgroup discovery
numeric target has order or scale
Vancouver 2010 Winter Olympics
ordinal target
Partial ranking
objects share a rank
regression target
Offical IOC ranking of countries (med > 0)
Rank
1
2
3
4
5
6
7
9
9
9
11
12
13.5
13.5
16
16
16
20
20
20
20
20
23
25
25
25
Country
Medals
USA
37
Germany
30
Canada
26
Norway
23
Austria
16
Russ. Fed. 15
Korea
14
China
11
Sweden
11
shared
France
11
Switzerland 9
Netherlands 8
Czech Rep. 6
Poland
6
Italy
5
Japan
5
Finland
5
Australia
3
Belarus
3
Slovakia
3
Croatia
3
Slovenia
3
Latvia
2
Great Britain 1
Estonia
1
Kazakhstan 1
Athletes
214
152
205
100
79
179
46
90
107
ranks
107 are
144
34
92
50
110
94
95
40
49
73
18
49
58
52
30
38
Continent
N. America
Europe
N. America
Europe
Europe
Asia
Asia
Asia
Europe
averaged
Europe
Europe
Europe
Europe
Europe
Europe
Asia
Europe
Australia
Europe
Europe
Europe
Europe
Europe
Europe
Europe
Asia
Fractional ranks
Popul.
309
82
34
4.8
8.3
142
73
1338
9.3
65
7.8
16.5
10.5
38
60
127
5.3
22
9.6
5.4
4.5
2
2.2
61
1.3
16
Language Family
Germanic
Germanic
Germanic
Germanic
Germanic
Slavic
Altaic
Sino-Tibetan
Germanic
Italic
Germanic
Germanic
Slavic
Slavic
Italic
Japonic
Finno-Ugric
Germanic
Slavic
Slavic
Slavic
Slavic
Slavic
Germanic
Finno-Ugric
Turkic
Repub.
y
y
n
n
y
y
y
y
n
y
y
n
y
y
y
n
y
y
y
y
y
y
y
n
y
y
Polar
y
n
y
y
n
y
n
n
y
n
n
n
n
n
n
n
y
n
n
n
n
n
n
n
n
n
Interesting Subgroups
‘polar = yes’
1. United States
3. Canada
4. Norway
6. Russian Federation
9. Sweden
16 Finland
‘language_family = Germanic & athletes 60’
1. United States
2. Germany
3. Canada
4. Norway
5. Austria
9. Sweden
11. Switzerland
Intuitions
Size: larger subgroups are more
reliable
*
Rank: majority of objects appear at
the top
language_family = Slavic
Position: ‘middle’ of subgroup
should differ from middle of ranking
Deviation: objects should have
similar rank
**
**
**
*
Intuitions
Size: larger subgroups are more
reliable
Rank: majority of objects appear at
the top
*
**
*
*
polar = yes
Position: ‘middle’ of subgroup
should differ from middle of ranking
Deviation: objects should have
similar rank
*
Intuitions
Size: larger subgroups are more
reliable
population
10M
Rank: majority of objects
appear
at
the top
Position: ‘middle’ of subgroup
should differ from middle of ranking
Deviation: objects should have
similar rank
**
*
*
**
**
*
*
Intuitions
Size: larger subgroups are more
reliable
Rank: majority of objects appear at
the top
language_family = Slavic & population 10M
Position: ‘middle’ of subgroup
should differ from middle of ranking
Deviation: objects should have
similar rank
**
**
*
Quality Measures
Average
Mean test
z-Score
t-Statistic
Median X2 statistic
AUC of ROC
Wilcoxon-Mann-Whitney Ranks
statistic
Median MAD Metric
Meet Cortana
the open source Subgroup Discovery tool
Cortana Features
Generic Subgroup Discovery algorithm
quality measure
search strategy
inductive constraints
Flat file, .txt, .arff, (DB connection to come)
Support for complex targets
41 quality measures
ROC plots
Statistical validation
Target Concepts
‘Classical’ Subgroup Discovery
nominal targets (classification)
numeric targets (regression)
Exceptional Model Mining
multiple targets
regression, correlation
multi-label classification
(to be discussed in a few slides)
Mixed Data
Data types
binary
nominal
numeric
Numeric data is treated dynamically (no
discretisation as preprocessing)
all: consider all available thresholds
bins: discretise the current candidate subgroup
best: find best threshold, and search from there
Statistical Validation
Determine distribution of random results
random subsets
random conditions
swap-randomization
Determine minimum quality
Significance of individual results
Validate quality measures
how exceptional?
Open Source
You can
Use Cortana binary
datamining.liacs.nl/cortana.html
Use and modify Cortana sources (Java)
Exceptional Model Mining
Subgroup Discovery with multiple target attributes
Mixture of Distributions
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Mixture of Distributions
100
90
100
80
70
90
60
50
80
40
70
30
20
60
10
0
0
50
20
40
60
80
100
20
40
60
80
100
100
40
90
80
30
70
20
60
50
10
40
30
0
0
20
40
60
80
100
20
10
0
0
Mixture of Distributions
100
90
100
80
70
90
60
50
80
40
70
30
20
60
10
0
0
50
20
40
60
80
100
20
40
60
80
100
100
40
90
80
30
70
20
60
50
10
40
30
0
0
20
40
60
80
100
20
10
0
0
For each datapoint it is unclear whether it belongs to G or G
Intensional description of exceptional subgroup G?
Model class unknown
Model parameters unknown
Solution: extend Subgroup Discovery
Use other information than X and Y: object desciptions D
Use Subgroup Discovery to scan sub-populations in
terms of D
Subgroup Discovery: find subgroups of the database
where the target attribute shows an unusual distribution.
Solution: extend Subgroup Discovery
Use other information than X and Y: object desciptions D
Use Subgroup Discovery to scan sub-populations in
terms of D
Model over subgroup becomes target of SD
Exceptional Model Mining
Subgroup Discovery: find subgroups of the database
where the target attributes show an unusual distribution,
by means of modeling over the target attributes.
Exceptional Model Mining
object description
target concept
X
Define a target concept (X and y)
y
Exceptional Model Mining
object description
target concept
X
y
modeling
Define a target concept (X and y)
Choose a model class C
Define a quality measure
φ over C
Exceptional Model Mining
target concept
object description
X
y
modeling
Subgroup Discovery
Define a target concept (X and y)
Choose a model class C
Define a quality measure
φ over C
Use Subgroup Discovery to find exceptional subgroups G and
associated model M
Quality Measure
Specify what defines an exceptional subgroup G based on
properties of model M
Absolute measure (absolute quality of M)
Correlation coefficient
Predictive accuracy
Difference measure (difference between M and M)
Difference in slope
qualitative properties of classifier
Reliable results
Minimum support level
Statistical significance of G
Correlation Model
Correlation coefficient
φρ = ρ(G)
Absolute difference in correlation
φabs = |ρ(G) - ρ(G)|
Entropy weighted absolute difference
φent = H(p)·|ρ(G) - ρ(G)|
Statistical significance of correlation difference φscd
compute z-score from
z*
z ' z '
ρ through Fisher transform
1
1
n3 n 3
compute p-value from z-score
Regression Model
Compare slope b of
yi = a + b·xi + e, and
yi = a + b·xi + e
Compute significance of slope difference φssd
drive = 1 basement = 0 #baths ≤ 1
y = 41 568 + 3.31·x
y = 30 723 + 8.45·x
Gene Expression Data
11_band = ‘no deletion’ survival time ≤ 1919
XP_498569.1 ≤ 57
y = 3313 - 1.77·x
y = 360 + 0.40·x
Classification Model
Decision Table Majority classifier
BDeu measure (predictiveness)
whole database
RIF1 160.45
Hellinger (unusual distribution)
prognosis = ‘unknown’
General Framework
object description
target concept
X
y
modeling
Subgroup Discovery
General Framework
object description
target concept
X
Subgroup Discovery
y
Regression ●
Classification ●
Clustering
Association
Graphical modeling ●
…
General Framework
object description
target concept
X
Subgroup Discovery ●
Decision Trees
SVM
…
y
Regression
Classification
Clustering
Association
Graphical modeling
…
General Framework
propositional ●
multi-relational ●
…
target concept
X
Subgroup Discovery
Decision Trees
SVM
…
y
Regression
Classification
Clustering
Association
Graphical modeling
…