Subgroup - LIACS Data Mining Group

Download Report

Transcript Subgroup - LIACS Data Mining Group

Subgroup Discovery
Finding Local Patterns in Data
Exploratory Data Analysis
 Scan the data without much prior focus
 Find unusual parts of the data
 Analyse attribute dependencies
 interpret this as ‘rule’:
if X=x and Y=y then Z is unusual
 Complex data: nominal, numeric, relational?
the Subgroup
Exploratory Data Analysis
 Classification: model the dependency of the
target on the remaining attributes.
 problem: sometimes classifier is a black-box, or uses
only some of the available dependencies.
 for example: in decision trees, some attributes may not
appear because of overshadowing.
 Exploratory Data Analysis: understanding the
effects of all attributes on the target.
Interactions between Attributes
 Single-attribute effects are not enough
 XOR problem is extreme example: 2 attributes
with no info gain form a good subgroup
 Apart from
A=a, B=b, C=c, …
 consider also
A=aB=b, A=aC=c, …, B=bC=c, …
A=aB=bC=c, …
…
Subgroup Discovery Task
“Find all subgroups within the inductive constraints
that show a significant deviation in the distribution
of the target attributes”
 Inductive constraints:
 Minimum support
 (Maximum support)
 Minimum quality (Information gain, X2, WRAcc)
 Maximum complexity
 …
Subgroup Discovery: the
Binary Target Case
Confusion Matrix
 A confusion matrix (or contingency table) describes
the frequency of the four combinations of subgroup
and target:
 within subgroup, positive
 within subgroup, negative
 outside subgroup, positive
 outside subgroup, negative
target
subgroup
T
F
T
.42
.13
F
.12
.33
.54
.55
1.0
Confusion Matrix
 High numbers along the TT-FF diagonal means a
positive correlation between subgroup and target
 High numbers along the TF-FT diagonal means a
negative correlation between subgroup and target
 Target distribution on DB is fixed
 Only two degrees of freedom
target
subgroup
T
F
T
.42
.13
.55
F
.12
.33
.45
.54
.46
1.0
Quality Measures
A quality measure for subgroups summarizes the interestingness of
its confusion matrix into a single number
WRAcc, weighted relative accuracy
 Balance between coverage and unexpectedness
 WRAcc(S,T) = p(ST) – p(S)p(T)
 between −.25 and .25, 0 means uninteresting
target
subgroup
T
F
T
.42
.13
F
.12
.33
.54
.55
1.0
WRAcc(S,T) = p(ST)−p(S)p(T)
= .42 − .297 = .123
Quality Measures
 WRAcc: Weighted Relative Accuracy
 Information gain
 X2
 Correlation Coefficient
 Laplace
 Jaccard
 Specificity
…
Subgroup Discovery as Search
true
A=a1
A=a1B=b1
A=a1B=b1C=c1
A=a2
A=a1B=b2
B=b1
…
B=b2
C=c1
A=a2B=b1
…
…
T
F
T
.42
.13
F
.12
.33
.54
minimum support
level reached
…
.55
1.0
Refinements are (anti-)monotonic
entire database
Refinements are (anti-)
monotonic in their support…
target concept
S3 refinement of S2
S2 refinement of S1
subgroup S1
…but not in interestingness.
This may go up or down.
Subgroup Discovery and
ROC space
ROC Space
ROC = Receiver Operating Characteristics
Each subgroup forms a
point in ROC space, in
terms of its False Positive
Rate, and True Positive
Rate.
TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup)
FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup)
ROC Space Properties
entire database
‘ROC heaven’
perfect subgroup
‘ROC hell’
random subgroup
perfect
negative subgroup
empty
subgroup
minimum support
threshold
source: Flach & Fürnkranz
Measures in ROC Space
0
positive
negative
WRAcc
Information Gain
isometric
Other Measures
Precision
Gini index
Correlation coefficient
Foil gain
Refinements in ROC Space
Refinements of S will
reduce the FPR and TPR, so
will appear to the left and
below S.
.
.
.
..
Blue polygon represents
possible refinements of S.
With a convex measure, f
is bounded by measure of
corners.
If corners are not above
minimum quality or current
best (top k?), prune search
space below S.
Multi-class problems
 Generalising to problems with more than 2 classes
is fairly staightforward:
target
T
F
X2
C2
C3
.27
.06
.22
.55
.03
.19
.23
.45
.3
.25
.45
1.0
combine values to quality
measure
Information gain
source: Nijssen & Kok
subgroup
C1
Subgroup Discovery for
Numeric targets
Numeric Subgroup Discovery
 Target is numeric: find subgroups with
significantly higher or lower average value
 Trade-off between size of subgroup and average
target value
h = 3600
h = 3100
h = 2200
Types of SD for Numeric Targets
 Regression subgroup discovery
 numeric target has order and scale
 Ordinal subgroup discovery
 numeric target has order
 Ranked subgroup discovery
 numeric target has order or scale
Vancouver 2010 Winter Olympics
ordinal target
Partial ranking
objects share a rank
regression target
Offical IOC ranking of countries (med > 0)
Rank
1
2
3
4
5
6
7
9
9
9
11
12
13.5
13.5
16
16
16
20
20
20
20
20
23
25
25
25
Country
Medals
USA
37
Germany
30
Canada
26
Norway
23
Austria
16
Russ. Fed. 15
Korea
14
China
11
Sweden
11
shared
France
11
Switzerland 9
Netherlands 8
Czech Rep. 6
Poland
6
Italy
5
Japan
5
Finland
5
Australia
3
Belarus
3
Slovakia
3
Croatia
3
Slovenia
3
Latvia
2
Great Britain 1
Estonia
1
Kazakhstan 1
Athletes
214
152
205
100
79
179
46
90
107
ranks
107 are
144
34
92
50
110
94
95
40
49
73
18
49
58
52
30
38
Continent
N. America
Europe
N. America
Europe
Europe
Asia
Asia
Asia
Europe
averaged
Europe
Europe
Europe
Europe
Europe
Europe
Asia
Europe
Australia
Europe
Europe
Europe
Europe
Europe
Europe
Europe
Asia
Fractional ranks
Popul.
309
82
34
4.8
8.3
142
73
1338
9.3
65
7.8
16.5
10.5
38
60
127
5.3
22
9.6
5.4
4.5
2
2.2
61
1.3
16
Language Family
Germanic
Germanic
Germanic
Germanic
Germanic
Slavic
Altaic
Sino-Tibetan
Germanic
Italic
Germanic
Germanic
Slavic
Slavic
Italic
Japonic
Finno-Ugric
Germanic
Slavic
Slavic
Slavic
Slavic
Slavic
Germanic
Finno-Ugric
Turkic
Repub.
y
y
n
n
y
y
y
y
n
y
y
n
y
y
y
n
y
y
y
y
y
y
y
n
y
y
Polar
y
n
y
y
n
y
n
n
y
n
n
n
n
n
n
n
y
n
n
n
n
n
n
n
n
n
Interesting Subgroups
‘polar = yes’
1. United States
3. Canada
4. Norway
6. Russian Federation
9. Sweden
16 Finland
‘language_family = Germanic & athletes  60’
1. United States
2. Germany
3. Canada
4. Norway
5. Austria
9. Sweden
11. Switzerland
Intuitions
 Size: larger subgroups are more
reliable
*
 Rank: majority of objects appear at
the top
language_family = Slavic
 Position: ‘middle’ of subgroup
should differ from middle of ranking
 Deviation: objects should have
similar rank
**
**
**
*
Intuitions
 Size: larger subgroups are more
reliable
 Rank: majority of objects appear at
the top
*
**
*
*
polar = yes
 Position: ‘middle’ of subgroup
should differ from middle of ranking
 Deviation: objects should have
similar rank
*
Intuitions
 Size: larger subgroups are more
reliable
population
 10M
 Rank: majority of objects
appear
at
the top
 Position: ‘middle’ of subgroup
should differ from middle of ranking
 Deviation: objects should have
similar rank
**
*
*
**
**
*
*
Intuitions
 Size: larger subgroups are more
reliable
 Rank: majority of objects appear at
the top
language_family = Slavic & population  10M
 Position: ‘middle’ of subgroup
should differ from middle of ranking
 Deviation: objects should have
similar rank
**
**
*
Quality Measures
 Average
 Mean test
 z-Score
 t-Statistic
 Median X2 statistic
 AUC of ROC
 Wilcoxon-Mann-Whitney Ranks
statistic
 Median MAD Metric
Meet Cortana
the open source Subgroup Discovery tool
Cortana Features
 Generic Subgroup Discovery algorithm
 quality measure
 search strategy
 inductive constraints
 Flat file, .txt, .arff, (DB connection to come)
 Support for complex targets
 41 quality measures
 ROC plots
 Statistical validation
Target Concepts
 ‘Classical’ Subgroup Discovery
 nominal targets (classification)
 numeric targets (regression)
 Exceptional Model Mining
 multiple targets
 regression, correlation
 multi-label classification
(to be discussed in a few slides)
Mixed Data
 Data types
 binary
 nominal
 numeric
 Numeric data is treated dynamically (no
discretisation as preprocessing)
 all: consider all available thresholds
 bins: discretise the current candidate subgroup
 best: find best threshold, and search from there
Statistical Validation
 Determine distribution of random results
 random subsets
 random conditions
 swap-randomization
 Determine minimum quality
 Significance of individual results
 Validate quality measures
 how exceptional?
Open Source
 You can
 Use Cortana binary
datamining.liacs.nl/cortana.html
 Use and modify Cortana sources (Java)
Exceptional Model Mining
Subgroup Discovery with multiple target attributes
Mixture of Distributions
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Mixture of Distributions
100
90
100
80
70
90
60
50
80
40
70
30
20
60
10
0
0
50
20
40
60
80
100
20
40
60
80
100
100
40
90
80
30
70
20
60
50
10
40
30
0
0
20
40
60
80
100
20
10
0
0
Mixture of Distributions
100
90
100
80
70
90
60
50
80
40
70
30
20
60
10
0
0
50
20
40
60
80
100
20
40
60
80
100
100
40
90
80
30
70
20
60
50
10
40
30
0
0
20
40
60
80
100
20
10
0
0
 For each datapoint it is unclear whether it belongs to G or G
 Intensional description of exceptional subgroup G?
 Model class unknown
 Model parameters unknown
Solution: extend Subgroup Discovery
 Use other information than X and Y: object desciptions D
 Use Subgroup Discovery to scan sub-populations in
terms of D
Subgroup Discovery: find subgroups of the database
where the target attribute shows an unusual distribution.
Solution: extend Subgroup Discovery
 Use other information than X and Y: object desciptions D
 Use Subgroup Discovery to scan sub-populations in
terms of D
 Model over subgroup becomes target of SD
Exceptional Model Mining
Subgroup Discovery: find subgroups of the database
where the target attributes show an unusual distribution,
by means of modeling over the target attributes.
Exceptional Model Mining
object description
target concept
X
 Define a target concept (X and y)
y
Exceptional Model Mining
object description
target concept
X
y
modeling
 Define a target concept (X and y)
 Choose a model class C
 Define a quality measure
φ over C
Exceptional Model Mining
target concept
object description
X
y
modeling
Subgroup Discovery
 Define a target concept (X and y)
 Choose a model class C
 Define a quality measure
φ over C
 Use Subgroup Discovery to find exceptional subgroups G and
associated model M
Quality Measure
 Specify what defines an exceptional subgroup G based on
properties of model M
 Absolute measure (absolute quality of M)
 Correlation coefficient
 Predictive accuracy
 Difference measure (difference between M and M)
 Difference in slope
 qualitative properties of classifier
 Reliable results
 Minimum support level
 Statistical significance of G
Correlation Model
 Correlation coefficient
φρ = ρ(G)
 Absolute difference in correlation
φabs = |ρ(G) - ρ(G)|
 Entropy weighted absolute difference
φent = H(p)·|ρ(G) - ρ(G)|
 Statistical significance of correlation difference φscd
 compute z-score from
z* 
z ' z '
ρ through Fisher transform
1
1

n3 n 3
 compute p-value from z-score
Regression Model
 Compare slope b of
yi = a + b·xi + e, and
yi = a + b·xi + e
 Compute significance of slope difference φssd
drive = 1  basement = 0  #baths ≤ 1
y = 41 568 + 3.31·x
y = 30 723 + 8.45·x
Gene Expression Data
11_band = ‘no deletion’  survival time ≤ 1919
 XP_498569.1 ≤ 57
y = 3313 - 1.77·x
y = 360 + 0.40·x
Classification Model
 Decision Table Majority classifier
 BDeu measure (predictiveness)
whole database
RIF1  160.45
 Hellinger (unusual distribution)
prognosis = ‘unknown’
General Framework
object description
target concept
X
y
modeling
Subgroup Discovery
General Framework
object description
target concept
X
Subgroup Discovery
y
Regression ●
Classification ●
Clustering
Association
Graphical modeling ●
…
General Framework
object description
target concept
X
Subgroup Discovery ●
Decision Trees
SVM
…
y
Regression
Classification
Clustering
Association
Graphical modeling
…
General Framework
propositional ●
multi-relational ●
…
target concept
X
Subgroup Discovery
Decision Trees
SVM
…
y
Regression
Classification
Clustering
Association
Graphical modeling
…