Document 7180047

Download Report

Transcript Document 7180047

ECML / PKDD 2004 Discovery Challenge
Mining Strong Associations and
Exceptions in the STULONG Data Set
Eduardo Corrêa Gonçalves and Alexandre Plastino*
Universidade Federal Fluminense
Department of Computer Science
Niterói, Rio de Janeiro, Brazil
{egoncalves,plastino}@ic.uff.br - http://www.ic.uff.br
*work sponsored by CNPq research grant 300879/00-8
ECML / PKDD 2004 Discovery Challenge
1
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
ECML / PKDD 2004 Discovery Challenge
2
Atherosclerosis Data Set
STULONG Data Set: risk factors of atherosclerosis
in a population of 1417 middle aged men from Czech
Republic.
Four tables are included in this data set:
Entry: data related to entry examinations
performed on these men (the first step of the
STULONG project).
Control: data related to long-term observations.
Letter: additional information about the health
status of 403 men.
Death: data related to the patients that became
dead.
ECML / PKDD 2004 Discovery Challenge
3
Basic Groups of Patients
The patients were classified into three basic groups,
according to the results of the entry examinations:
A.
Normal Group : men without the presence of any
risk factor.
B.
Risk Group : men with the presence of one or
more risk factors.
C.
Pathologic Group : men with either an identified
cardiovascular disease or other serious disease.
ECML / PKDD 2004 Discovery Challenge
4
Contribution
The main contribution of this work is to present
strong association rules and exceptions mined
from the Entry table.
The mining process was driven into discovering
relations among the following characteristics of the
patients in the basic groups:
Social factors.
Physical activities during free time.
Alcohol consumption.
Smoking.
Results of the biochemical examinations
and the physical check-up.
ECML / PKDD 2004 Discovery Challenge
5
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
ECML / PKDD 2004 Discovery Challenge
6
Multidimensional Association Rules
Multidimensional Association Rules (J. Han and M.
Kamber, 2001) represent combinations of attribute
values that often occur together in a database.
They can be mined from relational databases or data
warehouses.
Example:
(DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”)
meaning: “men who are heavy beer consumers tend to
be also heavy smokers”.
This rule involves two attributes (or dimensions):
DailyBeerCons and Smoking.
ECML / PKDD 2004 Discovery Challenge
7
Multidimensional Association Rules
Formal Definition
A1 = a1 , ... , An = an  B1 = b1 , ... , Bm = bm
Ai (1  i  n) and Bj (1  j  m) : distinct attributes
(dimensions) from a database relation.
ai and bj : values from the domains of Ai and Bj,
respectively.
generic representation: A  B
A is the antecedent and B is the consequent of the
rule. Several attributes can be involved in both the
antecedent and the consequent.
ECML / PKDD 2004 Discovery Challenge
8
Interest Measures:
Support and Confidence
Support index (Sup): the probability that a tuple
matches all conditions in A  B.
Confidence index (Conf): the probability that a tuple
matches B, given that it matches A.
Sup(A  B) = P(A,B) and Conf(A  B) = P(B|A).
The support indicates the relevance and the confidence
indicates the validity of an association rule.
Support / Confidence Framework (Agrawal et al,
1993): finding all rules that match user-provided
minimum support and minimum confidence.
ECML / PKDD 2004 Discovery Challenge
9
Interest Measures:
Support and Confidence
Problems with the Support / Confidence
Framework (Brin et al, 1997):
generation of a huge number of rules:


most of these rules are often
obvious.

In many cases, these rules express
relations that are not true.
ECML / PKDD 2004 Discovery Challenge
10
Interest Measures:
Support and Confidence
Id
Association Rule
SupA
SupB
Sup
Conf
R1 (DailyBeerCons = “>1l”) 
(Smoking = “>20 cig/day”)
0.1193
0.2602
0.0448
0.3758
R2 (DailyBeerCons = “>1l”) 
(Married = “yes”)
0.1193
0.8487
0.0905
0.7584
The support and confidence values of R2 are higher than
the R1 ones.
Is R2, in fact, more interesting than R1?
ECML / PKDD 2004 Discovery Challenge
11
Negative Dependence
Id
Association Rule
R2 (DailyBeerCons = “>1l”) 
(Married = “yes”)
SupA
SupB
Sup
Conf
0.1193
0.8487
0.0905
0.7584
R2 should imply that men who are heavy beer
consumers tend to be married.
84.87% of men are married. However, the probability for
a man to be married, given that he is a heavy beer
consumer is 75.84%.
Heavy beer consumers are, in fact, less likely to be
married. There is a negative dependence between
being married and being a heavy beer consumer.
ECML / PKDD 2004 Discovery Challenge
12
Positive Dependence
Id
Association Rule
R1 (DailyBeerCons = “>1l”) 
(Smoking = “>20 cig/day”)
SupA
SupB
Sup
Conf
0.1193
0.2602
0.0448
0.3758
26.02% of men are heavy smokers. The probability for a
man to be a heavy smoker, given that he is a heavy beer
consumer is 37.58%.
Heavy beer consumers are more likely to smoke a lot.
There is a positive dependence between being a heavy
beer consumer and being a heavy smoker.
ECML / PKDD 2004 Discovery Challenge
13
Strong Association Rule
Id
Association Rule
SupA
SupB
Sup
Conf
R1 (DailyBeerCons = “>1l”) 
(Smoking = “>20 cig/day”)
0.1193
0.2602
0.0448
0.3758
R2 (DailyBeerCons = “>1l”) 
(Married = “yes”)
0.1193
0.8487
0.0905
0.7584
Conclusions:
R1 is a strong association rule, while R2 is not
true.
In order to mine interesting information, we need to
evaluate the type of dependence between the
antecedent and the consequent of a rule.
ECML / PKDD 2004 Discovery Challenge
14
Lift and RI
Lift: how much more frequent is B when A occurs.
Lift(A  B) = Conf(A  B)  Sup(B)
RI - Rule Interest (G. Piatetsky-Shapiro, 1991):
computes the percentage of additional tuples matched
by an association rule that are above the expected.
RI(A  B) = Sup(A  B) - Sup(A) x Sup(B)
We believe that the use of different interest measures
(Sup, Conf, Lift and RI) provides alternative analysis of
the same data, giving a better understanding about the
associations.
ECML / PKDD 2004 Discovery Challenge
15
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
ECML / PKDD 2004 Discovery Challenge
16
Exceptions
In our approach, exceptions represent association rules
that become much weaker in some specific subsets of
the database.
Example: Does the rule (DailyBeerCons = “>1l”) 
(Smoking = “>20 cig/day”) become weaker on any
subset of the database?
Mined exception:
(DailyBeerCons = “>1l”) & (Age = “ 50”) 
(Smoking = “>20 cig/day”)
meaning: “among the men who are 50 years old or
above, the support value of the association between
being a heavy beer consumer and being a heavy smoker
is surprisingly smaller than what is expected”.
ECML / PKDD 2004 Discovery Challenge
17
Exceptions
(DailyBeerCons = “>1l”) & (Age = “ 50”) 
(Smoking = “>20 cig/day”)

This exception was obtained because the conventional
rule (DailyBeerCons = “>1l”) & (Age = “50”) 
(Smoking = “>20 cig/day”) did not achieve an
expected support.

This expected support is evaluated from the
support of the original rule (DailyBeerCons = “>1l”) 
(Smoking = “>20 cig/day”) and the support of the
condition (Age = “50”).
ECML / PKDD 2004 Discovery Challenge
18
Exceptions: Formal Definition

Let D be a database relation.

Let R: A  B be a multidimensional association rule.

Let Z = {Z1 = z1, ..., Zk = Zk} be a set of conditions
defined over D, where Z  A  B = . Z is named as
probe set.

An exception related to the positive rule R is an
implication of the form:
AZB
ECML / PKDD 2004 Discovery Challenge
19
Candidate Exceptions
Exceptions are extracted from candidate exceptions.
A candidate exception is an expression in the form:
AZB
Exceptions are mined only if the candidates do not
achieve an expected support.
This expectation is evaluated based on the support of
the original rule A  B and the support of the
conditions that compose the probe set Z:
ExpSup(A  Z  B) = Sup(A  B) x Sup(Z)
ECML / PKDD 2004 Discovery Challenge
20
The Interest Measure (IM) Index
We developed two interest measures to evaluate the
degree of interestingness of an exception.
The IM (Interest Measure) index evaluates the
strength (relevance) of an exception.
IM(E) = 1 - (Sup(A  Z  B)  ExpSup(A  Z  B))
An exception E is potentially interesting if the actual
support value of Sup(A  Z  B) is much lower than its
expected support value.
This measure captures the type of dependence between
Z and A  B. The closer the value is from 1, the more
the negative dependence.
ECML / PKDD 2004 Discovery Challenge
21
Example of the IM Index
R: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”) -
Sup(R) = 4.48%
Z = {(Age = “ 50”)} - Sup(Z) = 22.82%
The expected support for A  Z  B can be computed
as 4.48% x 22.82% = 1.02%.
The actual support of A  Z  B is 0.48%.
The exception E1: A  Z  B is potentially interesting
because IM(E1) = 1 - (0.48  1.02) = 0.53.
The actual support value of E1 is 53% lower than what is
expected.
ECML / PKDD 2004 Discovery Challenge
22
Degree of Unexpectedness
A high value for the IM measure is not a guarantee
that we found interesting information.
R: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”)
Sup(R) = 4.48%
Z = {(Alcohol = “no”)} - Sup(Z) = 9.47%

The expected support for A  Z  B can be computed as
4.48% x 9.47% = 0.42%.

The actual support for this candidate rule is 0.00%.

IM(A  Z  B) = 1 - (0.00  0.48) = 1.00.

However, this exception represents na information that
is obvious. The IM index could not detect the strong
negative dependence between A and Z.
ECML / PKDD 2004 Discovery Challenge
23
Degree of Unexpectedness
The DU (Degree of Unexpectedness ) Index is used
to determine the validity of an exception.
This measure captures how much the negative
dependence between a probe set Z and a rule A  B is
higher than the negative dependence between Z and
either A and B.
DU(E) = IM(E) - max(1 - Sup(A  Z)  ExpSup(A  Z),
1 - Sup(B  Z)  ExpSup(B  Z))
The greater the value is from 0, the more interesting
the exception will be. If DU(E)  0 the exception is
uninteresting.
ECML / PKDD 2004 Discovery Challenge
24
Example of the DU Index
R: (DailyBeerCons = “>1l”)  (Smoking = “>20 cig/day”)
Sup(R) =4.48% --- Sup(A) =11.93% --- Sup(B) =26.02%
Z = {(Age = “ 50”)}
Sup(Z)= 22.82% --- Sup(A  Z)= 2.00% --- Sup(B  Z)= 6.00%
1) compute the negative dependence between A and Z:

1 - (2.00%  (11.93% x 22.82%)) = 0.27
2) compute the negative dependence between B and Z:
1 - (6.00%  (26.02% x 22.82%)) = -0.01
The exception E1: A  Z  B is, in fact, interesting because:
DU(E1) = 0.53 - max(0.27,-0.01) = 0.26
ECML / PKDD 2004 Discovery Challenge
25
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
ECML / PKDD 2004 Discovery Challenge
26
Data Preparation
The following relations in the ARFF format (Witten and
Frank, 2000) were generated from the original Entry
table:
ENTRYTOT: 1249 tuples
(men from groups A, B and C).
ENTRYA: 276 tuples (only men from group A).
ENTRYB: 859 tuples (only men from group B).
ENTRYC: 114 tuples (only men from group C).
ECML / PKDD 2004 Discovery Challenge
27
Data Preparation
Data was enriched with new fields and the continuous
attributes were discretized.
Field
Possible Values
Cholesterol
“desirable” (<200), “bordering” (200 – 239),
“high” ( 240).
Triglycerides
“desirable” (<150), “bordering” (150 – 200),
“high” (201 - 499), “very high” ( 500).
BMI
(body mass index)
“underweight” ( bmi < 20),
“normal” (20  bmi < 25),
“overweight” (25  bmi < 30),
“obese” (30  bmi < 40),
“morbidly obese” (bmi  40).
Blood Pressure
“normal”, “normal / high”, “high”
Skin Folds
“8-20”, “21-30”, “31-40”, “>40”
Age
“38-39”, “40-44”, “45-49”, “  50”
ECML / PKDD 2004 Discovery Challenge
28
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
ECML / PKDD 2004 Discovery Challenge
29
Results
We developed two programs in C++ (g++ compiler):
MULTMINE: used to mine strong multidimensional
association rules.
EXCEPMINE: used to mine exceptions.
We use the following thresholds on the experiments:
Minimum support = 1% (MULTMINE).
Minimum IM = 0.30 and minimum DU = 0.05
(EXCEPMINE).
ECML / PKDD 2004 Discovery Challenge
30
Group A - EntryALL
(Group = “A”)  (Education = “university”)

SupA
SupB
0.2210
0.2762
Sup
0.0873
Conf
Lift
0.3949
1.430
RI
0.0262
Group A is the only one where men with university
degree are in the majority (Conf = 0.3949).
(Group = “A”)  (PhysActAfterJob = “great activity”)

SupA
SupB
0.2210
0.0857
Sup
0.0320
Conf
Lift
0.1449
1.692
RI
0.0131
There is a strong positive dependence between
belonging to Group A and practicing physical actvities
intensely in free time (lift = 1.692).
ECML / PKDD 2004 Discovery Challenge
31
Alcohol Consumption x Smoking
(DailyBeerCons = “>1l”)  (SmokingDuration = “>20 years”)
Group
SupA
SupB
A
0.0688
0.1667
B
0.1362
C
0.1140
Sup
Conf
Lift
RI
0.0145
0.2105
1.263
0.0030
0.5751
0.0908
0.6667
1.159
0.0125
0.4737
0.0789
0.6923
1.461
0.0249

Drinking a lot and smoking for more than 20 years are
positively dependent in groups A, B, and C (Lift and RI
columns).

However, there are much fewer smokers in Group A (SupB
column). In groups B and C, the greatest part of the heavy
beer consumers smoked cigarettes for more than 20 years
(Conf column).

Men from group B tend to smoke and drink more (SupA, SupB
and Sup columns).
ECML / PKDD 2004 Discovery Challenge
32
Alcohol Consumption x Cholesterol
(Alcohol = “No”)  (Cholesterol = “desirable”)
Group
SupA
SupB
A
0.0870
0.3370
B
0.0861
C
0.1316
Sup
Conf
Lift
RI
0.0507
0.5833
1.731
0.0214
0.1828
0.0186
0.2162
1.183
0.0029
0.1316
0.0263
0.2000
1.520
0.0090

Not drinking alcohol and having the cholesterol in the
desirable range are positively dependent in groups A, B, and C
(Lift and RI columns).

There are less alcohol consumers in Group C (SupA column).

In group A, the greatest part of the men who do not drink
alcohol have the cholesterol in the desirable range (Conf
column).
ECML / PKDD 2004 Discovery Challenge
33
Education x Smoking
(Education = “university”)  (Smoking = “no”)
Group
SupA
SupB
A
0.3949
0.5109
B
0.2526
C
0.1667
Sup
Conf
Lift
RI
0.2210
0.5596
1.095
0.0193
0.1793
0.0664
0.2627
1.465
0.0211
0.2018
0.0877
0.5263
2.608
0.0541

People with the highest education degree are less likely to be
smokers (Lift and RI columns).

In groups A and C, the majority of men with university degree
do not smoke (Conf column). The support of this rule is very
high in group A.

In group B, most of them are smokers (Conf column).
However, not smoking and having reached university degree
still are very positively dependent (Lift and RI columns).
ECML / PKDD 2004 Discovery Challenge
34
Skin Folds x Body Mass Index
(Skin Folds = “ 20”)  (BMI = “normal”)
Group
SupA
SupB
A
0.2319
0.5326
B
0.2154
C
0.1140
Sup
Conf
Lift
RI
0.1558
0.6719
1.261
0.0323
0.3586
0.1478
0.6865
1.914
0.0706
0.2632
0.0789
0.6923
2.631
0.0489

Most of the men who have the body mass index into the
normal range were classified into the lowest range of the
attribute Skin Folds (Conf column).

Both attributes are highly positive dependent (Lift and
RI columns).

There are much fewer people who have normal BMI in
Group C (SupB column).
ECML / PKDD 2004 Discovery Challenge
35
Exceptions
(Education = “apprentice school ”) &
(PhysActAfterJob = “great act.”)  (Smoking = “15-20 cig day”)
IM = 0.4755, DU = 0.2069

Original rule: “people whose education degree is
apprentice school tend to smoke a lot”.

Exception: Among the men who practice physical
activities intensely in free time, the support value of the
original rule is 47.55% smaller than what is expected.

The degree of unexpectedness is equal to 20.69%.
ECML / PKDD 2004 Discovery Challenge
36
Exceptions
(Education = “university ”) & (Group = “C”)  (BMI = “normal”)
IM = 0.7018, DU = 0.3052

Original rule: “people with the highest education degree
tend to have the body mass index into the normal range”.

Exception: Among the men who belong to Group C, the
support value of the original rule is 70.18% smaller than
what is expected.

The degree of unexpectedness is equal to 30.52%.
ECML / PKDD 2004 Discovery Challenge
37
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
ECML / PKDD 2004 Discovery Challenge
38
Summary
We presented some strong association rules and
exceptions mined from the STULONG Data Set,
concerning the entry examinations.
Strong association rules evaluated the differences of the
correlations concerning the characteristics of the patients
from the three basic groups.
Exceptions indicated negative patterns associated with
previously known strong positive rules. These exceptions
were mined from candidates that do not achieve an
expected support value.
ECML / PKDD 2004 Discovery Challenge
39
Future Work
Apply the same approach to the relations: Letter,
Control and Death.
Besides mining rules with large deviation between the
actual and the expected support, we intend to
investigate the interestingness of rules with large
deviation between the actual and the expected
confidence value.
ECML / PKDD 2004 Discovery Challenge
40
Universidade  Federal Fluminense
Universidade Federal
Fluminense
http://www.uff.br
Niterói, Rio de Janeiro, Brazil
Thank  you !!
ECML / PKDD 2004 Discovery Challenge
41