Transcript Slide 1

Rule Learning
Intelligent Systems – Lecture 10
©www.sti-innsbruck.at
Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at
1) Motivation (what is the problem solved)
2) Technical solution
• Definitions
• Explanations
• Procedures
• Illustrations by short examples
3) Illustration by a larger example
4) Extensions (Work only sketched)
www.sti-innsbruck.at
Motivation
• Data warehouses and other large scale data collections
provide vaste amounts of data, and hide knowledge that is not
obviously to discover
• Rule learning (associative rule learning) is a popular means
for discovering interesting relations between data sets.
• Rules enable the inference of knowledge and make hidden
facts explicit.
www.sti-innsbruck.at
Motivating Example
• Association rule learning is very popular in marketing:
– People shopping Coke and beer, are very likely to also buy
potato chips: {Coke, beer} ⇒ {potato chips}.
– Retail Advertising and Marketing Association: "For
example, convenience stores realized that new fathers
came in to buy Pampers and they also bought beer. So
where are the Pampers now? Near the beer. It's about
getting a feel for what other things you can merchandise“.
• Other popular applications domains include amongst others:
Web usage, intrusion detection or bioinformatics.
www.sti-innsbruck.at
Definition by Agrawal
• Let I = {i1, i2, ..., in} be a set of binary attributes called items,
and let D = {t1, t2, ..., tn} be a set of transactions called
database.
• Each transaction (tuple in the database) contains a subset of
the items in I.
• A rule is then defined to be of the form A → C (an implication),
where A, C ⊆ I and X ∩ Y = ∅.
• The left-hand side (A) is called antecedent, the right-hand
side (C) is referred to as consequent.
www.sti-innsbruck.at
Example
• [Simple example from Wikipedia about milk, butter, bread and beer]
www.sti-innsbruck.at
Significance of Rules
• Association rule learning only makes sense in the context of
very large data sets.
• In very large data sets there are obviously hundreds if not
thousands of implications discoverable.
• Significance and interest in a rule is therefore an important
selection criteria, and only those rules that represent a bigger
share of the whole can be considered relevant.
• The support of an itemset A is defined as the proportion of
transactions ti ⊆ D that contain A.
www.sti-innsbruck.at
Confidence
• The confidence in a rule depends on the support of the
itemset A and the support of the union of A and C.
• Confidence: conf(A → C) = supp(A ⋃ C) / supp(A).
• The confidence is an estimated measure of probability,
formally known as P(C|A), and gives an indication of the
probability that the consequent holds if the antecedent is
given.
www.sti-innsbruck.at
Finding Rules
• Association rules need to satisfy a given level of confidence,
and a given degree of support the same time.
• A two-step process is generally applied to discover rules that
satisfy both requirements:
– Mininmal support is used to determine frequent itemsets
– Minimum confidence tresholds are applied to determine
the rules.
• The first step is significantly more challenging than the second
one!
www.sti-innsbruck.at
Finding Frequent Datasets
• The number of possible datasets is given by the powerset of I,
and is thus equal to: 2n – 1, with n = | I |.
• Consequently, the number of potential datasets grows
exponentially in the size of I.
• Different algorithms allow nontheless to compute the frequent
datasets efficiently: Apriori (BFS), Eclat (DFS), FP-Growth
(FP-tree).
• These algorithms exploit the downward-closure property
(aka anti-monotonicity): frequent itemset have frequent
subsets, which results in infrequent itemset to have infrequent
supersets.
www.sti-innsbruck.at
Alternative Indicators
• All-confidence: all rules that can be derived from itemset A
have at least a confidence of
all-confidence(A) = supp(A) / max(supp(a ∈ A))
with max(supp(a ∈ A)) the support of the item with the highest
support in A.
• Collective strength is given by
cs(A) = (1-v(A))/(1-E[v(A)]) * E[v(A)]/v(A)
with v(Z) the violation rate and E[] the expected value for
independent items; the violation rate is defined as the fraction
of transactions which contain some of the items in an itemset
but not all. Collective strength gives 0 for perfectly negative
correlated items, infinity for perfectly positive correlated items,
and 1 if the items co-occur as expected under independence.
www.sti-innsbruck.at
Alternative Indicators (2)
• Coverage (aka antecedent support) measures how often a
rule is applicable in a database:
coverage(A → C) = supp(A) = P(A)
• The conviction of a rule indicates the ratio of appearences of
the antecedent A without C being a consequence of A:
conv(A → C) = 1 - supp(C) / 1 - conf(A → C).
• Leverage measures the difference of A and C appearing
together in the data set and what would be expected if A and
C where statistically dependent:
leverage(A → C) = P(A and C) - (P(A)P(B))
www.sti-innsbruck.at
Alternative Indicators (3)
• The lift of a rule is given as the ratio between the confidence
and the pure chance of observing an observation, and thus
measures how many times more often A and C occur together
than expected if they where statistically independent:
lift(A → C) = conf(A → C) / supp(C)
= supp(A ⋃ C) / supp(A) * supp(C).
www.sti-innsbruck.at
Decision Trees
• Many inductive knowledge acquisition algorithms generate
classifiers in form of decision trees.
• A decision tree is a simple recursive structure for expressing
a sequential classification process.
– Leaf nodes denote classes
– Intermediate nodes represent tests
www.sti-innsbruck.at
Decision Trees and Rules
• Rules can represent a decision tree:
if item1 then subtree1
elseif item2 then subtree2
elseif...
• There are as many rules as there are leaf nodes in the decision tree.
• Advantage of rules over decision trees:
– rules are a widely-used and well-understood representation
formalism for knowledge in expert systems;
– Rules are easier to understand, modify and combine; and
– Rules can significantly improve classification performance by
eliminating unnecessary tests.
www.sti-innsbruck.at
Decision-Driven Rules
• The following definition apply to rules that aim to conclude a
fact out of a given set of attribute value assignments.
• The decision tree thus takes the following form
if attribute1 = value1 then subtree1
elseif attribute1 = value2 then subtree2
elseif...
• The critical question is then: which attribute should be the first
one to evaluate, i.e. which attribute is the most selective
determiner and should be the first one in the decision tree.
www.sti-innsbruck.at
Entropy
• Entropy is a measure of ‚degree of doubt‘ and is a well-studied
concept in information theory.
• Let {c1, c2, ..., cn} be a set of conclusions C of a rule
(consequences); let {x1, x2, ..., xn} be a set of possible values of an
attribute X.
• The probability that ci is true given that X has value xj is given by
p(ci|xj).
• Entropy is then defined as
entropy = - Σ p(ci|xj) log2[p(ci|xj)] for i ∈ 1...|C|
• The logarithm (- log2[p(ci|xj)]) indicates the ‚amount of information‘
that xj has to offer about the conclusion ci.
www.sti-innsbruck.at
Most Useful Determiner
• The lower the entropy of xj with respect to C, the more
information xj has to offer about C.
• The entropy of an attribute X with respect to C is then given
by - Σ p(xj) Σ p(ci|xj) log2[p(ci|xj)].
• The attribute with the lowest entropy is the most useful
determiner, as it has the lowest ‚degree of doubt‘.
www.sti-innsbruck.at
ID3 Algorithm by Quinlan
1. For each attribute, compute its entropy with respect to the
conclusion;
2. Select the attribute with lowest entropy (say X)
3. Divide the data into separate sets so that within a set, X has a fixed
value (X=x1 in one set, X=x2 in another...)
4. Build a tree with branches:
if X=x1 then ... (subtree1)
if X=x2 then ... (subtree2)
...
5. For each subtree, repeat from step 1.
6. At each iteration, one attribute gets removed from consideration.
STOP if no more attributes are left, or if all attribute values lead to
the same conclusion.
www.sti-innsbruck.at
Example
• [Robinson Crusoe example: what is good to eat...]
www.sti-innsbruck.at
Counter Example
• Consider the following data:
– X=3, Y=3 ⇒ yes
– X=2, Y=1 ⇒ no
– X=3, Y=4 ⇒ no
– X=1, Y=1 ⇒ yes
– X=2, Y=2 ⇒ yes
– X=3, Y=2 ⇒ no
• The entropy-based ID3 algorithm is incapable to spot the
obvious relationship if X = Y then yes else no, as only one
attribute is considered at the time!
www.sti-innsbruck.at
Accuracy of ID3
• ID3 forms rules by eliminating conditions from a path in the decision
tree, and thus the rules tend to be over-generalized with respect to
the training data.
• Can rules keep up with the decision trees?
• Experimental results by Quinlan
in 1987 show that rules are
not only simpler in the
general case, but that they
are sometimes even more
accurate!
www.sti-innsbruck.at
Refinement of Rule Sets
•
There is four step procedure for the refinment of rules:
1. Rules that become incorrect because of new examples
are refined: incorrect rules are replaced by new rules
that cover the positive examples, but not the new
negative ones.
2. Complete the rule set to cover new positive examples.
3. Redundant rules are deleted to correct the rule set.
4. Minimize the rule set.
•
Steps 1 and 2 are subject to the algorithm JoJo that integrate
generalization and specification via a heuristic search
procedure.
www.sti-innsbruck.at
Specialization
• Specialization algorithms start from very general descriptions and
specializes those until they are correct.
• This is done by adding additional premises to the antecedent of a
rule, or by restricting the range of an attribute which is used in an
antecedent.
• Algorithms relying on specialization generally have the problem of
overspecialization: previous specialization steps could become
unnecessary due to subsequent specialization steps.
• This brings along the risk for ending up with results that are not
maximal-general.
• Some examples of specialization algorithms are the following: AQ,
C4, CN2, CABRO, FOIL, or PRISM.
www.sti-innsbruck.at
Generalization
• Generalization starts from very special descriptions and generalizes
them as long as they are not incorrect, i.e. in every step some
unnecessary premises are deleted.
• The generalization procedure stops if no more premises to remove
exist.
• Generalization avoids the maximal-general issue of specialization, in
fact it guarantees most-general descriptions.
• However, generalization of course risks to derive final results that
are not most-specific.
• RELAX is an example of a generalization-based algorithm.
www.sti-innsbruck.at
RELAX
www.sti-innsbruck.at
JoJo
• While ID3 makes in fact already use of both by first construction
rules via specialization and second by generalizing the rule set.
• However, in general it cannot be determined which search direction
is the better one.
• JoJo is an algorithm that combines both search directions in one
heuristic search procedure.
• JoJo can start at an arbitrary point in the lattice of complexes and
generalize and specialize as long as the quality and correctness can
be improved, i.e. until a local optimum can be found, or no more
search resources are available.
www.sti-innsbruck.at
JoJo (2)
www.sti-innsbruck.at