No Slide Title

Download Report

Transcript No Slide Title

CISC 4631
Data Mining
Lecture 09:
Association Rule Mining
Theses slides are based on the slides by
• Tan, Steinbach and Kumar (textbook authors)
• Prof. F. Provost (Stern, NYU)
• Prof. B. Liu, UIC
1
What Is Association Mining?
•
Association rule mining:
– Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
•
Applications:
– Market Basket analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
2
Association Mining?
•
Examples.
–
–
–
–
–
–
Rule form: “Body ead [support, confidence]”.
buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
buys(x, "bread") buys(x, "milk") [0.6%, 65%]
major(x, "CS") /\ takes(x, "DB")  grade(x, "A") [1%, 75%]
age(X,30-45) /\ income(X, 50K-75K)  buys(X, SUVcar)
age=“30-45”, income=“50K-75K”  car=“SUV”
Market-basket analysis
and finding associations
• Do items occur together?
(more than I might expect)
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied extensively by
the database and data mining community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how items
purchased by customers are related.
Bread  Milk
[sup = 5%, conf = 100%]
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction is a list
of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items
with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
• Applications
– *  Maintenance Agreement (What the store should do to
boost Maintenance Agreement sales)
– Home Electronics  * (What other products should the
store stocks up?)
– Detecting “ping-pong”ing of patients, faulty “collisions”
5
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,
not causality!
An itemset is simply a set of items
6
Association Rule Mining
– We are interested in rules that are
• non-trivial (and possibly unexpected)
• actionable
• easily explainable
7
Examples from a Supermarket
• Can you think of association rules from a
supermarket?
• Let’s say you identify association rules from a
supermarket, how might you exploit them?
– That is, if you are the store manager, how might
you make money?
• Assume you have a rule of the form X  Y
8
Supermarket examples
• If you have a rule X  Y, you could:
– Run a sale on X if you want to increase sales of Y
– Locate the two items near each other
– Locate the two items far from each other to make
the shopper walk through the store
– Print out a coupon on checkout for Y if shopper
bought X but not Y
9
Association “rules” – standard format
Rule format: (A set can consist of just a single item)
If {set of items}  Then {set of items}
Condition
If {Diapers,
Baby Food}
Customer buys both
Customer
buys beer
Customer
buys diaper
Results
Then
{Beer, Chips}
Condition implies Results
What is an interesting association?
• Requires domain-knowledge validation
– actionable vs. trivial vs. inexplicable
• Algorithms provide first-pass based on statistics on how
“unexpected” an association is
• Some standard statistics used:
CR
– support ≈ p(R&C)
• percent of “baskets” where rule holds
– confidence ≈ p(R|C)
• percent of times R holds when C holds
Support and Confidence
Customer buys both
Customer
buys beer
Customer
buys diaper
• Find all the rules X  Y with
minimum confidence and support
– Support = probability that a transaction
contains {X,Y}
• i.e., ratio of transactions in which X, Y
occur together to all transactions in
database.
– Confidence = conditional probability that
a transaction having X also contains Y
• i.e., ratio of transactions in which X, Y
occur together to those in which X
occurs.
In general confidence of a rule LHS => RHS can be computed as the support
of the whole itemset divided by the support of LHS:
Confidence (LHS => RHS) = Support(LHS
 RHS) / Support(LHS)
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of itemset
– E.g. ({Milk, Bread,Diaper}) = 2
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
13
Definition: Association Rule

Association Rule
– An implication expression of the form
X  Y, where X and Y are itemsets
– Example:
{Milk, Diaper}  {Beer}

Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Support (s)

Example:
Fraction of transactions that contain
both X and Y
{Milk, Diaper}  Beer
– Confidence (c)

Measures how often items in Y
appear in transactions that
contain X
s
 (Milk , Diaper, Beer )
|T|

2
 0.4
5
 (Milk, Diaper, Beer ) 2
c
  0.67
 (Milk , Diaper )
3
Support and Confidence - Example
Transaction ID Items Bought
1001
A, B, C
1002
A, C
1003
A, D
1004
B, E, F
1005
A, D, F
Itemset {A, C} has a support of 2/5 =
40%
Rule {A} ==> {C} has confidence of
50%
Rule {C} ==> {A} has confidence of
100%
Support for {A, C, E} ?
Support for {A, D, F} ?
Confidence for {A, D} ==> {F} ?
Confidence for {A} ==> {D, F} ?
Goal: Find all rules that satisfy the user-specified minimum support
(minsup) and minimum confidence (minconf).
Example
• Transaction data
• Assume:
t1:
t2:
t3:
t4:
t5:
t6:
t7:
Beef, Chicken, Milk
Beef, Cheese
Cheese, Boots
Beef, Chicken, Cheese
Beef, Chicken, Clothes, Cheese, Milk
Chicken, Clothes, Milk
Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Chicken, Clothes, Milk}
[sup = 3/7]
• Association rules from the itemset:
Clothes  Milk, Chicken
…
Clothes, Chicken  Milk,
[sup = 3/7, conf = 3/3]
…
[sup = 3/7, conf = 3/3]
16
CS583, Bing Liu, UIC
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375
Mining Association Rules
•
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
•
Frequent itemset generation is still computationally
expensive
Transaction data representation
• A simplistic view of shopping baskets,
• Some important information not considered.
E.g,
– the quantity of each item purchased and
– the price paid.
CS583, Bing Liu, UIC
20
Many mining algorithms
• There are a large number of them!!
• They use different strategies and data structures.
• Their resulting sets of rules are all the same.
– Given a transaction data set T, and a minimum support and a
minimum confident, the set of association rules existing in T is
uniquely determined.
• Any algorithm should find the same set of rules although
their computational efficiencies and memory
requirements may be different.
• We study only one: the Apriori Algorithm
CS583, Bing Liu, UIC
21
The Apriori algorithm
• The best known algorithm
• Two steps:
– Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
– Use frequent itemsets to generate rules.
• E.g., a frequent itemset
{Chicken, Clothes, Milk}
[sup = 3/7]
and one rule from the frequent itemset
Clothes  Milk, Chicken
CS583, Bing Liu, UIC
22
[sup = 3/7, conf = 3/3]
Step 1: Mining all frequent itemsets
• A frequent itemset is an itemset whose support
is ≥ minsup.
• Key idea: The apriori property (downward
closure property): any subsets of a frequent
itemset are also frequent itemsets
ABC
AB
A
CS583, Bing Liu, UIC
ABD
AC AD
B
ACD
BC BD
C
23
BCD
CD
D
Steps in Association Rule Discovery
• Find the frequent itemsets
– Frequent item sets are the sets of items that have minimum support
– Support is “downward closed”, so, a subset of a frequent itemset must
also be a frequent itemset
• if {AB} is a frequent itemset, both {A} and {B} are frequent itemsets
• this also means that if an itemset that doesn’t satisfy minimum support,
none of its supersets will either (this is essential for pruning search space)
– Iteratively find frequent itemsets with cardinality from 1 to k (kitemsets)
• Use the frequent itemsets to generate association rules
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
25
Mining Association Rules—An Example
User specifies these
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A  C:
support = support({A ,C}) = 50%
confidence = support({A ,C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
26
Mining Frequent Itemsets: the Key Step
• Find the frequent itemsets: the sets of items that have
minimum support
– A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset. Why? Make sure you can explain this.
– Iteratively find frequent itemsets with cardinality from 1 to k (kitemset)
• Use the frequent itemsets to generate association rules
– This step is more straightforward and requires less computation
so we focus on the first step
27
Illustrating Apriori Principle
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
Pruned
supersets
ABCE
ABDE
ACDE
BCDE
ABCDE
28
The Apriori Algorithm
• Terminology:
– Ck is the set of candidate k-itemsets
– Lk is the set of k-itemsets
• Join Step: Ck is generated by joining the set Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
– This is a bit confusing since we want to use it the other way. We
prune a candidate k-itemset if any of its k-1 itemsets are not in our
list of frequent k-1 itemsets
• To utilize this you simply start with k=1, which is single-item itemsets
and they you work your way up from there!
29
The Algorithm
• Iterative algo. (also called level-wise search):
Find all 1-item frequent itemsets; then all 2-item frequent
itemsets, and so on.
– In each iteration k, only consider itemsets that
contain some k-1 frequent itemset.
• Find frequent itemsets of size 1: F1
• From k = 2
– Ck = candidates of size k: those itemsets of size k
that could be frequent, given Fk-1
– Fk = those itemsets that are actually frequent, Fk 
Ck (need to scan the database once).
CS583, Bing Liu, UIC
30
Apriori candidate generation
• The candidate-gen function takes Lk-1 and
returns a superset (called the candidates)
of the set of all frequent k-itemsets. It has
two steps
– join step: Generate all possible candidate
itemsets Ck of length k
– prune step: Remove those candidates in Ck that
cannot be frequent.
CS583, Bing Liu, UIC
31
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
– The description below is a bit confusing– all we do is splice two sets
together so that only one new item is added (see example)
• insert into Ck
• select p.item1, p.item2, …, p.itemk-1, q.itemk-1
• from Lk-1 p, Lk-1 q
• where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
32
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
33
Dataset T
Example –
minsup=0.5
Finding frequent itemsets
TID
Items
T100 1, 3, 4
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1:
{1}:2, {2}:3, {3}:3,
 C2:
{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
{5}:3
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
 F2:
 C3:
{1,3}:2,
{2,3}:2, {2,5}:3, {3,5}:2
{2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}
CS583, Bing Liu, UIC
34
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
Items
134
235
1235
25
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
C2 itemset sup
L2 itemset sup
{1 3}
{2 3}
{2 5}
{3 5}
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
Scan D
(minsup = 30%)
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 352
Step 2: Generating rules from frequent
itemsets
• Frequent itemsets  association rules
• One more step is needed to generate association rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A  B is an association rule if
• Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) / support(A)
CS583, Bing Liu, UIC
36
Generating rules: an example
• Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
•
•
•
•
•
•
•
2,3  4,
confidence=100%
2,4  3,
confidence=100%
3,4  2,
confidence=67%
2  3,4,
confidence=67%
3  2,4,
confidence=67%
4  2,3,
confidence=67%
All rules have support = 50%
CS583, Bing Liu, UIC
37
Generating rules: summary
• To recap, in order to obtain A  B, we need
to have support(A  B) and support(A)
• All the required information for confidence
computation has already been recorded in
itemset generation. No need to see the data T
any more.
• This step is not as time-consuming as
frequent itemsets generation.
CS583, Bing Liu, UIC
38
On Apriori Algorithm
Seems to be very expensive
• Level-wise search
• K = the size of the largest itemset
• It makes at most K passes over data
• In practice, K is bounded (10).
• The algorithm is very fast. Under some conditions, all
rules can be found in linear time.
• Scale up to large data sets
CS583, Bing Liu, UIC
39
Granularity of items
• One exception to the “ease” of applying association rules is
selecting the granularity of the items.
• Should you choose:
–
–
–
–
diet coke?
coke product?
soft drink?
beverage?
• Should you include more than one level of granularity?
Be careful
• (Some association finding techniques allow you to represent
hierarchies explicitly)
Multiple-Level Association Rules
• Items often form a hierarchy
– Items at the lower level are expected to have lower support
– Rules regarding itemsets at appropriate levels could be quite useful
– Transaction database can be encoded based on dimensions and levels
Food
Milk
Skim
Bread
2%
Wheat
White
Mining Multi-Level Associations
• A top_down, progressive deepening approach
– First find high-level strong rules:
• milk bread [20%, 60%]
– Then find their lower-level “weaker” rules:
• 2% milk wheat bread [6%, 50%]
– When one threshold set for all levels; if support too high then it is
possible to miss meaningful associations at low level; if support
too low then possible generation of uninteresting rules
• different minimum support thresholds across multi-levels lead to different algorithms
(e.g., decrease min-support at lower levels)
• Variations at mining multiple-level association rules
– Level-crossed association rules:
• milk wonder wheat bread
– Association rules with multiple, alternative hierarchies:
• 2% milk wonder bread
Rule Generation
• Now that you have the frequent itemsets, you can generate
association rules
– Split the frequent itemsets in all possible ways and prune if
the confidence is below min_confidence threshold
• Rules that are left are called strong rules
– You may be given a rule template that constrains the rules
• Rules with only one item on the right side
• Rules with two items on the left and one on the right
– Rules of form {X,Y}  Z
43
Rules from Previous Example
• What are the strong rules of the form {X,Y} Z if the
confidence threshold is 75%?
– We start with {2,3,5}
• {2,3}  5 (confidence = 2/2 = 100%): STRONG
• {3,5}  2 (confidence = 2/2 = 100%): STRONG
• {2,5}  3 (confidence = 2/3 = 66%): PRUNE!
• Note that in general you don’t just look at the
frequent itemsets of maximum length. If we wanted
strong rules of form X  Y we would look at C2
44
Interestingness Measurements
• Objective measures
– Two popular measurements:
• Support
• Confidence
• Subjective measures (Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting if
– it is unexpected (surprising to the user); and/or
– actionable (the user can do something with it)
45
Criticism to Support and Confidence
 Lift of A => B = P(B|A)/P(B)
• Example 1:
and a rule is interesting if lift is
– Among 5000 students
not near 1.0
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball  eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
– play basketball  not eat cereal [20%, 33.3%] is far more interesting,
although with lower support and confidence
What is the lift
of this rule?
(1/3)/(1250/5000) = 1.33
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
500046
Customer Number vs. Transaction ID
• In the homework you may have a problem where
there is a customer id for each transaction
– You can be asked to do association analysis based on the
customer id
• If this is so, you need to aggregate the transactions to the
customer level
47
Market-basket analysis
and finding associations
• Do items occur together? (more than I might expect)
• Why might I care?
– merchandising
• e.g., placing products in a retail space (physical or electronic), catalog design
• packaging optional services
– recommendations
• cross-selling and up-selling opportunities
• mining credit-card data
• developing customer loyalty and self-investment
– fraud detection
• e.g., in insurance data, a doctor very often works on cases of particular lawyer
– simply understanding my business
• are there “investment profiles” of my clients?
• customer segmentation based on buying behavior
• is anything strange going on?
Virtual items
• If you’re interested in including other possible
variables, can create “virtual items”
• gift-wrap, used-coupon, new-store, winter-holidays,
bought-nothing,…
Associations: Pros and Cons
• Pros
– can quickly mine patterns describing business/customers/etc. without
major effort in problem formulation
– virtual items allow much flexibility
– unparalleled tool for hypothesis generation
• Cons
– unfocused
• not clear exactly how to apply mined “knowledge”
• only hypothesis generation
– can produce many, many rules!
• may only be a few nuggets among them (or none)
Association Rules
• Association rule types:
– Actionable Rules – contain high-quality, actionable
information
– Trivial Rules – information already well-known by
those familiar with the business
– Inexplicable Rules – no explanation and do not suggest
action
• Trivial and Inexplicable Rules occur most often