the updated slides

Download Report

Transcript the updated slides

ASSOCIATION RULES
(MARKET BASKET-ANALYSIS)
MIS2502
Data Analytics
Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
http://www-users.cs.umn.edu/~kumar/dmbook/
What is Association Mining?
• Discovering interesting relationships between
variables in large databases
(http://en.wikipedia.org/wiki/Association_rule_learning)
• Find out which items predict the occurrence of
other items
• Also known as “affinity analysis” or “market
basket” analysis
Examples of Association Mining
• Market basket analysis/affinity analysis
• What products are bought together?
• Where to place items on grocery store shelves?
• Amazon’s recommendation engine
• “People who bought this product also bought…”
• Telephone calling patterns
• Who do a set of people tend to call most often?
• Social network analysis
• Determine who you “may know”
Market-Basket Analysis
• Large set of items
• e.g., things sold in a supermarket
• Large set of baskets
• e.g., things one customer buys in one visit
• Supermarket chains keep terabytes of this data
• Informs store layout
• Suggests “tie-ins”
• Place spaghetti sauce in the pasta aisle
• Put diapers on sale and raise the price of beer
Market-Basket Transactions
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
Association Rules from these transactions
XY
(antecedent  consequent)
{Diapers}  {Beer},
{Milk, Bread}  {Diapers}
{Beer, Bread}  {Milk},
{Bread}  {Milk, Diapers}
Core idea: The itemset
• Itemset: A group of items of interest
{Milk, Coke, Diapers}
• This itemset is a “3 itemset” because it
contains…3 items!
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
• An association rule expresses related itemsets
• X  Y, where X and Y are two itemsets
• {Milk, Diapers}  {Coke} means
“when you have milk and diapers, you also have Coke)
Support
• Support count ()
• Frequency of occurrence of an itemset
• {Milk, Diapers, Coke} = 2
(i.e., it’s in baskets 3 and 5)
• Support (s)
• Fraction of transactions that contain all
itemsets in the relationship X  Y
• s({Milk, Diapers, Coke}) = 2/5 = 0.4
X
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
2 baskets have milk,
Coke, and diapers
5 baskets total
Y
• You can calculate support for both X and Y separately
• Support for X = 3/5 = 0.6; Support for Y = 2/5 = 0.4
Confidence
• Confidence is the strength of the
association
• Measures how often items in Y appear in
transactions that contain X
• Given we have one thing, what is our
confidence that we will see another thing?
c=
s (X ® Y ) s (Milk,Diapers,Coke) 2
=
= = 0.67
s (X)
s (Milk, Diapers)
3
s (Y ® X) s (Milk,Diapers,Coke) 2
c=
=
= =1.00
s (Y )
s (Coke)
2
c must be between 0 and 1
1 is a complete association
0 is no association
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
The first says: when you
have milk and diapers in
the itemset, 67% of the
time you also have
Coke!
What does the second
say?
c=
s (X ® Y ) s (Milk,Diapers,Coke) 2
=
= = 0.67
s (X)
s (Milk, Diapers)
3
67% of Total
Venn Diagram
Representation
{Milk, Diapers}
N=3
{Coke}
N=2
s (Y ® X) s (Milk,Diapers,Coke) 2
c=
=
= =1.00
s (Y )
s (Coke)
2
100% of Total
Venn Diagram
Representation
{Milk, Diapers}
N=3
{Coke}
N=2
Some sample rules
Association Rule
Support (s)
Confidence (c)
{Milk, Diapers}  {Beer}
2/5 = 0.4
2/3 = 0.67
{Milk,Beer}  {Diapers}
2/5 = 0.4
2/2 = 1.0
{Diapers,Beer}  {Milk}
2/5 = 0.4
2/3 = 0.67
{Beer}  {Milk,Diapers}
2/5 = 0.4
2/3 = 0.67
{Diapers}  {Milk,Beer}
2/5 = 0.4
2/4 = 0.5
{Milk}  {Diapers,Beer}
2/5 = 0.4
2/4 = 0.5
Basket
All the above rules
are binary partitions
of the same itemset:
{Milk, Diapers, Beer}
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
Amazon Recommendation Agent
• People who bought this,
also bought this
• May be bidirectional
association, or it may
be unidirectional
• Why?
• What does this mean?
But don’t blindly follow the numbers
• Rules originating from the same itemset (XY) will have
identical support
• Since the total elements (XY) are always the same
• But they can have different confidence
• Depending on what is contained in X and Y
• High confidence suggests a strong
association
• But this can be deceptive
• Consider {Bread} {Diapers}
• Support for the total itemset is 0.6 (3/5)
• And confidence is 0.75 (3/4) – pretty high
• But is this just because both are
frequently occurring items (s=0.8)?
Basket
1
2
3
4
5
Items
Bread, Milk
Bread, Diapers, Beer,
Eggs
Milk, Diapers, Beer, Coke
Bread, Milk, Diapers,
Beer
Bread, Milk, Diapers,
Coke
Real Association or Coincidence?
• The Question…
• Is the association real (e.g., causal) or is it due to chance?
• Confidence can help you get at direction of association…
• c = {HouseFire  Firemen} = 1.00
• c = {Fireman  HouseFire} = 0.10
• But, How Can We Tell Co-occurrence from Real
Association?
• E.g., Complementary Goods or Similar Goods?
• Bread & Butter: one increases the probability of the other
• Bread & Milk: both are just high probability items
Lift
• Takes into account how co-occurrence differs from what is
expected by chance
• i.e., if items were selected independently from one another
• Independent events (no interdependence between A, B)…
P(AÇ B) = P(A)* P(B)
• Based on the support metric
s( X  Y )
Lift 
s ( X ) * s (Y )
Support for total itemset X and Y
Support for X times support for Y
Lift Example
• What’s the lift of the association rule
{Milk, Diapers}  {Beer}
Basket
Items
1
Bread, Milk
2
Bread, Diapers, Beer, Eggs
3
Milk, Diapers, Beer, Coke
4
Bread, Milk, Diapers, Beer
5
Bread, Milk, Diapers, Coke
• So X = {Milk, Diapers} and Y = {Beer}
s({Milk, Diapers, Beer}) = 2/5 = 0.4
s({Milk, Diapers}) = 3/5 = 0.6
s({Beer}) = 3/5 = 0.6
So
0.4
0.4
Lift 

 1.11
0.6 * 0.6 0.36
When Lift > 1, the
occurrence of
X  Y together is
more likely than
what you would
expect by chance
Another example
Checking Account
Savings
Account
No
Yes
No
500
3500
4000
Yes
1000
5000
6000
Are people more likely to have a
checking account if they have a
savings account?
Support ({Savings} {Checking}) = 5000/10000 = 0.5
Support ({Savings}) = 6000/10000 = 0.6
Support ({Checking}) = 8500/10000 = 0.85
Confidence ({Savings} {Checking}) = 5000/6000 = 0.83
0.5
0.5
Lift 

 0.98
0.6 * 0.85 0.51
10000
Answer: No
In fact, it’s
slightly less
than what
you’d expect
by chance!
Selecting the rules
• We know how to calculate the
measures for each rule
• Support
• Confidence
• Lift
• Then we set up thresholds
for the minimum rule strength
we want to accept
• For support – called minsup
• For confidence – called minconf
The steps
• List all possible
association rules
• Compute the support
and confidence for
each rule
• Drop rules that don’t
make the minsup and
minconf thresholds
• Use lift to doublecheck the association
But this can be overwhelming
• Imagine all the combinations possible
in your local grocery store
• Every product matched with every
combination of other products
• Tens of thousands of possible rule
combinations
So where do you start?
Once you are confident in a rule, take
action
{Milk, Diapers}  {Beer}
Possible Marketing Actions
• Lower the price of milk and diapers, raise it
on beer
• Put beer next to one of those products in
the store
• Put beer in a special section of the store
away from milk and diapers
• Create “New Parent Coping Kits” of beer,
milk, and diapers
• Target new beers to people who buy milk
and diapers