Transcript Steven F. Ashby Center for Applied Scientific Computing
Data Mining Association Analysis: Basic Concepts and Algorithms
Lecture Notes for Chapter 6 Introduction to Data Mining
By Tan, Steinbach, Kumar
Lecture 8 Basic Association Analysis
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
TID
1 2 3 4 5 Market-Basket transactions
Items
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar
Example of Association Rules
{Diaper} {Beer}, {Milk, Bread} {Beer, Bread} {Eggs,Coke}, {Milk}, Implication means co-occurrence, not causality!
Introduction to Data Mining 4/18/2004 ‹#›
Definition: Frequent Itemset
Itemset
– A collection of one or more items Example: {Milk, Bread, Diaper} – k-itemset An itemset that contains k items
Support count (
)
– – Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater than or equal to a
minsup
threshold © Tan,Steinbach, Kumar Introduction to Data Mining
TID
1 2 3 4 5
Items
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
4/18/2004 ‹#›
Definition: Association Rule
Association Rule
– An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer}
TID
1 2 3 4
Items
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Rule Evaluation Metrics
– Support (s) Fraction of transactions that contain both X and Y – Confidence (c) Measures how often items in Y appear in transactions that contain X
5 Bread, Milk, Diaper, Coke
Example: { Milk , Diaper } Beer
s
( Milk , Diaper, Beer ) | T | 2 5 0 .
4
c
( Milk, Diaper, Beer ) ( Milk , Diaper ) 2 3 0 .
67 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having – – support ≥
minsup
threshold confidence ≥
minconf
threshold Brute-force approach: – – List all possible association rules Compute the support and confidence for each rule – Prune rules that fail the
minsup
and
minconf
thresholds Computationally prohibitive !
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Mining Association Rules
TID
1 2 3 4 5
Items
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Example of Rules: {Milk,Diaper} {Milk,Beer} {Diaper,Beer} {Beer} (s=0.4, c=0.67) {Diaper} (s=0.4, c=1.0) {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Mining Association Rules
Two-step approach: 1.
– Frequent Itemset Generation Generate all itemsets whose support minsup 2.
– Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining
Frequent Itemset Generation
null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD © Tan,Steinbach, Kumar ABCE ABDE ACDE ABCDE Introduction to Data Mining BCDE
Given d items, there are 2 d possible candidate itemsets
4/18/2004 ‹#›
Frequent Itemset Generation
Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database
Transactions List of Candidates
N
TID Items
1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
w M – Match each transaction against every candidate – Complexity ~ O(NMw) => © Tan,Steinbach, Kumar Expensive since M = 2 Introduction to Data Mining d !!!
4/18/2004 ‹#›
Computational Complexity
Given d unique items: – – Total number of itemsets = 2 d Total number of possible association rules:
R
k d
1 1 3
d
d k
2
j d
k
1
d d
1 1
j k
If d=6, R = 602 rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Frequent Itemset Generation Strategies
Reduce the number of candidates – – Complete search: M=2 d Use pruning techniques to reduce M (M) Reduce the number of transactions (N) – – Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons – (NM) Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Reducing Number of Candidates
Apriori principle : – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure:
X
,
Y
: (
X
Y
)
s
(
X
)
s
(
Y
) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Illustrating Apriori Principle
Found to be Infrequent © Tan,Steinbach, Kumar Pruned supersets Introduction to Data Mining 4/18/2004 ‹#›
Illustrating Apriori Principle
Item
Bread Coke Milk Beer Diaper Eggs
Count
4 2 4 3 4 1
Items (1-itemsets) Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 6 + 1 = 13 = 41 With support-based pruning, Itemset
{Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Count
3 2 3 2 3 3
Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Itemset Triplets (3-itemsets)
{Bread,Milk,Diaper}
Count
3
© Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining
Apriori Algorithm
Method: – – – Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining
Reducing Number of Comparisons
Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Hash Structure Transactions
N
TID Items
1 Bread, Milk 2 3 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke 4 5 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
k Buckets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need:
•
Hash function
•
Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
Hash function 3,6,9 1,4,7 2,5,8 © Tan,Steinbach, Kumar 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 Introduction to Data Mining 2 3 4 5 6 7 1 3 6 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 5 9 4/18/2004 ‹#›
Association Rule Discovery: Hash tree
Hash Function
Candidate Hash Tree 1,4,7 2,5,8 3,6,9
2 3 4 5 6 7 1 4 5 1 3 6 Hash on 1, 4 or 7 © Tan,Steinbach, Kumar 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Introduction to Data Mining 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 4/18/2004 ‹#›
Association Rule Discovery: Hash tree
Hash Function
Candidate Hash Tree 1,4,7 2,5,8 3,6,9
2 3 4 5 6 7 1 4 5 1 3 6 Hash on 2, 5 or 8 © Tan,Steinbach, Kumar 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Introduction to Data Mining 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 4/18/2004 ‹#›
Association Rule Discovery: Hash tree
Hash Function
Candidate Hash Tree 1,4,7 2,5,8 3,6,9
2 3 4 5 6 7 1 4 5 1 3 6 Hash on 3, 6 or 9 © Tan,Steinbach, Kumar 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Introduction to Data Mining 3 4 5 3 5 6 3 5 7 6 8 9 3 6 3 6 7 8 4/18/2004 ‹#›
Subset Operation
Given a transaction t, what are the possible subsets of size 3?
Level 1
1
2 3 5 6 Transaction, t
Level 2
1 2
3 5 6
1 3
5 6
1 5
6 1 2 3 5 6
2
3 5 6
2 3
5 6
2 5
6
3
5 6
3 5
6 1 2 3 1 2 5 1 2 6
Level 3
© Tan,Steinbach, Kumar 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 Subsets of 3 items Introduction to Data Mining 2 5 6 3 5 6 4/18/2004 ‹#›
Subset Operation Using Hash Tree
1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6
1,4,7 2,5,8
2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 6 8 9 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 © Tan,Steinbach, Kumar Introduction to Data Mining 3 6 7 3 6 8
3,6,9
4/18/2004 ‹#›
Subset Operation Using Hash Tree
1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 1 2 + 3 5 6 1 3 + 5 6 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 © Tan,Steinbach, Kumar Introduction to Data Mining 3 5 6 3 5 7 6 8 9 3 + 5 6
1,4,7 2,5,8
3 6 7 3 6 8
3,6,9
4/18/2004 ‹#›
Subset Operation Using Hash Tree
1 2 3 5 6 transaction Hash Function 1 2 + 3 5 6 1 + 2 3 5 6 2 + 3 5 6
1,4,7 2,5,8 3,6,9
3 + 5 6 1 3 + 5 6 1 5 + 6 1 4 5 1 2 4 4 5 7 © Tan,Steinbach, Kumar 2 3 4 5 6 7 1 3 6 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 2 5 4 5 8 1 5 9 Match transaction against 9 out of 15 candidates Introduction to Data Mining 4/18/2004 ‹#›
Factors Affecting Complexity
Choice of minimum support threshold – lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set – – more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width – transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Compact Representation of Frequent Itemsets
Some itemsets are redundant because they have identical support as their supersets 9 10 11 12 13 14 15
TID
1 2 3 4 5 6 7 8 0 0 0 0 0 0 0
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0
1 1
0 0 0 0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0
1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0 0 0 0
1 1
0 0
1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1 1 1 1
0 0
1 1
Number of frequent itemsets Need a compact representation 3
k
10 1 10
k
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
null
Maximal Itemsets
A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent Itemsets
© Tan,Steinbach, Kumar ABCD ABCE ABDE ACDE BCDE ABCD E Introduction to Data Mining
Border
4/18/2004 ‹#›
Closed Itemset
An itemset is closed if none of its immediate supersets has the same support as the itemset TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D} Support 2 3 2 3 2 © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining
Maximal vs Closed Itemsets
TID 1 2 3 4 5 Items ABC ABCD BCE ACDE DE
124 A 123 B null 1234 C 245 D Transaction Ids E 345 12 AB 124 AC 24 AD 4 AE 123 BC 2 BD 3 BE 24 CD 34 CE 45 DE 12 ABC 2 ABD ABE 24 ACD 4 ACE 4 ADE 2 BCD 3 BCE BDE 4 CDE 2 ABCD ABCE ABDE Not supported by any transactions
© Tan,Steinbach, Kumar Introduction to Data Mining
ABCDE 4 ACDE BCDE
4/18/2004 ‹#›
Maximal vs Closed Frequent Itemsets
Minimum support = 2 124 A 123 B null 1234 C 245 D Closed but not maximal E 345 Closed and maximal 12 AB 124 AC 24 AD 4 AE 123 BC 2 BD 3 BE 24 CD 34 CE 45 DE 12 ABC 2 ABD ABE 24 ACD 4 ACE 4 ADE 2 BCD 3 BCE BDE 4 CDE 2 ABCD ABCE ABDE 4 ACDE BCDE
© Tan,Steinbach, Kumar
ABCDE
Introduction to Data Mining
# Closed = 9 # Maximal = 4
4/18/2004 ‹#›
Maximal vs Closed Itemsets
Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Alternative Methods for Frequent Itemset Generation
Traversal of Itemset Lattice – General-to-specific vs Specific-to-general Frequent itemset border null null Frequent itemset border ..
..
..
..
null ..
..
{a 1 ,a 2 ,...,a n } (a) General-to-specific © Tan,Steinbach, Kumar {a 1 ,a 2 ,...,a n } Introduction to Data Mining Frequent itemset border (b) Specific-to-general {a 1 ,a 2 ,...,a n } (c) Bidirectional 4/18/2004 ‹#›
Alternative Methods for Frequent Itemset Generation
Traversal of Itemset Lattice – Equivalent Classes
null A B D C A B null C AB AC AD BC BD CD D AB AC BC AD BD CD ABC ABD ACD BCD ABCD ABC ABD ACD BCD ABCD
(a) Prefix tree © Tan,Steinbach, Kumar Introduction to Data Mining (b) Suffix tree 4/18/2004 ‹#›
Alternative Methods for Frequent Itemset Generation
Traversal of Itemset Lattice – Breadth-first vs Depth-first (a) Breadth first © Tan,Steinbach, Kumar Introduction to Data Mining (b) Depth first 4/18/2004 ‹#›
Alternative Methods for Frequent Itemset Generation
Representation of Database – horizontal vs vertical data layout Horizontal Data Layout TID 1 2 3 4 5 6 7 Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B 8 9 A,B,C A,C,D 10 B A 1 4 5 6 7 8 9 Vertical Data Layout B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Rule Generation
Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, A BCD, AB CD, BD AC, ABD C, B ACD, AC BD, CD AB, ACD B, C ABD, AD BC, BCD A, D ABC BC AD, If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L and L) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Rule Generation
How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have an anti monotone property c(ABC D) can be larger or smaller than c(AB D) – – But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Rule Generation for Apriori Algorithm
Lattice of rules Low Confidence Rule
Pruned Rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefix in the rule consequent CD=> AB BD=> AC join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC D=> ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Effect of Support Distribution
Many real data sets have skewed support distribution
Support distribution of a retail data set
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Effect of Support Distribution
How to set the appropriate
minsup
threshold?
– If
minsup
is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) – If
minsup
is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining