Steven F. Ashby Center for Applied Scientific Computing

Download Report

Transcript Steven F. Ashby Center for Applied Scientific Computing

Data Mining Association Analysis: Basic Concepts and Algorithms

Lecture Notes for Chapter 6 Introduction to Data Mining

By Tan, Steinbach, Kumar

Lecture 8 Basic Association Analysis

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Association Rule Mining

 Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

TID

1 2 3 4 5 Market-Basket transactions

Items

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

© Tan,Steinbach, Kumar

Example of Association Rules

{Diaper}  {Beer}, {Milk, Bread}  {Beer, Bread}  {Eggs,Coke}, {Milk}, Implication means co-occurrence, not causality!

Introduction to Data Mining 4/18/2004 ‹#›

Definition: Frequent Itemset

   

Itemset

– A collection of one or more items  Example: {Milk, Bread, Diaper} – k-itemset  An itemset that contains k items

Support count (

)

– – Frequency of occurrence of an itemset E.g.  ({Milk, Bread,Diaper}) = 2

Support

– Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset

– An itemset whose support is greater than or equal to a

minsup

threshold © Tan,Steinbach, Kumar Introduction to Data Mining

TID

1 2 3 4 5

Items

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

4/18/2004 ‹#›

Definition: Association Rule

 

Association Rule

– An implication expression of the form X  Y, where X and Y are itemsets – Example: {Milk, Diaper}  {Beer}

TID

1 2 3 4

Items

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Rule Evaluation Metrics

– Support (s)  Fraction of transactions that contain both X and Y – Confidence (c)  Measures how often items in Y appear in transactions that contain X

5 Bread, Milk, Diaper, Coke

Example: { Milk , Diaper }  Beer

s

  ( Milk , Diaper, Beer ) | T |  2 5  0 .

4

c

  ( Milk,  Diaper, Beer ) ( Milk , Diaper )  2 3  0 .

67 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Association Rule Mining Task

 Given a set of transactions T, the goal of association rule mining is to find all rules having – – support ≥

minsup

threshold confidence ≥

minconf

threshold  Brute-force approach: – – List all possible association rules Compute the support and confidence for each rule – Prune rules that fail the

minsup

and

minconf

thresholds  Computationally prohibitive !

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Mining Association Rules

TID

1 2 3 4 5

Items

Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Example of Rules: {Milk,Diaper}  {Milk,Beer}  {Diaper,Beer}  {Beer} (s=0.4, c=0.67) {Diaper} (s=0.4, c=1.0) {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Mining Association Rules

 Two-step approach: 1.

– Frequent Itemset Generation Generate all itemsets whose support  minsup 2.

– Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset  Frequent itemset generation is still computationally expensive © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining

Frequent Itemset Generation

null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD © Tan,Steinbach, Kumar ABCE ABDE ACDE ABCDE Introduction to Data Mining BCDE

Given d items, there are 2 d possible candidate itemsets

4/18/2004 ‹#›

Frequent Itemset Generation

 Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database

Transactions List of Candidates

N

TID Items

1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

w M – Match each transaction against every candidate – Complexity ~ O(NMw) => © Tan,Steinbach, Kumar Expensive since M = 2 Introduction to Data Mining d !!!

4/18/2004 ‹#›

Computational Complexity

 Given d unique items: – – Total number of itemsets = 2 d Total number of possible association rules:

R

 

k d

 1   1 3

d

    

d k

2   

j d

k

  1  

d d

 1  1

j k

   

If d=6, R = 602 rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Frequent Itemset Generation Strategies

 Reduce the number of candidates – – Complete search: M=2 d Use pruning techniques to reduce M (M)  Reduce the number of transactions (N) – – Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms  Reduce the number of comparisons – (NM) Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Reducing Number of Candidates

 Apriori principle : – If an itemset is frequent, then all of its subsets must also be frequent  Apriori principle holds due to the following property of the support measure: 

X

,

Y

: (

X

Y

) 

s

(

X

) 

s

(

Y

) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Illustrating Apriori Principle

Found to be Infrequent © Tan,Steinbach, Kumar Pruned supersets Introduction to Data Mining 4/18/2004 ‹#›

Illustrating Apriori Principle

Item

Bread Coke Milk Beer Diaper Eggs

Count

4 2 4 3 4 1

Items (1-itemsets) Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 6 + 6 + 1 = 13 = 41 With support-based pruning, Itemset

{Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

Count

3 2 3 2 3 3

Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Itemset Triplets (3-itemsets)

{Bread,Milk,Diaper}

Count

3

© Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining

Apriori Algorithm

 Method: – – – Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Prune candidate itemsets containing subsets of length k that are infrequent  Count the support of each candidate by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining

Reducing Number of Comparisons

 Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure  Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Hash Structure Transactions

N

TID Items

1 Bread, Milk 2 3 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke 4 5 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

k Buckets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Generate Hash Tree

Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need:

Hash function

Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Hash function 3,6,9 1,4,7 2,5,8 © Tan,Steinbach, Kumar 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 Introduction to Data Mining 2 3 4 5 6 7 1 3 6 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 5 9 4/18/2004 ‹#›

Association Rule Discovery: Hash tree

Hash Function

Candidate Hash Tree 1,4,7 2,5,8 3,6,9

2 3 4 5 6 7 1 4 5 1 3 6 Hash on 1, 4 or 7 © Tan,Steinbach, Kumar 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Introduction to Data Mining 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 4/18/2004 ‹#›

Association Rule Discovery: Hash tree

Hash Function

Candidate Hash Tree 1,4,7 2,5,8 3,6,9

2 3 4 5 6 7 1 4 5 1 3 6 Hash on 2, 5 or 8 © Tan,Steinbach, Kumar 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Introduction to Data Mining 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 4/18/2004 ‹#›

Association Rule Discovery: Hash tree

Hash Function

Candidate Hash Tree 1,4,7 2,5,8 3,6,9

2 3 4 5 6 7 1 4 5 1 3 6 Hash on 3, 6 or 9 © Tan,Steinbach, Kumar 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 Introduction to Data Mining 3 4 5 3 5 6 3 5 7 6 8 9 3 6 3 6 7 8 4/18/2004 ‹#›

Subset Operation

Given a transaction t, what are the possible subsets of size 3?

Level 1

1

2 3 5 6 Transaction, t

Level 2

1 2

3 5 6

1 3

5 6

1 5

6 1 2 3 5 6

2

3 5 6

2 3

5 6

2 5

6

3

5 6

3 5

6 1 2 3 1 2 5 1 2 6

Level 3

© Tan,Steinbach, Kumar 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 Subsets of 3 items Introduction to Data Mining 2 5 6 3 5 6 4/18/2004 ‹#›

Subset Operation Using Hash Tree

1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6

1,4,7 2,5,8

2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 6 8 9 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 © Tan,Steinbach, Kumar Introduction to Data Mining 3 6 7 3 6 8

3,6,9

4/18/2004 ‹#›

Subset Operation Using Hash Tree

1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 1 2 + 3 5 6 1 3 + 5 6 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 3 4 5 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 © Tan,Steinbach, Kumar Introduction to Data Mining 3 5 6 3 5 7 6 8 9 3 + 5 6

1,4,7 2,5,8

3 6 7 3 6 8

3,6,9

4/18/2004 ‹#›

Subset Operation Using Hash Tree

1 2 3 5 6 transaction Hash Function 1 2 + 3 5 6 1 + 2 3 5 6 2 + 3 5 6

1,4,7 2,5,8 3,6,9

3 + 5 6 1 3 + 5 6 1 5 + 6 1 4 5 1 2 4 4 5 7 © Tan,Steinbach, Kumar 2 3 4 5 6 7 1 3 6 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1 2 5 4 5 8 1 5 9 Match transaction against 9 out of 15 candidates Introduction to Data Mining 4/18/2004 ‹#›

Factors Affecting Complexity

    Choice of minimum support threshold – lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set – – more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width – transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Compact Representation of Frequent Itemsets

 Some itemsets are redundant because they have identical support as their supersets 9 10 11 12 13 14 15

TID

1 2 3 4 5 6 7 8 0 0 0 0 0 0 0

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0

1 1

0 0 0 0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0

1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0 0 0 0

1 1

0 0

1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1 1 1 1

0 0

1 1

  Number of frequent itemsets  Need a compact representation 3 

k

10   1   10

k

  © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Maximal Frequent Itemset

An itemset is maximal frequent if none of its immediate supersets is frequent

null

Maximal Itemsets

A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Infrequent Itemsets

© Tan,Steinbach, Kumar ABCD ABCE ABDE ACDE BCDE ABCD E Introduction to Data Mining

Border

4/18/2004 ‹#›

Closed Itemset

 An itemset is closed if none of its immediate supersets has the same support as the itemset TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D} Support 2 3 2 3 2 © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining

Maximal vs Closed Itemsets

TID 1 2 3 4 5 Items ABC ABCD BCE ACDE DE

124 A 123 B null 1234 C 245 D Transaction Ids E 345 12 AB 124 AC 24 AD 4 AE 123 BC 2 BD 3 BE 24 CD 34 CE 45 DE 12 ABC 2 ABD ABE 24 ACD 4 ACE 4 ADE 2 BCD 3 BCE BDE 4 CDE 2 ABCD ABCE ABDE Not supported by any transactions

© Tan,Steinbach, Kumar Introduction to Data Mining

ABCDE 4 ACDE BCDE

4/18/2004 ‹#›

Maximal vs Closed Frequent Itemsets

Minimum support = 2 124 A 123 B null 1234 C 245 D Closed but not maximal E 345 Closed and maximal 12 AB 124 AC 24 AD 4 AE 123 BC 2 BD 3 BE 24 CD 34 CE 45 DE 12 ABC 2 ABD ABE 24 ACD 4 ACE 4 ADE 2 BCD 3 BCE BDE 4 CDE 2 ABCD ABCE ABDE 4 ACDE BCDE

© Tan,Steinbach, Kumar

ABCDE

Introduction to Data Mining

# Closed = 9 # Maximal = 4

4/18/2004 ‹#›

Maximal vs Closed Itemsets

Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Alternative Methods for Frequent Itemset Generation

 Traversal of Itemset Lattice – General-to-specific vs Specific-to-general Frequent itemset border null null Frequent itemset border ..

..

..

..

null ..

..

{a 1 ,a 2 ,...,a n } (a) General-to-specific © Tan,Steinbach, Kumar {a 1 ,a 2 ,...,a n } Introduction to Data Mining Frequent itemset border (b) Specific-to-general {a 1 ,a 2 ,...,a n } (c) Bidirectional 4/18/2004 ‹#›

Alternative Methods for Frequent Itemset Generation

 Traversal of Itemset Lattice – Equivalent Classes

null A B D C A B null C AB AC AD BC BD CD D AB AC BC AD BD CD ABC ABD ACD BCD ABCD ABC ABD ACD BCD ABCD

(a) Prefix tree © Tan,Steinbach, Kumar Introduction to Data Mining (b) Suffix tree 4/18/2004 ‹#›

Alternative Methods for Frequent Itemset Generation

 Traversal of Itemset Lattice – Breadth-first vs Depth-first (a) Breadth first © Tan,Steinbach, Kumar Introduction to Data Mining (b) Depth first 4/18/2004 ‹#›

Alternative Methods for Frequent Itemset Generation

 Representation of Database – horizontal vs vertical data layout Horizontal Data Layout TID 1 2 3 4 5 6 7 Items A,B,E B,C,D C,E A,C,D A,B,C,D A,E A,B 8 9 A,B,C A,C,D 10 B A 1 4 5 6 7 8 9 Vertical Data Layout B 1 2 5 7 8 10 C 2 3 4 8 9 D 2 4 5 9 E 1 3 6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Rule Generation

 Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, A  BCD, AB  CD, BD  AC, ABD  C, B  ACD, AC  BD, CD  AB, ACD  B, C  ABD, AD  BC, BCD  A, D  ABC BC  AD,  If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L   and   L) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Rule Generation

 How to efficiently generate rules from frequent itemsets?

– In general, confidence does not have an anti monotone property c(ABC  D) can be larger or smaller than c(AB  D) – – But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD)  Confidence is anti-monotone w.r.t. number of items on the RHS of the rule © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Rule Generation for Apriori Algorithm

Lattice of rules Low Confidence Rule

Pruned Rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Rule Generation for Apriori Algorithm

 Candidate rule is generated by merging two rules that share the same prefix in the rule consequent CD=> AB BD=> AC  join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC D=> ABC  Prune rule D=>ABC if its subset AD=>BC does not have high confidence © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Effect of Support Distribution

 Many real data sets have skewed support distribution

Support distribution of a retail data set

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Effect of Support Distribution

 How to set the appropriate

minsup

threshold?

– If

minsup

is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) – If

minsup

is set too low, it is computationally expensive and the number of itemsets is very large  Using a single minimum support threshold may not be effective © Tan,Steinbach, Kumar 4/18/2004 ‹#› Introduction to Data Mining