Apriori for Mining Association Rules
Download
Report
Transcript Apriori for Mining Association Rules
Fast Algorithms for Mining
Association Rules
Rakesh Agrawal
Ramakrishnan Srikant
Slides from Ofer Pasternak
1
Data Mining Seminar 2003
Introduction
Bar-Code technology
Mining Association Rules over basket
data (93)
Tires ^ accessories automotive
service
Cross market, Attached mail.
Very large databases.
©Ofer Pasternak
2
Data Mining Seminar 2003
Notation
Items – I = {i1,i2,…,im}
Transaction – set of items
TI
– Items are sorted lexicographically
©Ofer Pasternak
TID – unique identifier for each
transaction
3
Data Mining Seminar 2003
Notation
Association Rule – X Y
X I , Y I and X Y
©Ofer Pasternak
4
Data Mining Seminar 2003
Confidence and Support
©Ofer Pasternak
Association rule XY has
confidence c,
c% of transactions in D that contain
X also contain Y.
Association rule XY has support s,
s% of transactions in D contain X
and Y.
5
Data Mining Seminar 2003
Define the Problem
Given a set of transactions D, generate
all association rules that have support
and confidence greater than the
user-specified minimum support and
minimum confidence.
©Ofer Pasternak
6
Data Mining Seminar 2003
Discovering all Association
Rules
Find all Large itemsets
– itemsets with support above minimum
support.
©Ofer Pasternak
Use Large itemsets to generate the
rules.
7
Data Mining Seminar 2003
General idea
Say ABCD and AB are large itemsets
Compute
conf = support(ABCD) / support(AB)
If conf >= minconf
AB CD holds.
©Ofer Pasternak
8
Data Mining Seminar 2003
Discovering Large Itemsets
Multiple passes over the data
First pass – count the support of individual
items.
Subsequent pass
– Generate Candidates using previous pass’s large
itemset.
– Go over the data and check the actual support
of the candidates.
©Ofer Pasternak
Stop when no new large itemsets are found.
9
Data Mining Seminar 2003
The Trick
Any subset of large itemset is large.
Therefore
To find large k-itemset
– Create candidates by combining large k-1
itemsets.
– Delete those that contain any subset
that is not large.
©Ofer Pasternak
10
Data Mining Seminar 2003
Algorithm Apriori
L1 {large 1-item sets}
For ( k 2; Lk-1 ; k ) do begin
Ck apriori-gen (Lk-1 );
foralltransactions t D do begin
Count item occurrences
Generate new k-itemsets
candidates
Ct subset (Ck ,t)
forallcandidatesc Ct do
c.count ;
Find the support of all the
candidates
end
end
Lk { c Ck|c.count m insup}
end
Answer
Take only those with
support over minsup
L ;
k
k
©Ofer Pasternak
11
Data Mining Seminar 2003
Candidate generation
Join step
insert intoCk
P and q are 2 k-1 large
itemsets identical in all
k-2 first items.
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1
Prune step
forallitem sets c Ck do
forall(k-1)-subsets s of cdo
if (s Lk-1 ) then
deletec from Ck
©Ofer Pasternak
Join by adding the last item of
q to p
Check all the subsets, remove a
candidate with “small” subset
12
Data Mining Seminar 2003
Example
L3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} }
After joining
{ {1 2 3 4}, {1 3 4 5} }
{1 4 5} and {3 4 5}
After pruning
Are not in L3
{1 2 3 4}
©Ofer Pasternak
13
Data Mining Seminar 2003
Correctness
Show that Ck Lk
Any subset of large itemset
must also be large
insert int oCk
Join is equivalent to
extending Lk-1 with all
items and removing
those whose (k-1)
subsets are not in Lk-1
©Ofer Pasternak
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1
forallitem sets c Ck do
forall(k-1)-subsets s of cdo
if (s Lk-1 ) then
deletec from Ck
Prevents duplications
14
Data Mining Seminar 2003
Subset Function
L1 {large 1-item sets}
Candidate itemsets - Ck are
stored in a hash-tree
Finds in O(k) time whether a
candidate itemset of size k
is contained in transaction t.
Total time O(max(k,size(t))
For ( k 2; Lk-1 ; k ) do begin
Ck apriori-gen (Lk-1 );
©Ofer Pasternak
foralltransactions t D do begin
Ct subset (Ck ,t)
forallcandidatesc Ct do
c.count ;
end
end
Lk { c Ck|c.count m insup}
end
Answer
L ;
k
k
15