下載/瀏覽

Download Report

Transcript 下載/瀏覽

A Scalable Association Rules Mining Algorithm
Based on Sorting, Indexing and Trimming
Chuang-Kai Chiou, Judy C. R Tseng
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics
Hong Kong, 19-22 August 2007
Outline







Introduction
Apriori Algorithm
DHP Algorithm
MPIP Algorithm
SIT Algorithm
Experiment and Evaluation
Conclusion and Future works
Introduction

Apriori algorithm


Several hash-based algorithms use hash functions to
filter out potential-less candidate itemsets.



Large amount of candidate itemsets will be generated.
DHP algorithm
MPIP algorithm
SIT algorithm


Using the sorting, indexing, and trimming techniques to reduce
the amount of itemsets to be considered.
Utilizing both the advantages of Apriori and MPIP algorithm.
Apriori Algorithm
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
DHP Algorithm
Database
MPIP Algorithm(1/2)

MPIP employs the minimal perfect hashing function for
mining L1 and L2.



It copes with the collision problem which occurred in DHP.
The time needed for scanning and searching data items can be
reduced.
It employs the Apriori algorithm for finding the frequent
k-itemsets for k>2.
MPIP Algorithm(2/2)
SIT Algorithm(1/5)

For mining association rules, we propose a revised
algorithm, Sorting-Indexing-Trimming (SIT) approach.

SIT approach can avoid generating potential-less
candidate itemsets and enhance the performance via
Sorting, Indexing and Trimming.
SIT Algorithm(2/5)

Sorting
(1) There is the original
transaction database.
(2) Count the occurred frequency.
(3) Sort the items by the counts in
increasing order and build a
mapping table.
(4) Translate the items into
mapping numbers.
(5) Re-sort the item ordering in
each transaction.
SIT Algorithm(3/5)

Indexing
Apriori
Indexing
Index Table
Comparing count=69
SIT Algorithm(4/5)

Trimming


If the minimum support is 3, all the items with frequency less
than 3 will be trimmed.
For reserving the data, physical trimming will be avoided.
 We
just record the starting position, and generate the hash table from this
position.
L1
SIT Algorithm(5/5)

The processes of SIT algorithm

For finding L1 and L2:
 Employ
the Sorting, Indexing and Trimming techniques to the original
database.
 Employ MPIP algorithm to find L1 and L2

For finding the k-itemsets for k>2:
 Employ Apriori
algorithm to database which has been sorted, indexed
and trimmed.
 Find out the frequent itemsets.
Experiment and Evaluation(1/2)

The experiments are focus on two parts :



Performance of Apriori, SI+Apriori, MPIP, and SIT.
Performance of SIT and MPIP under different transaction
qualities and length.
Performance of Apriori, SI+Apriori, MPIP, and SIT.
Experiment and Evaluation(2/2)

Performance of SIT and MPIP under different transaction
qualities and length.

The time of pre-sorting and pre-indexing are taken into
consideration in SIT2.
Conclusion and Future works



SIT reduces the amount of candidate itemsets, and also
avoids generating potential-less candidate itemsets.
The performance of SIT is better than Apriori, DHP and
MPIP.
Some problems still need to be dealt with:


When the data sets are increasing, we need to sort and index
again for association rule mining.
Mapping items into corresponding index number is timeconsuming for the long transaction length.