Transcript 下載/瀏覽
A Scalable Association Rules Mining Algorithm
Based on Sorting, Indexing and Trimming
Chuang-Kai Chiou, Judy C. R Tseng
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics
Hong Kong, 19-22 August 2007
Outline
Introduction
Apriori Algorithm
DHP Algorithm
MPIP Algorithm
SIT Algorithm
Experiment and Evaluation
Conclusion and Future works
Introduction
Apriori algorithm
Several hash-based algorithms use hash functions to
filter out potential-less candidate itemsets.
Large amount of candidate itemsets will be generated.
DHP algorithm
MPIP algorithm
SIT algorithm
Using the sorting, indexing, and trimming techniques to reduce
the amount of itemsets to be considered.
Utilizing both the advantages of Apriori and MPIP algorithm.
Apriori Algorithm
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
DHP Algorithm
Database
MPIP Algorithm(1/2)
MPIP employs the minimal perfect hashing function for
mining L1 and L2.
It copes with the collision problem which occurred in DHP.
The time needed for scanning and searching data items can be
reduced.
It employs the Apriori algorithm for finding the frequent
k-itemsets for k>2.
MPIP Algorithm(2/2)
SIT Algorithm(1/5)
For mining association rules, we propose a revised
algorithm, Sorting-Indexing-Trimming (SIT) approach.
SIT approach can avoid generating potential-less
candidate itemsets and enhance the performance via
Sorting, Indexing and Trimming.
SIT Algorithm(2/5)
Sorting
(1) There is the original
transaction database.
(2) Count the occurred frequency.
(3) Sort the items by the counts in
increasing order and build a
mapping table.
(4) Translate the items into
mapping numbers.
(5) Re-sort the item ordering in
each transaction.
SIT Algorithm(3/5)
Indexing
Apriori
Indexing
Index Table
Comparing count=69
SIT Algorithm(4/5)
Trimming
If the minimum support is 3, all the items with frequency less
than 3 will be trimmed.
For reserving the data, physical trimming will be avoided.
We
just record the starting position, and generate the hash table from this
position.
L1
SIT Algorithm(5/5)
The processes of SIT algorithm
For finding L1 and L2:
Employ
the Sorting, Indexing and Trimming techniques to the original
database.
Employ MPIP algorithm to find L1 and L2
For finding the k-itemsets for k>2:
Employ Apriori
algorithm to database which has been sorted, indexed
and trimmed.
Find out the frequent itemsets.
Experiment and Evaluation(1/2)
The experiments are focus on two parts :
Performance of Apriori, SI+Apriori, MPIP, and SIT.
Performance of SIT and MPIP under different transaction
qualities and length.
Performance of Apriori, SI+Apriori, MPIP, and SIT.
Experiment and Evaluation(2/2)
Performance of SIT and MPIP under different transaction
qualities and length.
The time of pre-sorting and pre-indexing are taken into
consideration in SIT2.
Conclusion and Future works
SIT reduces the amount of candidate itemsets, and also
avoids generating potential-less candidate itemsets.
The performance of SIT is better than Apriori, DHP and
MPIP.
Some problems still need to be dealt with:
When the data sets are increasing, we need to sort and index
again for association rule mining.
Mapping items into corresponding index number is timeconsuming for the long transaction length.