Transcript FPGrowth-Tanasic - start [kondor.etf.rs]
Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić ([email protected]) Department of Computer Engineering and Computer Science, School of Electrical Engineering, University of Belgrade
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach ◦ ◦ ◦ ◦ Jiawei Han (UIUC) Jian Pei (Buffalo) Yiwen Yin (SFU) Runying Mao (Microsoft) Ivan Tanasic ([email protected]) 2/25
Problem Definition Mining frequent patterns from a DB ◦ Frequent intemsets (milk + bread) ◦ ◦ Frequent sequential patterns (computer -> printer -> paper) Frequent structural patterns (subgraphs, subtrees) Ivan Tanasic ([email protected]) 3/25
Problem Importance 1/2 Basic DM primitive Used for mining data relationships ◦
Associations
◦ Correlations Helps with basic DM tasks ◦ Classification ◦ Clustering Ivan Tanasic ([email protected]) 4/25
Problem importance 2/2 Association rules ◦ buys(“laptop”)=>buys(“mouse”) [support = 2%, confidence = 30%] • Support=% of all transactions containing that items • Confidence=% of transactions containing I1 that contain I2 Ivan Tanasic ([email protected]) 5/25
Problem Trend Apriori speedup using techniques New data structures (trees) Association rule specific algorithms Specific AR algorithms (OneR, ZeroR) FP-Growth still widely used Ivan Tanasic ([email protected]) 6/25
Existing Solutions 1/3 (Apriori) Agrawal et al. (1994) AP: All nonempty subsets of a frequent itemset must also be frequent Starts from 1-itemsets Join + prune (using AP + min supp) Generates huge number of candidates Ivan Tanasic ([email protected]) 7/25
Existing Solutions 2/3 (ECLAT) Zaki (2000) Equivalence CLass Transformation Vertical format: {item,TID_set} instead of {TID,itemset} Intersects TID_sets of candidates TID_sets holds support info (no scans) Still generates candidates Ivan Tanasic ([email protected]) 8/25
Existing Solutions 3/3 (TreeProjection) Agarwal et al. (2001) Creates a lexicographical tree and projects db into sub-dbs based on the patterns mined so far Recursively mines subdatabases Less scalable then FP-Growth Ivan Tanasic ([email protected]) 9/25
FP-Tree construction 1/6
Desc. supp. sort
• Min support = 2 Ivan Tanasic ([email protected]) 10/25
FP-Tree construction 2/6
Desc. supp. sort
T1={I2,I1,I5} Ivan Tanasic ([email protected]) 11/25
FP-Tree construction 3/6
Desc. supp. sort
T1 = {I2, I1, I5} T2 = {I2, I4} Ivan Tanasic ([email protected]) 12/25
FP-Tree construction 4/6
Desc. supp. sort
T1 = {I2, I1, I5} T2 = {I2, I4} T3 = {I2, I3} Ivan Tanasic ([email protected]) 13/25
FP-Tree construction 5/6
Desc. supp. sort
T1 = {I2, I1, I5} T2 = {I2, I4} T3 = {I2, I3} T4 = {I2, I1, I4} Ivan Tanasic ([email protected]) 14/25
FP-Tree construction 6/6
Desc. supp. sort
T1 = {I2, I1, I5} T2 = {I2, I4} T3 = {I2, I3} T4 = {I2, I1, I4} T5 = {I1, I3} T6 = {I2, I3} T7 = {I1, I3} T8 = {I2, I1, I3, I5} T9 = {I2, I1, I3} Ivan Tanasic ([email protected]) 15/25
Mining of the FP-Tree 1/4
It. Conditional P. base
I5 {{I2,I1:1},{I2,I1,I3: 1 }}
Cond. FP-Tree
{I2:2, I1:2}
Freq. Patterns Generated
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2} Ivan Tanasic ([email protected]) 16/25
Mining of the FP-Tree 2/4
It. Conditional P. base
I5 I4 {{I2,I1:1},{I2,I1,I3:1}} {{I2,I1: 1 },{I2:1}}
Cond. FP-Tree
{I2:2, I1:2} {I2:2}
Freq. Patterns Generated
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2} {I2,I4:2} Ivan Tanasic ([email protected]) 17/25
Mining of the FP-Tree 3/4
It. Conditional P. base
I5 I4 I3 {{I2,I1:1},{I2,I1,I3:1}} {{I2,I1:1},{I2:1}} {{I2,I1:2},{I2:2},{I1:2}}
Cond. FP-Tree
{I2:2, I1:2} {I2:2} {I2:4,I1:2},{I1:2}
Freq. Patterns Generated
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2} {I2,I4:2} {I2,I3:4},{I1,I3:4},{I2,I1,I3:2} Ivan Tanasic ([email protected]) 18/25
Mining of the FP-Tree 4/4
It. Conditional P. base
I5 I4 I3 I1 {{I2,I1:1},{I2,I1,I3:1}} {{I2,I1:1},{I2:1}} {{I2,I1:2},{I2:2},{I1:2} {{I2:4}}
Cond. FP-Tree
{I2:2, I1:2} {I2:2} {I2:4,I1:2},{I1:2} {I2:4}
Freq. Patterns Generated
{I2,I5:2},{I1,I5:2},{I2,I1,I5:2} {I2,I4:2} {I2,I3:4},{I1,I3:4},{I2,I1,I3:2} {I2,I1:4} Ivan Tanasic ([email protected]) 19/25
How much batter is it 1/3?
Runtime on sparse data: Ivan Tanasic ([email protected]) 20/25
How much batter is it 2/3?
Runtime on mixed data: Ivan Tanasic ([email protected]) 21/25
How much batter is it 3/3?
Compactness: Ivan Tanasic ([email protected]) 22/25
Is it Original?
◦ ◦ ◦ A lot of methods try to improve Apriori ◦ Hashing Transaction reduction Partitioning Sampling TreeProjection uses similar structure, but it is still a different method Ivan Tanasic ([email protected]) 23/25
Importance over time Basic primitive (strong foundation for tall building) Performance gets very important as databases are getting huge Scalability also FP-Growth has both performance and scalability Ivan Tanasic ([email protected]) 24/25
Conclusion An important method for solving important DM tasks Fast Compact Scalable (db projection/tree on disk) Ivan Tanasic ([email protected]) 25/25
Mining Frequent Patterns Using FPGrowth Method Ivan Tanasić ([email protected]) Department of Computer Engineering and Computer Science, School of Electrical Engineering, University of Belgrade