Efficient Closed Pattern Mining in the Presence of Tough

Download Report

Transcript Efficient Closed Pattern Mining in the Presence of Tough

Efficient Closed Pattern Mining in the
Presence of Tough Block Constraints
Krishna Gade
Computer Science & Engineering
[email protected]
Outline



Introduction
Problem Definition and Motivation
Contributions





Block Constraints
Matrix Projection based approach
Search Space Pruning Techniques
Experimental Evaluation
Conclusions
Introduction to Pattern Mining





What is a frequent pattern ?
Why is frequent pattern mining a
fundamental task in data mining ?
Closed, Maximal and Constrained
Extensions
State-of-the-art algorithms
Limitations with the current solutions
What is a frequent pattern ?




A frequent pattern can be a set of items, sequence,
graph that occurs frequently in a database
It can also be a spatial, geometric or topological
pattern depending on the database chosen
For a given a transaction database and a support
threshold min_sup, an itemset X is frequent if
Sup(X) >= min_sup
Sup(X) is the support of X, defined as the fraction of
the transactions in the database which contain X.
Why is frequent pattern mining
so fundamental to data mining ?

Foundation for several essential data mining
tasks




Association, Correlation and Causality Analysis
Classification based on Association Rules
Pattern-based and Pattern-preserving Clustering.
Support is a simple, yet effective (in many
cases) measure to determine the significance
of a pattern, which also correlates with most
of the other statistical measures.
Closed, Maximal & Constrained
Extensions to Frequent Patterns

A frequent pattern X is said to be closed, if


It is said to be maximal, if


no superset of X has the same supporting set.
no superset of X is frequent.
It is said to be constrained, if

it satisfies some constraint defined on the items
present or on the transactions that support it.

E.g., length(X) >= min_l
State-of-the-art algorithms for
Pattern Discovery

Itemsets





Sequences


Frequent : FP-Growth, Apriori, OP, Inverted Matrix
Closed : Closet+, Charm, FPClose, LCM
Maximal : Mafia, FPMax
Constrained : LPMiner, Bamboo
SPAM, SLPMiner, BIDE etc.,
Graphs

FSG, gFSG, gSpan etc.,
Limitations

Limitations of Support


May not capture the semantics of user interest.
Too many frequent patterns if the support
threshold is too low.


Closed and Maximal frequent patterns fix this but there
may be loss of information (in case of maximal).
Support is ‘a’ measure of interestingness of a
pattern, there can be others such as ‘length’ etc.

E.g., One may be interested in finding patterns whose
length decreases with the support.
Definitions




Block is a 2-tuple B = (I,T), consisting of itemset I and its
supporting set T.
Weighted block is a block with a weight function w,
where w: I x T ->R+
B is a Closed block iff there exists no block B’ = (I’,T’)
where I’ is the superset of I. (If such B’ exists then it is
the super-block of B and B its sub-block.)
The size of the block B is defined as
BSize( B) | I |  | T |

The sum of the corresponding weighted block, is defined
as
BSum ( B )  tT ,iI w(t , i )
Example of a Block
T1
T2
T3
{a,b,c,e}
{b,c,d,e}
{a,d,e}
T4
{c,d,e}
Example Database
a
b
c
d
e
T1 1
1
1
0
1
T2 0
1
1
1
1
T3 1
0
0
1
1
T4 0
0
1
1
1
Matrix Representation of
the Database
B1 = ({a,b},{T1}), B2 = ({c,d},{T2,T4}) are examples of a block.
Red sub-matrix is not a block.
Block Constraints





Let t be the set of all transactions in the database and m be
the set of all items.
A block constraint C is a predicate, C : 2t x 2m -> {true, false}
A block B is a valid block for C if B satisfies C or C(B) is true.
C is a tough block constraint if there is no dependency
between the satisfaction (violation) of by a block B and the
satisfaction (violation) of its super or sub-blocks.
In this thesis explore 3 different tough block constraints.



Block-size
Block-sum
Block-similarity
Monotonicity and Antimontononicty of Constraints

Monotone Constraint




C is monotone iff C(X) = true, then for every Y
such that Y is a superset of X, C(Y) = true.
E.g. : Sup(X) <= v is monotone.
Benefit : Prune all Y if Sup(X) > v.
Anti-monotone Constraint



C is anti-monotone iff C(X) = true, then for every
Y such that Y is a subset of X, C(Y) = true.
E.g. : Sup(X) >= v is anti-monotone.
Benefit : Prune all Y if Sup(X) < v.
Why Block-size is a tough
constraint – An Illustration
a
b
c
d
e
T1 1
1
1
0
1
T2 0
1
1
1
1
T3 1
0
0
1
1
T4 0
0
1
1
1
For the constraint BSize >= 4,
•({b,c},{T1,T2} is a valid
block, but ({b,c,d},{T2}) is
invalid.
Block-size constraint is not
monotone.
• neither of ({b},{T1,T2}),
({c},{T1,T2,T4}) is valid.
Block-size constraint is not antimonotone.
Block-size, Block-sum
Constraints

Block-size Constraint


Motivation : Find set of
itemsets each of which
accounts for a certain
fraction of overall number of
transactions performed in a
period of time.
BSize( B)  v  N ,
where 0    1 and N   length(t)
tt
Block-sum Constraint

Motivation : Identify product
groups that account for a
certain fraction of the
overall sales, profits etc.
BSum( B)  v  W ,
where 0    1 andW 
w(t, i )

t m
t ,i
Block-similarity Definition




Motivation: Finding groups of thematically related
words in large document datasets.
Importance of a group of words can be measured by
their contribution to the overall similarity between the
documents in the collection.
Here t is the set of tf-idf scaled and normalized unitlength document vectors and m is the set of distinct
terms in the collection.
Block-similarity of a weighted block B is defined as


Loss in the aggregate pairwise similarity of the documents in
t, resulting from zeroing-out entries corresponding to B.
BSim(B) = S – S’, where S, S’ are the aggregate pairwise
similarities before and after removing B.
Block Similarity - Illustration
a
b
c
d
e
D1 .1
.1
.1
0
.1
D2 0
.1
.1
.3
D3 .1
0
0
D4 0
0
.3
a
b
c
d
e
D1 .1
0
0
0
.1
.4
D2 0
0
0
.3
.4
.3
.3
D3 .1
0
0
.3
.3
.2
.1
D4 0
0
.3
.2
.1
({b,c},{D1,D2}) is removed here to calculate its block-similarity, by measuring
the loss in the aggregate similarity.
Block-similarity contd.,


Similarity of any two documents is measured as the dot-product of
their unit-length vectors. (cosine)
For the given collection t, we define a composite vector to be the
sum of all document vectors in t.
D
t d
d

We define the composite vector BI for a weighted block B = (I,T) to
be the vector formed by adding all the vectors in T only along the
dimensions in I. Then,
S
 d  d  t d  t d D  D
i
di ,djt
j
i
di
j
dj
S '  ( D  BI )  ( D  BI )
BSim( B )  S  S '  2D  BI  BI  BI

Block-similarity constraint is now defined as
BSim( B)  v  S , where 0    1
Key Features of the Algorithm








Follows widely used projection-based pattern mining paradigm.
Adopts a depth-first search traversal on the lattice of complete set of itemsets,
with the items ordered non-decreasingly on their frequency.
Represents the transaction/document database as a matrix, transactions
(documents) as rows and items (terms) as columns.
Employs efficient compressed sparse matrix storage and access schemes like to
achieve high computational efficiency.
Matrix-projection based pattern enumeration shares ideas from the recently
developed array-projection based method H-Mine.
Prunes potentially invalid rows and columns at each node during the traversal of
the lattice (shown in the next page) as determined by our row-pruning and
column-pruning and matrix-pruning tests.
Adopts various closed itemset mining optimization techniques, like column
fusing, redundant pattern pruning from CHARM and Closet+ to the block
constraints.
The hash-table consists of only closed patterns hashed by the sum of the
transaction-ids of the transactions in their supporting sets.
Pattern Enumeration
Visits each node in the lattice in a dept-first order. Each node represents a distinct
pattern p.
At a certain node labeled p in the lattice, we report and store p in the hash table,
as a closed pattern if p is closed and valid under the given block constraint.
We build a p-projected matrix by pruning any potentially invalid columns and rows
determined by our pruning tests.



Ø
a
ab
b
ac
abc
c
ad
bc
abd
acd
abcd
Level 1
d
bd
cd
bcd
Level 2
Level 3
Level 4
Matrix-Projection


A p-projected matrix is
the matrix containing
only the rows that
contain p, and the
columns that appear
after p, in the
predefined order.
Projecting the matrix is
linear on the number of
non-zeroes in the
projected matrix.
given matrix
a
T1 1
T2 0
b
1
1
c
1
1
d
0
1
e
1
1
T3 1
T4 0
0
0
0
1
1
1
1
1
c
T1 1
T2 1
d
0
1
e
1
1
{b}-projected matrix
Compressed Sparse
Representation

CSR format utilizes two one-dimensional
arrays :



First stores the actual non-zero elements of
the matrix in a row (or column).
Second stores the indices corresponding to
the beginning of each row (or column).
We maintain both row- and columnbased representation for efficient
projection and frequency counting.
CSR format for the example
matrix
a
T1 1
T2 0
T3 1
b
1
1
0
c
1
1
0
d
0
1
1
e
1
1
1
T4 0
0
1
1
1
Row-based CSR
Pointer Array
T1
T2
T3
T4
end
0
4
8
11
14
Index Array
o
1
2
3
4
5
6
7
8
9
10 11 12 13
a
b
c
e
b
c
d
e
a
d
e
c
d
e
Search Space Pruning

{b}-projected matrix
Column Pruning



Given a pattern p and its pprojected matrix.
Necessary condition for the
columns which can form a
valid block with p.
Eliminate all columns in the
p-projected matrix that do
not satisfy it.
T1
T2
c
1
1
d
0
1
e
1
1
Let BSize >= 5 be the constraint.
‘d’ will get pruned as it can never
form a block of size >= 5 with its
 Block-Size
prefix {b}, since the maximum
 A= local supporting set of x.
block-size possible with ‘d’ is 4.

rlen(t)= local rowlength of t.
Prune x if BSize( p, A)   rlen(t )  v
tA
Search Space Pruning contd.,

Column Pruning

Block-sum

rsum(t) = local rowsum of t.
Prune x if BSize( p, A)   rsum(t )  v
tA

Block-similarity




e  maximum value of vector D.
g  local maximum rowsum.
freq = frequency.
a = 2D.BP
Prune x if freq(x) (v  a) / 2eg
Search Space Pruning contd.,

Row Pruning

Smallest Valid Extension, SVE



SVE(p) is the length of the smallest possible extension q to p,
such that resulting block formed by p and q, is valid.
Prune rows whose length is smaller than SVE in the pprojected matrix.
SVE for generic block constraint BSxxx is given below.

Block-size


Block-sum


z = size of the supporting set of p.
z = maximum column sum in the p-projected matrix.
Block-similarity

z = maximum column similarity in the p-projected matrix.
SVE( p)  (v  BSxxx( B)) / z
Search Space Pruning contd.,


Row Pruning Example
Matrix Pruning

c
d
e
T1
1
0
1
T2
1
1
1
Prune p-projected matrix if

Block-size



Sum of the row-lengths in
the projected matrix is
insufficient.
Block-sum

Sum of the row-sums and is
insufficient.
Block-similarity


{b}-projected matrix
Sum of the columnsimilarities is insufficient.
to form a valid block with p.
Let BSum >= 7 be the constraint.
Since SVE >= 3, T1 gets pruned.
Pattern Closure Check and
Optimizations

Closure Check



Hash-table consists of closed patterns
Hash-keys are sum of transaction-ids
At a certain node p in the lattice (shown before),

Column Fusing



Fuse the fully dense columns of the p-projected matrix to p.
Also fuse columns to one another that have identical supporting
sets.
Redundant Pattern Pruning

If p is a proper subset of an already mined closed pattern with
the same support, it can be safely pruned. Also any pattern
extending it need not be explored as it has already been done.
Hence p is a redundant pattern.
Experimental Setup
Data
Gazelle
Pumsb*
Big-market
Sports
T10I4Dx
#Trans
59601
49046
838466
8580
200k-1000k
#Items
498
2089
38336
126373
10000
A.(M.)tran.len
2.5(267)
50.5(63)
3.12(90)
258.3(2344)
10(31)
Notation :
CBMiner – Closed Block Miner Algorithm
CLOSET+ - State-of-the-art closed frequent itemset mining algorithm
CP – Column Pruning, RP – Row Pruning, MP – Matrix Pruning.
Experimental Results
Comparisons with Closet+ on Gazelle
Experimental Results contd.,
Comparisons with Closet+ on Sports
Experimental Results contd.,
Comparisons of Pruning Techniques on Gazelle
(left) and Pumsb*(right)
No Pruning Gazelle : 1578.48 , BSize >= 0.1; Pumsb* = 1330.03 , BSum >= 6.0
Experimental Results contd.,
Comparisons Closed & All Valid Block Mining Big-Market
Experimental Results contd.,
Comparison of Pruning
Techniques on Big-Market
Time for No Pruning : 3560 seconds
Scalability Test on T10I4Dx
Micro Concept Discovery






Scaled the document vectors using tf-idf.
Normalized using L2-norm.
Applied the CBMiner algorithm for each of the
three constraints.
Chose the top-1000 patterns ranked on the
constraint function value.
Compute the entropies of the documents that
form the supporting set of the block.
Also ran CLOSET+ to get the top-1000
patterns ranked on frequency.
Micro Concept Discovery
contd.,




Average entropies of the four
schemes are pretty low.
Block Similarity outperforms
the rest as it leads to lowest
entropies or purest clusters.
Block-size and itemset
frequency constraints do not
account for the weights
associated with the terms and
hence are inconsistent.
But, Block-sum performs
reasonably well as it accounts
for the term weights provided
by tf-idf and L2-norm.
Data
#docs #terms #classes
Classic
7089
12009
4
Sports
8580
18324
7
LA1
3204
31472
6
Micro Concept Discovery
contd.,
Conclusions






Proposed a new class of constraints called “tough”
block constraints.
And a matrix-projection based framework CBMiner for
mining closed block patterns.
Block Constraints discussed : Block-size, Block-sum,
Block-similarity
3 novel pruning techniques column pruning, row
pruning and matrix pruning.
Order(s) magnitude faster than traditional closed
frequent itemset mining algorithms
Finds much fewer patterns.
Thank You !!