PrefixSpan--- Mining Sequential Patterns Efficiently by

Download Report

Transcript PrefixSpan--- Mining Sequential Patterns Efficiently by

PrefixSpan: Mining Sequential Patterns
Efficiently by Prefix-Projected Pattern Growth
20th International Council for Open and Distance Education (ICDE)
World Conference on Open Learning and Distance Education
Dusseldorf, Germany, 01-05 April 2001
Mining Sequential Patterns by Pattern-Growth:
The PrefixSpan Approach
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE),
VOL. 16, NO. 10, OCTOBER 2004
Jiawei Han, Jian Pei , Helen Pinto, Behzad Mortazavi-Asl,
Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu
Advisor: Professor Hsin-Hsi Chen
Reporter: Clarence Min-Chi Hsieh
Natural Language Processing Laboratory,
Dept. of Computer Science and Info. Engineering, NTU
2005/07/19
Reporter: Clarence Min-Chi Hsieh
Outline
Abstract
 Introduction
 PrefixSpan
 Performance
 Conclusions

Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 2
Reporter: Clarence Min-Chi Hsieh
Abstract
Sequential pattern mining is a difficult
problem since one may need to examine a
combinatorially explosive number of
possible subsequence patterns
 The general idea of the method is to
integrate the mining of frequent sequences
with that of frequent patterns and use
projected sequence databases to confine
the search and the growth of subsequence
fragments

Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 3
Reporter: Clarence Min-Chi Hsieh
Introduction

Apriori-like algorithm will generate a
huge set of candidate sequences
– There are 1000 frequent sequences of length-1
– 1000×1000+(1000×999)/2=1,499,500 candidate sequences

Many scans of databases in mining
– Sequential pattern {(abc)(abc)(abc)(abc)(abc)}
– The Apriori-based method must scan the database at least 15 times

Difficulties at mining long sequential
patterns
– There is only a single sequence of length 100, min_sup=1
– length-1 candidate sequences: 100, length-2: 14950, …
– total = 2^100-1 » 10^30
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 4
Reporter: Clarence Min-Chi Hsieh
Introduction (Cont.)

Sequence, Elements, Subsequence and
Sequential Pattern
A sequence : < (ef) (ab) (df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Elements items within an element are
listed alphabetically
<a(bc)dc> is a subsequence of
<a(abc)(ac)d(cf)>
Given support threshold min_sup =2,
<(ab)c> is a sequential pattern
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 5
Reporter: Clarence Min-Chi Hsieh
PrefixSpan

PrefixSpan-1
– single-level projection

PrefixSpan-2
– bi-level projection
– Use S-matrix

PrefixSpan use Pseudo-Projection
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 6
Reporter: Clarence Min-Chi Hsieh
PrefixSpan (Cont.)

Definition
– Prefix and Postfix (Projection)
 <a>, <aa>, <a(ab)> and <a(abc)> are
prefixes of sequence <a(abc)(ac)d(cf)>
 Given sequence <a(abc)(ac)d(cf)>
Prefix
<a>
<aa>
<ab>
Postfix /Projection
<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 7
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1
Step 1. Find length-1 sequential patterns
Scan DB once to find all frequent items in sequences
Step 2. Divide search space
Partitioned into the following subsets according to the prefixes
Step 3. Find subsets of sequential patterns
The subsets of sequential patterns can be mined by constructing
corresponding projected databases and mine each recursively
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 8
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1 (Example)
Sequence_id
Sequence
10
< a ( abc ) ( ac ) d ( cf ) >
20
< ( ad ) c ( bc ) ( ae ) >
30
< ( ef ) ( ab ) ( df ) cb >
40
< eg ( af ) cbc >
min_support = 2
L1:<a> : 4,<b> : 4 ,<c> : 4
<d> : 3,<e> : 3,<f> : 3
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 9
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Prefix
(Example) (Cont.)
L1:<a> : 4,<b> : 4 ,<c> : 4
<d> : 3,<e> : 3,<f> : 3
Projected (Postfix) Database
<a>
<(abc)(ac)d(cf)> , <(_d)c(bc)(ae)>
<(_b)(df)cb> , <(_f)cbc>
<b>
<(_c)(ac)d(cf)> , <(_c)(ae)> , <(df)cb> , <c>
<c>
<(ac)d(cf)> , <(bc)(ae)> , <b> , <bc>
<d>
<(cf)> , <c(bc)(ae)> , <(_f)cb>
<e>
<(_f)(ab)(df)cb> , <(af)cbc>
<f>
<(ab)(df)cb> , <cbc>
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 10
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1 (Example) (Cont.)
<a>
<(abc)(ac)d(cf)> , <(_d)c(bc)(ae)>
<(_b)(df)cb> , <(_f)cbc>
Scanning <a>-Projected database once:
a:2 , b:4 , c:4 , d:2 , e:1 , f:2
(_b):2 , (_c):1 , (_d):1 , (_f):1
L2: <aa>:2 , <ab>:4 , <(ab)>:2
<ac>:4 , <ad>:2 , <af>:2
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 11
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1 (Example) (Cont.)
Prefix
< aa >
< ab >
< (ab) >
< ac >
< ad >
< af >
Projected (Postfix) Database
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)> , <(_c)a> , <c>
<(_c)(ac)d(cf)> , <(df)cb>
<(ac)d(cf)> , <(bc)a> , <b> , <bc>
<(cf)> , <(_f)cb>
<cb>
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 12
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1 (Example) (Cont.)
< ab >
<(_c)(ac)d(cf)> , <(_c)a> , <c>
Scanning <ab>-Projected database once:
a:2 , c:2 , d:1 , f:1 , (_c):2
L3: <a(bc)>:2 , <aba>:2 , <abc>:2
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 13
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1 (Example) (Cont.)
Prefix
< a(bc) >
Projected (Postfix) Database
<(ac)d(cf)> , <a>
< aba >
<(_c)d(cf)>
< abc >
<d(cf)>
Scanning <a(bc)>-Projected database once:
a:2 , c:1 , d:1 , f:1
L4: <a(bc)a>:2
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 14
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-1 (Example) (Cont.)
Prefix
Sequential Patterns
<a>
<a>, <aa>, <ab>, <a(bc)>, <a(bc)a>, <aba>, <abc>,
<(ab)>, <(ab)c>, <(ab)d>, <(ab)f>, <(ab)dc>, <ac>,
<aca>, <acb>, <acc>, <ad>, <adc>, <af>
<b>
<b>, <ba>, <bc>, <(bc)>, <(bc)a>, <bd>, <bdc>,
<bf>
<c>
<c>, <ca>, <cb>, <cc>
<d>
<d>, <db>, <dc>, <dcb>
<e>
<e>, <ea>, <eab>, <eac>, <eacb>, <eb>, <ebc>,
<ec>,
<ecb>, <ef>, <efb>, <efc>, <efcb>
<f>
<f>, <fb>, <fbc>, <fc>, <fcb>
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 15
Reporter: Clarence Min-Chi Hsieh
Completeness of PrefixSpan-1
…
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
Having prefix <b>
<a>-projected database
<(abc)(ac)d(cf)>
<(_d)c(bc)(ae)>
<(_b)(df)cb>
<(_f)cbc>
<aa>-proj. db
10
Having prefix <c>, …, <f>
Having prefix <a>
Having prefix <aa>
SID
SDB
sequence
<b>-projected database
Length-2
sequential
patterns
<aa>, <ab>, <(ab)>,
……
<ac>, <ad>, <af>
…
Having prefix <af>
<af>-proj. db
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 16
Reporter: Clarence Min-Chi Hsieh
Analysis
No candidate sequence needs to be
generated by PrefixSpan
 Projected databases keep shrinking
 The major cost of PrefixSpan is the
construction of projected databases

Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 17
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-2
Step 1. Find length-1 sequential patterns
Scan DB once to find all frequent item in sequences
Step 2. Construct triangular matrix M (S-matrix)
By scanning DB second time, the S-matrix can be filled up
Step 3. Construct -projected database
For each length-2 sequential pattern , construct -projected DB
Step 4. Mining each projected DB recursively
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 18
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-2 (Example)
Sequence_id
Sequence
10
< a ( abc ) ( ac ) d ( cf ) >
20
< ( ad ) c ( bc ) ( ae ) >
30
< ( ef ) ( ab ) ( df ) cb >
40
< eg ( af ) cbc >
min_support = 2
L1:<a> : 4,<b> : 4 ,<c> : 4
<d> : 3,<e> : 3,<f> : 3
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 19
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-2
(Example) (Cont.)
<ab> happens
4 times
<bb> happens
1 times
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
a
2
b
(4,2,2)
c
(4,2,1)(3,3,2)
d
(2,1,1)(2,2,0)(1,3,0)
e
(1,2,1)(1,2,0)(1,2,0)(1,1,0)
f
(2,1,1)(2,2,0)(1,2,1)(1,1,1)(2,0,1)
a
1
b
3
c
S-matrix
<dc> happens
3 times
0
d
0
e
Copyright © Natural Language Processing Lab., NTU, 2005
<(ef)> happens
1 times
1
f
Slider - 20
Reporter: Clarence Min-Chi Hsieh
PrefixSpan-2
(Example) (Cont.)
a
2
b
(4,2,2)
c
(4,2,1)(3,3,2)
d
(2,1,1)(2,2,0)(1,3,0)
e
(1,2,1)(1,2,0)(1,2,0)(1,1,0)
f
(2,1,1)(2,2,0)(1,2,1)(1,1,1)(2,0,1)
a
1
b
3
c
<ab>-projected database
<(_c)(ac)d(cf)>
<(_c)a>
<c>
Local length-1 sequential
patterns: <a>, <c>,
<(_c)>
0
d
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
0
e
1
f
Lead to pattern
<a(bc)a>
a
0
c
(1,0,1)
(_c) (,2, )
a
Copyright © Natural Language Processing Lab., NTU, 2005
No hope to form
(_ac),So no need
to count it
1
(,1, )

c
(_c)
Slider - 21
Reporter: Clarence Min-Chi Hsieh
Benefits of Bi-level Projection

Much less projections
– In this example
 there are 53 patterns
 53 level-by-level projections
 22 bi-level projections
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 22
Reporter: Clarence Min-Chi Hsieh
Speed-up by Pseudo-Projection


Major cost of PrefixSpan: Projection
– Postfixes of sequences often appear
repeatedly in recursive projected databases
When (projected) database can be held in
main memory, use pointers to form projections
– Pointer to the sequence
– Offset of the postfix
s=<a(abc)(ac)d(cf)>
<a>
s|<a>: ( , 2)<(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4)<(_c)(ac)d(cf)>
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 23
Reporter: Clarence Min-Chi Hsieh
Runtime (second)
Performance
400
PrefixSpan-1
350
PrefixSpan-2
300
FreeSpan
250
GSP
200
150
100
50
0
0.00
0.50
1.00
1.50
2.00
2.50
3.00
Support threshold (%)
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 24
Reporter: Clarence Min-Chi Hsieh
Performance (Cont.)
PrefixSpan-1
Runtime (second)
200
PrefixSpan-2
PrefixSpan-1 (Pseudo)
160
PrefixSpan-2 (Pseudo)
120
80
40
0
0.20
0.30
0.40
0.50
0.60
Support threshold (%)
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 25
Reporter: Clarence Min-Chi Hsieh
Performance (Cont.)
Runtime (thousand
second)
30
25
20
15
10
PrefixSpan-1
5
PrefixSpan-2
0
0
100
200
300
400
500
# of sequences (thousand)
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 26
Reporter: Clarence Min-Chi Hsieh
Conclusions

PrefixSpan is a novel, scalable, and efficient
sequential mining method
Copyright © Natural Language Processing Lab., NTU, 2005
Slider - 27