Transcript Document
Early Profile Pruning on XML-aware PublishSubscribe Systems
Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras
University of California
VLDB 2007
Presented by Lee Jae-won (SNU)
Introduction
Publish-subscribe applications (pub-sub) are an important class
of asynchronous content-based dissemination systems
Notification websites
–
Users can subscribe for events of interest and get automatic notification
when relevant events arrive in the systems
Events are announced with a message generated outside of the
system by third party applications referred as publishers
These messages are then selectively delivered to interested subscribers that
have announced their interest by submitting profiles
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 2
Introduction
Architecture of a pub-sub system
Matching process
–
Finding (filtering) which messages satisfy which profile subscriptions
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 3
Introduction
Three predominant strategies to design the matching process
Standard relational approach
–
Translating profiles and messages to the Relational Model
–
The matching can be expressed as join operations
Index techniques
–
Aggregating the profiles using some indexing techniques
–
The matching reads the input message and traverse the index in order
to select the profiles satisfied
Finite State Machine (FSM)
–
Top-down approach
–
This paper proposes bottom-up approach
XML documents has its more selective elements located at its leaves
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 4
Bottom-up Filtering FSM (BUFF)
By changing the order in which document is evaluated
We also need to change the order in which the query is evaluated
For query 1
–
NFA executes transitions to state 1, 2 and 3 eleven times before
achieving the final state 4
–
BUFF executes transitions to state 1, 2 and 3 only once, then achieves
the final state 4
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 5
BUFF – Automaton Matching Process
Algorithm
Keeping RS (runtime stack) that store the current
document path being processed
For each opening tag of XML document, the
respective element e is pushed to RS
For each closing tag, an element e is poped from RS
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 6
BUFF – Automaton Matching Process
Example
When a final state (in this case, 4) is reached, the algorithm is ended
–
Document & query (BUFF) are matched
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 7
Bounding-based XML Filtering (BOXFILTER)
BoxFilter
XML Filtering approach which is based on index-based Filtering technique
BoxFilter Core Module
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 8
Prufer Sequence – Background
Example
Initially, vertex 1 is the leaf with the smallest label
–
So it is removed first and “4” is put in the Prufer sequence
Vertex 2 and 3 are removed next, so “4” is added twice more
Vertex 4 is now a leaf and has the smallest label
–
So it is removed
–
We append “5” to the sequence
–
We are left with only two vertices (n-2)
The tree’s sequence is 4445
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 9
BoxFilter
Sequence Envelope
BoxFilter index tree is based on the concept of a Sequence Envelope
Assume that all sequence of profiles have the same length l
Consider a set of k prufer sequence of profiles, S1 … Sk
We derive tow new sequences (with length l) called the upper and the lower
bounde, or U and L repectively
–
Li = min (S1i … Ski )
–
Ui = max (S1i … Ski )
Example
–
L = ABCABABABAB
–
U = DEDEEEDEDED
–
∀i Li ≤ S1i … Ski ≤ Ui
SE ≡ (L, U)
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 10
BoxFilter
Sequence Envelope
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 11
Filtering Algorithms
We assume that there are 8 profiles submitted to the system
These are represented XML format
The profiles are transformed into Prufer sequences
Then these are inserted into the BoXFilter tree
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 12
Filtering Algorithms
For each index node, the figure shows the sequence envelope
that contains all envelopes from its children
Input Document D = ABCFABABABABF
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 13
Filtering Algorithm
Sequential Processing
Input Document D = ABCFABABABABF
We start by comparing the tree root in Document D to the query root
sequence envelope
Since there is a match we examine the root’s three children nodes
We have a subsequence matching only with the first child (node 2)
–
So we ignore the subtrees at the seconde and the third children nodes
We examine the children of node 2, which are now leaf nodes
We do subsequence matching between the document and each of the
strings in these leaf nodes
There is a match between the document and leaf number 1
Batch Processing
Matching process is performed by joining the BoXFitler tree and the
document tree (form of BoXFilter)
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 14
Experiments
Setup
Dataset with 1000, 10000 and 100000 small documents (up to 8KB)
Queries were specified for those datasets considering paths with 3 to 10
elements
Varying the Number of Documents
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 15
Experiments
Varying the Number of Queries
NFA and BUFF are linear to the number of documents and queries evaluated
BoXFilter has a constant performance for a relatively small number of queries
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 16
Experiments
Varying the Selectivity
Selectivity means how many document satisfy any of the queries
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 17
Experiments
Batching BoXFilter
B-BoXFilter is advantageous when the time spent for the extra step (tree
creation) is less than the benefit in the matching time
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 18
Conclusion
We proposed a FSM-based approach (BUFF)
It evaluates the documents in a bottom-up order
We introduced the idea of early profile pruning and proposed a
sequence-based index (BoXFilter)
It allows to prune out queries very efficiently
First, documents and queries are transformed into sequences and grouped
into envelops
Then, the queries and can be pruned out by evaluating the lower and upper
bounds of their envelopes
–
This is the first time that a concept of envelopes in employed for XML
query processing
Center for E-Business Technology
Copyright 2006 by CEBT
IDS Lab. Seminar - 19