Transcript Document

Early Profile Pruning on XML-aware PublishSubscribe Systems
Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras
University of California
VLDB 2007
Presented by Lee Jae-won (SNU)
Introduction
 Publish-subscribe applications (pub-sub) are an important class
of asynchronous content-based dissemination systems

Notification websites
–
Users can subscribe for events of interest and get automatic notification
when relevant events arrive in the systems
 Events are announced with a message generated outside of the
system by third party applications referred as publishers

These messages are then selectively delivered to interested subscribers that
have announced their interest by submitting profiles
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 2
Introduction
 Architecture of a pub-sub system

Matching process
–
Finding (filtering) which messages satisfy which profile subscriptions
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 3
Introduction
 Three predominant strategies to design the matching process



Standard relational approach
–
Translating profiles and messages to the Relational Model
–
The matching can be expressed as join operations
Index techniques
–
Aggregating the profiles using some indexing techniques
–
The matching reads the input message and traverse the index in order
to select the profiles satisfied
Finite State Machine (FSM)
–
Top-down approach
–
This paper proposes bottom-up approach

XML documents has its more selective elements located at its leaves
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 4
Bottom-up Filtering FSM (BUFF)
 By changing the order in which document is evaluated

We also need to change the order in which the query is evaluated

For query 1
–
NFA executes transitions to state 1, 2 and 3 eleven times before
achieving the final state 4
–
BUFF executes transitions to state 1, 2 and 3 only once, then achieves
the final state 4
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 5
BUFF – Automaton Matching Process
 Algorithm

Keeping RS (runtime stack) that store the current
document path being processed

For each opening tag of XML document, the
respective element e is pushed to RS

For each closing tag, an element e is poped from RS
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 6
BUFF – Automaton Matching Process
 Example

When a final state (in this case, 4) is reached, the algorithm is ended
–
Document & query (BUFF) are matched
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 7
Bounding-based XML Filtering (BOXFILTER)
 BoxFilter

XML Filtering approach which is based on index-based Filtering technique
 BoxFilter Core Module
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 8
Prufer Sequence – Background
 Example

Initially, vertex 1 is the leaf with the smallest label
–
So it is removed first and “4” is put in the Prufer sequence

Vertex 2 and 3 are removed next, so “4” is added twice more

Vertex 4 is now a leaf and has the smallest label

–
So it is removed
–
We append “5” to the sequence
–
We are left with only two vertices (n-2)
The tree’s sequence is 4445
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 9
BoxFilter
 Sequence Envelope

BoxFilter index tree is based on the concept of a Sequence Envelope

Assume that all sequence of profiles have the same length l

Consider a set of k prufer sequence of profiles, S1 … Sk

We derive tow new sequences (with length l) called the upper and the lower
bounde, or U and L repectively

–
Li = min (S1i … Ski )
–
Ui = max (S1i … Ski )
Example
–
L = ABCABABABAB
–
U = DEDEEEDEDED
–
∀i Li ≤ S1i … Ski ≤ Ui

SE ≡ (L, U)
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 10
BoxFilter
 Sequence Envelope
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 11
Filtering Algorithms
 We assume that there are 8 profiles submitted to the system

These are represented XML format
 The profiles are transformed into Prufer sequences

Then these are inserted into the BoXFilter tree
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 12
Filtering Algorithms
 For each index node, the figure shows the sequence envelope
that contains all envelopes from its children

Input Document D = ABCFABABABABF
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 13
Filtering Algorithm
 Sequential Processing

Input Document D = ABCFABABABABF

We start by comparing the tree root in Document D to the query root
sequence envelope

Since there is a match we examine the root’s three children nodes

We have a subsequence matching only with the first child (node 2)
–
So we ignore the subtrees at the seconde and the third children nodes

We examine the children of node 2, which are now leaf nodes

We do subsequence matching between the document and each of the
strings in these leaf nodes

There is a match between the document and leaf number 1
 Batch Processing

Matching process is performed by joining the BoXFitler tree and the
document tree (form of BoXFilter)
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 14
Experiments
 Setup

Dataset with 1000, 10000 and 100000 small documents (up to 8KB)

Queries were specified for those datasets considering paths with 3 to 10
elements
 Varying the Number of Documents
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 15
Experiments
 Varying the Number of Queries

NFA and BUFF are linear to the number of documents and queries evaluated

BoXFilter has a constant performance for a relatively small number of queries
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 16
Experiments
 Varying the Selectivity

Selectivity means how many document satisfy any of the queries
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 17
Experiments
 Batching BoXFilter

B-BoXFilter is advantageous when the time spent for the extra step (tree
creation) is less than the benefit in the matching time
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 18
Conclusion
 We proposed a FSM-based approach (BUFF)

It evaluates the documents in a bottom-up order
 We introduced the idea of early profile pruning and proposed a
sequence-based index (BoXFilter)

It allows to prune out queries very efficiently

First, documents and queries are transformed into sequences and grouped
into envelops

Then, the queries and can be pruned out by evaluating the lower and upper
bounds of their envelopes
–
This is the first time that a concept of envelopes in employed for XML
query processing
Center for E-Business Technology
Copyright  2006 by CEBT
IDS Lab. Seminar - 19