Transcript Document
Early Profile Pruning on XML-aware PublishSubscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented by Lee Jae-won (SNU) Introduction Publish-subscribe applications (pub-sub) are an important class of asynchronous content-based dissemination systems Notification websites – Users can subscribe for events of interest and get automatic notification when relevant events arrive in the systems Events are announced with a message generated outside of the system by third party applications referred as publishers These messages are then selectively delivered to interested subscribers that have announced their interest by submitting profiles Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 2 Introduction Architecture of a pub-sub system Matching process – Finding (filtering) which messages satisfy which profile subscriptions Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 3 Introduction Three predominant strategies to design the matching process Standard relational approach – Translating profiles and messages to the Relational Model – The matching can be expressed as join operations Index techniques – Aggregating the profiles using some indexing techniques – The matching reads the input message and traverse the index in order to select the profiles satisfied Finite State Machine (FSM) – Top-down approach – This paper proposes bottom-up approach XML documents has its more selective elements located at its leaves Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 4 Bottom-up Filtering FSM (BUFF) By changing the order in which document is evaluated We also need to change the order in which the query is evaluated For query 1 – NFA executes transitions to state 1, 2 and 3 eleven times before achieving the final state 4 – BUFF executes transitions to state 1, 2 and 3 only once, then achieves the final state 4 Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 5 BUFF – Automaton Matching Process Algorithm Keeping RS (runtime stack) that store the current document path being processed For each opening tag of XML document, the respective element e is pushed to RS For each closing tag, an element e is poped from RS Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 6 BUFF – Automaton Matching Process Example When a final state (in this case, 4) is reached, the algorithm is ended – Document & query (BUFF) are matched Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 7 Bounding-based XML Filtering (BOXFILTER) BoxFilter XML Filtering approach which is based on index-based Filtering technique BoxFilter Core Module Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 8 Prufer Sequence – Background Example Initially, vertex 1 is the leaf with the smallest label – So it is removed first and “4” is put in the Prufer sequence Vertex 2 and 3 are removed next, so “4” is added twice more Vertex 4 is now a leaf and has the smallest label – So it is removed – We append “5” to the sequence – We are left with only two vertices (n-2) The tree’s sequence is 4445 Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 9 BoxFilter Sequence Envelope BoxFilter index tree is based on the concept of a Sequence Envelope Assume that all sequence of profiles have the same length l Consider a set of k prufer sequence of profiles, S1 … Sk We derive tow new sequences (with length l) called the upper and the lower bounde, or U and L repectively – Li = min (S1i … Ski ) – Ui = max (S1i … Ski ) Example – L = ABCABABABAB – U = DEDEEEDEDED – ∀i Li ≤ S1i … Ski ≤ Ui SE ≡ (L, U) Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 10 BoxFilter Sequence Envelope Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 11 Filtering Algorithms We assume that there are 8 profiles submitted to the system These are represented XML format The profiles are transformed into Prufer sequences Then these are inserted into the BoXFilter tree Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 12 Filtering Algorithms For each index node, the figure shows the sequence envelope that contains all envelopes from its children Input Document D = ABCFABABABABF Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 13 Filtering Algorithm Sequential Processing Input Document D = ABCFABABABABF We start by comparing the tree root in Document D to the query root sequence envelope Since there is a match we examine the root’s three children nodes We have a subsequence matching only with the first child (node 2) – So we ignore the subtrees at the seconde and the third children nodes We examine the children of node 2, which are now leaf nodes We do subsequence matching between the document and each of the strings in these leaf nodes There is a match between the document and leaf number 1 Batch Processing Matching process is performed by joining the BoXFitler tree and the document tree (form of BoXFilter) Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 14 Experiments Setup Dataset with 1000, 10000 and 100000 small documents (up to 8KB) Queries were specified for those datasets considering paths with 3 to 10 elements Varying the Number of Documents Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 15 Experiments Varying the Number of Queries NFA and BUFF are linear to the number of documents and queries evaluated BoXFilter has a constant performance for a relatively small number of queries Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 16 Experiments Varying the Selectivity Selectivity means how many document satisfy any of the queries Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 17 Experiments Batching BoXFilter B-BoXFilter is advantageous when the time spent for the extra step (tree creation) is less than the benefit in the matching time Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 18 Conclusion We proposed a FSM-based approach (BUFF) It evaluates the documents in a bottom-up order We introduced the idea of early profile pruning and proposed a sequence-based index (BoXFilter) It allows to prune out queries very efficiently First, documents and queries are transformed into sequences and grouped into envelops Then, the queries and can be pruned out by evaluating the lower and upper bounds of their envelopes – This is the first time that a concept of envelopes in employed for XML query processing Center for E-Business Technology Copyright 2006 by CEBT IDS Lab. Seminar - 19