Color VG Template

Download Report

Transcript Color VG Template

EDBT’2011
GPX-Matcher A Generic Boolean Predicate-based
XPath Expression Matcher
Mohammad Sadoghi, Ioana Burcea, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
An X-ToPSS
Project
http://msrg.org/tags/x-topss
The Problem in a Nutshell
XML
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Event/Publication
XPath Expressions (XPE)
Subscriptions
(Millions of XPE) (Boolean Expressions)
XML Filtering
Pub/Sub Engine
Matched XPE
Matched Subscriptions
Publish/Subscribe Systems
TSX
Stock markets
NASDAQ
NYSE
Publisher
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Publisher
Publications
Broker
Subscriptions:
IBM > 85
ORCL < 10
JNJ > 60
Notification
Notification
Subscriptions
Subscriber
Subscriber
X-ToPSS & GPX-Matcher
3
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Pub/Sub Matching Algorithms
• Rete algorithm [Forgy, late 70s]
– A graph-structure to correlate events, process rules (solves a more
general problem)
• SIFT [Yan et al. TODS‘94]
– Predicate counting et al.
• Gough algorithm [Gough et al. ACSC‘95]
– Based on a finite state representation of subscriptions
• Gryphon algorithm [Aguilera, et al. PODC‘99]
– Decision tree over predicates
• Clustering algorithm [Fabret et al. SIGMOD‘01]
– Clusters subscriptions based on common predicates
• k-Index [Whang et al. VLDB‘09]
• Hardware-based matching acceleration [Sadoghi et al. VLDB‘10]
• BE-Tree [Sadoghi & Jacobsen, SIGMOD’2011]
X-ToPSS & GPX-Matcher
4
The Key Question?
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Can XML Filtering be benefited from the
efficient publish/subscribe matching
algorithms that have been developed for
more than three decades?
X-ToPSS & GPX-Matcher
5
XML Filtering Challenges
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XML
• Filter XML according to XPEs
• Efficiently, at Internet-scale, for
millions of XPEs, and for many
XML documents per unit of
time
X-ToPSS & GPX-Matcher
XPath Expressions (XPE)
(Millions of XPE)
Matched XPE
6
XML Filtering Systems
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Growing need for XML filtering
–
–
–
–
–
Application-level firewalls
Maleware detection and prevention
Document routing
RSS aggregators
XML-based messaging and application integration
• Selected industry players (XML appliances)
–
–
–
–
SolaceSystems
IBM DataPower
Talerian
Sarvega (Intel)
• XML filtering systems are
publish/subscribe systems
• XPath & XML are subscription
and publication, respectively
X-ToPSS & GPX-Matcher
7
The Core Problem
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• XML Document Filtering Problem
– Given a set of XPath expressions Q and an XML
document d, find all expressions in Q that are
matched by d
• An expressions q is matched by an XML
document d if and only if q selects a non-empty
set of nodes in d
– XPath expressions are used to select entire documents
or fragments of documents
X-ToPSS & GPX-Matcher
8
Agenda
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Supported XPath Language
• Mapping XML Filtering to Pub/Sub Matching
– XPath encoding
– XML encoding
• Experimental results
• Outlook
X-ToPSS & GPX-Matcher
9
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XML and XPath
XML fragment
XML tree
<section>
<subsection>
<figure> …
</figure>
</subsection>
<figure> …
</figure>
</section>
XML paths
section
subsection
figure
section-subsection-figure
section-figure
figure
relative
query
absolute
query
XPath queries
/section/subsection/figure
/section//subsection/figure
/section/*/figure
section/figure
section//figure
*/figure
location step
child operator
descendent
operator
wildcards
10
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XPath 2.0 Subset Considered
• Absolute path expressions
– /a/b
• Relative path expressions
– a/b/c
• Descendant operators in path expressions
– a/b//a/d
• Wildcards in path expressions
– a/*/*/b
• Not discussed, but shown how to address
– Filter predicates in path expressions
• <path>[@x>1]/<path>
– Nested path filters (the XPE becomes a tree)
• <path>[a/b]/<path>
X-ToPSS & GPX-Matcher
11
Agenda
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Supported XPath Language
• Mapping XML Filtering to Pub/Sub Matching
– XPath encoding
– XML encoding
• Experimental results
• Outlook
X-ToPSS & GPX-Matcher
12
Our Question(s)
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• How can we map XPath expressions onto
subscriptions?
– Conjunctive Boolean formula over predicates
– S = (a1 op v1)  (a2 op v2)  …  (an op vn)
• How can we map XML documents onto
publications?
– Set of attribute-value pairs
– P = {(a1, v1), (a2, v2), …, (am, vm)}
X-ToPSS & GPX-Matcher
13
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Predicate Calculus
 pt op v
• Single-tag predicate
• Double-tags predicate
d ( p , p
• End-tag predicate
• Length-constraint
predicate
X-ToPSS & GPX-Matcher
1
t
t
2
) op v
p
t

v

length  v 
14
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Single-tag Predicate Example
• XPath expression
/b/…
• Predicate
Tag b at position 1
 pb  1
b
a
b-a-c
d
c
(b, 1), (a, 2), (c, 3)
X-ToPSS & GPX-Matcher
15
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Double-tags Predicate Example I
• XPath expression
… a/b …
• Predicate
d ( p
a
Distance between
Tag a and Tag b is
one location step
, pb )  1
x
a
x-a-b
d
b
(x, 1), (a, 2), (b, 3)
X-ToPSS & GPX-Matcher
16
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Double-tags Predicate Example II
• XPath expression
a//b
• Predicate
d ( p
a
Distance between
Tag a and Tag b is at
least one location
step
, pb )  1
a
x
a-x-b
d
b
(a, 1), (x, 2), (b, 3)
X-ToPSS & GPX-Matcher
17
End-tag Predicate Example
• XPath expression
/a/*/*
• Predicate
p
a

2
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Tag a at least two
location steps away
from path end

a
x
a-x-y
d
y
(a, 1), (x, 2), (y, 3), (length, 3)
X-ToPSS & GPX-Matcher
18
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Length-constraint Predicate Example
• XPath expression
*/*/*
Length of the path is
at least 3
• Predicate
length  3
x
y
x-y-z
d
z
(x, 1), (y, 2), (z, 3) (length, 3)
X-ToPSS & GPX-Matcher
19
Putting it Together:
XPath Query Encoding Example
Q1: a/b//a
Q2: a//b/d
Q3: a/*/*/*//b/d
Q1:
Q1: a1/b1//a2
Q2: a1//b1/d1
Q3: a1/*/*/*//b1/d1
P1
d ( p
a1
, pb1 )  1
P3
Q2: d ( pa , pb )  1
1
1
P5
Q3:
d ( p
a1
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
, pb1 )  4
P2
 d ( p
b1
 d ( p
, pa2 )  1
P4
b1
, pd1 )  1
P4
 d ( p
b1
, pd1 )  1
Our XPath encoding grows linearly
in the size of the XPath expression
20
XML Document Path Encoding
Document path
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
a-b-c-d
Without duplicate tags
a1-b1-c1-d1
(i.e., all occurrence
numbers are 1)
Attribute-value pair
(length, 4),
(a1, 1), (b1, 2), (c1, 3), (d1, 4)
(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),
(b1, c1, 1), (b1, d1, 2),
(c1, d1, 1)
Publication
The resulting attribute-value “pairs” set has O(n2) tags.
21
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Mapping XML Filtering to Pub/Sub Matching
XML
Event/Publication
XPath Expressions (XPE)
(Millions of XPE)
Subscriptions
(Boolean Expressions)
Pub/Sub Engine
Matched XPE
Matched Subscriptions
Matching Algorithms
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Pick any pub/sub matching algorithm
• We used
– Counting algorithm [exact origin is unknown]
– Clustering algorithm [Fabret, Jacobsen et al.,
2001]
• Both are two-phased matching algorithms
1. Predicate matching: Match all predicates.
2. Subscriptions matching: Match subscriptions
using the result from step 1.
X-ToPSS & GPX-Matcher
23
Predicate Matching:
Single Tag Predicate
Publication:
(a1,
1),
(b1,
 pa  1
Hash on the tag
(length, 4),
2),
(c1,
3),
(d1,
4)
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
a
with id i
1 2 3 4
=
i
Predicate
value
j
 pc  3

(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),
c
(b1, c1, 1), (b1, d1, 2),
(c1, d1, 1)

with id j
i
1
 pt op v
p
t

v

0 0 0
Predicate bit vector
length  v 
24
Subscription Matching:
Clustering Algorithm
•
•
•
•
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Cluster queries based on the access predicates
Access predicates shared by all queries in cluster
Only check clusters whose access predicates are matched
Open Question: how to choose an effective access predicate
Access
predicates
false
pi
false
pi
X-ToPSS & GPX-Matcher
25
Experimental Evaluation
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• All algorithms implemented in C
– GPX – the base encoding with counting
– GPX-ap – the base encoding with clustering (access pred.)
– YFilter & BPA
• DTDs used for generating workloads
– NITF DTD (News Industry Data Format)
– PSD DTD (Protein Sequence Database)
• Total filtering time averaged over
500 XML documents
– XML parsing time is negligible in
the overall filtering time
XML
encoded
XPath expressions
• Intel Quad-Core 2.66 GHz, 4GB
X-ToPSS & GPX-Matcher
26
Scalability in Number of XPEs
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
All XPEs are distinct
1 ms vs.
18 ms
ap on first
ap on last
X-ToPSS & GPX-Matcher
27
Scalability in Number of XPEs
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XPEs workload contains duplicates
X-ToPSS & GPX-Matcher
28
Effect of Path Length
X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
29
Effect of Wildcards
X-ToPSS & GPX-Matcher
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
30
Conclusions
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Novel XML/XPath encoding
• Leverages existing matching techniques
• Differs significantly from predominantly
automata-based related work
• Outperforms related approach by an order of
magnitude under many experimental
conditions
X-ToPSS & GPX-Matcher
31
Thank You!
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• To learn more about X-ToPSS, please see
– http://msrg.org/tags/x-topss
X-ToPSS & GPX-Matcher
32
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
X-ToPSS & GPX-Matcher
33
Agenda
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• XML-based Filtering Systems
• Mapping XML Filtering to Pub/Sub Matching
– XPath encoding
– XML encoding
• Experimental results
• Outlook
X-ToPSS & GPX-Matcher
34
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Content-based Publish/Subscribe
• Subscription: Boolean expressions (i.e., an
attribute-operator-value triple)
(subject = news)  (topic = travel)  (date > 21.2.2011)
• Publication (a.k.a. event): Sets of attribute-value
pairs
(subject, news), (topic, travel), (date, 21.2.2011), …
X-ToPSS & GPX-Matcher
35
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
The Pub/Sub Matching Problem
• Given an event, e, and a set of subscriptions,
S, determine all subscriptions, s  S, that
match e.
event / publication
subscriptions
matches
X-ToPSS & GPX-Matcher
36
Wide Applicability
•
•
•
•
•
•
•
•
•
•
•
•
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Selective information dissemination
Location-based services
Personalization, alerting services
Application integration
Service & resource discovery
Network and distributed system management
Monitoring, surveillance, and control
Network and distributed system management
Workforce management
Workload management & job scheduling
Business activity monitoring
Business process management, monitoring, and execution
X-ToPSS & GPX-Matcher
37
Matching Algorithm Techniques
•
•
•
•
•
•
•
•
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Amortized storage & processing
Access predicates
Cost model-driven subscription partitioning
Cache-conscious data structure layout
Asynchronous cache-level pre-fetching
Event queue re-ordering and batch processing
Parallelization of algorithms for SMP & multi-core
FPGA-based acceleration (hardware-level)
X-ToPSS & GPX-Matcher
38
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
eXtensible Markup Language
• XML – de facto standard for data exchange
– Web Services, data and application integration,
information dissemination
• XPath – XML query language
– Also used as basis for other query languages (e.g.,
XQuery, Xpointer, XSLT et al.)
X-ToPSS & GPX-Matcher
39
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XML and XPath
XML fragment
<section>
<subsection>
<figure> …
</figure>
</subsection>
<figure> …
</figure>
</section>
XML tree
XML paths
section
subsection
figure
section-subsection-figure
section-figure
figure
XPath queries
/section/subsection/figure
/section//subsection/figure
/section/*/figure
section/figure
section//figure
*/figure
X-ToPSS & GPX-Matcher
40
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XML and XPath
XML fragment
XML tree
<section>
<subsection>
<figure> …
</figure>
</subsection>
<figure> …
</figure>
</section>
XML paths
section
subsection
figure
section-subsection-figure
section-figure
figure
XPath queries
/section/subsection/figure
/section//subsection/figure
/section/*/figure
section/figure
section//figure
*/figure
location step
child operator
41
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XML and XPath
XML fragment
<section>
<subsection>
<figure> …
</figure>
</subsection>
<figure> …
</figure>
</section>
XML tree
XML paths
section
subsection
figure
section-subsection-figure
section-figure
figure
XPath queries
/section/subsection/figure
/section//subsection/figure
/section/*/figure
section/figure
section//figure
*/figure
location step child operator
descendent
operator
42
Our Research Goal
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Solve the XML filtering problem using contentbased pub/sub matching algorithm.
• Why
– Build on and exploit several decades worth of
insights, rather than construct special purpose
solutions.
X-ToPSS & GPX-Matcher
43
In a Nutshell
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
section
subsection
figure
figure
section-subsection-figure
encoded
XPath expressions
section-figure
X-ToPSS & GPX-Matcher
44
Special purpose XML/XPath
Filtering Algorithm
•
•
•
•
•
•
•
•
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
XFilter [Altinel et al. VLDB‘00]
WebFilter [Pereira et al. VLDB’01]
YFilter [Diao et al. TODS‘03]
XTrie [Chan et al. ICDE‘03]
AFilter [Candan et al. VLDB‘06]
BPA [Huo & Jacobsen, ICDE‘06]
BoXFilter [Moro et al. VLDB‘07]
pFiST [Kwon et al. DKE’08]
X-ToPSS & GPX-Matcher
45
From XML Filtering to
Publish/Subscribe Matching
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• XPath expressions are encoded in a predicate
calculus
• XML documents are expressed as a set of paths
from the root to a leave in the document tree
– Each path is translated into sets of attribute-value
pairs (tags and their location in the path)
• Matching algorithm
– The attribute-value pairs are matched against the
predicates with traditional pub/sub matching
algorithms
X-ToPSS & GPX-Matcher
46
Possibly Extensions
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Extend predicate calculus to encompass other
XPath 2.0 features
• Alternative encodings
• Exploit DTD or schema information
• Exploit information about XPath expressions
processed
X-ToPSS & GPX-Matcher
47
X-ToPSS: XML-based Toronto
Publish/Subscribe System
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Distributed, content-based publish/subscribe (cf.
ICDCS’08)
– Exploit DTDs (Document Type Definition) to optimize
subscription routing in distributed pub/sub systems
– Explain covering and merging optimizations for
XML/XPath
• Alternative predicate-based XML/Xpath
matching algorithm that cannot exploit
traditional pub/sub schemes (cf. ICDE’06)
• Encoding presented herein, cf. EDBT’2011
(forthcoming)
http://msrg.org/tags/x-topss
48
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Example: XPath Query Encoding
P1 d ( pa , pb )  1
1
1
P2 d ( pb , pa )  1
1
a
1b1
1 2 3 4
1
=
5 
3
b
1a2
=

2
P3 d ( pa , pb )  1
1
1
P4 d ( pb , pd )  1
1
1
1d1
P5 d ( pa , pb )  4
1
2
4
=

1
Predicate identifier (pid)
X-ToPSS & GPX-Matcher
49
That’s Like Data Base Querying  !!
publication
data tuples
sets of tuples
About future
About past
query
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
subscriptions
sets of tuples
Query and subscription are very similar.
Data tuples and publication are very similar.
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
However, the two problem statements are inverse.
XML Document Path Encoding Example
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
a-b-c-b-d
With duplicate
tags
a1-b1-c1-b2-d1
a1-b1-c1-b2-d1
a1-b
X1-c1-b1-d1
(length, 5),
a1- -c1-b1-d1
(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)
(a1,
b1,
1),
(a1,
c1,
2),
(a1,
b2,
3),
(a1,
d1,
(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),
(c1, b2, 1), (c1, d1, 2),
(b2, d1, 1)
4),
(length, 5),
(a1, 1), (c1, 3), (b1, 4), (d1, 5)
(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),
(c1, b1, 1), (c1, d1, 2),
X-ToPSS & GPX-Matcher
(b1, d1, 1)
51
Example - XML Document Path Encoding
(with Duplicates)
a-b-c-b-d
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
a1-b1-c1-b2-d1
a1-b
X1-c1-b1-d1
a1-b1-c1-b2-d1
(length, 5),
a1- -c1-b1-d1
(a1, 1), (b1, 2), (c1, 3), (b2, 4), (d1, 5)
(a1, b1, 1), (a1, c1, 2), (a1, b2, 3), (a1, d1,
(length, 5),
4),
(a1, 1), (c1, 3), (b1, 4), (d1, 5)
(b1, c1, 1), (b1, b2, 2), (b1, d1, 3),
(a1, c1, 2), (a1, b1, 3), (a1, d1, 4),
(c1, b2, 1), (c1, d1, 2),
(b2, d1, 1)
(c1, b1, 1), (c1, d1, 2),
X-ToPSS & GPX-Matcher
(b1,
d1,
1)
52
Predicate Matching:
Double Tags Predicate
Publication:
(length, 4),
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
1 2 3 4
Predicate
= value
a
1b1
b
1a2
=

1d1
=


(a1, 1), (b1, 2), (c1, 3), (d1, 4)
(a1, b1, 1), (a1, c1, 2), (a1, d1, 3),
(b1, c1, 1), (b1, d1, 2),
(c1, d1, 1)
Hash on the
first tag
d ( p , p
1
t
t
2
) op v
Hash on
(occ # first tag,
second tag,
occ # second tag)
X-ToPSS & GPX-Matcher
Predicate
operator
53
Matching Algorithm
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
1. Match all predicates (predicate matching)
and record results in predicate bit vector
2. Match subscriptions based on predicate bit
vector (subscriptions matching)
From here on forward, nothing new really (we
re-use pub/sub matching algorithms, as promised.)
X-ToPSS & GPX-Matcher
54
Subscription Matching:
Counting Algorithm
Q1  P1  P2
For each predicate associate
queries that contain it
Predicates
1
2
3
4
5
Q1
Q1
Q2
Q 2, Q3
Q3
For each query
record the number
of predicates
Q1 2
Q2 2
Q3 2
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Q2  P3  P4
Q3  P5  P4
P3
P4
match
For each query
count the number
of satisfied
predicates
=
0
$
1
X-ToPSS & GPX-Matcher
Q2 is matched
55
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Related Work: Yfilter
Q1: a/b//a
Q2: a//b/d
Q3: a/*/*/*//b/d
[Diao et al. TODS‘03]
*
ε
b
a
*
a
Q1
*
ε
b
*
d
Q2, Q3
ε
*
56
Longer-term Vision
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
• Map matching problems for different
languages onto an efficient pub/sub matching
kernel
• For example, for:
– Graph-structured query / data (RSS, RQL)
– Tree-structured query / data (XML / XPath)
– Regular expressions / sentences
– Etc.
X-ToPSS & GPX-Matcher
57