Adaptive Processing of Top-k Queries in XML Amelie Marian , Sihem Amer-Yahia

Transcript Adaptive Processing of Top-k Queries in XML Amelie Marian , Sihem Amer-Yahia

Adaptive Processing of Top-k
Queries in XML
Amelie Marian , Sihem Amer-Yahia
Nick Koudas , Divesh Srivastava
Proceedings of the 21st International
Conference on Data Engineering (ICDE2005)
XML
<book>
<title>wodehouse</title>
<info>
<publisher>
<name>psmith</name>
<location>london</location>
</publisher>
<isbn>1234</isbn>
</info>
<price>48.95</price>
</book>
<book>
<title>wodehouse</title>
<publish>
<name>psmith</name>
<location>london</location>
</publish>
<info>
<isbn>1234</isbn>
</info>
</book>
XML
XML
XML XPath
pc : parent – child
ad : ancestor-descendant
Scoring Function
The traditional tf*idf function is defined in
IR.
tf : term frequency : quantifies the relative
importance of a keyword in an individual
document.
idf : inverse document frequency :
quantifies the relative importance of an
individual keyword in the collection of
documents.
Scoring Function
XML unlike traditional IR
An answer to an XPath query need not be
an entire document, but can be any node in
a document.
An XPath query consists of several
predicates linking the returned node to other
query nodes, instead of simply “keyword
containment in the document” (as in IR).
Scoring Function
XPath Component Predicates
XPath query Q
q0 : query answer node
qi , 1 <= i <= l : other query nodes
p( q0 , qi ) : XPath axis between query
nodes q0 and qi , i>=1
PQ (component predicates of Q):
set of predicates {p(q0,qi)}, 1<= i <= l
Scoring Function
XML idf
Scoring Function
XML tf
Scoring Function
XML tf*idf Score
Whirlpool Architecture
Whirlpool Architecture
Servers and Server Queues
Top-k Set
Router and Router Queue
Server Predicates Generation
Whirlpool
Scheduling between components
Single-threaded
Multi-threaded
Experimental
Conclusion
Whirlpool , an adaptive evaluation strategy
for computing exact and approximate top-k
answers of XPath queries.
We are investigating new directions such as
increasing the number of threads per server
for maximal parallelism.