Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University Dayton, OH-45435, USA.

Download Report

Transcript Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University Dayton, OH-45435, USA.

Flexible Querying of XML
Documents
Krishnaprasad Thirunarayan and Trivikram Immaneni
Department of Computer Science and Engineering
Wright State University
Dayton, OH-45435, USA
1
Talk Outline
Goal (What?)
Background and Motivation (Why?)
Query Language and Examples (What?)
Implementation Details (How?)
Evaluation and Applications (Why?)
Conclusions
2
Goal
3
Develop a keyword-based XML Query
Language and its Semantics that is


flexible and sufficiently expressive
easy to use (for query formulation)
Implement, reusing mature software
components, for efficient indexing
and search
4
Background and Motivation
5
XML vs Text Documents
DATA: Exploit metadata/markup and
aggregation structure implicit in XML
documents

For expressiveness and precision
QUERY: Obtain progressively
improved extractions using convenient
keyword-based queries in contrast
with accurate extractions using
complex XML-based queries
6
Relationship to Other Work
Extends XSEarch (Cohen et al)

Expressive power: Incorporates
attributes and their values
 Equivalence (E.g., RDF) :
<T A="s"/>
vs
 <T> <A> s </A> </T>

7

Invariance under Refinement
<T> <A> word_1 and word_2 </A> </T>
vs
 <T> <A>

<B> word_1 </B>

and

<C> word_2 </C>
 </A> </T>

8
Coherence
Interconnectedness : Infer related
pieces of information using
aggregation implicit in XML

Cohen et al : Name equivalence


Li et al: Structural equivalence


XSEarch
Scheme-free XML
Guo et al : Completeness

XRANK
9
Information Retrieval

Explore robust relevance ranking
strategy to deal with high recall
Variation on TFIDF
 Naïve implementation computationally
prohibitive
 Extension beyond “type-delimited-document”
unclear

10
Query Language and Examples (What?)
11
Query Syntax
Entity-Attribute-Keyword
Search Terms
e:a:k
 e:a:, :a:k, e::k
 e::, a::, ::k

Signed/optional Search Terms

+ e:a:k vs e:a:k
12
Single Search Term Satisfaction
e:a:k

The search term e:a:k is satisfied by a
tree containing a subtree with the top
element e that is associated with the
attribute a with value containing k, or a
subelement a with descendant text node
containing k.
13
Example (Mondial)
<country id="f0_149" name="Austria" capital="f0_1467" population="8023244"
datacode="AU" total_area="83850" population_growth="0.41"
infant_mortality="6.2" ... government="federal republic" ...>
...
</country>
:name:Vienna is satisfied by
<province id="f0_17447" name="Vienna" ...>
<city id="f0_1467" country="f0_149" province="f0_17447" ...>
<name>Vienna</name> <population year="94">1583000</population>
</city>
</province> ...
name::Vienna is satisfied by a part of it
<name>Vienna</name>.
14
Example (Heterogeneity)
<author><name>Adam Dingle</name></author>
<author name="A. Dingle" ></author>
<article id="3">
@inproceedings{IMN97,
author="Adam Dingle and Ed MacNair
and Thao Nguyen",
…
</article>
author:name:Dingle misses the last one.
15
Query Answer Candidate
Query Answer Candidate for the
query Q(t_1,t_2,...,t_m), is a

Most preferred satisfying collection of
trees (P_1,P_2,...,P_m)

Precise : smallest enclosing

Adequate : optional search terms satisfied
as much as possible
16
Query Answer
Query Answer for the query
Q(t_1,t_2,...,t_m), is a

Query Answer Candidate
(P_1,P_2,...,P_m) in which

Trees P_i’s are Interconnected

Specifies trees related to the same “realworld” entity
17
Interconnectedness (Cohen et al)
Two subtrees T_a and T_b are said
to be interconnected if the path
from their roots to the lowest
common ancestor does not contain
two distinct nodes with the same
element, or the only distinct nodes
with the same element are these
roots.
18
Interconnectedness (Li et al)
Two subtrees T_a and T_b are said
to be interconnected, if the path
from T_a's root to their lowest
common ancestor in the tree does not
contain another node that is the
lowest common ancestor of T_a and a
distinct subtree T_b‘ , where T_b'
has the same root element label as
T_b.
19
Interconnectedness (Two Approaches)
20
Implementation Details (How?)
21
Tools Used
Apache Lucene 2.0 APIs in Java
A high-performance, text search engine
library with smart indexing strategies.
 Further tuned for memory-centric
operation in contrast with disk-centric
defaults

SAXParser APIs
22
Mapping to Lucene
XML documents to Lucene documents
for indexing
XML keyword-based queries to
Lucene queries for searching
ENCODING

XML fragment of an XML document is
referred to internally using the filename
and the XPath (of the XML fragment's
root from the XML document root)
23
Evaluation and Application (Why?)
24
Experiments
DATASETs: Sigmod, Mondial, and DBLP.
PLATFORMS:

For Sigmod and Mondial datasets: HP xw9300
Workstation with 2 GHz AMD Opteron dual-core
processor (270), 4 GB of main memory, and 250 GB 7200
rpm hard drive, running 32-bit Windows XP.
(java -Xms750M -Xmx1500M).

For DBLP dataset: SUN Ultra-40 Workstation with 2.4
GHz dual AMD Opteron dual-core processor (280), 8GB
of main memory, and 250GB 7500 rpm hard drive, running
64-bit Solaris 10.
(java -Xms1000M -Xmx3600M).
25
Dataset Sizes
DATASET
SIZE
Sigmod
468 KB
Mondial
1743 KB
DBLP
337 MB
26
Dataset Indexing via Lucene
DATASET
INDEXING
TIME
INDEX
SIZE
Sigmod
32 sec
6 MB
Mondial
180 sec
16 MB
DBLP
36 hrs
4 GB
27
Query Answer:
Computation Time vs Display Time
DATASET
SIMPLE
QUERY
Sigmod
35 ms / 1 sec
Mondial
DBLP
COMPLEX
QUERY
400 ms / 3
min
25 ms / 350 1 sec / 2 min
ms
335 ms / 1
--sec
28
More Subtle Example
In Extended Paper:
A Coherent Keyword-Based XML
Query Language
29
Pubs.xml
<publications>
- <book>
<title>Modern Information Retrieval</title>
<author>Ricardo Baeza-Yates</author>
<author>Berthier Ribeiro-Neto</author>
- <chapter>
<title>Digital Libraries</title>
<author>Edward A. Fox</author>
<author>Ohm Sornil</author>
</chapter>
</book>
- <article>
<title>The Anatomy of a Large-Scale Hypertextual Web Search Engine</title>
<author>Sergey Brin</author>
<author>Lawrence Page</author>
</article>
- <article>
<title>An Algorithm for Suffix Stripping</title>
<author>M.F.Porter</author>
</article>
- <article>
<title>Indexing by Latent Semantic Analysis</title>
</article>
</publications>
30
Characteristics of Pubs.xml
Total number of authors = 7
Total number of titles = 5
Title Distribution =
1 book (with 1 chapter) + [3 articles]
Author Distribution =
2
( 2 ) + [2 + 1 + 0]
31
Queries to Pubs.xml (Answer counts)
Arbitrary mix and match of authors
and titles = 7 * 5 = 35
author::, title:: (8 hits)
+author::, +title:: (7 hits)
+author::, +title::, +author:: (4 hits)
32
Completeness ( ::pWord, ::qWord) (Guo et al)
33
Conclusions
34
Developed declarative semantics for
keyword-based XML Query language with
an effective query answering algorithm
Developed a notion of interconnectedness
that provides coherent answers
Implemented using Lucene 2.0 APIs
Indexing: Time and Space Intensive
But
 Query Answering: Quick

35
THANK YOU!
36