Making XML Documents Searchable through the Web Dongwook Shin

Transcript Making XML Documents Searchable through the Web Dongwook Shin

XML Developer’s conf.
August 19 1999
Making XML Documents Searchable
through the Web
Dongwook Shin
[email protected]
Lister Hill Natrional Center for Biomedical Informatics
National Library of Medicine
Importance of XML Search Engine

More and more documents are beginning to be provided
in XML formats.

XML documents are supposed to have certain structures
 Current Web Search Engines do not provide structural search
capability
2
Dongwook Shin, National Library of Medicine
Searching Characteristics

Content Searching
 Searching for certain words in the element hierarchy
 Retrieve CHAPTER whose TITLE contains “servlet” and
PARAGRAPH contains “session”.

Structural Searching
 Searching for elements satisfying certain relations
 Retrieve SECTION that has at least two FIGUREs
3
Dongwook Shin, National Library of Medicine
Searching Characteristics (Cont’d)

Combined Searching
 Content + Structural Searching
 Retrieve SECTION that has TITLE containing “XML” and
contains at least a FIGURE.
4
Dongwook Shin, National Library of Medicine
Other XML Search Engines on the Web

Most engines provide search in a fixed set of fields.




5
User cannot search in any elements in the document hierarchy.
http://www.goxml.com
http://www.scoobs.com
http://www.xmlTree.com
Dongwook Shin, National Library of Medicine
XRS (XML Retrieval System)

Providing a variety of structural search functions
 Users can search in any elements in the document hierarchy
 Content + Structural Searching

Allowing less index overhead and quick retrieval time
 BUS (Bottom Up Scheme) is used

Applicable to valid documents, but not to well-formed
documents
 Using DTD when making queries and retrieving
 Examples are Shakespeare or Bible data
6
Dongwook Shin, National Library of Medicine
Architecture of XRS
Server Side
Java process
Search Engine
Socket Comm
Servlet
Query Mediator
Servlet
query
Applet
Client Side
7
XML result
with XSL
Search result
User
Interface
Rendering
Component
HTML format
Initiate Web browser
Dongwook Shin, National Library of Medicine
User Interface (Initialization)
DTD can be
browsed here
Query conditions appear here
Search results are shown here
with similarity value
8
Dongwook Shin, National Library of Medicine
Query Composition

Principle
 Any element can be a target - the element to be retrieved
 Search conditions can be imposed on any elements

EXAMPLE
 Retrieve SPEECH whose SPEAKER contains ‘Hamlet’ and
LINE contains ‘Denmark’
Target
Search Condition
9
Dongwook Shin, National Library of Medicine
User Interface (Query Composition)
10
Dongwook Shin, National Library of Medicine
User Interface (with Search Results)
11
Dongwook Shin, National Library of Medicine
Browser Side

12
Show XML results
Dongwook Shin, National Library of Medicine
Browsing a List of Elements
13
Dongwook Shin, National Library of Medicine
XML Result
14
Dongwook Shin, National Library of Medicine
Query at Another Target Element
 Retrieve SCENE whose TITLE contains ‘Castle’
and SPEAKER contains ‘Horatio’
15
Dongwook Shin, National Library of Medicine
XML Result
16
Dongwook Shin, National Library of Medicine
Query Mediator Servlet

Mediate the query and results
 Convey the user query into the backend search engine
 Transmit the retrieved results to the applet or the rendering
component
Send the result sets with brief information to the applet
Send the XML content with a proper XSL to the rendering
component so that it can transform into the HTML format

Session tracking and Result Sets Reclamation
 Keep session tracking so that a user can use his/her session
continuously until he/she quits.
 Detect the dead sessions periodically and reclaim the
corresponding result sets.
17
Dongwook Shin, National Library of Medicine
Query Language

INIT
 Get the DBs and their DTDs available in the server
 It is sent to the server when the applet is initialized

SEARCH db_name search_cond
 db_name is one of DBs available in the server
 search_cond includes the target and search conditions

PRES num
 Get the XML results
 num is the n-th result in the result set
18
Dongwook Shin, National Library of Medicine
Result Set

A result set is assigned to each session
 Query Mediator does session tracking

Backend Search engine keeps multiple result sets
 Multi-thread safe code is required

When a session is relinquished, the result set is reclaimed
 Garbage collection for the result set is required
19
Dongwook Shin, National Library of Medicine
The Content of a Result Set

DB_name
 The name of the database where the search is performed and the
result is obtained

DB_path
 The directory path from the root where the DB resides

ptr_to_result_set
 pointer to the dynamic arrays having the search results

num_result
 number of elements retrieved

ptr_to_K_ary_table
 pointer to the table that keeps the k_ary information for the DB
20
Dongwook Shin, National Library of Medicine
RS (Result Set) Management
Backend search engine
Query Mediator
Session comes
i-th RS returned
i-th RS accessed
RS Index requested
RS Index returned (i)
i-th RS
actual result
Used
Result Set Index Table
Unused
Session Monitor
21
Dongwook Shin, National Library of Medicine
Periodical RS Reclamation
Backend search engine
Query Mediator
Reclamation done
Session ends
Reclamation requested
(j ,…)
alive sessions sent
RS Indices to be reclaimed
returned (j, ...)
i-th RS
actual result
Result Set Index Table
Used
Unused
j-th RS
i
j
Used but
to be reclaimed
Session Monitor
22
Dongwook Shin, National Library of Medicine
reclaimed
Backend Search Engine

Less Indexing Overhead and Quick Retrieval
 Use BUS (Bottom Up Scheme)
 Most of codes are written in Native C code

Support Multi-thread
 Multi-thread safe C code
 Compile the C code into a shared library

23
Save index information in files
Dongwook Shin, National Library of Medicine
BUS (Bottom Up Scheme)

Main Idea
 Index only at the lowest level of the document structure
 Weight information at higher level is computed at retrieval time

Benifits
 Minimize the indexing overhead
 Support term weight and full-blown structural search
 Guarantee quick retrieval time
24
Dongwook Shin, National Library of Medicine
Principle of BUS
Term frequency is computed at run time.
chapter
chapter
section2
section1
hypertext
browser
hypertext(10)
browser(4)
internet(5)
multimedia(5)
java(7)
section1
para1
hypertext
internet
multimedia
para2
section2
hypertext(8)
internet(5)
multimedia(5)
java(7)
hypertext
internet
java
para1
hypertext(2)
browser(4)
para2
hypertext(3)
internet(3)
multimedia(5)
hypertext(5)
internet(2)
java(7)
Indexing is performed at leaf nodes only
Document tree with index terms
25
Bottom Up Scheme
Dongwook Shin, National Library of Medicine
Key Issues in BUS

How to figure out ancestor elements of a leaf element
efficiently ?

How to accumulate the term frequency effectively ?
26
Dongwook Shin, National Library of Medicine
UID (Unique element IDentifier)
Represent each document as a k-ary complete tree and
assign a UID to each node

a
b
d
h
e
i
j
real node
c
e
f
g
e
e
3-ary tree
parent(i) = [(i-2)/k+1]
27
e
e
e
virtual node
Result of assigning UIDs
e
element
UID
element
UID
a
b
c
d
e
1
2
3
4
5
f
g
h
i
j
8
9
14
15
16
Dongwook Shin, National Library of Medicine
K-ary table

Each document is assigned k, which is the maximum
number of siblings in the document tree.

Each collection has a K-ary table, each element of which
represent k in the document.

Each result set has a pointer to the K-ary table.
28
Dongwook Shin, National Library of Medicine
Level and Element Type Number

Level
 Level means the level in the document tree
 It gives a clue how many parent function is applied to get to a
target element

Element type number
 A unique number is assigned to each element type in DTD ( not
the elements in documents )
 It enables to filter out unnecessary elements and accumulate the
correct frequencies
29
Dongwook Shin, National Library of Medicine
Level and Element Type Number (Cont’d)

User query: Retrieve sections that contain “hypertext”
chapter
Level 1
user level
hypertext(9)
browser(1)
internet(5)
multimedia(5)
java(7)
section1
hypertext(8)
internet(5)
multimedia(5)
java(7)
Level 2
Level difference informs
how many times parent
function is applied
Level 3
title
para1
hypertext(3)
internet(3)
multimedia(5)
hypertext(1)
browser(1)
para2
hypertext(5)
internet(2)
java(7)
Index information
text level
30
Dongwook Shin, National Library of Medicine
Element type number
lets unnecessary index
information filtered out.
Representing a Document Tree
<5,1,1,1>
<5,3,2,3>
<5,2,2,2>
hypertext(1)
model(1)
retrieval(1)
semantics(1)
e
e
e
e
e
<5,9,3,5>
<5,8,3,5>
index(3)
lexical(1)
noun(4)
stem(2)
e
e
<5,11,3,6>
e
e
e
browser(2)
hypertext(2)
java(5)
link(6)
e
e
<5,33,4,7>
document(4)
index(3)
precision(1)
term(5)
<5,12,3,6>
e
document(4)
index(3)
precision(2)
term(5)
<5,32,4,7>
31
<5,4,2,3>
<5,35,4,7>
e
anchor(2)
browser(1)
html(3)
internet(5)
Dongwook Shin, National Library of Medicine
<5,36,4,7>
basian(3)
inquiry(2)
link(3)
matrix(3)
e
Query Evaluation

Create accumulators at user level
 Accumulators correspond to the elements at the user level

Compute the TF (Term Frequency) and DF (Document
Frequency) of a term
 Summing up all the term frequencies of the descendent elements
into the corresponding accumulators.
 The number of non-zero accumulators is the DF of the term.

Calculate the term weight

Compute the similarity of the elements and rank
32
Dongwook Shin, National Library of Medicine
Accumulating Term Frequency
<5,1>
query : find sections containing ‘browser’
.
.
.
.
<5,11>
<5,12>
6
11
.
.
.
.
33
Subtree of the tree in slide 28
<5,12,3,6>
<5,11,3,6>
<5,32,4,7>
<5,33,4,7>
<5,35,4,7>
<5,36,4,7>
browser (4)
index(3)
precision(1)
term(5)
browser(2)
hypertext(2)
java(5)
link(6)
anchor(2)
browser(1)
html(3)
internet(5)
basian(3)
inquiry(2)
link(3)
matrix(3)
Dongwook Shin, National Library of Medicine
Performance Data (in Ultra Sparc 2)

Index Overhead
Collection
PATENT
(SGML)
SH AKE
(XML)
CLIN
(XML)

Data size Posting
Index
Index time
(Mb)
file (Mb)
overhead (%) (hh/ mm)
256
120
46.87
1/ 30
7
2.8
40.00
< / 02
3
1.36
45.33
< / 01
Retrieval time
 Almost of single term queries are evaluated within one second
34
Dongwook Shin, National Library of Medicine
Advantage of XRS

Provides a variety of structural search functions.

Less indexing overhead and quick retrieval time

Easy to port
 Java + native C code
 C code is made as shared libraries
35
Dongwook Shin, National Library of Medicine
Alternative Architecture of XRS
Server Side
Shared Library
Search Engine
JNI interface
Servlet
Query Mediator
Servlet
query
Applet
Client Side
36
XML result
with XSL
Rendering
Component
Search result
User
Interface
HTML format
Initiate Web browser
Dongwook Shin, National Library of Medicine
Benefit and Problem

Benefit
 Simpler and easier to port than the current implementation
 Do not need an independent Java process

Problem
 Current Java Servlet engines can not run the shared libraries
Apache Jserv, Jrun and Jigsaw fail to run it!
37
Dongwook Shin, National Library of Medicine
Current Status

Finish the development of the content retrieval part
 Available on the Web at the end of August 1999.
 http://dlb2.nlm.nih.gov/~dwshin

Structural retrieval part is in development
 will be finished soon.
38
Dongwook Shin, National Library of Medicine

Making XML Documents Searchable through the Web Dongwook Shin

Transcript Making XML Documents Searchable through the Web Dongwook Shin

Directory