Making XML Documents Searchable through the Web Dongwook Shin
Download
Report
Transcript Making XML Documents Searchable through the Web Dongwook Shin
XML Developer’s conf.
August 19 1999
Making XML Documents Searchable
through the Web
Dongwook Shin
[email protected]
Lister Hill Natrional Center for Biomedical Informatics
National Library of Medicine
Importance of XML Search Engine
More and more documents are beginning to be provided
in XML formats.
XML documents are supposed to have certain structures
Current Web Search Engines do not provide structural search
capability
2
Dongwook Shin, National Library of Medicine
Searching Characteristics
Content Searching
Searching for certain words in the element hierarchy
Retrieve CHAPTER whose TITLE contains “servlet” and
PARAGRAPH contains “session”.
Structural Searching
Searching for elements satisfying certain relations
Retrieve SECTION that has at least two FIGUREs
3
Dongwook Shin, National Library of Medicine
Searching Characteristics (Cont’d)
Combined Searching
Content + Structural Searching
Retrieve SECTION that has TITLE containing “XML” and
contains at least a FIGURE.
4
Dongwook Shin, National Library of Medicine
Other XML Search Engines on the Web
Most engines provide search in a fixed set of fields.
5
User cannot search in any elements in the document hierarchy.
http://www.goxml.com
http://www.scoobs.com
http://www.xmlTree.com
Dongwook Shin, National Library of Medicine
XRS (XML Retrieval System)
Providing a variety of structural search functions
Users can search in any elements in the document hierarchy
Content + Structural Searching
Allowing less index overhead and quick retrieval time
BUS (Bottom Up Scheme) is used
Applicable to valid documents, but not to well-formed
documents
Using DTD when making queries and retrieving
Examples are Shakespeare or Bible data
6
Dongwook Shin, National Library of Medicine
Architecture of XRS
Server Side
Java process
Search Engine
Socket Comm
Servlet
Query Mediator
Servlet
query
Applet
Client Side
7
XML result
with XSL
Search result
User
Interface
Rendering
Component
HTML format
Initiate Web browser
Dongwook Shin, National Library of Medicine
User Interface (Initialization)
DTD can be
browsed here
Query conditions appear here
Search results are shown here
with similarity value
8
Dongwook Shin, National Library of Medicine
Query Composition
Principle
Any element can be a target - the element to be retrieved
Search conditions can be imposed on any elements
EXAMPLE
Retrieve SPEECH whose SPEAKER contains ‘Hamlet’ and
LINE contains ‘Denmark’
Target
Search Condition
9
Dongwook Shin, National Library of Medicine
User Interface (Query Composition)
10
Dongwook Shin, National Library of Medicine
User Interface (with Search Results)
11
Dongwook Shin, National Library of Medicine
Browser Side
12
Show XML results
Dongwook Shin, National Library of Medicine
Browsing a List of Elements
13
Dongwook Shin, National Library of Medicine
XML Result
14
Dongwook Shin, National Library of Medicine
Query at Another Target Element
Retrieve SCENE whose TITLE contains ‘Castle’
and SPEAKER contains ‘Horatio’
15
Dongwook Shin, National Library of Medicine
XML Result
16
Dongwook Shin, National Library of Medicine
Query Mediator Servlet
Mediate the query and results
Convey the user query into the backend search engine
Transmit the retrieved results to the applet or the rendering
component
Send the result sets with brief information to the applet
Send the XML content with a proper XSL to the rendering
component so that it can transform into the HTML format
Session tracking and Result Sets Reclamation
Keep session tracking so that a user can use his/her session
continuously until he/she quits.
Detect the dead sessions periodically and reclaim the
corresponding result sets.
17
Dongwook Shin, National Library of Medicine
Query Language
INIT
Get the DBs and their DTDs available in the server
It is sent to the server when the applet is initialized
SEARCH db_name search_cond
db_name is one of DBs available in the server
search_cond includes the target and search conditions
PRES num
Get the XML results
num is the n-th result in the result set
18
Dongwook Shin, National Library of Medicine
Result Set
A result set is assigned to each session
Query Mediator does session tracking
Backend Search engine keeps multiple result sets
Multi-thread safe code is required
When a session is relinquished, the result set is reclaimed
Garbage collection for the result set is required
19
Dongwook Shin, National Library of Medicine
The Content of a Result Set
DB_name
The name of the database where the search is performed and the
result is obtained
DB_path
The directory path from the root where the DB resides
ptr_to_result_set
pointer to the dynamic arrays having the search results
num_result
number of elements retrieved
ptr_to_K_ary_table
pointer to the table that keeps the k_ary information for the DB
20
Dongwook Shin, National Library of Medicine
RS (Result Set) Management
Backend search engine
Query Mediator
Session comes
i-th RS returned
i-th RS accessed
RS Index requested
RS Index returned (i)
i-th RS
actual result
Used
Result Set Index Table
Unused
Session Monitor
21
Dongwook Shin, National Library of Medicine
Periodical RS Reclamation
Backend search engine
Query Mediator
Reclamation done
Session ends
Reclamation requested
(j ,…)
alive sessions sent
RS Indices to be reclaimed
returned (j, ...)
i-th RS
actual result
Result Set Index Table
Used
Unused
j-th RS
i
j
Used but
to be reclaimed
Session Monitor
22
Dongwook Shin, National Library of Medicine
reclaimed
Backend Search Engine
Less Indexing Overhead and Quick Retrieval
Use BUS (Bottom Up Scheme)
Most of codes are written in Native C code
Support Multi-thread
Multi-thread safe C code
Compile the C code into a shared library
23
Save index information in files
Dongwook Shin, National Library of Medicine
BUS (Bottom Up Scheme)
Main Idea
Index only at the lowest level of the document structure
Weight information at higher level is computed at retrieval time
Benifits
Minimize the indexing overhead
Support term weight and full-blown structural search
Guarantee quick retrieval time
24
Dongwook Shin, National Library of Medicine
Principle of BUS
Term frequency is computed at run time.
chapter
chapter
section2
section1
hypertext
browser
hypertext(10)
browser(4)
internet(5)
multimedia(5)
java(7)
section1
para1
hypertext
internet
multimedia
para2
section2
hypertext(8)
internet(5)
multimedia(5)
java(7)
hypertext
internet
java
para1
hypertext(2)
browser(4)
para2
hypertext(3)
internet(3)
multimedia(5)
hypertext(5)
internet(2)
java(7)
Indexing is performed at leaf nodes only
Document tree with index terms
25
Bottom Up Scheme
Dongwook Shin, National Library of Medicine
Key Issues in BUS
How to figure out ancestor elements of a leaf element
efficiently ?
How to accumulate the term frequency effectively ?
26
Dongwook Shin, National Library of Medicine
UID (Unique element IDentifier)
Represent each document as a k-ary complete tree and
assign a UID to each node
a
b
d
h
e
i
j
real node
c
e
f
g
e
e
3-ary tree
parent(i) = [(i-2)/k+1]
27
e
e
e
virtual node
Result of assigning UIDs
e
element
UID
element
UID
a
b
c
d
e
1
2
3
4
5
f
g
h
i
j
8
9
14
15
16
Dongwook Shin, National Library of Medicine
K-ary table
Each document is assigned k, which is the maximum
number of siblings in the document tree.
Each collection has a K-ary table, each element of which
represent k in the document.
Each result set has a pointer to the K-ary table.
28
Dongwook Shin, National Library of Medicine
Level and Element Type Number
Level
Level means the level in the document tree
It gives a clue how many parent function is applied to get to a
target element
Element type number
A unique number is assigned to each element type in DTD ( not
the elements in documents )
It enables to filter out unnecessary elements and accumulate the
correct frequencies
29
Dongwook Shin, National Library of Medicine
Level and Element Type Number (Cont’d)
User query: Retrieve sections that contain “hypertext”
chapter
Level 1
user level
hypertext(9)
browser(1)
internet(5)
multimedia(5)
java(7)
section1
hypertext(8)
internet(5)
multimedia(5)
java(7)
Level 2
Level difference informs
how many times parent
function is applied
Level 3
title
para1
hypertext(3)
internet(3)
multimedia(5)
hypertext(1)
browser(1)
para2
hypertext(5)
internet(2)
java(7)
Index information
text level
30
Dongwook Shin, National Library of Medicine
Element type number
lets unnecessary index
information filtered out.
Representing a Document Tree
<5,1,1,1>
<5,3,2,3>
<5,2,2,2>
hypertext(1)
model(1)
retrieval(1)
semantics(1)
e
e
e
e
e
<5,9,3,5>
<5,8,3,5>
index(3)
lexical(1)
noun(4)
stem(2)
e
e
<5,11,3,6>
e
e
e
browser(2)
hypertext(2)
java(5)
link(6)
e
e
<5,33,4,7>
document(4)
index(3)
precision(1)
term(5)
<5,12,3,6>
e
document(4)
index(3)
precision(2)
term(5)
<5,32,4,7>
31
<5,4,2,3>
<5,35,4,7>
e
anchor(2)
browser(1)
html(3)
internet(5)
Dongwook Shin, National Library of Medicine
<5,36,4,7>
basian(3)
inquiry(2)
link(3)
matrix(3)
e
Query Evaluation
Create accumulators at user level
Accumulators correspond to the elements at the user level
Compute the TF (Term Frequency) and DF (Document
Frequency) of a term
Summing up all the term frequencies of the descendent elements
into the corresponding accumulators.
The number of non-zero accumulators is the DF of the term.
Calculate the term weight
Compute the similarity of the elements and rank
32
Dongwook Shin, National Library of Medicine
Accumulating Term Frequency
<5,1>
query : find sections containing ‘browser’
.
.
.
.
<5,11>
<5,12>
6
11
.
.
.
.
33
Subtree of the tree in slide 28
<5,12,3,6>
<5,11,3,6>
<5,32,4,7>
<5,33,4,7>
<5,35,4,7>
<5,36,4,7>
browser (4)
index(3)
precision(1)
term(5)
browser(2)
hypertext(2)
java(5)
link(6)
anchor(2)
browser(1)
html(3)
internet(5)
basian(3)
inquiry(2)
link(3)
matrix(3)
Dongwook Shin, National Library of Medicine
Performance Data (in Ultra Sparc 2)
Index Overhead
Collection
PATENT
(SGML)
SH AKE
(XML)
CLIN
(XML)
Data size Posting
Index
Index time
(Mb)
file (Mb)
overhead (%) (hh/ mm)
256
120
46.87
1/ 30
7
2.8
40.00
< / 02
3
1.36
45.33
< / 01
Retrieval time
Almost of single term queries are evaluated within one second
34
Dongwook Shin, National Library of Medicine
Advantage of XRS
Provides a variety of structural search functions.
Less indexing overhead and quick retrieval time
Easy to port
Java + native C code
C code is made as shared libraries
35
Dongwook Shin, National Library of Medicine
Alternative Architecture of XRS
Server Side
Shared Library
Search Engine
JNI interface
Servlet
Query Mediator
Servlet
query
Applet
Client Side
36
XML result
with XSL
Rendering
Component
Search result
User
Interface
HTML format
Initiate Web browser
Dongwook Shin, National Library of Medicine
Benefit and Problem
Benefit
Simpler and easier to port than the current implementation
Do not need an independent Java process
Problem
Current Java Servlet engines can not run the shared libraries
Apache Jserv, Jrun and Jigsaw fail to run it!
37
Dongwook Shin, National Library of Medicine
Current Status
Finish the development of the content retrieval part
Available on the Web at the end of August 1999.
http://dlb2.nlm.nih.gov/~dwshin
Structural retrieval part is in development
will be finished soon.
38
Dongwook Shin, National Library of Medicine