eXist Indexing Subtitle Date: x/x/2008

Download Report

Transcript eXist Indexing Subtitle Date: x/x/2008

eXist Indexing
Using the right index for you data
Date: 9/29/2008
M
D
Metadata Solutions
Dan McCreary
President
Dan McCreary & Associates
[email protected]
(952) 931-9198
Overview
•
•
•
•
Using eXist Indexes
Types of indexes
Configuring indexes
Testing indexes
M
D
Copyright 2008 Dan McCreary & Associates
2
Index Types
Structural Indexes: These index the nodal structure, elements (tags) and
attributes, of the documents in a collection.
Range Indexes: Ideal for indexing measurements (integers, doubles, floats,
currency or discrete value measurements).
Full Text Indexes: These map specific text nodes and attributes of the documents
in a collection to text tokens.
NGram Indexes: These map specific text nodes and attributes of the documents
in a collection to split tokens of n-characters (where n = 3 by default). Very
efficient for exact substring searches and for queries on software program
code which can not be easily split into whitespace separated tokens and are
thus a bad match for the full text index.
Spatial Indexes (Experimental): These map elements of the documents in a
collection containing geo-referenced geometries to dedicated data structures
that allow efficient spatial queries.
M
D
Copyright 2008 Dan McCreary & Associates
3
Structural Indexes
• Keeps track of the elements (tags), attributes, and
nodal structure for all XML documents in a
collection
• It is created and maintained automatically in eXist
• Can not be reconfigured nor disabled by the user
• Used by all non-wildcard XPath and XQuery
expressions in eXist (not “//*”)
• Stored in the database file elements.dbx
M
D
Copyright 2008 Dan McCreary & Associates
4
How Do Structural Indexes Work?
• Maps every element and attribute qname (or qualified
name) in a document collection to a list of <documentId,
nodeId> pairs.
• This mapping is used by the query engine to resolve
queries for a given XPath expression.
• Example:
– //book/section
– eXist uses two index lookups: the first for the <book> node, and
the second for the <section> node
– eXist computes the structural join between these node sets to
determine which <section> elements are in fact children of <book>
elements
M
D
Copyright 2008 Dan McCreary & Associates
5
Range Index
• Range indexes provide a shortcut for the database
to directly select nodes based on their typed
values.
• Used when matching or comparing nodes by way
of standard XPath operators and functions.
• Without a range index, comparison operators like
=, > or < will default to a "brute-force" inspection
of the DOM, which can be extremely slow if eXist
has to search through maybe millions of nodes:
each node has to be loaded and cast to the target
type.
M
D
Copyright 2008 Dan McCreary & Associates
6
Example
• You have a catalog of items that contain 50,000
items
• You want to find all items that have a price under
$100
• XPath: //item[price < 100.0]
• Without a range index you would have to do up to
50,000 comparisons for each search
• With a range index it would quickly find the
subset that have a price under $100 with a single
lookup
M
D
Copyright 2008 Dan McCreary & Associates
7
Restriction on Ranges
• All collections that are included in the
search must be indexed
• The data types must match
• Their must be no context dependencies
M
D
Copyright 2008 Dan McCreary & Associates
8
All Collections Must be Indexes
• The range index must be defined on all
items in the input sequence
– If you search collections A and B but only A is
range indexed, the query will not use the
indexes
XQuery
Collection A
with range index
Collection B
no range index
M
D
Copyright 2008 Dan McCreary & Associates
9
Fulltext Fallback
• If all collections do not have the exact same
type of range index the search will
automatically revert to using the default
fulltext indexes (slow)
M
D
Copyright 2008 Dan McCreary & Associates
10
Data Types Must Match
• The index data type (first argument type)
must match the test data type (second
argument type)
• Wrong
– //item[price = '1000.0']
• Right
– //item[price < xs:double($max-price)]
M
D
Copyright 2008 Dan McCreary & Associates
11
Context Dependencies
• The right-hand argument must not have
dependencies on the current context item.
• Wrong:
– //item[price = self]
• Right:
– //item[xf:double($max-price) < price]
M
D
Copyright 2008 Dan McCreary & Associates
12
Fulltext Index
•
•
•
•
•
Used to query for a sequence of separate "words" or tokens in a longer stream
of text.
While building the index, the text is parsed into single tokens which are then
stored in the index.
Historically, eXist has been creating a default full text index on all text nodes
and attribute values. This will likely change in the future as the index is
undergoing a major redesign. As the index becomes more configurable, we
may drop the current default indexing behaviour.
Anyway, as for the other index types, you can configure the full text index in
the collection configuration and we will try to keep the configuration of the
new index backwards compatible. We thus recommend to create a collection
configuration file, disable the default index-all behaviour and define some
explicit full text indexes on your documents. The details of this process will be
described below.
The full text index is only used in combination with eXist's fulltext search
extensions. In particular, you can use the following eXist-specific operators
and functions that apply a fulltext index:
M
D
Copyright 2008 Dan McCreary & Associates
13
Fulltext Operators and Functions
• Operators:
– &=
– |=
• Main Functions
– text:match-all()
– text:match-any()
– near()
M
D
Copyright 2008 Dan McCreary & Associates
14
Disabling Indexes
• If you have disabled full text indexing for certain
elements, these operators and functions will also
be effectively disabled, and will not return
matches.
• eXist will not return results for queries that
normally would have results provided fulltext
indexing was enabled.
• This is in direct contrast to the operation of range
indexing, which does fallback to full searching of
the document if no range index applies
M
D
Copyright 2008 Dan McCreary & Associates
15
Geospatial Indexing (Beta)
• A working proof-of-concept index, which
listens for spatial geometries described
through the Geography Markup Language
(GML)
M
D
Copyright 2008 Dan McCreary & Associates
16
Sample Geospatial Data
<gml:Polygon xmlns:gml="http://www.opengis.net/gml" srsName="osgb:BNG">
<gml:outerBoundaryIs>
. <gml:LinearRing>
<gml:coordinates>
278515.400,187060.450 278515.150,187057.950 278516.350,187057.150
278546.700,187054.000 278580.550,187050.900 278609.500,187048.100
278609.750,187051.250 278574.750,187054.650 278544.950,187057.450
278515.400,187060.450
</gml:coordinates>
</gml:LinearRing>
</gml:outerBoundaryIs>
</gml:Polygon>
M
D
Copyright 2008 Dan McCreary & Associates
17
Sample of Geospatial Queries
• What is the distance from point X to point
Y?
• What items are within X miles of this point?
• What are inside county Y?
M
D
Copyright 2008 Dan McCreary & Associates
18
Custom Indexing
• eXist version 1.2 and later feature a
modularized indexing architecture
• Allows arbitrary indexes to be plugged into
an indexing pipeline
• Required Java development skills
• See
– http://exist-db.org/devguide_indexes.html
M
D
Copyright 2008 Dan McCreary & Associates
19
For the eXist Database Administrator
• For each collection you want to administer
– /db/foo - create a file collection.xconf and store
it as /db/system/config/db/foo/collection.xconf
• Inheritance
– Subcollections which do not have a
collection.xconf file of their own will be
governed by the configuration policy specified
for the closest ancestor collection which does
have such a file
M
D
Copyright 2008 Dan McCreary & Associates
20
Inheritance Example
/db
/db/system/config/db/foo/collection.xconf
/db/foo
/db/foo/bar
If no collection exists for this collection it will default to the
parent’s collection configuration.
M
D
Copyright 2008 Dan McCreary & Associates
21
Thank You!
Please contact me for more information:
•
•
•
•
•
•
Native XML Databases
Metadata Management
Metadata Registries
Service Oriented Architectures
Business Intelligence and Data Warehouse
Semantic Web
Dan McCreary, President
Dan McCreary & Associates
Metadata Strategy Development
[email protected]
(952) 931-9198
M
D
Copyright 2008 Dan McCreary & Associates
22
Index Creation and Updates
• The eXist index system automatically maintains
and updates indexes defined by the user
• You therefore do not need to update an index
when you update a database document or
collection.
• eXist will even update indexes following partial
document updates via XUpdate or XQuery Update
expressions.
• The only exception to eXist's automatic update
occurs when you add a new index definition to an
existing database collection
M
D
Copyright 2008 Dan McCreary & Associates
23
Sample Collection Index
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index>
<fulltext default="none" attributes="false">
<!-- Full text indexes -->
<create qname="author"/>
<create qname="title" content="mixed"/>
</fulltext>
<!-- Range indexes -->
<create qname="title" type="xs:string"/>
<create qname="author" type="xs:string"/>
<create qname="year" type="xs:int"/>
<!-- N-gram indexes -->
<ngram qname="author"/>
<ngram qname="title"/>
</index>
</collection>
M
D
Copyright 2008 Dan McCreary & Associates
24