A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected] Background Information      Smithsonian Institution is a public institution whose mission is the increase and.

Download Report

Transcript A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected] Background Information      Smithsonian Institution is a public institution whose mission is the increase and.

A Collections Searching Center
Using Lucene – Solr
Ching-hsien Wang
Smithsonian Institution
Collections.si.edu
[email protected]
Background Information





Smithsonian Institution is a public
institution whose mission is the increase
and diffusion of knowledge,
19 museums and 9 research institutes,
136 million collection objects,
12 major museum collection information
systems (with 30 databases),
Hundreds of other databases.
Issues we faced
Users want information now!
 Google Effect and user’s mentality:
“if it is not online, it does not
exist.”
 Users want immediate access to
digital documents.
 Separate databases are confusing
to the public.
We must act now!
Smithsonian’s Collection Searching Center
Overview
a
discovery center for information with a
single searching point
 faceted searching and content-sensitive
navigation
 positive and negative browse & select
options
 relevancy ranking of search results
 automatic stemming for word matching
Smithsonian’s Cross Searching Catalog
Overview (continued)
 integrated
searching of data from multiple
types of databases
 scalability for large data sets
 a metadata center which interacts with other
online applications
Project Team and Resources





Andrew Gunther
implementation
Jim Felley
George Bowman
configuration
Randy Arnold
Ching-hsien Wang
– Software development and
– Data conversion and implementation
– Database management and security
– Project support
– Program Manager
Since August 2007, we have integrated data from
12 major databases with 2 million records.
Starting from Multiple databases
Transform into a single Search Center
Cross Searching Demo – simple
opening screen
Demo – search result screen
Demo – search history
Process Flow Diagram
Horizon
Horizon
Virtual
Museum
In 2nd Life
Data
Extract
and
TransFormation
Output data
In XML
XML
documents
Solr
Horizon
Lucene
Index
Digital
Library
Digital
Archives
Digital
Museum
Data
Extract
and
TransFormation
Online
Exhibition
Output data
In JSON
Cross
Searching
Catalog
Solr
XML
documents
Output data
In Python
Education
Interface
Open Access
Applications
Automated Process
Trigger
XML Data Transformation
Library
Trigger
Archives
Art
Inventory
Horizon
Archives
Trigger
Photo
Archives
Trigger
Exhibition
Catalogs
Trigger
Smithsonian
History
Trigger
Research
Trigger
Bibliographies
Airplane
Directory
Trigger
Solr_
Index_
Pending
…….
DB
Table
A
Perl
program
converts
records
based
on
BIB#
XML
Documents
Define an Index Metadata Model:
Free text data fields used for Keyword
searching & display
Record Link
Title/Object-name
Identifier
Physical Description
Gallery Label
Notes
Publisher
Object Type
Taxonomic Name
Language
Topic
Place
Date
Name
Culture
Set Name
Data Source
Credit Line
Online Media Group
Facet data fields used for
browsing and limiting
Record ID
Object Type
Language
Topic
Place
Date
Name
Culture
Data Source
Online Media Type
Rights for Online Media File
Related Record
Usage Flag
Taxon-Kingdom
Taxon-Phylum
Taxon-Division
Taxon-Class
Taxon-Order
Taxon-Family
Tabxon-Sub-Family
Scientific_name
Common name
Geo-age-Era
Geo-Age-System
Geo-Age-Series
Geo-Age-Stage
Strat-Group
Strat-Formation
Strat-Member
Getting help from Solr

Task specific handlers:
Request handler
Respond handler
Update handler


Solr
Lucene
Index
Solr
Schema.xml file defines fields to be
indexed, displayed, and searchable.
Solrconfig.xml file defines cache
size, faceted field type, request
handler customization.
Solrconfig.xml Example
facet field definition




















<str name="facet.field">object_type</str>
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
<str
name="facet.field">language</str>
name="facet.field">topic</str>
name="facet.field">place</str>
name="facet.field">date</str>
name="facet.field">name</str>
name="facet.field">culture</str>
name="facet.field">online_media_type</str>
name="facet.field">set_name</str>
name="facet.field">data_source</str>
name="facet.field">tax_kingdom</str>
name="facet.field">tax_phylum</str>
name="facet.field">tax_division</str>
name="facet.field">tax_class</str>
name="facet.field">tax_order</str>
name="facet.field">tax_family</str>
name="facet.field">tax_sub-family</str>
name="facet.field">common_name</str>
name="facet.field">scientific_name</str>
name="facet.field">freetext</str>
Data Example (abbreviated) – a Library Book
<doc boost="1">
<descriptiveNonRepeating>
<record_ID>siris_sil_905285</record_ID>
<unit_code>SIL</unit_code>
<data_source>Smithsonian Institution Libraries</data_source>
<title_sort>STORY OF WEST POINT: 18021943 THE WEST POINT TRADITION IN AMERICAN
LIFE</title_sort>
<title label="Title">Story of West Point: 1802-1943; the West Point tradition in American
life</title>
</descriptiveNonRepeating>
<descriptiveOptional>
<freetext category="dataSource" label="Data Source“ >Smithsonian Institution Libraries</freetext>
<freetext category="objectType" label="Type“ >Books</freetext>
<freetext category="date" label="Date">1943</freetext>
</descriptiveOptional>
<indexedStructured>
<object_type>Books</object_type>
<date>1943</date>
</indexedStructured>
</doc>
Data Example (abbreviated) – a Photograph
<doc
boost="6.4">
<descriptiveNonRepeating>
<record_ID>siris_arc_104765</record_ID>
<unit_code>EEPA</unit_code>
<data_source>Eliot Elisofon Photographic Archives</data_source>
<title_sort>AERIAL VIEW OF DOWNTOWN JOHANNESBURG SOUTH AFRICA SLIDE</title_sort>
<title label="Title">Aerial view of downtown Johannesburg, South Africa, [slide]</title>
<online_media mediaCount="1">
<media thumbnail=http://sirismm.si.edu/eepa/eepthb/eepa_05859thb.jpg
Type="Images">http://sirismm.si.edu/eepa/eep/eepa_05859.jpg< /media>
</online_media>
</descriptiveNonRepeating>
<descriptiveOptional>
<freetext category="dataSource" label="Data Source">Eliot Elisofon Photographic Archives</freetext>
<freetext category="identifier" label="Local number">EEPA EECL 15973</freetext>
<freetext label="photographer" category="name">Elisofon, Eliot</freetext>
<freetext category="physicalDescription" label="Physical description">slide : col</freetext>
<freetext category="notes" label="Summary">This photograph was taken when Eliot Elisofon was on as
magazine and traveled to Africa from August 18, 1959 to December 20, 1959</freetext>
<freetext category="objectType" label="Type">Photographs</freetext>
<freetext category="topic" label="Topic">Mod. architecture/cityscape</freetext>
<freetext category="place" label="Place">South Africa</freetext>
<freetext category="date" label="Date">1959</freetext>
<freetext category="setName" label="See more items in">Eliot Elisofon Field photographs 1942-1972</
</descriptiveOptional>
<indexedStructured>
Data Example (abbreviated) – a sculpture
<doc boost="6.4">
- <descriptiveNonRepeating>
<record_ID>siris_ari_7985</record_ID>
<unit_code>ARI</unit_code>
<data_source>Art Inventories</data_source>
<title_sort>DREXEL MONUMENT SCULPTURE</title_sort>
<title label="Title">The Drexel Monument, (sculpture)</title>
<record_link>http://sirisartinventories.si.edu/ipac20/ipac.jsp?&profile=all&source=~!siartinventories&uri=full=3100001~!7985
0#focus</record_link>
- <online_media mediaCount="7">
<media thumbnail="http://sirismm.si.edu/saam/scan3thb/S75004286_1bthb.jpg"
type="Images">http://americanart.si.edu/images/1966/1966.47.36_1b.jpg</media>
</online_media>
</descriptiveNonRepeating>
- <descriptiveOptional>
<freetext category="dataSource" label="Data Source">Art Inventories</freetext>
<freetext category="identifier" label="Control number">IAS 75004286</freetext>
<freetext label="sculptor" category="name">Manger, Heinrich b. 1833</freetext>
<freetext label="founder" category="name">Chas. F. Heaton</freetext>
<freetext category="title" label="title">Francis M. Drexel Monument, (sculpture)</freetext>
<freetext category="physicalDescription" label="Physical description">metal: bronze Sculpture: bronze;
Base: granite; Fountain basin: concrete</freetext>
<freetext category="notes" label="Description">Index of American Sculpture, University of Delaware,
1985</freetext>
<freetext category="objectType" label="Type">Sculptures-Fountain</freetext>
<freetext category="name" label="Subject">Drexel, Francis M</freetext>
<freetext category="place" label="Place">Illinois</freetext>
<freetext category="date" label="Date">1881. Cast 1882. Dedicated 1883</freetext>
</descriptiveOptional>
- <indexedStructured>
<name>Manger, Heinrich</name>
<name>Chas. F. Heaton</name>
<object_type>Sculptures</object_type>
<topic>Portrait male</topic>
<name>Drexel, Francis M</name>
A system is only as good as the
data that is in it.
Data mapping for multiple databases (truncated)
Faceted Categories

Determine the most useful facets; more
is not better.

Number of unique facets will affect system
response time
 Smithsonian has 4.6 million unique
terms. Among them:





864,000 names,
126,000 topics,
47,000 places,
139 dates(down from 40,000 before cleanup),
1,000 types (down from 2,000 before cleanup)
Build the facet terms
650
$a Art $z Africa, North $v Periodicals.
<Topic> Art </Topic>
<Place> Africa, North </place>
<object_type> Periodicals </object_type>
Build the facet terms
655
$a Photographs $y 1840-1860.
<type>
<date>
<date>
<date>
Photographs </type>
1840s </date>
1850s </date>
1860s </date>
Challenges




Adapting LCSH and AAT terms in a whole
new way
Still seeking a good way to use See and
See Also reference data
Reduce Data inconsistency in our records
for better quality facet terms
Character conversion challenge with
MARC8, UNICODE and UTF8
Future plans

Continue to add data from more digital library
databases and museum collection databases


Working on National History museum, and American
Indian museum.
Complete the implementation of the capability to
interact with external applications

Plan to support “American Art and Artist” application

Add new functionality such as my-list, list-sharing,
social tagging.

Support more visual displays such as Google map and
time slider
A Collections Searching Center
Using Lucene – Solr
Ching-hsien Wang
Smithsonian Institution
www.siris.si.edu
[email protected]