A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected] Background Information Smithsonian Institution is a public institution whose mission is the increase and.
Download ReportTranscript A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected] Background Information Smithsonian Institution is a public institution whose mission is the increase and.
A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution Collections.si.edu [email protected] Background Information Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge, 19 museums and 9 research institutes, 136 million collection objects, 12 major museum collection information systems (with 30 databases), Hundreds of other databases. Issues we faced Users want information now! Google Effect and user’s mentality: “if it is not online, it does not exist.” Users want immediate access to digital documents. Separate databases are confusing to the public. We must act now! Smithsonian’s Collection Searching Center Overview a discovery center for information with a single searching point faceted searching and content-sensitive navigation positive and negative browse & select options relevancy ranking of search results automatic stemming for word matching Smithsonian’s Cross Searching Catalog Overview (continued) integrated searching of data from multiple types of databases scalability for large data sets a metadata center which interacts with other online applications Project Team and Resources Andrew Gunther implementation Jim Felley George Bowman configuration Randy Arnold Ching-hsien Wang – Software development and – Data conversion and implementation – Database management and security – Project support – Program Manager Since August 2007, we have integrated data from 12 major databases with 2 million records. Starting from Multiple databases Transform into a single Search Center Cross Searching Demo – simple opening screen Demo – search result screen Demo – search history Process Flow Diagram Horizon Horizon Virtual Museum In 2nd Life Data Extract and TransFormation Output data In XML XML documents Solr Horizon Lucene Index Digital Library Digital Archives Digital Museum Data Extract and TransFormation Online Exhibition Output data In JSON Cross Searching Catalog Solr XML documents Output data In Python Education Interface Open Access Applications Automated Process Trigger XML Data Transformation Library Trigger Archives Art Inventory Horizon Archives Trigger Photo Archives Trigger Exhibition Catalogs Trigger Smithsonian History Trigger Research Trigger Bibliographies Airplane Directory Trigger Solr_ Index_ Pending ……. DB Table A Perl program converts records based on BIB# XML Documents Define an Index Metadata Model: Free text data fields used for Keyword searching & display Record Link Title/Object-name Identifier Physical Description Gallery Label Notes Publisher Object Type Taxonomic Name Language Topic Place Date Name Culture Set Name Data Source Credit Line Online Media Group Facet data fields used for browsing and limiting Record ID Object Type Language Topic Place Date Name Culture Data Source Online Media Type Rights for Online Media File Related Record Usage Flag Taxon-Kingdom Taxon-Phylum Taxon-Division Taxon-Class Taxon-Order Taxon-Family Tabxon-Sub-Family Scientific_name Common name Geo-age-Era Geo-Age-System Geo-Age-Series Geo-Age-Stage Strat-Group Strat-Formation Strat-Member Getting help from Solr Task specific handlers: Request handler Respond handler Update handler Solr Lucene Index Solr Schema.xml file defines fields to be indexed, displayed, and searchable. Solrconfig.xml file defines cache size, faceted field type, request handler customization. Solrconfig.xml Example facet field definition <str name="facet.field">object_type</str> <str <str <str <str <str <str <str <str <str <str <str <str <str <str <str <str <str <str <str name="facet.field">language</str> name="facet.field">topic</str> name="facet.field">place</str> name="facet.field">date</str> name="facet.field">name</str> name="facet.field">culture</str> name="facet.field">online_media_type</str> name="facet.field">set_name</str> name="facet.field">data_source</str> name="facet.field">tax_kingdom</str> name="facet.field">tax_phylum</str> name="facet.field">tax_division</str> name="facet.field">tax_class</str> name="facet.field">tax_order</str> name="facet.field">tax_family</str> name="facet.field">tax_sub-family</str> name="facet.field">common_name</str> name="facet.field">scientific_name</str> name="facet.field">freetext</str> Data Example (abbreviated) – a Library Book <doc boost="1"> <descriptiveNonRepeating> <record_ID>siris_sil_905285</record_ID> <unit_code>SIL</unit_code> <data_source>Smithsonian Institution Libraries</data_source> <title_sort>STORY OF WEST POINT: 18021943 THE WEST POINT TRADITION IN AMERICAN LIFE</title_sort> <title label="Title">Story of West Point: 1802-1943; the West Point tradition in American life</title> </descriptiveNonRepeating> <descriptiveOptional> <freetext category="dataSource" label="Data Source“ >Smithsonian Institution Libraries</freetext> <freetext category="objectType" label="Type“ >Books</freetext> <freetext category="date" label="Date">1943</freetext> </descriptiveOptional> <indexedStructured> <object_type>Books</object_type> <date>1943</date> </indexedStructured> </doc> Data Example (abbreviated) – a Photograph <doc boost="6.4"> <descriptiveNonRepeating> <record_ID>siris_arc_104765</record_ID> <unit_code>EEPA</unit_code> <data_source>Eliot Elisofon Photographic Archives</data_source> <title_sort>AERIAL VIEW OF DOWNTOWN JOHANNESBURG SOUTH AFRICA SLIDE</title_sort> <title label="Title">Aerial view of downtown Johannesburg, South Africa, [slide]</title> <online_media mediaCount="1"> <media thumbnail=http://sirismm.si.edu/eepa/eepthb/eepa_05859thb.jpg Type="Images">http://sirismm.si.edu/eepa/eep/eepa_05859.jpg< /media> </online_media> </descriptiveNonRepeating> <descriptiveOptional> <freetext category="dataSource" label="Data Source">Eliot Elisofon Photographic Archives</freetext> <freetext category="identifier" label="Local number">EEPA EECL 15973</freetext> <freetext label="photographer" category="name">Elisofon, Eliot</freetext> <freetext category="physicalDescription" label="Physical description">slide : col</freetext> <freetext category="notes" label="Summary">This photograph was taken when Eliot Elisofon was on as magazine and traveled to Africa from August 18, 1959 to December 20, 1959</freetext> <freetext category="objectType" label="Type">Photographs</freetext> <freetext category="topic" label="Topic">Mod. architecture/cityscape</freetext> <freetext category="place" label="Place">South Africa</freetext> <freetext category="date" label="Date">1959</freetext> <freetext category="setName" label="See more items in">Eliot Elisofon Field photographs 1942-1972</ </descriptiveOptional> <indexedStructured> Data Example (abbreviated) – a sculpture <doc boost="6.4"> - <descriptiveNonRepeating> <record_ID>siris_ari_7985</record_ID> <unit_code>ARI</unit_code> <data_source>Art Inventories</data_source> <title_sort>DREXEL MONUMENT SCULPTURE</title_sort> <title label="Title">The Drexel Monument, (sculpture)</title> <record_link>http://sirisartinventories.si.edu/ipac20/ipac.jsp?&profile=all&source=~!siartinventories&uri=full=3100001~!7985 0#focus</record_link> - <online_media mediaCount="7"> <media thumbnail="http://sirismm.si.edu/saam/scan3thb/S75004286_1bthb.jpg" type="Images">http://americanart.si.edu/images/1966/1966.47.36_1b.jpg</media> </online_media> </descriptiveNonRepeating> - <descriptiveOptional> <freetext category="dataSource" label="Data Source">Art Inventories</freetext> <freetext category="identifier" label="Control number">IAS 75004286</freetext> <freetext label="sculptor" category="name">Manger, Heinrich b. 1833</freetext> <freetext label="founder" category="name">Chas. F. Heaton</freetext> <freetext category="title" label="title">Francis M. Drexel Monument, (sculpture)</freetext> <freetext category="physicalDescription" label="Physical description">metal: bronze Sculpture: bronze; Base: granite; Fountain basin: concrete</freetext> <freetext category="notes" label="Description">Index of American Sculpture, University of Delaware, 1985</freetext> <freetext category="objectType" label="Type">Sculptures-Fountain</freetext> <freetext category="name" label="Subject">Drexel, Francis M</freetext> <freetext category="place" label="Place">Illinois</freetext> <freetext category="date" label="Date">1881. Cast 1882. Dedicated 1883</freetext> </descriptiveOptional> - <indexedStructured> <name>Manger, Heinrich</name> <name>Chas. F. Heaton</name> <object_type>Sculptures</object_type> <topic>Portrait male</topic> <name>Drexel, Francis M</name> A system is only as good as the data that is in it. Data mapping for multiple databases (truncated) Faceted Categories Determine the most useful facets; more is not better. Number of unique facets will affect system response time Smithsonian has 4.6 million unique terms. Among them: 864,000 names, 126,000 topics, 47,000 places, 139 dates(down from 40,000 before cleanup), 1,000 types (down from 2,000 before cleanup) Build the facet terms 650 $a Art $z Africa, North $v Periodicals. <Topic> Art </Topic> <Place> Africa, North </place> <object_type> Periodicals </object_type> Build the facet terms 655 $a Photographs $y 1840-1860. <type> <date> <date> <date> Photographs </type> 1840s </date> 1850s </date> 1860s </date> Challenges Adapting LCSH and AAT terms in a whole new way Still seeking a good way to use See and See Also reference data Reduce Data inconsistency in our records for better quality facet terms Character conversion challenge with MARC8, UNICODE and UTF8 Future plans Continue to add data from more digital library databases and museum collection databases Working on National History museum, and American Indian museum. Complete the implementation of the capability to interact with external applications Plan to support “American Art and Artist” application Add new functionality such as my-list, list-sharing, social tagging. Support more visual displays such as Google map and time slider A Collections Searching Center Using Lucene – Solr Ching-hsien Wang Smithsonian Institution www.siris.si.edu [email protected]