What is OAIster?

Download Report

Transcript What is OAIster?

OAIster: What’s with the
Weird Name?
Kat Hagedorn
UM Library Information Technology
November 28, 2005
What is OAIster?



Is/was a means for UM to test the OAI
protocol… (hence the name)
A method for sharing metadata among
institutions and groups of people
A means of developing a search service for
end-users worldwide
Basics of OAI
What does OAIster collect?



Harvests all metadata from all OAI data
providers (within reason)
Only keeps metadata that points to digital
objects, e.g., articles, photographs,
datasets, etc. in digitized form
All available via search service…
Searching OAIster


Time to show off OAIster…
http://www.oaister.org/
A little history




Service is now 3.5 years old
Started with 66 data providers and a little
over 200K records
Now have 572 data providers and “a little”
over 6 million records
37% US, 63% international
Visibility of OAI

Surprising who hasn’t made their metadata
shareable through OAI
 Harvard, Yale, Stanford…the big ones

Initially perplexing, but now clearer:
 always done at the end
 only recently thought of at initiation of projects
 truthfully, many institutions not collaborative…
Examples of data providers

Many data providers are huge, e.g.,
 arXiv: physics preprint and postprint articles
 pubmed: medical articles, although restricted
 pictureaustralia: images from govt and
academic institutions in Australia
 lcoa: Library of Congress digital archives
 usc: U South California census data
Examples of data providers



Most are small, though
Many around 100 records
Value of making their records available
 increased visibility
 inclusion in bigger search service than theirs
 incorporation in Yahoo! Search
Yahoo! Search

Two years ago, collaborated with team at
Yahoo! Search to send our metadata to
them for indexing
 e.g., “gardens at albury” in Yahoo! Search
 know it’s not static html roboting
 <dc:relation>IspartOf Victorian Railways
collection.</dc:relation>


Many, many more hits
Also send metadata to Google
System design
XSL
stylesheets
(per source
type)
UM
harvester
OAI-enabled
DC records
Non-OAIenabled
DC records
Record
storage
BibClass
indexes
XSLT
transformation
tool
Search
interface
(XPAT)
Transformation of metadata

Most metadata needs to be brushed off
 adding an http:// to the front of URLs

Or raked
 removing instances of <![CDATA[

Or wrung out
 instead of “Where’s Waldo,” it’s “Where’s the
incorrect UTF-8 character?”

And should be normalized…
Why normalize?

Sample date values
<date>2-12-01</date>
<date>2002-01-01</date>
<date>0000-00-00</date>
<date>1822</date>
<date>between 1827 and 1833</date>
<date>18--?</date>
<date>November 13, 1947</date>
<date>SEP 1958</date>
<date>235 bce</date>
<date>Summer, 1948</date>
Why use a CV?

Sample subject values
<subject>30,51,52</subject>
<subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta
[Judson].</subject>
<subject>Slavery--United States--Controversial
literature</subject>
<subject>view of interior with John Henry
sculpture</subject>
<subject>Particles (Nuclear physics) -Research.</subject>
Best practices



Fixing more than half of the data providers is
cumbersome
Individuals at OAI-enabled institutions
started a “Best Practices” group to inform
data providers what they ought to do
http://oai-best.comm.nsdl.org/cgibin/wiki.pl?TableOfContents
2nd phase OAI


“Best Practices” group sponsored by the
Digital Library Federation, which also…
Sponsors our latest grant




Better and more easily calculated statistics
Search interface improvements
Clustering / classification techniques
Using richer metadata
Clustering / classification



Using automated means to take a selection
of metadata and determine “what it’s about”
Working with Emory University (one of our
grant partners) to test their tool
Results will be integrated into search so can
search in smaller group of OAIster records
Using richer metadata


Data providers must use simple Dublin Core
Very sparse schema for describing objects
 dc:title must contain main title, sorted title and
alternative titles
 dc:subject doesn’t distinguish between
geographical, hierarchical, temporal…
Using richer metadata


Encouraging use of richer metadata,
especially MODS (Metadata Object
Description Schema) from LOC
Developed testbed for grant deliverables
 currently only shows MODS work…
 http://www.hti.umich.edu/m/mods/
Other stuff


Well, make it smaller somehow…
Clean up Boolean interface
 squinch fields together
 include more normalization



Make it available through federated search
Proselytize sharing metadata
Test, test, test
Contact me




Kat Hagedorn
UM Library Information Technology
[email protected]
www.oaister.org