What is OAIster?
Download
Report
Transcript What is OAIster?
OAIster: What’s with the
Weird Name?
Kat Hagedorn
UM Library Information Technology
November 28, 2005
What is OAIster?
Is/was a means for UM to test the OAI
protocol… (hence the name)
A method for sharing metadata among
institutions and groups of people
A means of developing a search service for
end-users worldwide
Basics of OAI
What does OAIster collect?
Harvests all metadata from all OAI data
providers (within reason)
Only keeps metadata that points to digital
objects, e.g., articles, photographs,
datasets, etc. in digitized form
All available via search service…
Searching OAIster
Time to show off OAIster…
http://www.oaister.org/
A little history
Service is now 3.5 years old
Started with 66 data providers and a little
over 200K records
Now have 572 data providers and “a little”
over 6 million records
37% US, 63% international
Visibility of OAI
Surprising who hasn’t made their metadata
shareable through OAI
Harvard, Yale, Stanford…the big ones
Initially perplexing, but now clearer:
always done at the end
only recently thought of at initiation of projects
truthfully, many institutions not collaborative…
Examples of data providers
Many data providers are huge, e.g.,
arXiv: physics preprint and postprint articles
pubmed: medical articles, although restricted
pictureaustralia: images from govt and
academic institutions in Australia
lcoa: Library of Congress digital archives
usc: U South California census data
Examples of data providers
Most are small, though
Many around 100 records
Value of making their records available
increased visibility
inclusion in bigger search service than theirs
incorporation in Yahoo! Search
Yahoo! Search
Two years ago, collaborated with team at
Yahoo! Search to send our metadata to
them for indexing
e.g., “gardens at albury” in Yahoo! Search
know it’s not static html roboting
<dc:relation>IspartOf Victorian Railways
collection.</dc:relation>
Many, many more hits
Also send metadata to Google
System design
XSL
stylesheets
(per source
type)
UM
harvester
OAI-enabled
DC records
Non-OAIenabled
DC records
Record
storage
BibClass
indexes
XSLT
transformation
tool
Search
interface
(XPAT)
Transformation of metadata
Most metadata needs to be brushed off
adding an http:// to the front of URLs
Or raked
removing instances of <![CDATA[
Or wrung out
instead of “Where’s Waldo,” it’s “Where’s the
incorrect UTF-8 character?”
And should be normalized…
Why normalize?
Sample date values
<date>2-12-01</date>
<date>2002-01-01</date>
<date>0000-00-00</date>
<date>1822</date>
<date>between 1827 and 1833</date>
<date>18--?</date>
<date>November 13, 1947</date>
<date>SEP 1958</date>
<date>235 bce</date>
<date>Summer, 1948</date>
Why use a CV?
Sample subject values
<subject>30,51,52</subject>
<subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta
[Judson].</subject>
<subject>Slavery--United States--Controversial
literature</subject>
<subject>view of interior with John Henry
sculpture</subject>
<subject>Particles (Nuclear physics) -Research.</subject>
Best practices
Fixing more than half of the data providers is
cumbersome
Individuals at OAI-enabled institutions
started a “Best Practices” group to inform
data providers what they ought to do
http://oai-best.comm.nsdl.org/cgibin/wiki.pl?TableOfContents
2nd phase OAI
“Best Practices” group sponsored by the
Digital Library Federation, which also…
Sponsors our latest grant
Better and more easily calculated statistics
Search interface improvements
Clustering / classification techniques
Using richer metadata
Clustering / classification
Using automated means to take a selection
of metadata and determine “what it’s about”
Working with Emory University (one of our
grant partners) to test their tool
Results will be integrated into search so can
search in smaller group of OAIster records
Using richer metadata
Data providers must use simple Dublin Core
Very sparse schema for describing objects
dc:title must contain main title, sorted title and
alternative titles
dc:subject doesn’t distinguish between
geographical, hierarchical, temporal…
Using richer metadata
Encouraging use of richer metadata,
especially MODS (Metadata Object
Description Schema) from LOC
Developed testbed for grant deliverables
currently only shows MODS work…
http://www.hti.umich.edu/m/mods/
Other stuff
Well, make it smaller somehow…
Clean up Boolean interface
squinch fields together
include more normalization
Make it available through federated search
Proselytize sharing metadata
Test, test, test
Contact me
Kat Hagedorn
UM Library Information Technology
[email protected]
www.oaister.org