Crosswalk or Cross-stagger? : Using OCLC's “Digital Import

Download Report

Transcript Crosswalk or Cross-stagger? : Using OCLC's “Digital Import

A case study in using the Connexion
Digital Import tool to streamline
metadata creation in a digital state
documents collection, or, . . .
Christy Allen & Amy Rudersdorf
State Library of North Carolina
Southeastern CONTENTdm Users Group
Annual Meeting, Starkville, MS
July 31, 2008
The Good, the Bad, and the Ugly
[Graphic Removed]
What is Connexion Digital Import?
New-ish* feature in Connexion that allows you
to:
1. upload a digital object to a new or existing MARC
record in WorldCat, and
2. automatically “dump” the record (mapped to
Qualified Dublin Core) and object into a hosted
instance of CONTENTdm,
(and to the Digital Archive if you have a
subscription to that, too),
3. using the OCLC number as the connection point.
*See OCLC’s announcement here: http://www.oclc.org/news/announcements/announcement247.htm
What is required?
• Connexion version 2.0 or higher
• Full-level authorization status or higher in
OCLC
• A hosted version of CONTENTdm!
• An OCLC authorization that includes
CONTENTdm authorization
• A WorldCat record to attach digital content
to
Why did we use it?
• State Library of N.C. is the mandated
depository for state government documents
in North Carolina:
– Need to provide access to all state documents
– Source for original cataloging of *most* state
documents in MARC
– Depository Library survey indicated our clients
want us to continue full MARC cataloging of
documents -- let’s re-use that data!
– Pilot project started using already-cataloged
paper docs that have electronic versions
How does it work?
How does it work? (cont.)
How does it work? (cont.)
How does it work? (cont.)
How does it work? (cont.)
It’s Magic!!!
completely
gratuitous
picture of
Stonehenge
taken by our
Cataloging
Branch Head
The Good…
• Multiple access points: WorldCat, ILS,
CONTENTdm, and Google
• Reuses already-existing metadata (MARC
records)
• Files are automatically moved into the
Digital Archive for those who subscribe to
it
• Fits into existing cataloging workflow
• CONTENTdm support is responsive
The Good . . . (continued)
• CONTENTdm is ready out of the box
• Built-in functionalities: JPEG2000, full-text
searchability, user-friendly interface
• Compound object functionality:
– Easy-to-use compound object interface
– builds compound objects on-the-fly from PDF
files
• Crosswalking does allow special characters/
diacritics to come through from WorldCat
(special characters/diacritics can’t be easily added to records
created through the Acquisitions Station until the fall release of
CONTENTdm)
OK, maybe it’s not all magic . . .
Stonehenge
snow globe.
Doesn’t
have quite
the same
effect.
• Likewise,
MARC and
QDC are
not quite
the same…
[Graphic of Stonehenge
snowglobe Removed]
. . . The Bad and the Ugly:
post-crosswalk editing
At first you feel
like this guy. . .
[Graphics of sad pig
balloon and girl saying “I
don’t care what you say.
I’m gonna be a horse
when I grow up.”
Removed]
. . . but after a
while it’s not so
bad
Why edit, you say?
• Doesn’t the full-text document contain
everything the user needs? Well . . .
– The mapping between MARC and QDC is
defined by OCLC and is “fixed,” so you
don’t get to pick which MARC fields map
into which QDC fields!
– This means that you may have:
1. Data mapping to a field in which you don’t
want it
2. Data you don’t want at all that maps anyway
3. Data you want that doesn’t map anywhere
Data mapping to a field in which
you don’t want it
– Where is this a problem?
• dc.subject - 099/092/096 fields and
non-LCSH subject terms applied by
other institutions
• dc.language – we use ISO 639-2
code as controlled vocabulary, but
free text note field in MARC (546)
maps to dc.language!
• dc.relation – OCLC URL maps to this
field instead of to dc.identifier
Data you don’t want at all that
maps anyway (1/2)
MARC
099/092/096
fields (call &
cutter
numbers) map
to dc.subject
field in
CONTENTdm
Data you don’t want at all that
maps anyway (2/2)
– Issues:
• CONTENTdm supplies a controlled
vocabulary (TGM) for this field or you can
implement your own. However, the CV is
difficult to apply because every record
now contains unique value that does not
exist in the controlled vocabulary!
• If you DO apply a controlled vocabulary to
the dc.subject field and forget to remove
the classification number while editing the
record, the system will not let you save the
record, and you may lose all your other
edits to that record.
Data you want that doesn’t map
to any CONTENTdm field
– 041 (language codes)
– 780/785 (title replaces/replaced by
fields) only certain indicator/subfield
combos are crosswalked
– 260 $a (place of publication)
– 245 $c (statement of responsibility)
So, we manually add some of this
information . . .
Fields that don’t exist in MARC
We repeatedly input the same data directly
into multiple CONTENTdm records
because . . .
1. the data simply doesn’t exist in the
MARC record, and
2. you can’t apply a CONTENTdm
template to a record directly dumping
from CONNEXION
Examples: “Collection,” “Digital Format,” “Rights,”
etc.
Controlled vocabulary issues
• We use LCSH and LC name authorities in
various fields
• Terms were loaded into CONTENTdm after
pulling the data from our Voyager system
• If the WorldCat record had authority
headings that were added or changed
before load, those terms aren’t in our CV
• In Admin module: new controlled
vocabulary terms can’t be added to the CV
directly from the record (must be
laboriously added before record is edited)
MARC record authorization problems
• Our OCLC authorization = “Enhance level”
• Some of “our” MARC records have been
upgraded to Elvl:[blank]
(i.e., we can’t edit them anymore)
• CDI process replaces record, but we no
longer have authorization to do so!
• OCLC has recommended we create a
duplicate record
• We are brainstorming other alternatives
with OCLC
Workflow for Editing New Items
1. New items added through CDI appear in
the live repository (not in approval queue)
– (We don’t insert a collection name into these
records until they are edited/approved so that
they don’t come up in a collection-specific
search (The item will still come up in a repository search)
Workflow for Editing New Items
2. Newly imported records are batchdownloaded into the Acquisitions Station,
edited, and re-uploaded with the
Collection name
3. They then become accessible through
collection and repository searches
A search
within the
Publications
Collection for
“Dept. of
Transportation”
returns 7 hits
(all edited
records)
A search
across the
entire
repository for
the same
phrase returns
12 hits (3 of
the first 4 are
unedited
records)
Other Issues
• Import isn’t always successful (sometimes,
the digital object isn’t “there” when you
index the collection)
• Unspecified time lags may occur during
digital import
• Large bandwidth required for digital import
to work consistently
• Can’t export administrative fields autopopulated by OCLC. (e.g., the OCLC
number) *Not really a CDI issues, but since
we’re here . . .
Potential improvements? (1/2)
• Use templates (or something) to apply
“constant data” to imported records
• Add controlled vocabulary terms directly
from metadata record while working in
Admin module
• Attach digital content to ALL records
(including CONSER/Elvl:[blank] records)
• Suppress individual records in the “live”
collection until ready to make them publicly
available
Potential improvements? (2/2)
• Let CONTENTdm talk to the WorldCat
authority file for controlled vocabularies
• Some kind of visible “required fields”
indicator in the Admin interface
(customizable on a collection basis).
During creation, editing, updating process,
required fields would be obvious.
• Export ALL fields (both administrative and
Dublin Core) from CONTENTdm
The Digital Import Process:
Sometimes Weird but Very Useful
The Flowbee
cuts and
vacuums hair
at the same
time!
[Graphic of “Flowbee” in
use Removed]
State Library of North Carolina
Check out our collections:
http://statelibrary.dcr.state.nc.us/dimp/index.html
Christy Allen
[email protected]
(soon to be [email protected])
Amy Rudersdorf
[email protected]
(soon to be [email protected])