Customized Mapping and Metadata Transfer from DSpace to

Download Report

Transcript Customized Mapping and Metadata Transfer from DSpace to

ALCTS
Catalog Form and Function Interest Group Meeting
Optimized Metadata Repurposing
in a Library Using MarcEdit
Sai Deng, Wichita State University Libraries
Outline
Goldbarth poems repurposing from an inhouse inventory to OCLC/Voyager
Challenges in metadata mapping and
transformation
ETDs harvesting and repurposing from DSpace
to OCLC/Voyager
Discussion on metadata repurposing and
management
Goldbarth Poems in A Special Collection
Inventory
Albert Goldbarth’s contributions to journals,
including over 1800 poems.
Goldbarth Collection:
A: Books authored by Goldbarth
B: Blurbs by Goldbarth (reviews)
C: Contributions to Journals (poems)
L: Library (Goldbarth’s library collection)
R: Research Books
T: Textbooks and Anthologies
X: Miscellaneous
(This in-house inventory structure was developed by Special Collections’
staff and is based on B. C. Bloomfield and Edward Mendelson’s schema
for a bibliography of the poet W. H. Auden.)
Goldbarth Poems Metadata
Excel file, free style
SC number; SC subnumber; vol/number/date; Journal title; Title
of Goldbarth appearance (page number); Notes
Data example: 5575 C 50; Volume 1 Number 1 Spring 1971;
Ark River Review; Blocking a Street for Peace (4); published in
Wichita
Created by Special Collection to track materials.
A local decision was made to add the poem entries
to OCLC and Voyager.
How to reuse these Excel data instead of recataloging every field in OCLC?
MarcEdit Delimited Text Translator
Manipulating and Optimizing Metadata in Excel
Enriching data and batch editing
publication place (260$a), publisher name (260$b),
publication year (260$c), dimensions (300$c), journal OCLC
no. (773$w), additional notes (if needed)
Combine several fields to one to conform to MARC
field definition
Example: =Concatenate(G2, “(“, I2,”)”)
Result: Texas Observer (Austin, Tex.)
(Connect “journal title” in column G and “publication
place” in column I for 773$t, host item title)
Manipulating and Optimizing Metadata in Excel
Extract data to form a new field
Example: =Right(F2, 4)
Result: Vol. 76, No. 24, 1984  1984
(Extract publication year “1984” from original value “Vol. 76, No. 24,
1984” in column F line 2 for 260$c, publication year)
Formatting data to conform to MARC punctuation
Example: =Concatenate(I2, “:”)
Result: Austin, Tex.  Austin, Tex. :
(Add “:” after publication year for Marc filed 260$a, publication place)
Other batch editing
Change title cases;
Make vol., no., places and other terms consistent…
Metadata Mapping and Transformation from
Excel to Marc
Metadata Mapping and Transformation from
Excel to Marc
Metadata Mapping and Transformation from
Excel to Marc
Mapping and transformation
Excel Field
Marc field
Callno
099$a
Title
245$a
Pubplace
260$a
Pubname
260$b
Size
260$c
Page
300$a
Notes
500$a (or 590$a)
Vol
773$a
Journal
773$t
Transformation from Excel file to Marc text file: (.xls->.mrk)
Post-transformation Editing in MarcEditor
Add/delete field (if needed)
Example: add general notes, 500 $a Albert
Goldbarth Collection.
Edit indicator data and subfield data (if
needed)
Edit 008 field (adjust year and place code)
A Transformed Marc Text Sample Record
=LDR 00738caa a2200229Ia 4500
=008 081124s1988\\\\tnu\\\\\\\\\\\000\0\eng\d
=099 \\$aSPEC$aCOLL$a5575$aC 1550
=100 1\$aGoldbarth, Albert.
=245 10$aLullabye for skyler /$cAlbert Goldbarth.
=260 \\$aClarksville, Tenn :$bAustin Peay State University, Center for the
Creative Arts,$c1988.
=300 \\$aP. 15 ;$c23 cm.
=500 \\$aAlbert Goldbarth Collection.
=500 \\$aGoldbarth's poems printed on blue paper.
=730 0\$aAlbert Goldbarth Collection.
=773 0\$7nnas$tZone 3 (Clarksville, Tenn)$aVol. 3, No. 2 (Spring
1988)$w(OCoLC)13451008
Batch Export Marc to OCLC
Need to delete special collection call no in OCLC after records being
exported to Voyager.
Case 1: Some Reflections
Greatly improves cataloging efficiency;
Some limitations:
008 field cannot deal with data varieties, needs
editing in MarcEditor;
Data mapping interface doesn’t show the field
name (only field number such as field 1…), which
makes mapping less intuitive.
Challenges in Metadata Mapping and Transfer
The following apply to both Goldbarth and ETD projects.
One-to-many and many-to-one mapping
One field mapped to several sub fields;
Combine several fields to form a new field.
No data to be mapped to a particular field
Enriching field data (before/after mapping);
Adding indicators in mapping.
Metadata integrity: data loss not obvious, except
undefined punctuations…
Case-by-case analysis of homegrown metadata
Data maintenance (different departments, systems)
More Challenges in Metadata Mapping and
Transfer
The following apply more to the ETDs DC-Marc
mapping.
Specificity and granularity
Handling of keywords and controlled vocabulary to be
mapped to a more structured system;
Subfields and indicators need to be added;
DSpace 1.4 does not offer qualified DC harvesting.
Left-out metadata
Decision to discard some DC data.
ETDs at WSU
ETDs in DSpace/SOAR, OCLC and Voyager
One digital copy deposited in SOAR (Shocker Open
Access Repository, DSpace based)
Metadata in Dublin Core format
Re-entered in OCLC and downloaded to Voyager
Metadata in MARC format
ETD workflow dilemma
Double keying in SOAR and OCLC
Metadata management: repurposing needed
Chart created by Institutional Repository Librarian Susan Matveyeva
ETD Workflow in Other Institutions
ETD workflow
University of Virginia (1999), Texas A & M (2004)
Home-grown scripts, site-specific harvesters
Kent State University (2007)
Harvest from OhioLINK ETD Center, ETD-MS to Marc…
XSLT Transformation
LC MARC 21 XML schema with MarcXML toolkit
Dublin Core to MARCXML Stylesheet
OAI community developed tools, mostly for IT staff
MarcEdit (Terry Reese)
Metadata Harvester, MARC Editor
Low-barrier harvester, can be used by catalogers
Sample Record in SOAR (Dublin Core)
DC Field
Value
dc.contributor.author
Niles, Raedc.date.accessioned
2006-12-24T14:56:10Z
dc.date.available
2006-12-24T14:56:10Zdc.date.copyright
2006
dc.date.issued
2006-05
dc.identifier.other
d06005
dc.identifier.uri
http://hdl.handle.net/10057/373dc.description
Thesis (Ed.D.)--Wichita State University, College of Education.en
dc.description
"May 2006.”
dc.description
Includes bibliographic references (leaves 129-145).en
dc.description.abstract
The purpose of this study was to describe and identify Sedgwick High School’s teacher and
student perceptions of the impact of one-to-one laptop computer access using an appreciative inquiry theoretical research
perspective and the theoretical frameworks of change and paradigm shift…
dc.format.extent
xiv, 167 leaves : digital, PDF file.
dc.format.extent
1174852 bytesdc.format.mimetype
application/pdfdc.language.iso
en_US
dc.rights
Copyright Rae Niles, 2006. All rights reserved.
dc.subject.lcsh
Educational technology
dc.subject.lcsh
Education--Data processing
dc.subject.lcsh
Electronic dissertations
dc.title
A study of the application of emerging technology: teacher and student perceptions of the
impact of one-to-one laptop computer access
dc.type
Dissertation
dc.thesis.adviser
Calabrese, Raymond L.
dc.identifier.oclc
71805797Appears in Collections:
EL Theses and Dissertations
COE Theses and Dissertations
Dissertations
Dublin Core to MARC Mapping
Fields in DSpace
dc.contributor.author 
dc.date.accessioned 
dc.date.available 
dc.date.copyright 
dc.date.issued 
dc.identifier.other 
dc.identifier.uri 
dc.description 
dc.description 
dc.description 
dc.description.abstract 
dc.format.extent 
dc.format.extent 
dc.format.mimetype 
dc.language.iso 
dc.rights 
dc.subject 
dc.subject.lcsh 
dc.title 
dc.type 
dc.thesis.adviser 
dc.identifier.oclc 
Appears in Collections: 
Transformed MARC fields in OCLC
100 1 _ Author.
260 ǂc year.
099 ……
856 4 0 …
502 Thesis (Ed.D.)--Wichita State University, College of …
500 "Month year."
504 Includes bibliographic references…
520 3 _ …
300
546 en_US
540 Access restricted to WSU students, faculty and staff (delete)
690 (keywords, not controlled vocabulary, delete)
650 _ 0
245 1 _ …
655 _ 7 Dissertation ǂ2 local
700 1 2 … ǂe advisor
856 4 1 …
Using MarcEdit
MarcEdit Interface
Metadata transformation in MarcEdit
The wheel and spoke design for metadata transformation (by Reese)
Dublin Core
EAD
MarcXML
MODS
TEI
Data Flow Diagram
OAI request
DSpace
MarcEdit
MarcEditor
MARC
Raw
XML
(DC)
Export
XSLT
(DC to MarcXML)
OCLC
Authorized data
processing
(Title, author,
subject…)
Metadata Harvester
OAI response
Voyager
Customization
Resolving data
ambiguity
(Many to one
mapping w/
element
positioning…)
String
Processing
(Data
normalization…)
Selective Harvesting
Define in MarcEdit
by identifier (e.g.
oai:soar.wichita.edu:10057/255 )
by set (e.g. hdl_10057_351)
by date (e.g. from=2007-01-01&until=200801-01)
Or, http://soar.wichita.edu/dspaceoai/request?verb=ListRecords&metadataPrefix=
oai_dc&from=2007-01-01&until=2008-01-01
How do we define harvesting theses
only?
Define by set
(http://soar.wichita.edu/dspaceoai/request?verb=ListSets)
Sets by schools and departments
AE Theses and Dissertations
(hdl_10057_313)
ANTH Theses (hdl_10057_233)
BIO Theses (hdl_10057_389)
CE Theses and Dissertations
…
Or sets in two categories
Master’s These (hdl_10057_351)
Dissertations (hdl_10057_352)
Mapping Problems
Harvested Test Records Exported to OCLC
Error Reports in OCLC
100 occurrence 1, indicator 2 - invalid code
520 occurrence 4, $a occurrence 1, position 76 - invalid character - data must
be ALA characters
655 occurrence 1, $2 - invalid relationship - when element is present, then 655
indicator 2 must equal 7 …
Mapping Problems
Four “description” fields of DC all mapped to 520 (summary)
dc.date (e.g. “2006-12-24T14:56:10Z”) mapped to 260 (publication,
distribution)
dc.identifier (e.g. “d06005”) mapped to 500 (general notes)
All keywords and subjects mapped to 690 (local subject).
Need customization to meet our needs.
Customized Mapping in XSLT
Customization category (Reese)
Resolving data ambiguity
Authorized data processing
String Processing
What I originally tried to categorize the
customization types
Selective data transformation; metadata element
positioning; field relationship definition; field indicator
correction and validation; partial data extraction…
The issues were arranged under Reese’ category
Customized Mapping in XSLT
Resolving data ambiguity
Same DC fields to different MARC fields:
description  502(Dissertation)
500(General Note)
504 (Bibliography)
Qualified DC element:
description.abstract  520(abstract)
Solution: element positioning
<xsl:for-each select="dc:description[1]">
- <datafield tag="502" ind1="" ind2="">
- <subfield code="a">
<xsl:value-of select="normalize-space(.)" />
</subfield>
</datafield>
</xsl:for-each>
<xsl:for-each select="dc:description[2]">
- <datafield tag="500" ind1="" ind2="">
- <subfield code="a">
<xsl:value-of select="normalize-space(.)" />
</subfield>
</datafield>
</xsl:for-each>
…
Customized Mapping in XSLT
Authorized data processing
Primary entries vs. added entries: title and personal names processing
Template to deal with personal names (in MarcEdit)
E.g. <dc:creator>Webb, Kyle M.</dc:creator>
<dc:creator>Webb, Kyle M., 1977 -</dc:creator>
transformed to
=100 1\$aWebb, Kyle M.
=100 1\$aWebb, Kyle M., $d1977-
Identify field relationship and correct indicators
100, 245 (author, title) relationship: if 100 exists, 245 1 _
or else, 245 0 _
Local element: dc.thesis.advisor transformed to 700 1_
(If more than one dc.thesis exists, positioning is needed.)
Customized Mapping in XSLT
Processing of non-filing characters in title
245 (title) 2nd indicator: …a, an, the… (0, 2, 3, 4)
<xsl:for-each select="dc:title[1]">
- <xsl:choose>
- <xsl:when test="$exist100!=''">
- <xsl:choose>
- <xsl:when test="substring(., 1, 2)='A '">
- <datafield tag="245" ind1="1" ind2="2">
- <xsl:choose>
- <xsl:when test="contains(.,':')">
- <subfield code="a">
<xsl:value-of select="concat(substring-before(.,':'),' : ')" />
</subfield>
- <subfield code="b">
<xsl:value-of select="concat(substring-after(.,':'),' / ')" />
</subfield>
</xsl:when>
…
<xsl:otherwise>
<datafield tag=“245” ind1=“1” ind2=“0”>
Alternatively, it can be defined in the title template.
Customized Mapping in XSLT
Subjects vs. Keywords
Only kept common subject in the test (when keywords and subjects mixed inconsistently)
- <xsl:for-each select="dc:subject">
- <xsl:if test=".='Electronic dissertations'">
- <datafield tag="650" ind1="" ind2="0">
- <subfield code="a">
<xsl:value-of select="." />
</subfield>
…
Subject template (OSU solution)
<dc:subject>ocean wave energy</dc:subject>
<dc:subject>direct-drive</dc:subject>
<dc:subject>fluid-structure interaction</dc:subject>
<dc:subject>Ocean wave power</dc:subject>
<dc:subject>Fluid-structure interaction</dc:subject>
Transformed to
=650 \0$aOcean wave power.
=650 \0$aFluid-structure interaction.
=690 \\$aocean wave energy.
=690 \\$adirect-drive.
=690 \\$afluid-structure interaction.
Customized Mapping in XSLT
String Processing
Functions
normalize-space()
translate()
substring()…
Example: Extract partial value from DC element
260 (Date): only extract year from the issuing date in DC
- <xsl:for-each select="dc:date[4]">
- <xsl:if test=".!=''">
- <datafield tag="260" ind1="" ind2="">
- <subfield code="c">
<xsl:value-of select="substring(.,1,4)" />
.
</subfield>
</datafield>
</xsl:if>
</xsl:for-each>
Customized Mapping in XSLT
Leaders: fixed fields that comprise the first 24 character positions (00-23) of each MARC
record. They provide information for the processing of the record.
008 field (Fixed-Length Data Elements)
Type (t, manuscript language material)
BLvl (m, Encoding level is monograph)
Desc (a)
ELvl (I, encoding level is full level)
Form (s, form of item is electronic)
Cont (b, m, content is theses with bibliographies)
Ills (a, illustration included)
Srce (d, cataloging source)
Conf (0, not a conference publication)
Fest (0, not a festschrift)
LitF (0, not fiction)
DtSt (s, single date)
Indx (0, no index) Lang (eng, language is English) Ctry (xx)
Ways to handle:
Scripting and adding all fixed fields (leader and 008 fields) in OAIDCtoMARCXML.xsl;
Or, Adding 008 in MarcEditor after record export;
Or, applying fixed field template after records being exported to OCLC.
Harvesting Using Customized XSLT and Records
will be Dumped to MarcEdit- MarcEditor
MarcEditor
Edit harvested theses in MarcEditor
Batch edit fields, subfields, indicators (if needed)
E.g.: add 008 field for all records
.mrk (MARC text file)  Compile to .mrc (MARC)
Or
Save as .mrk8 (MARC UTF8 text file)  Compile to
.mrc (MARC)
Batch Import Records to OCLC
Click “File-Import
Records…”
Select “Import to
Local Save File”
Review/editing
as needed,
attach holding
and apply fixed
field template of
ETD (if needed).
Case 2: Some Reflections
The customized mapping and metadata transfer can
eliminate the need of double entry in DSpace and
OCLC/Voyager and significantly improve our ETD work
flow.
Data mapping, manipulation and transformation
Using qualified DC instead of element positioning in XSLT;
DSpace 1.5 enables qualified DC crosswalk for OAI-PMH;
Handling of MARC fixed fields and 008 field.
Other technical issues
Using other tools for harvesting besides MarcEdit;
Using DSpace Item Importer and Exporter instead of Metadata Harvester.
Discussion on Metadata Repurposing and
Management
Metadata Repurposing
The tool
MarcEdit: a low-barrier metadata harvesting, mapping and transfer tool.
The crosswalk
One single crosswalk and style sheet will not meet all needs;
Needs to be based on standard practice but add local variations;
Application-specific mapping is needed for special projects.
Dealing with common mapping challenges
One-to-many/many-to-one mapping, specificity and granularity, no data to be
mapped to a field, left-out data, data integrity, data loss, home-grown data, case
analysis…
Metadata management
Coordination in metadata repurposing is important;
Synconizing metadata in different systems?
Updating data in both systems (e.g. DSpace, OCLC/Voyager);
Re-harvesting, re-transfer and data overlay?
Project Team and Acknowledgements
Goldbarth Cataloging
Sai Deng, Nancy Deyoe, Laurie Allen, Technical Services
Mary Nelson, Josh Yearout, Lorraine Madway, Special
Collections, Wichita State University
ETDs Project
Susan Matveyeva, Sai Deng, Tse-Min Wang, Sandy Oswald,
Manoj Gogoi, Technical Services, Wichita State University
Terry Reese, Consultant, Oregon State University
Thank you!