Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest Group, ALA Midwinter 2011, Jan.

Download Report

Transcript Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest Group, ALA Midwinter 2011, Jan.

Batch-conversion of Non-standard
Multiscript Records by XSLT
Lucas Mak
Metadata and Catalog Librarian
Michigan State University
Catalog Management Interest Group, ALA Midwinter 2011, Jan. 8, 2011, San Diego CA
Agenda
• Background
– Structure of multiscript records
• Model A vs. Model B
– Using z39.50 for cataloging
• Multiscript records retrieved through z39.50
– Coding issues
– Problems caused by non-standard multiscript records
• Solutions
– Design of XSLT
• Processing logic
• Factors affecting the design
• Limitations & unintended consequence
Structure of Multiscript Records
• Multiscript records
– For recording data in multiple scripts in MARC
records
– One script may be considered the primary script
of the data content of the record, even though
other scripts are also used for data content
– Two models
• Model A: Vernacular & Transliteration
• Model B: Simple Multiscript Records
Structure of Multiscript Records
• Model A: Vernacular & Transliteration
– The regular fields may contain data in different scripts
and in the vernacular or transliteration of the data.
Fields 880 are used when data needs to be duplicated
to express it in both the original vernacular script and
transliterated into one or more scripts
– Model A data in the regular fields is linked to the data
in 880 fields by a subfield $6 that occurs in both of the
associated fields
• $6 [linking tag]-[occurrence number]/[script identification
code]/[field orientation code]
* MARC21 Bibliographic Appx. D
Structure of Multiscript Records
• Model A: Vernacular & Transliteration
Linking Tag
Occurrence
Number
Linking Tag
Occurrence
Number
Field Orientation
Code
Script
Identification Code
Structure of Multiscript Records
• Model A: Vernacular & Transliteration
Linking Tag
Occurrence
Number
CJK Record according to Model A Specifications
Structure of Multiscript Records
• Model B: Simple Multiscript Records
– All data is contained in regular fields and script
varies depending on the requirements of the data
– Repeatability specifications of all fields should be
followed
– Although the Model B record may contain
transliterated data, Model A is preferred if the
same data is recorded in both the original
vernacular script and transliteration
– Field 880 is not used
* MARC21 Bibliographic Appx. D
CJK Record according to Model B Specifications
Item in Chinese. Cataloging language in English
Structure of Multiscript Records
• Field 066 (Character Sets Present)
– To indicate the MARC-8 character sets other than
the default sets that are invoked in the record
• MARC-8 vs. Unicode Environment
MARC-8
Unicode
MARC Field 066
Required
N/A
Script Identification
Code
Required
N/A
Field Orientation
Code
Required
Required
z39.50 for Cataloging
• SkyRiver
– MSU switched to SkyRiver in Oct 2009
– Ways to expand the pool of re-usable
bibliographic records
• z39.50 function in Innovative Millennium (day-to-day
cataloging)
• MarcEdit z39.50 client (HathiTrust record load)
z39.50 search in Millennium
z39.50 search in Millennium (Record retrieved for Editing)
HathiTrust Data Availability
MarcEdit z39.50 Client (HathiTrust)
Batch search against Univ. of Michigan Catalog
using UM record identifier
Request
U of M
Catalog
Record Dump
Retrieve
HathiTrust Record Load Workflow
MSU
Catalog
Non-standard Multiscript Records from z39.50
Sample Non-standard CJK Record Retrieved by
MSU Millennium z39.50 Client
Same Record in Source Library Catalog (Staff View)
Non-standard Multiscript Records from z39.50
HathiTrust Record Retrieved by MarcEdit z39.50 Client*
* As of Dec. 10, 2010, Univ. of Michigan has rebuilt 880 fields on their z39.50 serving records
Same HathiTrust Record in Univ. of Michigan Catalog (Staff View)
Coding Issues
Non-standard Coding
Standard Model A Coding
• Field-pairing
• Field-pairing
– Vernacular data in regular
field
• No linking tag in subfield $6
• No script identification code
in subfield $6 (may be due to
Unicode environment)
– Transliteration in regular field
– Vernacular data in 880 field
• Linking tag
– Tag number of an associated
field
• Script identification code*
– $1 => CJK script
* Applicable to MARC-8 encoded records
Coding Issues
Non-standard Coding
Standard Model A Coding
• No field orientation code in
subfield $6
• Field orientation code
– /r
Coding Issues
Non-standard Coding Practice
• Repeat non-repeatable fields
(245, 250)
• Duplication of data in both
vernacular and transliteration
Model B Guidelines
• Repeatability specifications of all
fields should be followed
• Model A is preferred if the same
data is recorded in both the
original vernacular script and
transliteration
Problems Caused by
Non-standard Multiscript Records
• Irregular/Incorrect field orientation in Arabic and Hebrew records in
OPAC display
– Left-to-right display of subfields in “Title” due to the lack of “Field
Orientation code” while scripts within subfields are from right to left
“Field Orientation code” added back
Problems Caused by
Non-standard Multiscript Records
• Irregularity in result display
– Inconsistent sequencing of vernacular and
transliteration fields
Problems Caused by
Non-standard Multiscript Records
• Database maintenance
– Data structure inconsistency
• Same kind of data resides in two different places
• Extra steps needed to accommodate inconsistencies
– Heading validation issues
• NACO records with headings in vernacular in 4xx since
mid 2008
• Vernacular headings (4xx) in regular fields
Problems Caused by
Non-standard Multiscript Records
• Expectation in retrieval of vernacular data
– MSU only indexes CJK and Cyrillic data in 880
fields
– Arabic, Hebrew, Greek, and other vernacular data
in regular fields of non-standard multiscript
records are indexed and searchable
• Create a false impression that patrons can search in
scripts other than CJK and Cyrillic
Solutions
• MSU uses Model A for multiscript records
• Tasks
– To change field tag of vernacular data to 880
– Subfield $6 in both regular & 880 fields
• To insert linking tag
– Subfield $6 in 880 fields
• To insert script identification code*
• To insert field orientation code for Arabic & Hebrew
records
– To insert 066 field if not already exist*
*No longer applicable since MSU has moved to Unicode environment
Solutions
• Necessary steps
– Determine which fields contain vernacular data
• Replace regular field tag with 880
– Determine which script(s) is contained in a record
• Insert field 066*
• Insert “Script Identification code*” and “Field
Orientation code” when appropriate
*No longer applicable since MSU has moved to Unicode environment
Solutions
• XSLT (Extensible Stylesheet Language Transformation)
– Within the family of XML
• Current version: 2.0
• Case sensitive
– “Transformation”means:
• Manipulation of XML documents by creating a new document
based on the original document
– Common usages in library context
• Web display
– e.g. converting EAD into HTML for display
• Metadata crosswalking
– Data selection and manipulation
– Conditional processing
• Specify matching criteria and corresponding action(s)
MSU
Catalog
Corrected
MARC File
Format
Conversion
Corrected
MARCXML
Uncorrected
MARC File
Format
Conversion
XSLT
Processor
Uncorrected
MARCXML
Database Maintenance Workflow
Request
U of M
Catalog
XSLT
Processor
Retrieve
Corrected records
MSU
Catalog
Alternative HathiTrust Pre-load Data Cleanup Workflow
Design of XSLT
• Processing logic
– Regular field to 880 and insert linking tag
• Remove all roman data from a field
• Determine length of a field
– 0 => no vernacular data
– ≠0 => contains vernacular data
– Field 066, Script identification & Field orientation
codes
• Match vernacular data field against vernacular
characters
Design of XSLT
• Remove all roman data
– Roman data (ASCII, special characters & diacritics used
in transliteration)
– replace() and translate() functions
• Find “pattern A” and replace it with “pattern B”
– Replace roman data with nothing
<xsl:value-of
select="replace(replace(replace(translate(translate(trans
late(translate(normalizespace(.),$ascii,$spaces),$specialCharacters,'
'),$diacritics,' '),$extendedLatin,' '),$apos,' '),'[A-Za-z]','
'),' ','')"/>
Design of XSLT
• Test the length of the field after removing all nonvernacular data
– XSLT elements: <xsl:choose> in combination with
<xsl:when> & <xsl:otherwise>
– XSLT functions: string-length()
<xsl:choose>
<xsl:when test="string-length($subfieldString)=0">
…… [series of actions when string-length equals 0]
</xsl:when>
<xsl:otherwise>
…… [series of actions when string-length not equals 0]
</xsl:otherwise>
</xsl:choose>
Design of XSLT
• Field with no vernacular data
<xsl:when test="string-length($subfieldString)=0">
Test length of the field
<xsl:element name="marc:datafield">
<xsl:attribute name="tag">
<xsl:value-of select="$tag"/>
</xsl:attribute>
<xsl:attribute name="ind1">
Insert original values
<xsl:value-of select="$ind1"/>
</xsl:attribute>
<xsl:attribute name="ind2">
<xsl:value-of select="$ind2"/>
</xsl:attribute>
<xsl:element name="marc:subfield">
<xsl:attribute name="code">
Insert linking tag (880)
<xsl:text>6</xsl:text>
and original
</xsl:attribute>
<xsl:text>880-</xsl:text>
occurrence number
<xsl:value-of select="$subfield6"/>
</xsl:element>
<xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/>
</xsl:element>
</xsl:when>
Copy subfields other than $6
Design of XSLT
• Field with vernacular data
<xsl:otherwise>
<xsl:element name="marc:datafield">
<xsl:attribute name="tag">
Insert “880” as tag no.
<xsl:text>880</xsl:text>
</xsl:attribute>
<xsl:attribute name="ind1">
<xsl:value-of select="$ind1"/>
</xsl:attribute>
Insert original values
<xsl:attribute name="ind2">
<xsl:value-of select="$ind2"/>
</xsl:attribute>
<xsl:element name="marc:subfield">
<xsl:attribute name="code">
• Insert original tag
<xsl:text>6</xsl:text>
no. as linking tag
</xsl:attribute>
<xsl:value-of select="$tag"/>
• Insert original
<xsl:text>-</xsl:text>
occurrence number
<xsl:value-of select="$subfield6"/>
…… [Insert “Script Identification Code” & “Field Orientation Code”]
</xsl:element>
<xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/>
</xsl:element>
</xsl:otherwise>
Design of XSLT
• Insert “Script Identification Code” (MARC-8 environment)
<xsl:choose>
<xsl:when test="matches($basicArabic,substring($subfieldString,1,1)) or
matches($extendedArabic,substring($subfieldString,1,1))">
<xsl:text>/(3</xsl:text>
Insert code for Arabic
</xsl:when>
<xsl:when test="matches($greek,substring($subfieldString,1,1))">
<xsl:text>/(S</xsl:text>
Insert code for Greek
</xsl:when>
<xsl:when test="matches($basicHebrew,substring($subfieldString,1,1))">
<xsl:text>/(2</xsl:text>
Insert code for Hebrew
</xsl:when>
<xsl:when test="matches($basicCyrillic,substring($subfieldString,1,1)) or
matches($extendedCyrillic,substring($subfieldString,1,1))">
Insert code for Cyrillic
<xsl:text>/(N</xsl:text>
</xsl:when>
<xsl:when test="matches($bengali,substring($subfieldString,1,1)) or
matches($tamil,substring($subfieldString,1,1)) or matches($thai,substring($subfieldString,1,1)) or
matches($devanagar,substring($subfieldString,1,1)) "/>
<xsl:otherwise>
<xsl:text>/$1</xsl:text>
Insert code for CJK
</xsl:otherwise>
</xsl:choose>
Design of XSLT
• Insert “Field Orientation Code”
Test if the subfield contains Arabic or
Hebrew script
<xsl:choose>
<xsl:when test=“contains($subfieldString,‘[Arabic script]’
or contains($subfieldString,‘[Hebrew script]’)">
<xsl:text>//r</xsl:text>
</xsl:when>
Insert Field Orientation Code
</xsl:choose>
Design of XSLT
• Field 066 (MARC-8 environment)
– Insert character set code in subfield $c
– A single record may have more than one vernacular
script => multiple subfield $c
• XSLT element: <xsl:if>
– Allows multiple matches
• XSLT function: matches()
– Processing logic
• Turn the whole record into a text string
• Remove all Latin data
• Match vernacular script against normalized text string
Design of XSLT
• After removing all Replace
LatinArabic
data
from
the
record
characters with “3”
<xsl:value-of
select="translate(translate(translate(translate(translate(translate(translate(translate(trans
late(translate(.,$basicArabic,'3'),$extendedArabic,'4'),$basicCyrillic,'N'),$extendedCyrillic,'
Q'),$Greek,'S'),$basicHebrew,'2'),$bengali,'b'),$tamil,'ta'),$thai,'th'),$devanagar,'d')"/>
…
Test if the normalized
<xsl:if test="matches($normalizedWholeRecord,'3')">
data contains “3”
<xsl:element name="marc:subfield">
<xsl:attribute name="code">c</xsl:attribute>
<xsl:text>(3</xsl:text>
Insert “(3” as the character set code in $c
</xsl:element>
</xsl:if>
……
<xsl:if test="matches($normalizedWholeRecord,'[^A-Za-z0-9]')">
<xsl:element name="marc:subfield">
<xsl:attribute name="code">c</xsl:attribute>
Insert code for CJK
<xsl:text>$1</xsl:text>
Test if any non-alpha-numeral
</xsl:element>
characters exist
</xsl:if>
Design of XSLT
• Factors affecting the design
– Pre-load vs. post-load data clean up (HathiTrust
workflow)
• Mechanism to filter out non-multiscript records needed for
pre-load data clean up
• Construction of 949 overlay command*
– MARC-8 vs. Unicode
• Field 066 and Script identification code not allowed in
Unicode environment
– 2 separate XSLTs made
– OCLC vs. MARC21 Standard
• Representation of Bengali, Devanagari, Tamil, and Thai in
field 066
* Innovative Millennium specific
Limitations & Unintended
Consequences
• Processing of data represented by UTF-8
character number
– \U+0e33\\U+0e43\\U+0e2b\\U+0e49\
• Vernacular scripts processed (MARC-8 environment)
• Handling of unlinked vernacular data
– Implications on OPAC display
Questions?
Lucas Mak
[email protected]
Michigan State University Libraries