CJK Character Validation – Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5,

Download Report

Transcript CJK Character Validation – Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5,

CJK Character Validation –
Impact from EACC to Unicode Migration
2006 CEAL Conference
Committee on Technical Processing
Ai-lin Yang
East Asian Library, UC Berkeley
April 5, 2006
EACC/MARC21 and Unicode

East Asian Character Code (EACC) is
MARC-8 CJK in MARC21
 Migration to Unicode
Library of Congress database
 RLG’s Union catalog database
 OCLC’s WorldCat database


CJK Bibliographic records are restricted
to “EACC characters”
Microsoft IME Variants

Non-MARC21 characters




Duplicate CJK characters (e.g.路, F937, and 路, 8DEF)
Close variants (e.g.步, 6B65, and 歩, 6B69)
Typically one of these variants is a MARC21 character
CJK character validation errors in OCLC




OCLC XWC (Extended WorldCat) in Oracle database is built on
Unicode
OCLC online cataloging follows MARC21 standards
CJK scripts are input by using Microsoft Global Input Method
Editors (IMEs)
Non-MARC21 characters cause CJK character validation errors
OCLC Connxion / IME
Online Cataloging Examples

Title: 汉宫秋月 (simplified 宫)



Title:瑶族长鼓舞曲 (simplified 瑶)



245 (non-Latin) occurrence 1, $a occurrence 1, position 2 - invalid
character - data must be valid non-Latin characters
Valid when changed to: 汉宮秋月 (traditional 宮)
245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid
character - data must be valid non-Latin characters
Valid when changed to: 瑤族长鼓舞曲 (traditional 瑤)
Title: 説故事的人 (traditional 説)


245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid
character - data must be valid non-Latin characters
Valid when changed to: 說故事的人 (traditional 說)
OCLC Connxion / IME
Online Cataloging Examples

Title: 户外环境敎育 (simplified 户)

245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid
character - data must be valid non-Latin characters

戸外环境敎育(traditional 戸)

245 (non-Latin) occurrence 1, $a occurrence 1, position 1 invalid character - data must be valid non-Latin characters
Valid when changed to: 戶外环境敎育 (traditional 戶)
Title: 吴大澂手批本地子箴言
 澂 only can be found in the traditional list; this character does not


exist in the simplified list
Solutions
 Unihan
Database
 CJK Compatibility Database
 OCLC CJK E-dictionary
Unihan Database
http://www.unicode.org/charts/unihan.html

Unihan database index



Unihan grid index
Unihan radical-stroke index
Unihan database information












(I) Several different glyphs for the character
(N) Different representations of the character's scalar value
(N) Mappings to the IRG sources for the character
(I) Mappings to major industrial and national standards and other character
collections
(N) Positions in the four dictionaries used by the IRG
(I) Positions in other commonly-used dictionaries
(I) Radical-stroke counts as derived from different sources
(I) Phonetic data derived from various sources
(I) Other dictionary data
(I) Variants (with links to the variant forms)
Compounds containing the character
(I) Other information contained in the Unihan database
Unihan Database Search (U+6237)
Unihan Database Search (U+6236)
CJK Compatibility Database
http://www.loc.gov/ils/cjk_search/cjk_cpso.html

Replace a non-MARC21 character with its
MARC21 equivalent
 Steps for using the CJK compatibility database
1)
2)
3)
4)
5)
Copy the invalid character from your bibliographic
record
Open the CJK Compatibility Page
Paste the invalid character in the white box and use the
index "Invalid character"
Click "Submit"
Copy & Paste the valid alternative into your
bibliographic record
CJK Compatibility Database Search
OCLC CJK E-Dictionary
OCLC CJK E-Dictionary Search
OCLC CJK E-Dictionary Search
CJK Character Validation
Thank you!