CJK Character Validation – Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5,
Download
Report
Transcript CJK Character Validation – Impact from EACC to Unicode Migration 2006 CEAL Conference Committee on Technical Processing Ai-lin Yang East Asian Library, UC Berkeley April 5,
CJK Character Validation –
Impact from EACC to Unicode Migration
2006 CEAL Conference
Committee on Technical Processing
Ai-lin Yang
East Asian Library, UC Berkeley
April 5, 2006
EACC/MARC21 and Unicode
East Asian Character Code (EACC) is
MARC-8 CJK in MARC21
Migration to Unicode
Library of Congress database
RLG’s Union catalog database
OCLC’s WorldCat database
CJK Bibliographic records are restricted
to “EACC characters”
Microsoft IME Variants
Non-MARC21 characters
Duplicate CJK characters (e.g.路, F937, and 路, 8DEF)
Close variants (e.g.步, 6B65, and 歩, 6B69)
Typically one of these variants is a MARC21 character
CJK character validation errors in OCLC
OCLC XWC (Extended WorldCat) in Oracle database is built on
Unicode
OCLC online cataloging follows MARC21 standards
CJK scripts are input by using Microsoft Global Input Method
Editors (IMEs)
Non-MARC21 characters cause CJK character validation errors
OCLC Connxion / IME
Online Cataloging Examples
Title: 汉宫秋月 (simplified 宫)
Title:瑶族长鼓舞曲 (simplified 瑶)
245 (non-Latin) occurrence 1, $a occurrence 1, position 2 - invalid
character - data must be valid non-Latin characters
Valid when changed to: 汉宮秋月 (traditional 宮)
245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid
character - data must be valid non-Latin characters
Valid when changed to: 瑤族长鼓舞曲 (traditional 瑤)
Title: 説故事的人 (traditional 説)
245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid
character - data must be valid non-Latin characters
Valid when changed to: 說故事的人 (traditional 說)
OCLC Connxion / IME
Online Cataloging Examples
Title: 户外环境敎育 (simplified 户)
245 (non-Latin) occurrence 1, $a occurrence 1, position 1 - invalid
character - data must be valid non-Latin characters
戸外环境敎育(traditional 戸)
245 (non-Latin) occurrence 1, $a occurrence 1, position 1 invalid character - data must be valid non-Latin characters
Valid when changed to: 戶外环境敎育 (traditional 戶)
Title: 吴大澂手批本地子箴言
澂 only can be found in the traditional list; this character does not
exist in the simplified list
Solutions
Unihan
Database
CJK Compatibility Database
OCLC CJK E-dictionary
Unihan Database
http://www.unicode.org/charts/unihan.html
Unihan database index
Unihan grid index
Unihan radical-stroke index
Unihan database information
(I) Several different glyphs for the character
(N) Different representations of the character's scalar value
(N) Mappings to the IRG sources for the character
(I) Mappings to major industrial and national standards and other character
collections
(N) Positions in the four dictionaries used by the IRG
(I) Positions in other commonly-used dictionaries
(I) Radical-stroke counts as derived from different sources
(I) Phonetic data derived from various sources
(I) Other dictionary data
(I) Variants (with links to the variant forms)
Compounds containing the character
(I) Other information contained in the Unihan database
Unihan Database Search (U+6237)
Unihan Database Search (U+6236)
CJK Compatibility Database
http://www.loc.gov/ils/cjk_search/cjk_cpso.html
Replace a non-MARC21 character with its
MARC21 equivalent
Steps for using the CJK compatibility database
1)
2)
3)
4)
5)
Copy the invalid character from your bibliographic
record
Open the CJK Compatibility Page
Paste the invalid character in the white box and use the
index "Invalid character"
Click "Submit"
Copy & Paste the valid alternative into your
bibliographic record
CJK Compatibility Database Search
OCLC CJK E-Dictionary
OCLC CJK E-Dictionary Search
OCLC CJK E-Dictionary Search
CJK Character Validation
Thank you!