ISO/IEC 10646 and Unicode • It is a • Features:

Download Report

Transcript ISO/IEC 10646 and Unicode • It is a • Features:

ISO/IEC 10646 and Unicode
• It is a coded character set(codeset)
– Designed for text processing and exchange
• Features:
– Universal: characters in almost all national standards
– Framework: Fix the coding architectures, and codepoints can be filled up later.
– Uniform and Efficient: fixed-width encoding, no
need to identify the coding length(ASCII, Big5, GB)
– Unambiguous: Any given 16-bit(32-bit) value always
represents the same character
Lecture 3
1
UCS-4
(Canonical form of ISO 10646)
• Fixed 32-bit(actually 31 bits) coding assignment
• 00 00 00 00 to 7F FF FF FF
Group No.
Plane No
High Byte
Low Byte
(total: 128)
(total: 256)
(total: 256)
(total: 256)
• Each plane: 216 = 65,536 code points
• BMP(the basic multilingual plane)
– Both Group No. and Plan No. are 00(first two bytes
of zeros)
• Before ISO 10646 part 2 came out(end of year 2001),
only BMP contains characters
Lecture 3
2
Code Architecture of UCS-4
Group
127
Groups
Group
1
Group
0
Planes
256/Group
Plane 00
BMP
Lecture 3
3
• UCS-2: 2-byte representation of UCS-4
– Basic Multilingual Plan(BMP)
– Switching mechanism to use code range of BMP to access
another 16 planes (Surrogate pairs)
• BMP
A-Zone
Alphabets, Symbols, CJK Misc
I-Zone
CJK ideographs
O-zone
Hangul
S-Zone(Surrogate)
R-Zone
Private Use, Compatibility, Arabic Presentations
• Compatibility Zone:
Lecture 3
4
Unicode
• Unicode is the implementation of ISO 10646 with
16 bit representation using UCS-2
• Has definition of actions associated with certain
characters
• control character behavior
• Rendering behavior: combining characters
• Examples
– Control character bell <BEL> should cause a
sound in the system
– Type the character using U+0061(a)U+0300( ̀ )
will be rendered as one symbol à
Lecture 3
5
Extension of ISO 10646
– Extension A(BMP) has 6,582 characters,
published in 2000, ISO/IEC 10646-1 Second
Edition(2000).
– Extension B:
• All characters in 康熙字典,漢語大字典, plus
other characters such as those in HK Supplementary
Character Set,
• ISO/IEC 10646-2(2001) , total of 43,253 characters
• In Plane 2 of UCS-4
– How would Extension B be supported in UCS2? => Using some encoding scheme
Lecture 3
6
Surrogate Pairs
• 2 UCS-2 code H followed by L <H,L> where
– H is in the range of D800 - DBFF
– L is in the range of DC00 - DFFF
• For a given UCS-2 code(or code pair) U, the corresponding
UCS-4 code-point value N (scalar value)
– N= U if U is a single, non-surrogate value
– N=(H-D80016)*400 16 + (L-DC00 16) + 10000 16 where U is a
surrogate pair<H,L>
– Undefined for any other U in UCS-2.
• N: in the range of 0 to 10FFFF16
• <D800, DC00> => N = 1000016
• <DBFF,DFFF> => ?
Lecture 3
7
• UTF: UCS Transformation Format
– Allows a certain number of code values in UCS
which correspond to some other coding standard
(e.g. ASCII) be transmitted just as what they
would be in that coding standard, a property
known as transparency-while other code values
are represented through escape mechanism
– variable length encoding to achieve greater
efficiency
Lecture 3
8
• UTF-8: 8-bit encoding for 8-but UNIX Environment
– ASCII transparent
– First-byte indicates the number of characters
– Shortest encoding principle for invertible (or bijective)
encoding/decoding
– Save storage space for ASCII, non-ideographic characters
– Example: Unicode A324 0430 0023 8A43
=> UTF-8:
Lecture 3
– Example: UTF-8 24 38 58 CE 82
=> UCS-4:
9
Character vs. glyph
• Character: smallest component of written language that
have semantic value
• Glyphs: represent the shapes that characters can have when
they are rendered or displayed.
• Example: A, A, are the same character and having the
same code. Concrete shape can be very different and are
given one codepoint.
• Coding of variants
Lecture 3
10
ISO 10646/Unicode Features
for Chinese
– Han Unification (Chinese, Japanese and Korean)
– Unification Problems:
• Different sources, non-cognate
– Three-dimensional Conceptual Model:
semantics(x), abstract shape(y), actual shape(z)
Examples
Lecture 3
11
Unification Rules(認同規則)
• R1: Source Separation Rule: If two ideographs are distinct in a
primary source standard, then they are not unified.Why
• R2: Non-cognate(非同源)Rule: In general, if two ideographs
are unrelated in historical derivation(non-cognate characters),
then they are not unified
• R3: By means of two-level classification, the abstract shape of
each ideograph is determined. Any two ideographs that possess
the same abstract shape are unified unless disallowed by R1 or
R2.
Lecture 3
12
• Example:
• Component structure analysis
Lecture 3
13
Sources of Unified Han Characters
Primary Source
Category Standard
G0
GB2312-80
G1
GB12345-90
G3
GB7589-87
G5
GB7590-87
G7
General Use Characters for Modern Chinese
G8
GB8565-88
T1
CNS11643-1986/plane 1
T2
CNS11643-1986/plane 2
Te
CNS11643-1986/plane 14
Jo
JIS X 0208-1990
J1
JIS X 0212-1990
K0
KS C 5601-1987
K1
KS C 5657-1991
Secondary Source
No.
Standard
1 ANSI Z39.64-1989
2 Big-5 (Taiwan)
3 CCCII, level 1
4 GB 12052-89 (Korean)
5 JEF (Fujitsu)
6 PRC Telegraph Code
7 Taiwan Telegraph Code (CCDC)
8 Xerox Chinese
Lecture 3
Number of Source Characters
6763
2352
4835
2842
42
290
5401
7650
4198
6356
5801
4620
2856
Number of Source Characters
13053
13481
4808
94
3149
~8000
9040
9776
14
Wide character vs. Multi-byte characters
• Text information needs to be represented by the
right data types.
– Multi byte characters: data are processed on a
per-byte basis: Big5, GB, EUC, even UTF-8
– Wide characters: Fixed-byte encoding and no
testing of high bit needed.
• Processing representation for wide characters:
– Big Endian vs. Little Endian
Lecture 3
• Data type dependent
• System architecture dependent
• Distinction: 0xFEFF for Big Endian and 0xFFFE for
Little Endian
15