Unicode Normalization

Download Report

Transcript Unicode Normalization

Unicode Normalization
Mark Davis
www.macchiato.com
Normalization
Uniqueness
 two
equivalent strings have precisely the
same normalized form
Fast binary comparison,
accurate digital signatures
Recommended for XML, JavaScript
and other standards
Canonical Equivalence
Fundamental
equivalence
Indistinguishable to
users, when correctly
rendered
Includes
Ç
C
가
ㄱ ㅏ
Combining sequences
Hangul
Singletons
Ω
Ω



¸
Compatibility Equivalence
Formatting differences









Font variants (ℌ)
Breaking differences (-)
Cursive forms (‫)ﻦ ﻨ ﻧ ﻥ‬
Circled (⑪)
Width, size, rotated (カ ﹠ ︷)
Super/subscripts (₉ ⁹)
Squared characters (㌀)
Fractions (⅚)
Others (dž)
カ
カ
㎏
k
g
fi
f
i
UTR #15:
Unicode Normalization Forms
Form D
Canonical Decomposition
Form KD Compatibility Decomposition
Form D
Form C
+ Canonical Composition
Form KD
Form KC
+ Canonical Composition
Normalization Requirement
Uniqueness: two equivalent strings will have
precisely the same normalized form

If two strings x and y are canonical equivalents, then
C(x) = C(y)
D(x) = D(y)

If two strings are compatibility equivalents, then
KC(x) = KC(y)
KD(x) = KD(y)
Affected Characters
None of the forms affect text with only
ASCII characters (U+0000 to U+007F)
None of the forms generate compability
characters that were not in the source text.
Both KD and KC replace compatibility
characters.
Both D and C maintain compatibility
characters.
Cautions: Decomposition
Requires decomposition mappings from the
Unicode Character Database
Those decomposition mappings must be
applied recursively
The string must be put into canonical order
Either Canonical or Compatibility
Cautions: Composition
Decomposition required first!
Then canonical composition
Composition data: fixed at Unicode 3.0.0

Some characters are excluded from
composition
Form C and Form KC can still have
combining characters!

Required for Indic, Arabic, Hebrew, &c.
Caution: Both C & D
All normalization forms are not closed
under string concatenation. Example:
 NFC/D
"…a◌̰" + "◌̀
…"
 Not Norm.
"…a◌̰◌̀…"
 NFC
"…à◌̰…"
 NFD
"…a◌̀◌̰…"
Exceptions easy to test for
Composition Process
Decompose (D or KD)
2. Combine unblocked characters with the
previous starter, if possible*
1.
Composition Exclusions
क + ◌̣ ⇏ क़
Futures:
G + ◌̣ ⇏ G̣
Singletons*
Ω⇏Ω
Non-starter sequences* ◌̈ + ◌́ ⇏ ◌̈́
Script Specifics
Legacy Encoding
Legacy text is ‘normalized’ if it maps 1:1 to
normalized Unicode text
Legacy sets:
 Prenormalized: e.g. ISO 8859-1
 Normalizable: e.g. ISO 2022
(ISO 5426/ISO 8859-1/…)
 Unnormalizable: e.g. ISO 5426
Programming Identifiers
Closed under all Normalization Forms, if minor
changes incorporated
Modified syntax:



identifier := start ( start | extend )*
start := [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}]
- irregulars – combining_like
extend := [{Mn}{Mc}{Nd}{Pc}{Cf}]
- irregulars + combining_like + mid_dot
(Almost) closed under Case Mappings

see SpecialCasing.txt
Resources
Reference version on Unicode Site
Production Version
http://oss.software.ibm.com/icu
 ICU: C/C++ and Java Versions
 Open Source, with IBM Public License

Free commercial use and distribution: Not Viral!
 Panel Later today


Other companies also providing: ask!
Normalization
Uniqueness: two equivalent strings
have precisely the same normalized
form
Fast binary comparison, accurate
digital signatures
Recommended for XML, JavaScript
and other standards
Q&A
Backup Slides
Definition: Starter
S is a starter =


Canonical class of zero in the Unicode Character
Database
Can start a composition
Examples:
Starters: Spacing marks, some non-spacing
‘a’, ‘‫‘ ’ق‬Θ’ ‘क’ ‘ी ’ ‘◌ी’
Non-starters: most non-spacing marks
‘◌̀’, ‘◌̊’ ‘◌̽’ ‘◌̥’
Definition: Blocked
C is blocked from S

There is some character B between S and C, and either
 B is a starter or
 B has the same canonical class as C
Examples
 “ABC”
– B blocks C from A
 “A◌̀◌̊” – ◌̀ blocks ◌̊ from A
 “A◌̥◌̊” – ◌̥ doesn’t block ◌̊ from A
Testing Conformance: Canonical
For all Unicode characters X
C(X) = C(D(X)
D(X), C(X) in canonical order
CDM
No CDM
X = D(X)
X ≠ D(X)
X = C(X)
No characters in D(X) have CDM
X ∈ Exclusions
X ∉ Exclusions
X ≠ C(D(X)
X = C(D(X)
Unicode Normalization
Introduction
Normalization forms
Design goals
 Specification
 Excluded characters
 Versions

Legacy encodings
Applications
Characters and Encoding Forms
Abstract
Encoded
Serialized
UTF-16BE
Å
A
°
UTF-8
C5
00 C5
C3 85
212B
21 2B
E2 84 AB
DB 80 DC 00
F3 B0 80 80
00 61 03 0A
61 CC 8A
F0000
61
30A