Document 7381915

Download Report

Transcript Document 7381915

Globalization Gotchas
Mark Davis
Unicode Basics
Unicode encodes characters, not glyphs:
U+0067 → g g g g g g g g
g g g g g. ...
Unicode does not encode characters by language:
French, German, English j have the same code point even though all have
different pronunciations
Chinese 大 (da) has the same code point as Japanese 大 (dai).
UTF-8, UTF-16, and UTF-32 are all Unicode.
The word character means different things to different people:
make clear which one you mean.
glyphs, code points, bytes, code units, user-perceived characters
(grapheme clusters),…
Unicode in APIs
U+0000 to U+10FFFF: Be prepared to handle (at least not corrupt!) any
incoming code points
A back-level system may get unassigned code points from later versions.
Watch for "UCS-2" implementations. They use UTF-16 text, but don't support
characters above U+FFFF; they also may accidentally cause isolated surrogates.
Some APIs/protocols will count lengths in code points, and others in bytes
(or other code units).
Make sure you don't mix them up.
Don't limit API parameters to a single character (and definitely not to a single
code unit!).
What users think of as a single character (e.g. ẍ, ch) may be a sequence in Unicode.
Use the latest version of Unicode: supports new characters, corrections, more
stability guarantees.
Choice of Characters
Character and block names may be misleading, eg,
U+034F COMBINING GRAPHEME JOINER doesn't join graphemes.
► http://www.unicode.org/faq/
Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for
everything but the BOM function.
Never use unassigned code points; those will be used in future
versions of Unicode.
Only use private use (PUA) or non-characters (and only if necessary)
If you do, minimize the opportunity for collision by picking an unusual
range.
Character Conversion
Always use "shortest form" UTF-8.
It's the Law.
And if that isn’t enough, consider security attacks.
If a protocol allows a choice of charsets, always tag correctly
Not all text is correctly tagged: character detection may be necessary. But
remember, it's always a guess!
Converting a database of mixed, untagged data is extremely painful.
Bad assumptions:
Length [bytes] = N * length [code points]
1 character [charset X] = 1 character [Unicode]
The ordering may also be different.
Character Conversion II
IANA / MIME charset names are ill-defined: vendors often
convert same charset different ways.
Shift-JIS: 0x5C → U+005C (\) or U+00A5 (¥)
Don’t simply omit unconvertable data; to reduce security
problems, at least substitute:
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes).
► http://www.w3.org/TR/japanese-xml/
► http://icu.sourceforge.net/charts/charset/
Properties
Use properties such as Alphabetic, not hard-coded lists:
isAlphabetic(x)
regex: \p{Alphabetic} or [:Alphabetic:]
Not (“A” ≤ x ≤ “Z” OR “a” ≤ x ≤ “z”)
Some properties aren't what you think; use:
White_Space
not General_Category=Zs
Alphabetic
not General_Category=L
Lowercase
not General_Category=Ll
Script=Greek
not Block=Greek
Characters may change property values between versions of
Unicode
► http://unicode.org/standard/stability_policy.html
Identifiers & Tokens
When designing syntax, use as a base:
Pattern_Syntax for operators / relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers.
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacks:
“paypal.com” with a Cyrillic “a”
► See Unicode Security at this conference
Comparison (Collation):
Searching, Sorting, Matching
There are two binary orders:
code point order = UTF-8 order = UTF-32 order
≠ UTF16 order
Don’t present users with binary order!
No users expect A < Z < a < z < Ç < ä.
Apply normalization to get a unique form, so Å = Å.
Security Issues: Protocols must precisely define the comparison
operations:
Eg, LDAP doesn't, so lookup may fail (or falsely succeed!)
Aside from wrong results, opening for security attacks.
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectations:
a < A < ä < Ç = C◌̧ < z < Z
Real language-sensitive order requires tailoring on top of UCA;
ordering depends on context and language:
china < China < chinas < danish
ae < æ < af
z < æ (Danish)
c < d < ... h < ch < i (Slovak)
Follow UCA for substring match offsets – some gotchas here.
Don't mix up "stable" and "deterministic" sorting: they are very
different.
► http://unicode.org/reports/tr10/
► http://unicode.org/cldr
Normalization (NFC,…)
Standardized normalized forms defined by Unicode.
The ordering of accents in a normalization form may not be the
typical type-in order.
Fonts should handle both orders.
Normalization is context independent
Don't assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes, but some characters
decompose in NFC.
Trivia: In Unicode 4.1 there are exactly 3 characters that are
different in all 4 normalization forms: ϓ, ϔ, ẛ
Maximum Expansion (U4.1)
Operation
UTF
Factor
Sample
8
3X
𝅘𝅥𝅮
U+1D160
16, 32
3X
‫שּׁ‬
U+FB2C
8
3X
ΐ
U+0390
16, 32
4X
ᾂ
U+1F82
8
11X
‫ﷺ‬
U+FDFA
NFC
NFD
NFKC / NFKD
16, 32
18X
Case Conversion
Not a simple 1:1 mapping
Title case:
dz↔DZ↔Dz
Expansion:
heiß → HEISS → heiss
Context-dependent:
ΌΣΟΣ → όσος
Language-dependent:
istanbul ↔ İSTANBUL
Warning: never use language-dependent casing for
language-independent structures, like file-system BTrees.
Casing: Maximum Expansion
Operation
UTF
Factor
Sample
8
1.5X
 U+023A
16, 32
1X
A U+0041
3X
ΐ U+0390
Lower
Upper / Title / Fold 8, 16, 32
Case Conversion II
Case folding was not stable.
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 5.0
Don't use the Lowercase_Letter (Ll) or Uppercase_Letter
(Lt) of General_Category
These were constrained to be in a partition.
Use the separate binary properties Lowercase and Uppercase
instead.
Lowercase / Uppercase:
Form vs Function
Lowercase, the binary property:
The character is lowercase in form,
but not necessarily in function.
Functionally Lowercase:
isCased(x) & isLowercase(x).
See Section 3.13 of TUS.
Lowercase: Form vs Function
LC
F. LC
Ll
Count
(U4.1)
Examples
N
114
ˠ
U+02E0
MODIFIER LETTER SMALL GAMMA
Y
705
ª
U+00AA
FEMININE ORDINAL INDICATOR
N
43
ⅰ
U+2170
SMALL ROMAN NUMERAL ONE
Y
903
a
U+0061
LATIN SMALL LETTER A
N
Y
Y
Segmentation
What a user thinks of as a characters is often a
sequence.
Words are not just sequences of letters.
Lines don’t just break at spaces
All may be language-dependent
► http://www.unicode.org/reports/tr14/
► http://www.unicode.org/reports/tr29/
Transliteration
Transliteration
≠ Translation
Ελληνικά ↔ Ellēniká
Ελληνικά ↔ Greek
Transliteration may vary by language:
Путин ↔ Putin, Poutine, ...
Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv,
Gorbachov, Gorbatsov, Gorbatschow, ...
Watch for terminology: “lossy” vs “lossless”
Lossy transliteration:
Ελληνικά → Ellinika → Ελλινικα
In ISO terms:
“transliteration” = lossless transliteration
“transcription” = lossy transliteration.
► http://unicode.org/draft/reports/tr35/tr35.html
Rendering is Contextual
Processing character-by-character gives the wrong results!
Glyphs may change shape
Multiple characters → 1
glyph
One character → multiple
glyphs
Rendering II
Good rendering systems will handle customary type-in order for text plus canonical
order.
Excellent ones will do any canonically-equivalent order, but those are rare.
There may be differences in the customary glyphs for different languages; specify the
font or the language where they have to be distinguished
Security Issues:
Never render a missing glyph as "?“.
Don't simply overlay diacritics: it can cause security problems.
► http://www.unicode.org/notes/tn2/
► http://unicode.org/reports/tr14/
Globalization
Unicode ≠ Globalization (aka Internationalization, Localizability)
Unicode provides the basis for software globalization, but there's more work to be done...
Use globalization APIs: Formatting and parsing of dates, times, numbers, currencies;
comparison of text; calendar systems; ... are locale-dependent.
Where OS facilities are not adequate or cross-platform solutions are needed, use ICU (C,
C++, Java)
Don't put any translatable strings into your code; separate into resource files.
Provide context to translators: is Mark a noun, a verb, or a name…
Don’t use the same string in different contexts unless the meaning is identical (including
references).
Note: User-Interface language (menus, dialog, help-system,...) ≠
Data language (body text, spreadsheet cells).
Programs need to handle, as data, more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as “ANSI” (the default!).
Don't simply concatenate strings to make messages:
Order of components differs by language: use Java MessageFormat, or structure
UI as separate fields.
Don't assume icons and symbols mean the same around the world. Don't
assume everyone can read the Latin alphabet.
Allocate space flexibly: “OK” in English → “Aceptar” in Spanish
English is a relatively compact language; others may require more characters (eg in
database fields) and more screen real estate (in UIs).
Beware of discrepancies in “fallback” behavior:
Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face
(JSF), Apache HTTP,...
► http://unicode.org/cldr/
► http://ibm.com/software/globalization/icu/
Neutral Formats
Store and transmit neutral-format data wherever possible. Convert that data to
the user's preferred formats as "close" to the user as possible.
Type
Example
Rec. Standard
Language/Locale*
en-US (en_US)
RFC 3066 bis / CLDR
Territory
AU
RFC 3066 bis
Currency
EUR
ISO 4217
Timezone
Australia/Melbourne
TZDB
Calendar
islamic-civil
CLDR Calendar ID
Custom Date
yyyy-mmm-dd
CLDR Pattern Format
Binary Time
8C80E9E3967A4B0
Windows File Time
Identification
Locale IDs are extensions of language IDs; use CLDR.
► http://unicode.org/cldr/
Don't assume that everyone in country always uses that country’s currency.
Always use an explicit currency ID (ISO 4217).
<RUR, 1.23457×10³> ↔ 1 234,57р. in Russian,
but Rub 1,234.57 in English.
Don't assume the timezone ID is implied by the user's locale. For the best
timezone information, use the TZ database; use CLDR for timezone names.
► http://www.twinsun.com/tz/tz-link.htm
If you heuristically compute territory IDs, timezone IDs, currency IDs, etc.
(eg, from browser settings) make sure the user can override that and pick an
explicit value.
Unicode Guide
Authoritative but
lightweight
Introduction, overview,
and quick reference
Main principles of the
Unicode Standard
Best practices in
Software Globalization
Other Resources
Unicode Site:
http://unicode.org
An Overview of ICU:
http://icu.sourceforge.net/docs/papers/icu_overview_latest.ppt
Globalizing Software:
http://icu.sourceforge.net/docs/papers/globalizing_software.ppt
W3C Internationalization:
http://www.w3.org/International/
Microsoft Global Software Development
http://www.microsoft.com/globaldev/default.asp
Q&A
Backup Slides
User Input
If you develop your own text editor, use the OS APIs
to handle IMEs (Input Method Engines) for Chinese,
Japanese, Korean,...
If you are using "type-ahead" to get to a position in a
list (eg typing "Jo" gets to the first element starting with
those characters), allow arbitrary input. This is often
easiest with visible fields.
If your password field can contain characters that
require an IME, a screen pop-up box may reveal the
password to onlookers.
Dotted and Dotless I
Uppercase
I
Normal
⇄
Lowercase
i
0069
0049
İ
0130
I+˙
0049 0307
ı
←
0131
→
i+˙
⇄
0069 0307
Turkic
Uppercase
←
I+˙
⇄
⇄
⇄
0049 0307
İ
0130
I
0049
İ+˙
0130 0307
Java
In MessageFormat, watch for words like can't, since ASCII ' has syntactic
meaning. Use a real apostrophe (U+2019) where possible: can’t.
In Date and Calendar, the months are numbered from 0 (February is month
number 1!). However, weeks and days are numbered from 1.
Java serialized text isn't UTF-8, though it's close. U+0000 and supplementary
code points are encoded differently.
Java globalization support is pretty outdated: use ICU to supplement it.
Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server
Face (JSF), Apache HTTP server, etc. all provide some locale determination
mechanism and facility; but they all differ in details.
JavaScript
Always encode characters above U+007F with
escapes (\uxxxx).
There is an HTML mechanism to specify the
charset of the Javascript source, but it is not
widely implemented.
The JDK tool native2ascii can be used to
convert the files to use escapes