What’s New in Globalization

Download Report

Transcript What’s New in Globalization

What’s New in Globalization?
Mark Davis
President & Cofounder
The Unicode Consortium
The Unicode Standard, Version 5.0
“Hard copy versions of the Unicode
Standard have been among the most crucial
and most heavily used reference books in
my personal library for years.”
— Donald E. Knuth
“For more than a decade, Unicode has been
a foundation for many Microsoft products
and technologies; Unicode Standard Version
5.0 will help us deliver important new
benefits to users.”
— Bill Gates
“The path W3C follows to making text on
the Web truly global is Unicode.”
— Sir Tim Berners-Lee, KBE
“Without Unicode, Java wouldn't be Java,
and the Internet would have a harder time
connecting the people of the world.”
— James Gosling
The Unicode Standard, Version 5.0
Obsoletes previous versions
Basis for Microsoft's Vista; in upgrade plans for
Google, Yahoo!, and ICU, to name but a few.
Hundreds of pages of new information; thousands of
revised pages; all Unicode Standard Annexes
Systematic framework for improved text processing
Improvements to the Unicode Encoding Model for UTF-8,
…
Rigorous stability of case folding and identifiers
Improved interoperability and backward compatibility
Enabling additional new ways to optimize code
U5.0 Unicode Character Database
Unicode: far more than a list of characters
General
Properties: key to how characters function
Private Use
Changes in 5.0
Scripts: Unassigned code points → Zzzz
Casing Stability: Upper → folded
BIDI: Consistent Bidi_Mirrored
Now Normative: kIICore
Line Break: SE Asian → Complex_Context
New Properties: Normative_Name_Alias,
Deprecated, 3 Unihan provisional properties
Surrogate
Noncharacter
Reserved
99,089
137,468
2,048
66
875,441
U5.0 Conformance
Stable Case-Folded
≈ Upper → Lower
Much clearer encoding / property model
Stable Approved Named Character Sequences
Bengali, Gurmukhi, Tamil changes
Combining grapheme joiner clarified
Disunification of Diacritics
5.0 Annexes: Core
UAX #9: Bidirectional Algorithm
Tightened conformance requirements
UAX #15: Unicode Normalization Forms
New Stream-Safe Text Format
Appendix of characters requiring special handling
Expanded info on stability guarantees
Additional detailed figures, guidelines
UAX #31: Identifier and Pattern Syntax
Added profiles & information on usage
U5.0 Annexes: Boundaries
UAX #14: Line Breaking Properties
Rules modified to improve behavior
Now Normative (conformance clauses reorganized)
UAX #29: Text Boundaries
Edge cases improved
Tailorings for text boundaries now in Unicode CLDR
Format of the rules changed to ease implementation
Additional guidelines on regex, identifiers,…
U5.0 Characters by Script
Inherited
Phoenician
Cyrillic Devanagari
Kannada
Greek
Hebrew
Latin
Common
Phags Pa
Nko
Cuneiform
Balinese
Unicode Character Timeline
1,000,000
100,000
Letter
Symbol
10,000
Mark
Number
1,000
Punctuation
Control/Format
Separator
100
10
1
2.0.0
2.1.2
3.0.0
3.1.0
3.2.0
4.0.0
4.1.0
5.0.0
Unicode Guide for Programmers
Adjunct to Standard
Concise Guide for Software
Globalization
Crucial Concepts
Key “Gotchas”
Recognize and Avoid
Details on
Encoding & conversions:
UTF-8, 16, 32 & BOM
Using character properties
Text Operations
Unicode Common Locale Data
Repository: CLDR
Key locale data for world Δευτέρα, 05 Σεπτεμβρίου 2005
languages
Montag, 5. September 2005
Most extensive standard
repository of locale data
XML format
Arabic – arabski
Bulgarian – bułgarski
Czech – czeski
…
Z<Å
¥1,234.57
AED – .‫إ‬.‫د‬
BHD – .‫ب‬.‫د‬
DZD – .‫ج‬.‫د‬
EGP – .‫م‬.‫ج‬
EUR – €
…
1 234,57руб.
Africa – 非洲
Central America – 中美洲
Eastern Africa – 东非
Northern Africa – 北非
…
Unicode CLDR 1.4
121 languages and 142 territories – 360 locales in all
25% more locale data; over 17,000 new/modified items
Repository separated into language vs locale data
Language-specific segmentation (word/line breaks…)
Transliterations (eg Ελληνικά ↔ Ellēniká)
Data for lenient date/time formatting and parsing
Programmer asks for “numeric day” + “abbreviated month”
Best format pattern returned, eg “dd.MMM”
+ Quarters in dates (eg 2006Q1)
BCP 47 compatibility + extensions
BCP 47 Language Tags
Usage: HTTP, HTML, XML; CLDR Locale IDs…
RFC 4646; Obsoletes RFCs 1766, 3066
Addresses problems in RFC3066
ISO standards: stability / accessibility / ambiguity
Parseability, Extensibility; Registration speed
Identification of script (where necessary):
Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn),
Azerbaijani (Cyrillic) az-Cyrl, etc.
Unicode Security
Examples:
Visual Confusables: “paypal.com” with Cyrillic ‘a’…
Non visual problems: buffer overflows, non-shortest form,…
UTR# 36 Unicode Security Considerations
Guidelines & Recommendations
UTS# 39. Unicode Security Mechanisms
Algorithms & Data
Limitations on Repertoire
Testing for Confusables
Internationalized Domain Names
One instance of broad problem
Many RFCs use Nameprep – limited to Unicode 3.2
Unicode recommendations
Narrow the repertoire: exclude symbols, punctuation
Expand the coverage: currently only Unicode 3.2.
IETF idn-nextsteps published
Some positive developments, but misreads Unicode,
needs more work
URL → IRI
International Resource Identifier (IRI)
UTF-8, %-escaped
Example:
http://w3.org/International/articles/idn-and-iri/
JP納豆/引き割り納豆.html
http://w3.org/International/articles/idn-and-iri/
JP%E7%B4%8D... %E8%B1%86.html
See http://ietf.org/rfc/rfc3987.txt
Ideographic Variation Database
U+82A6 ashi: multiple
forms
The first occurrence –
any glyph
Second occurrence is in
the name of the town
Ashiya – customarily
displayed with form #4
Registration for variants
Ideographic Variation Database
Variation Selector
Identifies a restriction on the appearance of a character
Character + Variation Selector = Variation Sequence
Han ideographs
Impossible to build a single collection for everyone:
requirements from scholars, governments and publishers…
Instead, registration of multiple independent collections
Unicode Ideographic Variation Database
A given variation sequence is used in at most one collection
Makes interchange of variation sequences reliable.
Registration, not Assessment
ICU 3.6
Mature, portable C/C++/Java int’l libraries
Unicode 5.0, UCA 5.0, CLDR 1.4
ICU4C
Charset Detection
Improved: Time Zones, Thai word break, UText (64 bit),
Performance, Data Management,…
ICU4J
Globalization Preferences
Flexible date/time formats*, Charset conversion*
Near-Term Issues
Unicode 5.0.1, Unicode 5.1
CLDR / BCP 47bis
LDAP
Collation Registry
IANA Charset Registry
Unicode 5.1 - possibilities
Characters
CJK Unified Ideographs Extension C
Minority Scripts: Cham and Lanna
Malayalam chillu
…
Properties/Behavior
Normalization process for stable strings
…
CLDR 1.5 / BCP 47bis
CLDR 1.5
Data Submission Starting November
New structures / data
BCP 47
Adding ~7,000 (!) new language subtags
Possibly other changes…
LDAP
Now has definitive comparison
Stuck at Unicode 3.2
http://www.ietf.org/rfc/rfc4518.txt
(good)
(bad)
Collation Registry
Nearing approval
Adds ability to register comparisons
Workable for basic cases
http://www.ietf.org/internet-drafts/
draft-newman-i18n-comparator-14.txt
IANA Charset registry
Currently limited usefulness
Ill-defined
Missing mapping tables
Incomplete
Inaccurate
Regime Change
Hope for future improvements!
What’s New in Globalization?
Mark Davis
President & Cofounder
The Unicode Consortium