Transcript Slide 1
Internationalization
An Introduction
Part I: Unicode and Character Encodings
License
This presentation and its associated materials licensed under a
Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 License.
You may use these materials without obtaining permission from the author. Any materials used or redistributed must contain this notice.
[Derivative works may be permitted with permission of the author.] This work is copyright © 2008-2012 by Addison P. Phillips
Who is this guy?
• • Globalization Architect, Lab126 We make the technology behind the Kindle Chair, W3C Internationalization WG
The basics of text processing in software.
CHARACTER ENCODINGS
The Biggest Source of Woe “Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.” Glen Perkins Globalization Architect
A Lot of Jargon
Multibyte Variable width Wide character Character encoding Coded character set Bidi or bidirectional Glyph, character, code unit Unicode kanji double-byte language extended ASCII ANSI, OEM encoding agnostic
011010010101 001010010101 011101010101 010110100010 101011110101 010111011011 011010010101 001010010101 011101010101 010110100010 101011110101 010111011011
010000010101101101101000
Code Unit
01000001 0101101101101000 byte (0x41) A unit of physical storage and information interchange Other code units exist (16-bit, 32-bit, etc.)
À
Glyph
à à à à à à à à à àà à à à
नि
じ
A خ
A single shape (in text)
Grapheme
À नि
じ
A خ
A single visual unit of text: the smallest abstract unit of meaning in a writing system.
À
Character
ि न नि
じ
A خ
A single logical unit of text
À
Character Set
Abstract Character Repertoire A set of characters
À U+00C0
Coded Character Set
Code Point A set of characters in which each character is assigned a numeric identifier.
À U+00C0
Character Encoding Form
11000011 10000000 0xC3 0x80 UTF-8 Maps code points to code units
À U+00C0
11000011 10000000 0xC3 0x80 UTF-8
*(the most important slide in this presentation) In memory, on disk, on the network, etc.
All text has a character encoding
When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding.
Counting Things
Be aware of whether you need to count graphemes, characters, or bytes (code units) : – Is the limit “ screen positions ” , “ characters ” , or “ bytes of storage ” ?
– Should you be using a different limit? Which one are you actually counting?
varchar(110) यूनिकोड य ूू ि नू क ूो ड E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1 (4 glyphs) (7 characters) (21 bytes)
Common Encoding Problems
Tofu
hollow boxes
Mojibake
garbage characters
Question Marks
(conversion not supported)
It can happen to anyone…
• • • Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example) Not usually a bug: it’s a display problem Can mask or masquerade as character corruption.
Tofu
Mojibake When Good Characters Go Bad
Sources of Mojibake • • • View text using the wrong encoding Apply a transfer encoding and forget to remove it Convert to an encoding twice • • • • Convert to or from the wrong encoding Overzealous escaping Conversion to entities ( “ entitization ” ) Multiple conversions
Character Encoding Forms
Their theory, structure, and use
EBCDIC
ASCII
• • 7 bits = 2 7 = 128 characters Enough for “ U.S. English ”
Latin-1 (ISO 8859-1)
ASCII for characters 0x00 through 0x7F Accented letters and other symbols 0x80 through 0xFF
Code Page
Originally an IBM character encoding term. IBM numbered their character sets with “ CCSIDs ” (coded character set ids) and numbered the corresponding character encoding forms as “ code pages ” .
Microsoft borrowed code pages to create PC-DOS.
Microsoft defines two kinds of code pages: “ ANSI ” code pages are the ones used by Windows GUI programs.
“ OEM ” code pages are the ones used by command shell/command line programs.
Neither “ ANSI ” nor “ OEM ” refer to a particular encoding standard or standards body in this context. Avoid the use of ANSI and OEM when referring to encodings.
windows-1252
Windows ’ s encodings (called “ code pages ” ) are generally based on standard encodings — plus some additional characters.
Beyond Single Byte Encodings
• So far we ’ ve been looking at single byte encodings: one byte per character 1 byte = 1 character (= 1 glyph?) 256 character maximum Good enough for most alphabetic languages À Some languages need more characters.
What about the “ double-byte ” languages?
Don ’ t those take two bytes per character?
丏丣並
Beyond Single-Byte
• • • Escape sequences to select another character set – Example: encodings ISO 2022 uses escape sequences to select various Use a larger code unit ( “ wide ” character encoding) – Example: IBM DBCS code pages or Unicode UTF-16 – 2 16 = 64K characters – 2 32 = 4.2 billion characters Use a variable-width encoding Variable width encodings use different numbers of code units to represent different types of characters within the same encoding form.
Multibyte Encodings
• One or more bytes per character – 1 byte
!=
1 character – May use 1, 2, 3, or 4 bytes per character -> maximum number of bytes per character varies by encoding form.
– May use shift or escape sequences – May encode more than one character set Single-byte encodings are a special case of multibyte!
Multibyte Encoding
: “variable-width” encoding that uses the byte as its code unit.
• •
Simple Multibyte Encoding Forms
Specific byte ranges encode characters that take more than one byte.
– – A “ lead byte ” One or more “ trailing bytes ”
A
あ Code point != code unit 0x41 0x82 0xA0 s i n g l e b y t e l e a d b y t e t r a i l b y t e
Shift-JIS: A Multibyte Encoding
• • In order to reach more characters, Shift-JIS characters start with a limited range of “ lead bytes ” These can be followed by a larger range of byte values ( “ trail byte ” )
Shift-JIS
• • • Lead bytes can be trail byte values Trail bytes include ASCII values Trail bytes include special values such as 0x5C ( “ \ ” )
Shift-JIS
int pos = strchr(mybuf, ‘ @ ’ );
More Complex Multibyte Systems Example: IBM “MBCS” code pages [SI/SO shift between 1 -byte and 2-byte characters] Example: ISO 2022 [escape sequence changes character set being encoded]
Ad hoc and Font Encodings
Encoding Conversion
Templates ISO 8859-1 Content UTF-8 Process Output (HTML, XML, etc.) Data Shift_JIS Document formats often require a single character encoding be used for all parts of the document.
Common Encoding Conversion Tools and Libraries
• iconv (Unix) • ICU (C, C++, Java) • perl Encode • Java (native2ascii, IO/NIO) • (etc.)
Encoding Conversion as Filter
ISO 8859-1
ÀàС£
UTF-8
детски »èçسني 文字
ISO 8859-1
ÀàС£ ??????
»èç ?????
????
UTF-8
ÀàС£ ??????
»èç ?????
????
Shift_JIS
文字化け ?
(0x3F) is the
replacement character
for ISO 8859-1 Encoding conversion acts as a “ filter ” – Replacement characters ( “ question marks ” ) replace characters from the
source character set
that are not present in the
target character set
.
Too Many Fish in the Sea
• • • Need for more converters and conversion maps Difficulty of passing, storing, and processing data in multiple encodings Too many character sets …
Unicode / ISO-10646
The Idea Behind Unicode
• • • A universal character set to encode the world’s scripts.
Encodes characters (not glyphs).
Consistent interchange and interpretation: one set of rules for all text, everywhere
Unicode: the Universal Character Set An organized collection of characters.
A “ coded character set ” : each character has a
code point
aka Unicode Scalar Value (USV) U+0041 <= hex notation
Unicode or ISO 10646?
Unicode and ISO 10646 are maintained in sync. – Unicode is maintained by an industry consortium. – ISO 10646 is maintained by the ISO.
Unicode
• • Code space of up to 0x10FFFF million) characters (about 1.1 Currently encodes 110,116 characters
The Unicode Standard
• • • Core Standard (TUS) Reports (http://www.unicode.org/reports) – Unicode Standard Annexes (UAX) – – Unicode Technical Standards (UTS) Unicode Technical Reports (UTR) Unicode Character Database (UCD) • Unicode Technical Notes (UTN) [not part of standard]
Encodes the World ’s Scripts
• • • • • Modern scripts Historical scripts Ancient and extinct scripts Minority languages Some fun stuff too!
Characters, Not Glyphs
Aa
Aa
Aa
Aa
Aa Aa AaAa Aa
Aa
Aa Aa
AaAa
Aa Aa
Aa
Aa
Aa Aa
Aa
Aa Aa
Aa
Aa Aa Aa Aa Aa Aa Aa
Aa
Aa
Aa
Aa Aa Aa Aa
Characters, Not Glyphs: Han Unification • “ Unihan ” unifies abstract Han ideographs, even if specific writing traditions (such as Japanese
kanji
vs. Simplified Chinese) appear different.
Encoding Work Continues
Unicode 6.1 added 732 characters, including several new scripts.
… but the pace of change has slowed and most living scripts are encoded.
Planes
Unicode is divided into “ planes ” of code points 17 planes (0 through 0x10) 64K (65,535) code points per plane Plane 0 is called the
Basic Multilingual Plane
(
BMP
).
Planes 1 through 0x10 are called
supplementary planes
: Plane 1:
supplementary multilingual plane (SMP)
Plane 2:
supplementary ideographic plane (SIP)
Plane 4:
supplementary special-purpose plane (SSP)
Planes 15,16:
private use
Scripts and Blocks
• Most characters belong to a
script
, a distinct writing system.
– Some characters, such as many of the punctuation characters, are used by multiple scripts • Characters are assigned in Unicode to
blocks
. – Most blocks are used to encode (and named for) a specific script.
– Some scripts have multiple blocks (Latin, Han ideographs)
Unicode Blocks
• Ranges of code points allocated together for assignment. Not all code points in a block are assigned (some reserved for future assignment)
Unicode Blocks
• • See: http://www.unicode.org/charts Block names are stabilized and not always fully indicative of block usage Example: Phags-pa block
Various Character Types
• • • • Unicode Controls Compatibility Characters Byte Order Mark Replacement Character • • • • Combining Marks Variation Selectors Private Use Surrogates
Combining Marks
Composition can create “ new ” characters Base + non-spacing ( “ combining ” ) characters A + ˚ = Å Å U+0041 + U+030A = U+00C5 a + ˆ + . = ậ U+0041 + U+0302 + U+0323 = U+1EAD a + . + ˆ = ậ U+0041 + U+0323 + U+0302 = U+1EAD
Combining Marks: Thai
Unicode คืออะไร
?
คื
=
ค + ืื
glyph = base + vowel modifier
Combining Marks: Devanagari
यूनिकोड यू
नि
क्या है
?
को ड य ू
ि ि न + न
क ो ड
=
नि
Combining Marks: Tamil
யூனிக்ககோடு என்றோல் என்ன ?
யூ னி க் ககோ டு ககோ
க + ோ U+0B95 U+0BCB
Byte Order Mark (BOM)
• •
U+FEFF
Used to indicate the “ byte-order ” of UTF-16 code units – 0xFE FF; 0xFF FE Also used as a Unicode signature by some software (Windows ’ s Notepad editor, for example) for UTF-8 – 0xEF BB BF Appears as a character or renders as junk in some formats or on some systems. Has an annoying secondary meaning: “ zero width non-breaking space ”
The Replacement Character
U+FFFD
Indicates a bad byte sequence or a character that could not be converted.
Equivalent to “ question marks ” in legacy encoding conversions � there was a character here, but it is gone now
Compatibility Characters
Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings: ①②③45Ⅵ ¾ Lj ¼ Nj ½ dž ︴︷︻︽﹁﹄ ヲィゥォェュ゙ ﷺ ﵬ ﺽ ﻫ ﺳ ﺲ fi fl ffi ffl ſt ﬔ
Compatibility Characters
includes
presentation forms
legacy encoding: a term for non Unicode character encodings.
Half and Full Width Forms
• Compatibility characters for East Asian legacy encodings that vary in character “ width ”
Half width forms
ヲァィゥェォャ Abcdefh
Full width forms
ァアィイゥ ぁあぃいぅ ABCDefg
Variation Selectors
UTS#37 defines the Ideographic Variation Database (IVD)
Unicode Controls
Private Use
Surrogate Code Points
• Reserved code points in two blocks needed for the UTF-16 character encoding.
– Don ’ t encode characters – Never to be used as characters on their own
Unicode Properties
• • • • • • • • • code point name character class combining level bidi class case mappings canonical decomposition mirroring default grapheme clustering
The Unicode Character Database (UCD) ӑ (U+04D1) CYRILLIC SMALL LETTER A WITH BREVE letter non-combining left-to-right decomposes to U+0430 U+0306 Ӑ U+04D0 is uppercase (and titlecase)
UNICODE'S ENCODING FORMS
Unicode Encoding Forms
• • • UTF-32 – Uses 32-bit code units. – All characters are the same width.
UTF-16 – Uses 16-bit code units.
– – BMP characters use one 16-bit code unit.
Supplementary characters use two special 16-bit code units: a “ surrogate pair ” .
UTF-8 – Uses 8-bit code units (bytes!) – – – It ’ s a multi-byte encoding!
Characters use between 1 and 4 bytes.
ASCII is ASCII in UTF-8
Unicode Encodings Compared
A (U+0041) UTF-32: UTF-16: UTF-8: 0x0000041 0x0041 0x41 ቐ (U+1251) UTF-32: UTF-16: UTF-8: 0x00001251 0x1251 0xE1 0x89 0x91 À (U+00C0) UTF-32: UTF-16: UTF-8: 0x000000C0 0x00C0 0xC2 0x80 𐌸 (U+10338) 0x00010338 0xD800 0xDF38 0xF0 0x90 0x8C 0xB8
UTF-32
• • Uses 32-bit code units (instead of the more-familiar 8-bit code unit, aka the “ byte ” ) Each character takes exactly one code unit.
U+1251 U+10338 ቑ 𐌸 0x00001251 0x00010338
Advantages and Disadvantages of UTF-32 • Easy to process – each logical character takes one code unit – can use pointer arithmetic • Not as commonly used – Not efficient for storage • 11 bits are never used • BMP characters are the most common — 16 bits wasted for each of these – Affected by processor architecture (Big-Endian vs. Little-Endian) – Disallowed for HTML5
UTF-16
• Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “ byte ” ) – – BMP characters use one unit Supplementary characters use a “ surrogate pair ” , special code points that don ’ t do anything else.
0x1251 0xD800 0xDF38 High Surrogate 0xD800-DBFF U+1251 U+10338 Low Surrogate 0xDC00-DFFF ቑ 𐌸 Unique Ranges!
Advantages and Disadvantages of UTF-16 • Most common languages and scripts are encoded in the BMP.
– Less wasteful than UTF-32 – Simpler to process (excepting surrogates) – Commonly supported in major operating environments, programming languages, and libraries • May not be suitable for all applications – Affected by processor architecture (Big-Endian vs. Little-Endian) – Requires more storage, on average, for Western European scripts, ASCII, HTML/XML markup.
UTF-8
• • 7-bit ASCII is itself All other characters take 2, 3, or 4 bytes each – – lead bytes have a special pattern trailing bytes range from 0x80 -> 0xBF Lead Byte Trail Bytes Corresponding Code Point 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx < 0x80 < 0x800 < 0x10000 Supplementary
Advantages and Disadvantages of UTF-8 • • • • • • • ASCII-compatible Default or recommended encoding for many Internet standards Bit pattern highly detectable (over longer runs) Non-endian Streaming C char* friendly Easy to navigate • • • Multibyte encoding requires additional processing awareness Non-shortest form checking needed Less efficient than UTF-16 for large runs of Asian text
HTML
Set Web server to declare UTF-8 in HTTP Content-Type header Declare UTF-8 in META tag header Actually use UTF-8 as the encoding!!
Вибір і застосування кодування
It ’ s more than just a character set and some encodings … WORKING WITH UNICODE
Unicode Properties, Annexes, and Standards Unicode provides additional information: Character name Character class “ ctype ” information, such as if it ’ s a digit, number, alphabetic, etc.
Directionality (LTR, RTL, etc.) and the Bidi Algorithm Case mappings (UPPER, lower, and Titlecase) Default Collation and the Unicode Collation Algorithm (UCA) Identifier names Regular Expression syntaxes Normalization Compatibility information Many of these items are in the form of Unicode Technical Reports http://www.unicode.org/reports
Unicode 6.1 Annexes
9 Unicode Bidirectional Algorithm 11 14 East Asian Width Unicode Line Breaking Algorithm 15 Unicode Normalization Forms 24 29 31 Unicode Script Property Unicode Text Segmentation Unicode Identifier and Pattern Syntax 34 Unicode Named Character Sequences 38 Unicode Han Database (Unihan) 41 Common References for Unicode Standard Annexes 42 Unicode Character Database in XML 44 Unicode Character Database
Abc ABC abc abC aBc
UAX#15: Normalization
Unicode Normalization has to deal with more issues: • single or multiple combining marks • compatibility characters • presentation forms abc Ǻ U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301
Four Normalization Forms
Ǻ ways to represent: U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301 • • • • Form D canonical decomposition Form C canonical decomposition followed by composition Form KD kompatibility decomposition Form KC kompatibility decomposition followed by composition
Ǻ Original U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301
Normalization in Action
Form C U+01FA U+01FA U+01FA U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+212B U+0301 U+01FA U+01FA U+01FA U+01FA U+01FA U+01FA Form D Form KC U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+01FA U+01FA Form KD U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A
Normalization: Not a Panacea
Not all compatibility characters have a compatibility decomposition.
Not all characters that look alike or have similar semantics have a compatibility decomposition.
– For example, there are many ‘ dots ’ used as a period.
Not all character variations are handled by normalization.
– For example, upper, title, and lowercase variations.
Normalization can remove meaning
A Bit of Bidi
UAX#9: Unicode Bidirectional Algorithm
Bi-directional Scripts • • Some scripts are written predominantly from left-to right (LTR).
Some scripts are written predominantly from right to-left (RTL).
Other Writing Directions
Writing direction is a separate consideration from text direction. Both of the texts shown here are “left-to right”
Character Direction
• • • Unicode defines a character ’ s direction Left-to-right Right-to-left Neutral Characters can be “ weakly ” or “ strongly ” directional
Unicode Bidi Algorithm
• • Depends on “ base direction ” Breaks text into “ runs ”
Embedding and “Logical Order”
Characters are encoded in logical order.
Visual order is determined by the layout.
– – Override and bidi control characters “Indeterminate” characters
Bidirectional Embedding
Paste in Arabic
Unicode Controls and Markup
Natural Language Processing
Unicode Collation Algorithm
• • Defines a collation algorithm (UTS#10) Defines “ DUCET ” (Default Unicode Collation Element Table) – Must be tailored by language and “ locale ” (culture) and other variations (maintained by CLDR): Language Usage Customizations Swedish: German: German Dictionary: German Telephone: Upper-first Lower-First z < ö ö < z öf < of of < öf A < a a < A
Line Breaking (UAX#14)
Defines rules for general-purpose non-dictionary line-breaking.
– Tailored by language – Doesn ’ t work for languages such as Thai that require morphological analysis (aka “ a dictionary ” )
Text Segmentation: Thai
ญั ตติที่เสนอได ้ผ่านที่ประชุมด ้วยมติเอฉัันท ญัตติ ที่ เสนอ ได ้ ผ่าน ที่ประชุม ด ้วย มติ เอฉัันท boundaries) (word
Text Segmentation (UAX#29)
Find grapheme, word, and line-break boundaries in text.
• Tailored by language • Provides good basic default handling
Unicode Consortium
• Does some other things – CLDR: Common Locale Data Repository – ULI: Localization Interoperability – ISO 15924: Script registry
“ That ’ s great: I ’ ll just use Unicode ” Remember “all text has an encoding”?
user input via forms email data feeds existing, legacy data database instances uploads Use UTF-8 for HTML and Web forms Use UTF-8 in your APIs Check that data really is UTF-8 Control encoding via code; avoid hard-coding the encoding Watch out for legacy encodings Convert to Unicode as soon as practical.
Convert from Unicode as late as possible.
Wrap Unicode-unfriendly technologies
Map Your System APIs use Unicode encoding hide internal storage encoding Data Stores, Local I/O use Unicode encoding consider an encoding conversion plan Front Ends use Unicode encoding Back Ends, External Data Uses Unicode?
If not, what encoding?
Store the encoding!
Convert to Legacy Your System Unicode Interface Unicode Cloud Detect / Convert API Legacy Encoding Capture Encoding Unicode Detect / Convert Input
SUMMARY
Character Encodings
• • • • Code unit Code point Character Glyph/grapheme • “ All text has an encoding ” • Multibyte encoding – Tofu – Mojibake – Question Marks
• • 17 planes of goodness – 1.1 million potential code points – 150,000 assigned code points 3 encodings – UTF-32 – UTF-16 – UTF-8
Unicode
• • Unicode Standard, Annexes, and Reports – CLDR for language specific tailoring Unicode Character Database
Q&A
Would you write the code for I18N on the whiteboard before you go?
#define UNICODE #import I18N.h