Transcript Slide 1

Internationalization

An Introduction

Part I: Unicode and Character Encodings

License

This presentation and its associated materials licensed under a

Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 License.

You may use these materials without obtaining permission from the author. Any materials used or redistributed must contain this notice.

[Derivative works may be permitted with permission of the author.] This work is copyright © 2008-2012 by Addison P. Phillips

Who is this guy?

• • Globalization Architect, Lab126 We make the technology behind the Kindle Chair, W3C Internationalization WG

The basics of text processing in software.

CHARACTER ENCODINGS

The Biggest Source of Woe “Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.” Glen Perkins Globalization Architect

A Lot of Jargon

Multibyte Variable width Wide character Character encoding Coded character set Bidi or bidirectional Glyph, character, code unit Unicode kanji double-byte language extended ASCII ANSI, OEM encoding agnostic

011010010101 001010010101 011101010101 010110100010 101011110101 010111011011 011010010101 001010010101 011101010101 010110100010 101011110101 010111011011

010000010101101101101000

Code Unit

01000001 0101101101101000 byte (0x41) A unit of physical storage and information interchange Other code units exist (16-bit, 32-bit, etc.)

À

Glyph

à à à à à à à à à àà à à à

नि

A خ

A single shape (in text)

Grapheme

À नि

A خ

A single visual unit of text: the smallest abstract unit of meaning in a writing system.

À

Character

ि न नि

A خ

A single logical unit of text

À

Character Set

Abstract Character Repertoire A set of characters

À U+00C0

Coded Character Set

Code Point A set of characters in which each character is assigned a numeric identifier.

À U+00C0

Character Encoding Form

11000011 10000000 0xC3 0x80 UTF-8 Maps code points to code units

À U+00C0

11000011 10000000 0xC3 0x80 UTF-8

*(the most important slide in this presentation) In memory, on disk, on the network, etc.

All text has a character encoding

When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding.

Counting Things

Be aware of whether you need to count graphemes, characters, or bytes (code units) : – Is the limit “ screen positions ” , “ characters ” , or “ bytes of storage ” ?

– Should you be using a different limit? Which one are you actually counting?

varchar(110) यूनिकोड य ूू ि नू क ूो ड E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1 (4 glyphs) (7 characters) (21 bytes)

Common Encoding Problems

Tofu

hollow boxes

Mojibake

garbage characters

Question Marks

(conversion not supported)

It can happen to anyone…

• • • Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example) Not usually a bug: it’s a display problem Can mask or masquerade as character corruption.

Tofu

Mojibake When Good Characters Go Bad

Sources of Mojibake • • • View text using the wrong encoding Apply a transfer encoding and forget to remove it Convert to an encoding twice • • • • Convert to or from the wrong encoding Overzealous escaping Conversion to entities ( “ entitization ” ) Multiple conversions

Character Encoding Forms

Their theory, structure, and use

EBCDIC

ASCII

• • 7 bits = 2 7 = 128 characters Enough for “ U.S. English ”

Latin-1 (ISO 8859-1)

ASCII for characters 0x00 through 0x7F Accented letters and other symbols 0x80 through 0xFF

Code Page

   Originally an IBM character encoding term. IBM numbered their character sets with “ CCSIDs ” (coded character set ids) and numbered the corresponding character encoding forms as “ code pages ” .

Microsoft borrowed code pages to create PC-DOS.

 Microsoft defines two kinds of code pages:    “ ANSI ” code pages are the ones used by Windows GUI programs.

“ OEM ” code pages are the ones used by command shell/command line programs.

Neither “ ANSI ” nor “ OEM ” refer to a particular encoding standard or standards body in this context.  Avoid the use of ANSI and OEM when referring to encodings.

windows-1252

Windows ’ s encodings (called “ code pages ” ) are generally based on standard encodings — plus some additional characters.

Beyond Single Byte Encodings

• So far we ’ ve been looking at single byte encodings:  one byte per character  1 byte = 1 character (= 1 glyph?)  256 character maximum  Good enough for most alphabetic languages À Some languages need more characters.

What about the “ double-byte ” languages?

Don ’ t those take two bytes per character?

丏丣並

Beyond Single-Byte

• • • Escape sequences to select another character set – Example: encodings ISO 2022 uses escape sequences to select various Use a larger code unit ( “ wide ” character encoding) – Example: IBM DBCS code pages or Unicode UTF-16 – 2 16 = 64K characters – 2 32 = 4.2 billion characters Use a variable-width encoding Variable width encodings use different numbers of code units to represent different types of characters within the same encoding form.

Multibyte Encodings

• One or more bytes per character – 1 byte

!=

1 character – May use 1, 2, 3, or 4 bytes per character -> maximum number of bytes per character varies by encoding form.

– May use shift or escape sequences – May encode more than one character set Single-byte encodings are a special case of multibyte!

Multibyte Encoding

: “variable-width” encoding that uses the byte as its code unit.

• •

Simple Multibyte Encoding Forms

Specific byte ranges encode characters that take more than one byte.

– – A “ lead byte ” One or more “ trailing bytes ”

A

あ Code point != code unit 0x41 0x82 0xA0 s i n g l e b y t e l e a d b y t e t r a i l b y t e

Shift-JIS: A Multibyte Encoding

• • In order to reach more characters, Shift-JIS characters start with a limited range of “ lead bytes ” These can be followed by a larger range of byte values ( “ trail byte ” )

Shift-JIS

• • • Lead bytes can be trail byte values Trail bytes include ASCII values Trail bytes include special values such as 0x5C ( “ \ ” )

Shift-JIS

int pos = strchr(mybuf, ‘ @ ’ );

More Complex Multibyte Systems Example: IBM “MBCS” code pages [SI/SO shift between 1 -byte and 2-byte characters] Example: ISO 2022 [escape sequence changes character set being encoded]

Ad hoc and Font Encodings

Encoding Conversion

Templates ISO 8859-1 Content UTF-8 Process Output (HTML, XML, etc.) Data Shift_JIS Document formats often require a single character encoding be used for all parts of the document.

Common Encoding Conversion Tools and Libraries

• iconv (Unix) • ICU (C, C++, Java) • perl Encode • Java (native2ascii, IO/NIO) • (etc.)

Encoding Conversion as Filter

ISO 8859-1

ÀàС£

UTF-8

детски »èçسني 文字

ISO 8859-1

ÀàС£ ??????

»èç ?????

????

UTF-8

ÀàС£ ??????

»èç ?????

????

Shift_JIS

文字化け ?

(0x3F) is the

replacement character

for ISO 8859-1 Encoding conversion acts as a “ filter ” – Replacement characters ( “ question marks ” ) replace characters from the

source character set

that are not present in the

target character set

.

Too Many Fish in the Sea

• • • Need for more converters and conversion maps Difficulty of passing, storing, and processing data in multiple encodings Too many character sets …

Unicode / ISO-10646

The Idea Behind Unicode

• • • A universal character set to encode the world’s scripts.

Encodes characters (not glyphs).

Consistent interchange and interpretation: one set of rules for all text, everywhere

Unicode: the Universal Character Set An organized collection of characters.

A “ coded character set ” : each character has a

code point

aka Unicode Scalar Value (USV) U+0041 <= hex notation

Unicode or ISO 10646?

Unicode and ISO 10646 are maintained in sync. – Unicode is maintained by an industry consortium. – ISO 10646 is maintained by the ISO.

Unicode

• • Code space of up to 0x10FFFF million) characters (about 1.1 Currently encodes 110,116 characters

The Unicode Standard

• • • Core Standard (TUS) Reports (http://www.unicode.org/reports) – Unicode Standard Annexes (UAX) – – Unicode Technical Standards (UTS) Unicode Technical Reports (UTR) Unicode Character Database (UCD) • Unicode Technical Notes (UTN) [not part of standard]

Encodes the World ’s Scripts

• • • • • Modern scripts Historical scripts Ancient and extinct scripts Minority languages Some fun stuff too!

Characters, Not Glyphs

Aa

Aa

Aa

Aa

Aa Aa AaAa Aa

Aa

Aa Aa

AaAa

Aa Aa

Aa

Aa

Aa Aa

Aa

Aa Aa

Aa

Aa Aa Aa Aa Aa Aa Aa

Aa

Aa

Aa

Aa Aa Aa Aa

Characters, Not Glyphs: Han Unification • “ Unihan ” unifies abstract Han ideographs, even if specific writing traditions (such as Japanese

kanji

vs. Simplified Chinese) appear different.

Encoding Work Continues

Unicode 6.1 added 732 characters, including several new scripts.

… but the pace of change has slowed and most living scripts are encoded.

Planes

   Unicode is divided into “ planes ” of code points  17 planes (0 through 0x10)  64K (65,535) code points per plane Plane 0 is called the

Basic Multilingual Plane

(

BMP

).

Planes 1 through 0x10 are called

supplementary planes

: Plane 1:

supplementary multilingual plane (SMP)

Plane 2:

supplementary ideographic plane (SIP)

Plane 4:

supplementary special-purpose plane (SSP)

Planes 15,16:

private use

Scripts and Blocks

• Most characters belong to a

script

, a distinct writing system.

– Some characters, such as many of the punctuation characters, are used by multiple scripts • Characters are assigned in Unicode to

blocks

. – Most blocks are used to encode (and named for) a specific script.

– Some scripts have multiple blocks (Latin, Han ideographs)

Unicode Blocks

• Ranges of code points allocated together for assignment. Not all code points in a block are assigned (some reserved for future assignment)

Unicode Blocks

• • See: http://www.unicode.org/charts Block names are stabilized and not always fully indicative of block usage Example: Phags-pa block

Various Character Types

• • • • Unicode Controls Compatibility Characters Byte Order Mark Replacement Character • • • • Combining Marks Variation Selectors Private Use Surrogates

Combining Marks

 Composition can create “ new ” characters  Base + non-spacing ( “ combining ” ) characters A + ˚ = Å Å U+0041 + U+030A = U+00C5 a + ˆ + . = ậ U+0041 + U+0302 + U+0323 = U+1EAD a + . + ˆ = ậ U+0041 + U+0323 + U+0302 = U+1EAD

Combining Marks: Thai

Unicode คืออะไร

?

คื

=

ค + ืื

glyph = base + vowel modifier

Combining Marks: Devanagari

यूनिकोड यू

नि

क्या है

?

को ड य ू

ि ि न + न

क ो ड

=

नि

Combining Marks: Tamil

யூனிக்ககோடு என்றோல் என்ன ?

யூ னி க் ககோ டு ககோ

க + ோ U+0B95 U+0BCB

Byte Order Mark (BOM)

• •

U+FEFF

Used to indicate the “ byte-order ” of UTF-16 code units – 0xFE FF; 0xFF FE Also used as a Unicode signature by some software (Windows ’ s Notepad editor, for example) for UTF-8 – 0xEF BB BF Appears as a character or renders as junk in some formats or on some systems. Has an annoying secondary meaning: “ zero width non-breaking space ”

The Replacement Character

U+FFFD

 Indicates a bad byte sequence or a character that could not be converted.

 Equivalent to “ question marks ” in legacy encoding conversions � there was a character here, but it is gone now

Compatibility Characters

Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings: ①②③45Ⅵ ¾ Lj ¼ Nj ½ dž ︴︷︻︽﹁﹄ ヲィゥォェュ゙ ﷺ ﵬ ﺽ ﻫ ﺳ ﺲ fi fl ffi ffl ſt ﬔ

Compatibility Characters

includes

presentation forms

legacy encoding: a term for non Unicode character encodings.

Half and Full Width Forms

• Compatibility characters for East Asian legacy encodings that vary in character “ width ”

Half width forms

ヲァィゥェォャ Abcdefh

Full width forms

ァアィイゥ ぁあぃいぅ ABCDefg

Variation Selectors

UTS#37 defines the Ideographic Variation Database (IVD)

Unicode Controls

Private Use

Surrogate Code Points

• Reserved code points in two blocks needed for the UTF-16 character encoding.

– Don ’ t encode characters – Never to be used as characters on their own

Unicode Properties

• • • • • • • • • code point name character class combining level bidi class case mappings canonical decomposition mirroring default grapheme clustering

The Unicode Character Database (UCD) ӑ (U+04D1) CYRILLIC SMALL LETTER A WITH BREVE  letter  non-combining    left-to-right decomposes to U+0430 U+0306 Ӑ U+04D0 is uppercase (and titlecase)

UNICODE'S ENCODING FORMS

Unicode Encoding Forms

• • • UTF-32 – Uses 32-bit code units. – All characters are the same width.

UTF-16 – Uses 16-bit code units.

– – BMP characters use one 16-bit code unit.

Supplementary characters use two special 16-bit code units: a “ surrogate pair ” .

UTF-8 – Uses 8-bit code units (bytes!) – – – It ’ s a multi-byte encoding!

Characters use between 1 and 4 bytes.

ASCII is ASCII in UTF-8

Unicode Encodings Compared

A (U+0041) UTF-32: UTF-16: UTF-8: 0x0000041 0x0041 0x41 ቐ (U+1251) UTF-32: UTF-16: UTF-8: 0x00001251 0x1251 0xE1 0x89 0x91 À (U+00C0) UTF-32: UTF-16: UTF-8: 0x000000C0 0x00C0 0xC2 0x80 𐌸 (U+10338) 0x00010338 0xD800 0xDF38 0xF0 0x90 0x8C 0xB8

UTF-32

• • Uses 32-bit code units (instead of the more-familiar 8-bit code unit, aka the “ byte ” ) Each character takes exactly one code unit.

U+1251 U+10338 ቑ 𐌸 0x00001251 0x00010338

Advantages and Disadvantages of UTF-32 • Easy to process – each logical character takes one code unit – can use pointer arithmetic • Not as commonly used – Not efficient for storage • 11 bits are never used • BMP characters are the most common — 16 bits wasted for each of these – Affected by processor architecture (Big-Endian vs. Little-Endian) – Disallowed for HTML5

UTF-16

• Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “ byte ” ) – – BMP characters use one unit Supplementary characters use a “ surrogate pair ” , special code points that don ’ t do anything else.

0x1251 0xD800 0xDF38 High Surrogate 0xD800-DBFF U+1251 U+10338 Low Surrogate 0xDC00-DFFF ቑ 𐌸 Unique Ranges!

Advantages and Disadvantages of UTF-16 • Most common languages and scripts are encoded in the BMP.

– Less wasteful than UTF-32 – Simpler to process (excepting surrogates) – Commonly supported in major operating environments, programming languages, and libraries • May not be suitable for all applications – Affected by processor architecture (Big-Endian vs. Little-Endian) – Requires more storage, on average, for Western European scripts, ASCII, HTML/XML markup.

UTF-8

• • 7-bit ASCII is itself All other characters take 2, 3, or 4 bytes each – – lead bytes have a special pattern trailing bytes range from 0x80 -> 0xBF Lead Byte Trail Bytes Corresponding Code Point 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx < 0x80 < 0x800 < 0x10000 Supplementary

Advantages and Disadvantages of UTF-8 • • • • • • • ASCII-compatible Default or recommended encoding for many Internet standards Bit pattern highly detectable (over longer runs) Non-endian Streaming C char* friendly Easy to navigate • • • Multibyte encoding requires additional processing awareness Non-shortest form checking needed Less efficient than UTF-16 for large runs of Asian text

HTML

   Set Web server to declare UTF-8 in HTTP Content-Type header Declare UTF-8 in META tag header Actually use UTF-8 as the encoding!!

</p> <p><b>Вибір і застосування кодування</b></p> <p>

It ’ s more than just a character set and some encodings … WORKING WITH UNICODE

Unicode Properties, Annexes, and Standards Unicode provides additional information:  Character name          Character class “ ctype ” information, such as if it ’ s a digit, number, alphabetic, etc.

Directionality (LTR, RTL, etc.) and the Bidi Algorithm Case mappings (UPPER, lower, and Titlecase) Default Collation and the Unicode Collation Algorithm (UCA) Identifier names Regular Expression syntaxes Normalization Compatibility information Many of these items are in the form of Unicode Technical Reports  http://www.unicode.org/reports

Unicode 6.1 Annexes

9 Unicode Bidirectional Algorithm 11 14 East Asian Width Unicode Line Breaking Algorithm 15 Unicode Normalization Forms 24 29 31 Unicode Script Property Unicode Text Segmentation Unicode Identifier and Pattern Syntax 34 Unicode Named Character Sequences 38 Unicode Han Database (Unihan) 41 Common References for Unicode Standard Annexes 42 Unicode Character Database in XML 44 Unicode Character Database

Abc ABC abc abC aBc

UAX#15: Normalization

Unicode Normalization has to deal with more issues: • single or multiple combining marks • compatibility characters • presentation forms abc Ǻ U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301

Four Normalization Forms

Ǻ ways to represent: U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301 • • • • Form D canonical decomposition Form C canonical decomposition followed by composition Form KD kompatibility decomposition Form KC kompatibility decomposition followed by composition

Ǻ Original U+01FA U+00C5 U+0301 U+00C1 U+030A U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+030A U+0301

Normalization in Action

Form C U+01FA U+01FA U+01FA U+212B U+0301 U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+212B U+0301 U+01FA U+01FA U+01FA U+01FA U+01FA U+01FA Form D Form KC U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+01FA U+01FA Form KD U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A U+0041 U+0301 U+030A

Normalization: Not a Panacea

Not all compatibility characters have a compatibility decomposition.

Not all characters that look alike or have similar semantics have a compatibility decomposition.

– For example, there are many ‘ dots ’ used as a period.

Not all character variations are handled by normalization.

– For example, upper, title, and lowercase variations.

Normalization can remove meaning

A Bit of Bidi

UAX#9: Unicode Bidirectional Algorithm

Bi-directional Scripts • • Some scripts are written predominantly from left-to right (LTR).

Some scripts are written predominantly from right to-left (RTL).

Other Writing Directions

Writing direction is a separate consideration from text direction. Both of the texts shown here are “left-to right”

Character Direction

• • • Unicode defines a character ’ s direction Left-to-right Right-to-left Neutral Characters can be “ weakly ” or “ strongly ” directional

Unicode Bidi Algorithm

• • Depends on “ base direction ” Breaks text into “ runs ”

Embedding and “Logical Order”

Characters are encoded in logical order.

Visual order is determined by the layout.

– – Override and bidi control characters “Indeterminate” characters

Bidirectional Embedding

Paste in Arabic

Unicode Controls and Markup

Natural Language Processing

Unicode Collation Algorithm

• • Defines a collation algorithm (UTS#10) Defines “ DUCET ” (Default Unicode Collation Element Table) – Must be tailored by language and “ locale ” (culture) and other variations (maintained by CLDR): Language Usage Customizations Swedish: German: German Dictionary: German Telephone: Upper-first Lower-First z < ö ö < z öf < of of < öf A < a a < A

Line Breaking (UAX#14)

Defines rules for general-purpose non-dictionary line-breaking.

– Tailored by language – Doesn ’ t work for languages such as Thai that require morphological analysis (aka “ a dictionary ” )

Text Segmentation: Thai

ญั ตติที่เสนอได ้ผ่านที่ประชุมด ้วยมติเอฉัันท ญัตติ ที่ เสนอ ได ้ ผ่าน ที่ประชุม ด ้วย มติ เอฉัันท boundaries) (word

Text Segmentation (UAX#29)

Find grapheme, word, and line-break boundaries in text.

• Tailored by language • Provides good basic default handling

Unicode Consortium

• Does some other things – CLDR: Common Locale Data Repository – ULI: Localization Interoperability – ISO 15924: Script registry

“ That ’ s great: I ’ ll just use Unicode ”  Remember “all text has an encoding”?

  user input via forms email     data feeds existing, legacy data database instances uploads      Use UTF-8 for HTML and Web forms Use UTF-8 in your APIs Check that data really is UTF-8 Control encoding via code; avoid hard-coding the encoding Watch out for legacy encodings  Convert to Unicode as soon as practical.

 Convert from Unicode as late as possible.

 Wrap Unicode-unfriendly technologies

Map Your System APIs  use Unicode encoding  hide internal storage encoding Data Stores, Local I/O  use Unicode encoding  consider an encoding conversion plan Front Ends  use Unicode encoding Back Ends, External Data  Uses Unicode?

  If not, what encoding?

Store the encoding!

Convert to Legacy Your System Unicode Interface Unicode Cloud Detect / Convert API Legacy Encoding Capture Encoding Unicode Detect / Convert Input

SUMMARY

Character Encodings

• • • • Code unit Code point Character Glyph/grapheme • “ All text has an encoding ” • Multibyte encoding – Tofu – Mojibake – Question Marks

• • 17 planes of goodness – 1.1 million potential code points – 150,000 assigned code points 3 encodings – UTF-32 – UTF-16 – UTF-8

Unicode

• • Unicode Standard, Annexes, and Reports – CLDR for language specific tailoring Unicode Character Database

Q&A

Would you write the code for I18N on the whiteboard before you go?

#define UNICODE #import I18N.h