Transcript Slide 1

Character Encodings
& Unicode
Internationalization: An
Introduction
Tutorial from
License
This presentation and its associated materials licensed under a
Creative Commons Attribution-Noncommercial-No Derivative
Works 2.5 License.
You may use these materials without obtaining permission from the
author. Any materials used or redistributed must contain this notice.
[Derivative works may be permitted with permission of the author.]
This work is copyright © 2008-2011 by Addison P. Phillips
Presenter and Presentation
• Addison Phillips
– Globalization Architect, Lab126
• This Presentation
– Part I of the Internationalization and Unicode Conference
tutorial :
“Internationalization: An Introduction”
Character Encodings and Unicode
Who is this guy?
• Globalization Architect, Lab126
We make the technology behind the Kindle
• Chair,
W3C Internationalization WG
Internationalization is:
• the design and development of a product that
is enabled for target audiences that vary in
culture, region, or language. [W3C]
• a fundamental architectural approach to
software development
Opinions differ on
capitalization (C12N);
choose from:
 i18N
 I18n
 I18n
 I18N
Very geeky; not very
internationalized
(I19G?)
Mystic Numbering (M4C N7G)
II N1 T2 E3 R4 N
ATI O NALI ZATI O N
5 6 7 8 9 10 11 12 13 14 15 16 17 18 N
I18N
Localization
Globalization
Canonicalization
Accessibility
=
=
=
=
L10N
G11N
C14N
A12Y
The basics of text processing in software.
CHARACTER ENCODINGS
The Biggest Source of Woe
“Character encodings consume more than 80% of
my work day. They are the source of more misinformation and confusion than any other
single thing. And developers aren’t getting any
better educated.”
~Glen Perkins
Globalization Architect
A lot of jargon
Real Jargon
Potentially Bogus Jargon
Multibyte
kanji
Variable width
double-byte language
Wide character
extended ASCII
Character encoding
ANSI, OEM
Coded character set
encoding agnostic
Bidi or bidirectional
Glyph, character, code unit
Unicode
How the computer sees the world
“bits”: 010000010101101101101000
“byte” or “octet”: 01000001 (0x41)
code unit: a unit of physical storage and information interchange
• represent numbers
• come in various sizes (e.g. 7, 8, 16, 32, 64 bits)
how do we map text to the numbers used by computers?
From text to bits
À
Glyphs
– A “glyph” is screen unit of text: it’s a picture
of what users think of as a character.
– A “grapheme” is a single visual unit of text.
U+00C0
Characters
– A “character” is a single logical unit of text.
– A “character set” is a set of characters.
– A “code point” is a number assigned to a
character in a character set.
– A “coded character set” is a character set
where each character has a code point.
Bytes
– A “character encoding form” maps a
sequence of code points (“characters”) to a
sequence of code units (such as bytes).
– A “code unit” is a single logical unit of
storage.
… 0xC3 0x80 …
Coded Character Set
• Collection (repertoire) of characters, that is: a set.
• Organized so that each character has a unique numeric
(typically integer) value (code point).
• Examples:
–
–
–
–
–
Unicode
ASCII (ANSI X3.4)
ISO 646
JIS X 208
Latin-1 (ISO 8859-1)
Character sets are often
associated with a particular
language or writing system.
Character Encoding Form
• Maps a sequence of code points (characters)
to a sequence of code units (e.g. bytes).
– Some encoding forms use another code unit
instead of the byte. For example, some encoding
forms use a 16-bit, 32-bit, or 64-bit code unit.
U+00C0
0xC3 0x80
Often shortened as “character encoding”,
“encoding form”, or, confusingly, “charset”
*(the most important slide in this presentation)
In memory, on disk, on the network, etc.
All text has a character
encoding
When things go wrong, start by asking what the
encoding is, what encoding you expected it to be,
and whether the bytes match the encoding.
Common Encoding Problems
Tofu
hollow boxes
Mojibake
Question Marks
garbage characters
(conversion not supported)
It can happen to anyone…
Tofu
•
•
•
Can appear as either
hollow boxes (empty
glyph) or as question
marks (Firefox, for
example)
Not usually a bug: it’s a
display problem
Can mask or masquerade
as character corruption.
Mojibake
When Good Characters Go Bad
Sources of Mojibake
• View text using the wrong
encoding
• Apply a transfer encoding
and forget to remove it
• Convert to an encoding
twice
• Convert to or from the
wrong encoding
• Overzealous escaping
• Conversion to entities
(“entitization”)
• Multiple conversions
Character Encoding Forms
Their theory, structure, and use
EBCDIC
ASCII
• 7 bits = 27 = 128 characters
• Enough for “U.S. English”
Latin-1 (ISO 8859-1)
ASCII for
characters 0x00
through 0x7F
Accented letters
and other
symbols 0x80
through 0xFF
One character—many character sets and many
character encodings!
È
0xC8
0xD4
char
cp1252
cp850
Windows Code Pages
Windows’s encodings
(called “code pages”) are
generally based on
standard encodings—
plus some additional
characters.
Example:
 CP 1252 is based on ISO
8859-1, but includes 27
“extra” characters in the
C1 control range (0x800x9F)
Code Page
 Originally an IBM character
encoding term.
 IBM numbered their
character sets with “CCSIDs”
(coded character set ids) and
numbered the
corresponding character
encoding forms as “code
pages”.
 Microsoft borrowed code
pages to create PC-DOS.
 Microsoft defines two kinds of
code pages:
 “ANSI” code pages are the
ones used by Windows GUI
programs.
 “OEM” code pages are the
ones used by command
shell/command line programs.
 Neither “ANSI” nor “OEM”
refer to a particular encoding
standard or standards body in
this context.
 Avoid the use of ANSI and OEM
when referring to encodings.
Beyond Single Byte Encodings
• So far we’ve been
looking at singlebyte encodings:
 one byte per character
 1 byte = 1 character
(= 1 glyph?)
 256 character maximum
 Good enough for most
alphabetic languages
À
Some languages need more
characters.
What about the “double-byte”
languages?
Don’t those take two bytes per
character?
丏丣並
Methods of reaching beyond singlebyte
• Escape sequences to select
another character set
– Example: ISO 2022 uses escape
sequences to select various
encodings
• Use a larger code unit (“wide”
character encoding)
– Example: IBM DBCS code pages
or Unicode UTF-16
– 216 = 64K characters
– 232 = 4.2 billion characters
• Use a variable-width encoding
Variable width encodings use
different numbers of code
units to represent different
types of characters within the
same encoding form.
Multibyte Encodings
One or more bytes per character
– 1 byte != 1 character
– May use 1, 2, 3, or 4 bytes per
character
-> maximum number of bytes per
character varies by encoding form.
– May use shift or escape
sequences
– May encode more than one
character set
• Single-byte encodings are a
special case of multibyte!
Multibyte Encoding: Any
“variable-width” encoding that
uses the byte as its code unit.
JIS X 213
 11,233 characters
 (2) 94x94 character planes
JIS X 213: A Coded Character Set whose common encoding forms are multibyte
あ
6
1-4-1
1-3-22
(code point)
(code point)
Simple Multibyte Encoding Forms
• Specific byte ranges
encoding characters that
take more than one byte.
– A “lead byte”
– One or more “trailing bytes”
あ
A
1-4-1
1-3-33
(code point)
(code point)
0x82 0xA0
0x41
• Code point != code unit
l
e
a
d
t
r
a
i
l
b
y
t
e
s
i
n
g
l
e
b
y
t
e
b
y
t
e
Shift_JIS: A Multibyte Encoding
• In order to reach
•
more characters,
Shift_JIS
characters start
with a limited
range of “lead
bytes”
These can be
followed by a
larger range of
byte values
(“trail byte”)
Shift_JIS
Shift-JIS
• Lead bytes can be
trail byte values
• Trail bytes include
ASCII values
• Trail bytes include
special values such
as 0x5C (“\”)
int pos = strchr(mybuf, ‘@’);
More Complex Multibyte Systems
• Stateful Encodings
– ex. IBM “MBCS” code pages [SI/SO shift between 1byte and 2-byte characters]
– ISO 2022 [escape sequence changes character set being
encoded]
Ad hoc Encodings
Transfer Encodings
• A transfer encoding syntax is a reversible transform of encoded data
which may (or may not) include textual data represented in one or
more character encoding schemes.
• Email headers
• URIs
• IDN (domain names)
Abcソース
=?UTF-8?B?QWJj44K
944O844K5?=
Abcソース
Encoding Conversion
Common Encoding
Conversion Tools
and Libraries
Templates
ISO 8859-1
• iconv (Unix)
Content
UTF-8
Process
Output
(HTML, XML, etc.)
• ICU (C, C++,
Java)
• perl Encode
Data
Shift_JIS
• Document formats
often require a single
character encoding be
used for all parts of the
document.
 When data is merged, the same
encoding form must be used or
some of the data will be “mojibake”.
• Java
(native2ascii,
IO/NIO)
• (etc.)
Encoding Conversion as Filter
ISO 8859-1
ÀàС£
ISO 8859-1
UTF-8
детски
»èç‫ينس‬文字
ÀàС£
??????
»èç?????
????
UTF-8
ÀàС£
??????
»èç?????
????
Shift_JIS
文字化け
? (0x3F) is the replacement
character for ISO 8859-1
Encoding conversion acts as a “filter”
– Replacement characters (“question marks”) replace
characters from the source character set that are not
present in the target character set.
Too Many Fish in the Sea
• Need for more converters and
conversion maps
• Difficulty of passing, storing, and
processing data in multiple
encodings
• Too many character sets…
…leads to what we call “code
page hell”
Unicode / ISO-10646
The Idea Behind Unicode
• Basic Principles
–
–
–
–
–
–
–
–
–
–
Universal repertoire
Logical order
Efficiency
Unification
Characters, not glyphs
Dynamic composition
Semantics
Stability
Plain Text
Convertibility
• Fights mojibake
because:
– characters are from the
common repertoire;
– characters are encoded
according to one of the
encoding forms;
– characters are
interpreted with
Unicode semantics;
– unknown characters are
not corrupted
Unicode (ISO 10646)
Unicode is a character set that supports all of
the world’s languages and writing systems.
 Code space of up to 0x10FFFF characters (about
1.1 million)
 Unicode and ISO 10646 are maintained in sync.
 Unicode is maintained by an industry consortium.
 ISO 10646 is maintained by the ISO.
What are “planes”?
 Divide Unicode in equal sized
regions of code points.
 17 planes (0 through 0x10),
each with 65,535 characters.
 Plane 0 is called the Basic
Multilingual Plane (BMP).
 > 99% of text in the wild lives
in the BMP
 Planes 1 through 0x10 are
called supplementary planes.
Unicode as the Universal Character Set
• An organized collection
of characters.
• Each character has a
code point
aka Unicode Scalar Value
(USV)
• U+0041 <= hex
notation
Unicode Character Database
•
•
•
•
•
•
•
•
•
code point
name
character class
combining level
bidi class
case mappings
canonical decomposition
mirroring
default grapheme clustering
ӑ (U+04D1)
CYRILLIC SMALL LETTER A WITH BREVE





letter
non-combining
left-to-right
decomposes to U+0430 U+0306
Ӑ U+04D0 is uppercase (and titlecase)
Compatibility Characters
Many characters were included in
Unicode for round-trip conversion
compatibility with legacy encodings:
①②③45Ⅵ
¾Lj¼Nj½dž
︴︷︻︽﹁﹄
ヲィゥォェュ゙
‫ﺲﺳﻫﺽﵬﷺ‬
fiflffifflſtﬔ
Compatibility Characters
includes presentation forms
legacy encoding: a term for nonUnicode character encodings.
Byte Order Mark (BOM)
U+FEFF
• Used to indicate the “byte-order” of UTF-16 code units
– 0xFE FF; 0xFF FE
• Also used as a Unicode signature by some software (Windows’s
Notepad editor, for example) for UTF-8
– 0xEF BB BF
Appears as a character or renders as
junk in some formats or on some
systems. For example, older browsers
render it as three bytes of mojibake.
The Replacement Character
U+FFFD
 Indicates a bad byte
sequence or a
character that could
not be converted.
 Equivalent to
“question marks” in
legacy encoding
conversions
�
there was a character here,
but it is gone now
Combining Marks
 Composition can create “new” characters
 Base + non-spacing (“combining”) characters
A+˚ = Å
U+0041 + U+030A = U+00C5
a+ˆ+.=ậ
U+0061 + U+0302 + U+0323 = U+1EAD
a+.+ˆ=ậ
U+0061 + U+0323 + U+0302 = U+1EAD
Complex Scripts
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท
ญั = ญ + ัั
glyph = consonant + vowel
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท (word boundaries)
Hindi
What is Unicode?
यू निकोड क्या है ?
यू नि को ड
य ूू ि िू क ूो ड
ि + िू = नि
Tamil Example
க ொ
‘ko’
க ொ
U+0B95
U+0BCA
Combining mark drawn to the “left” of the base character
UNICODE'S ENCODING FORMS
Unicode Encoding Forms
• UTF-32
– Uses 32-bit code units.
– All characters are the same width.
• UTF-16
– Uses 16-bit code units.
– BMP characters use one 16-bit code unit.
– Supplementary characters use two special 16-bit code units: a “surrogate
pair”.
• UTF-8
–
–
–
–
Uses 8-bit code units (bytes!)
It’s a multi-byte encoding!
Characters use between 1 and 4 bytes.
ASCII is ASCII in UTF-8
Unicode Encodings Compared
A (U+0041)
UTF-32:
UTF-16:
UTF-8:
À (U+00C0)
0x0000041
0x0041
0x41
0x000000C0
0x00C0
0xC2 0x80
𐌸(U+10338)
ቐ (U+1251)
UTF-32:
UTF-16:
UTF-8:
UTF-32:
UTF-16:
UTF-8:
0x00001251
0x1251
0xE1 0x89 0x91
0x00010338
0xD800 0xDF38
0xF0 0x90 0x8C 0xB8
UTF-32
• Uses 32-bit code units (instead of the more-familiar 8-bit
code unit, aka the “byte”)
• Each character takes exactly one code unit.
U+1251
ቑ
0x00001251
U+10338
𐌸
0x00010338
Advantages and Disadvantages of UTF-32
• Easy to process
– each logical character
takes one code unit
– can use pointer arithmetic
• Not commonly used
– Not efficient for storage
• 11 bits are never used
• BMP characters are the
most common—16 bits
wasted for each of
these
– Affected by processor
architecture (Big-Endian vs.
Little-Endian)
UTF-16
• Uses 16-bit code units (instead of the more-familiar 8-bit
code unit, aka the “byte”)
– BMP characters use one unit
– Supplementary characters use a “surrogate pair”, special code
points that don’t do anything else.
0x1251
U+1251 ቑ
0xD800 0xDF38
U+10338 𐌸
High Surrogate
Low Surrogate
0xD800-DBFF
0xDC00-DFFF
Unique Ranges!
Advantages and Disadvantages of UTF-16
• Most common languages
and scripts are encoded in
the BMP.
– Less wasteful than UTF-32
– Simpler to process
(excepting surrogates)
– Commonly supported in
major operating
environments,
programming languages,
and libraries
• May not be suitable for all
applications
– Affected by processor
architecture (Big-Endian vs.
Little-Endian)
– Requires more storage, on
average, for Western
European scripts, ASCII,
HTML/XML markup.
UTF-8
• 7-bit ASCII is itself
• All other characters take 2, 3, or 4 bytes each
– lead bytes have a special pattern
– trailing bytes range from 0x80->0xBF
Corresponding
Lead Byte
Trail Bytes
Code Point
0xxxxxxx
< 0x80
110xxxxx 10xxxxxx
< 0x800
1110xxxx 10xxxxxx 10xxxxxx
< 0x10000
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Supplementary
Advantages and Disadvantages of UTF-8
• ASCII-compatible
• Default or recommended
encoding for many Internet
standards
• Bit pattern highly
detectable (over longer
runs)
• Non-endian
• Streaming
• C char* friendly
• Easy to navigate
• Multibyte encoding
requires additional
processing awareness
• Non-shortest form checking
needed
• Less efficient than UTF-16
for large runs of Asian text
HTML
 Set Web server to declare UTF-8 in HTTP Content-Type header
 Declare UTF-8 in META tag header
 Actually use UTF-8 as the encoding!!
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Вибір і застосування кодування</title>
It’s more than just a character set and some encodings…
WORKING WITH UNICODE
Unicode Properties, Annexes, and Standards
Unicode provides additional information:










Character name
Character class
“ctype” information, such as if it’s a digit, number, alphabetic, etc.
Directionality (LTR, RTL, etc.) and the Bidi Algorithm
Case mappings (UPPER, lower, and Titlecase)
Default Collation and the Unicode Collation Algorithm (UCA)
Identifier names
Regular Expression syntaxes
Normalization
Compatibility information
Many of these items are in the form of Unicode Technical Reports
 http://www.unicode.org/reports
Normalization
Unicode Normalization has to deal
with more issues:
• single or multiple combining marks
Abc
ABC
abc
abC
aBc
• compatibility characters
• presentation forms
Ǻ
U+01FA
U+00C5 U+0301
abc
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
Four Normalization Forms
Ǻ
ways to represent:
U+01FA
U+00C5 U+0301
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
• Form D
canonical decomposition
• Form C
canonical decomposition
followed by composition
• Form KD
kompatibility decomposition
• Form KC
kompatibility decomposition
followed by composition
Normalization in Action
Ǻ
Original
Form C
Form D
Form KC
Form KD
U+01FA
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C5 U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C1 U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+212B U+0301
U+212B U+0301
U+212B U+0301
U+01FA
U+0041 U+0301
U+030A
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+0041 U+030A
U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
Normalization: Not a Panacea
 Not all compatibility characters have a
compatibility decomposition.
 Not all characters that look alike or have similar
semantics have a compatibility decomposition.
 For example, there are many ‘dots’ used as a period.
 Not all character variations are handled by
normalization.
 For example, upper, title, and lowercase variations.
 Normalization can remove meaning
A Bit of Bidi
Bi-directional Scripts
• Some languages are written
predominantly from left-toright (LTR).
• Some languages are written
predominantly from rightto-left (RTL).
• (A few can be written topto-bottom or using other
schemes)
Unicode defines character
“directionality” and a “Bidi”
algorithm for rendering text.
 Uses logical, not visual, order.
 Uses levels of “embedding”.
 Requires markup changes (as in
HTML) or special controls for
certain cases.
Embedding and “Logical Order”
Characters are encoded in logical order.
Visual order is determined by the layout.
– Override and bidi control characters
– “Indeterminate” characters
Bidirectional Embedding
Paste in Arabic
Unicode Controls and Markup
Natural Language Processing
Unicode Collation Algorithm
• Defines default collation algorithm and
sequences (UTS#10)
– Must be tailored by language and “locale”
(culture) and other variations.
Language
Swedish:
z<ö
German:
ö<z
German Dictionary: öf < of
Usage
Customizations
German
Telephone:
of < öf
Upper-first
A<a
Lower-First
a<A
Text Segmentation (UAX#29)
Find grapheme, word, and line-break boundaries in
text.
• Tailored by language
• Provides good basic default handling
CLDR and Language Specific
Processing…
… is in the next section
SUMMARY
“That’s great: I’ll just use Unicode”
 Remember “all text
has an encoding”?






user input via forms
email
data feeds
existing, legacy data
database instances
uploads
 Use UTF-8 for HTML and Web
forms
 Use UTF-8 in your APIs
 Check that data really is UTF-8
 Control encoding via code;
avoid hard-coding the encoding
 Watch out for legacy encodings
 Convert to Unicode as soon
as practical.
 Convert from Unicode as
late as possible.
 Wrap Unicode-unfriendly
technologies
Map Your System
APIs
 use Unicode encoding
 hide internal storage
encoding
Data Stores, Local I/O
Your System
Convert to
Legacy
Unicode
Interface
Unicode Cloud
 use Unicode encoding
 consider an encoding
conversion plan
API
Front Ends
 use Unicode encoding
Detect / Convert
Back Ends, External
Data
 Uses Unicode?
 If not, what encoding?
 Store the encoding!
Legacy
Encoding
Unicode
Capture
Encoding
Detect / Convert
Input
Counting Things
Be aware of whether you need to count
glyphs, characters, or bytes:
– Is the limit “screen positions”,
“characters”, or “bytes of storage”?
– Should you be using a different limit?
Which one are you actually counting?
varchar(110)
यूनिकोड
य ूू ि नू क ूो ड
(4 glyphs)
(7 characters)
E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1
(21 bytes)
Character Encodings
•
•
•
•
Code unit
Code point
Character
Glyph
• Multibyte encoding
– Tofu
– Mojibake
– Question Marks
• “All text has an
encoding”
Unicode
• 17 planes of goodness
– 1.1 million potential
code points
– 150,000 assigned code
points
• 3 encodings
– UTF-32
– UTF-16
– UTF-8
•
•
•
•
•
Normalize
Bidi
Collation
Case folding
… and so much more
Q&A
Would you write the code for I18N on the
whiteboard before you go?
#define UNICODE
#import I18N.h