Character Encodings and Unicode - Inter
Download
Report
Transcript Character Encodings and Unicode - Inter
Character Encodings
& Unicode
Unicode: A Grand Tour
This presentation and its associated materials licensed under a
Creative Commons Attribution-Noncommercial-No Derivative
Works 2.5 License.
You may use these materials without obtaining permission from the
author. Any materials used or redistributed must contain this notice.
[Derivative works may be permitted with permission of the author.]
This work is copyright © 2008 Addison P. Phillips
Addison Phillips
Globalization Architect, Lab126
This Presentation
“Internationalization and Unicode Conference”
Tutorial
Globalization Architect,
Lab126
(Yes, you can touch my Kindle)
Chair,
W3C Internationalization WG
Editor,
IETF LTRU-WG (BCP 47)
Unicode
the design and development of a product that is
enabled for target audiences that vary in culture,
region, or language. [W3C]
a fundamental architectural approach to software
development
Opinions differ on
capitalization (C12N);
choose from:
i18N
I18n
I18n
I18N
Very geeky; not very
internationalized
(I19G?)
Mystic Numbering (M4C N7G)
II N1 T2 E3 R4 N
ATI O NALI ZATI O N
5 6 7 8 9 10 11 12 13 14 15 16 17 18 N
I18N
Localization
Globalization
Canonicalization
=
=
=
L10N
G11N
C14N
The basics of text processing in software.
“Character encodings consume more than 80% of my work
day. They are the source of more mis-information and
confusion than any other single thing. And developers
aren’t getting any better educated.”
~Glen Perkins
Globalization Architect
Real Jargon
Potentially Bogus Jargon
Multibyte
kanji
Variable width
double-byte language
Wide character
extended ASCII
Character encoding
ANSI, OEM
Coded character set
encoding agnostic
Bidi or bidirectional
Glyph, character, code unit
Unicode
“bits”: 010000010101101101101000
“byte” or “octet”: 01000001 (0x41)
code unit: a unit of physical storage and information interchange
• represent numbers
• come in various sizes (e.g. 7, 8, 16, 32, 64 bits)
how do we map text to the numbers used by computers?
Glyphs
A “glyph” is screen unit of text: it’s a picture
of what users think of as a character.
A “grapheme” is a single visual unit of text.
À
U+00C0
Characters
A “character” is a single logical unit of text.
A “character set” is a set of characters.
A “code point” is a number assigned to a
character in a character set.
A “coded character set” is a character set
where each character has a code point.
Bytes
A “character encoding” maps a sequence
of code points (“characters”) to a sequence
of code units (such as bytes).
A “code unit” is a single logical unit of
storage.
… 0xC3 0x80 …
Collection (repertoire) of characters, that is: a set.
Organized so that each character has a unique numeric
(typically integer) value (code point).
Examples:
Unicode
ASCII (ANSI X3.4)
ISO 646
JIS X 208
Latin-1 (ISO 8859-1)
Character sets are often
associated with a particular
language or writing system.
Maps a sequence of code points (characters) to a
sequence of code units (e.g. bytes).
Some encodings use another unit instead of the byte.
For example, some encodings use a 16-bit, 32-bit, or 64bit code unit.
U+00C0
0xC3 0x80
In memory, on disk, on the network, etc.
All text has a character
encoding
When things go wrong, start by asking what the
encoding is, what encoding you expected it to be,
and whether the bytes match the encoding.
Tofu
hollow boxes
Mojibake
Question Marks
garbage characters
(conversion not supported)
Can appear as either
hollow boxes (empty
glyph) or as question
marks (Firefox, for
example)
Not usually a bug: it’s a
display problem
Can mask or
masquerade as
character corruption.
When Good Characters Go Bad
View text using the
Convert to or from the
wrong encoding
Apply a transfer
encoding and forget to
remove it
Convert to an
encoding twice
wrong encoding
Overzealous escaping
Conversion to entities
(“entitization”)
Multiple conversions
7 bits = 27 = 128 characters
Enough for “U.S. English”
ASCII for
characters 0x00
through 0x7F
Accented letters
and other symbols
0x80 through 0xFF
char
Cp1252
Cp437
Cp850
È
0xC8
?
0xD4
Windows’s encodings
(called “code pages”)
are generally based on
standard encodings—
plus some additional
characters.
Example:
CP 1252 is based on ISO
8859-1, but includes 27
“extra” characters in the
C1 control range (0x800x9F)
Originally an IBM
character encoding term.
IBM numbered their
character sets with
“CCSIDs” (coded character
set ids) and numbered the
corresponding character
encodings as “code pages”.
Microsoft borrowed code
pages to create PC-DOS.
Microsoft defines two kinds
of code pages:
“ANSI” code pages are the
ones used by Windows GUI
programs.
“OEM” code pages are the
ones used by command
shell/command line
programs.
Neither “ANSI” nor “OEM”
refer to a particular encoding
standard or standards body in
this context.
Avoid the use of ANSI and
OEM when referring to
encodings.
So far we’ve been
looking at single-byte
encodings:
one byte per character
1 byte = 1 character (= 1
glyph?)
256 character maximum
Good enough for most
alphabetic languages
À
Some languages need more
characters.
What about the “double-byte”
languages?
Don’t those take two bytes per
character?
丏丣並
Escape sequences to select
another character set
Example: ISO 2022 uses escape
sequences to select various
encodings
Use a larger code unit (“wide”
character encoding)
Example: IBM DBCS code
pages or Unicode UTF-16
216 = 64K characters
232 = 4.2 billion characters
Use a variable-width encoding
Variable width encodings
use different numbers of
code units to represent
different types of
characters within the same
encoding
One or more bytes per character
1 byte != 1 character
May use 1, 2, 3, or 4 bytes per
character
May use shift or escape
sequences
May encode more than one
character set
In fact, single-byte encodings are
a special case of multibyte!
Multibyte Encoding: Any
“variable-width” encoding that
uses the byte as its code unit.
JIS X 213
11,233
characters
(2) 94x94
character
planes
JIS X 213: A “Multibyte” Character Set
Specific byte ranges
encoding characters that
take more than one byte.
A “lead byte”
One or more “trailing bytes”
あ
A
1-4-1
1-3-33
(code point)
(code point)
0x82 0xA0
0x41
Code point != code unit
l
e
a
d
t
r
a
i
l
s
i
n
g
l
e
b
y
t
e
b
y
t
e
b
y
t
e
In order to reach
more characters,
Shift_JIS
characters start
with a limited
range of “lead
bytes”
These can be
followed by a
larger range of
byte values
(“trail byte”)
Lead bytes can be
trail byte values
Trail bytes include
ASCII values
Trail bytes include
special values such
as 0x5C (“\”)
int pos = strchr(mybuf, ‘@’);
Stateful Encodings
ex. IBM “MBCS” code pages [SI/SO shift between 1byte and 2-byte characters]
ISO 2022 [escape sequence changes character set being
encoded]
A transfer encoding syntax is a reversible transform of encoded
data which may (or may not) include textual data represented in
one or more character encoding schemes.
Email headers
URIs
IDN (domain names)
Abcソース
=?UTF-8?B?QWJj44K
944O844K5?=
Abcソース
Common Encoding
Conversion Tools
and Libraries
Templates
ISO 8859-1
• iconv (Unix)
Content
UTF-8
Process
Output
(HTML, XML, etc.)
• ICU (C, C++,
Java)
• perl Encode
• Java
(native2ascii,
IO/NIO)
Data
Shift_JIS
Document formats
often require a single
character encoding be
used for all parts of
the document.
When data is merged,
the encodings must be
merged also (or some
of the data will be
“mojibake”).
• (etc.)
ISO 8859-1
ÀàС£
ISO 8859-1
UTF-8
детски
»èçينس文字
ÀàС£
??????
»èç?????
????
UTF-8
ÀàС£
??????
»èç?????
????
Shift_JIS
文字化け
? (0x3F) is the replacement
character for ISO 8859-1
Encoding conversion acts as a “filter”
Replacement characters (“question marks”) replace
characters from the source character set that are not
present in the target character set.
Need for more converters and
conversion maps
Difficulty of passing, storing,
and processing data in multiple
encodings
Too many character sets…
…leads to what we call “code
page hell”
Basic Principles
Universal repertoire
Logical order
Efficiency
Unification
Characters, not glyphs
Dynamic composition
Semantics
Stability
Plain Text
Convertibility
Fights mojibake
because:
characters are from the
common repertoire;
characters are encoded
according to one of the
encoding forms;
characters are
interpreted with
Unicode semantics;
unknown characters are
not corrupted
Unicode is a character set that supports all of the world’s
languages and writing systems.
Code space of up to 0x10FFFF characters (about 1.1
million)
Unicode and ISO 10646 are maintained in sync.
Unicode is maintained by an industry consortium.
ISO 10646 is maintained by the ISO.
Divide Unicode in equal
sized regions of code points.
17 planes (0 through 0x10),
each with 65,535 characters.
Plane 0 is called the Basic
Multilingual Plane (BMP).
> 99% of text in the wild lives
in the BMP
Planes 1 through 0x10 are
called supplementary planes.
An organized
collection of
characters.
Each character has a
code point
aka Unicode Scalar Value
(USV)
U+0041 <= hex
notation
code point
name
character class
combining level
bidi class
case mappings
canonical decomposition
mirroring
default grapheme clustering
ӑ (U+04D1)
CYRILLIC SMALL LETTER A WITH BREVE
letter
non-combining
left-to-right
decomposes to U+0430 U+0306
Ӑ U+04D0 is uppercase (and titlecase)
Many characters were included in
Unicode for round-trip conversion
compatibility with legacy encodings:
①②③45Ⅵ
¾Lj¼Nj½dž
︴︷︻︽﹁﹄
ヲィゥォェュ゙
ﺲﺳﻫﺽﵬﷺ
fiflffifflſtﬔ
Compatibility Characters
includes presentation forms
legacy encoding: a term for nonUnicode character encodings.
U+FEFF
Used to indicate the “byte-order” of UTF-16 code units
0xFE FF; 0xFF FE
Also used as a Unicode signature by some software (Windows’s
Notepad editor, for example) for UTF-8
0xEF BB BF
Appears as a character or renders as
junk in some formats or on some
systems. For example, older browsers
render it as three bytes of mojibake.
U+FFFD
Indicates a bad byte
sequence or a
character that could
not be converted.
Equivalent to
“question marks” in
legacy encoding
conversions
�
there was a character here,
but it is gone now
Composition can create “new” characters
Base + non-spacing (“combining”) characters
A+˚ = Å
U+0041 + U+030A = U+00C5
a+ˆ+.=ậ
U+0061 + U+0302 + U+0323 = U+1EAD
a+.+ˆ=ậ
U+0061 + U+0323 + U+0302 = U+1EAD
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท
ญั = ญ + ัั
glyph = consonant + vowel
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท (word boundaries)
What is Unicode?
यू नि को ड क्या है ?
यू नि को ड
य ूू ि िू क ूो ड
ि + िू = नि
க ொ
‘ko’
U+0B95
Ka
U+0BBE
Aa
U+0BC6
E
Combining mark drawn to the left of the base character
UTF-32
Uses 32-bit code units.
All characters are the same width.
UTF-16
Uses 16-bit code units.
BMP characters use one 16-bit code unit.
Supplementary characters use two special 16-bit code units: a
“surrogate pair”.
UTF-8
Uses 8-bit code units (bytes!)
It’s a multi-byte encoding!
Characters use between 1 and 4 bytes.
ASCII is ASCII in UTF-8
A (U+0041)
UTF-32:
UTF-16:
UTF-8:
À (U+00C0)
0x0000041
0x0041
0x41
0x000000C0
0x00C0
0xC2 0x80
𐌸(U+10338)
ቐ (U+1251)
UTF-32:
UTF-16:
UTF-8:
UTF-32:
UTF-16:
UTF-8:
0x00001251
0x1251
0xE1 0x89 0x91
0x00010338
0xD800 0xDF38
0xF0 0x90 0x8C 0xB8
Uses 32-bit code units (instead of the more-familiar 8-
bit code unit, aka the “byte”)
Each character takes exactly one code unit.
U+1251
ቑ
0x00001251
U+10338
𐌸
0x00010338
Easy to process
each logical character
takes one code unit
can use pointer arithmetic
Not commonly used
Not efficient for storage
11 bits are never used
BMP characters are the
most common—16 bits
wasted for each of these
Affected by processor
architecture (Big-Endian
vs. Little-Endian)
Uses 16-bit code units (instead of the more-familiar 8-
bit code unit, aka the “byte”)
BMP characters use one unit
Supplementary characters use a “surrogate pair”, special code
points that don’t do anything else.
0x1251
U+1251 ቑ
0xD800 0xDF38
U+10338 𐌸
High Surrogate
Low Surrogate
0xD800-DBFF
0xDC00-DFFF
Unique Ranges!
Most common languages
and scripts are encoded in
the BMP.
Less wasteful than UTF-32
Simpler to process
(excepting surrogates)
Commonly supported in
major operating
environments,
programming languages,
and libraries
May not be suitable for all
applications
Affected by processor
architecture (Big-Endian
vs. Little-Endian)
Requires more storage, on
average, for Western
European scripts, ASCII,
HTML/XML markup.
7-bit ASCII is itself
All other characters take 2, 3, or 4 bytes each
lead bytes have a special pattern
trailing bytes range from 0x80->0xBF
Lead Bytes
Trail Bytes
Code Points
0xxxxxxx
< 0x80
110xxxxx 10xxxxxx
< 0x800
1110xxxx 10xxxxxx 10xxxxxx
< 0x10000
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Supplementary
ASCII-compatible
Default or recommended
encoding for many
Internet standards
Bit pattern highly
detectable (over longer
runs)
Non-endian
Streaming
C char* friendly
Easy to navigate
Multibyte encoding
requires additional
processing awareness
Non-shortest form
checking needed
Less efficient than UTF-16
for large runs of Asian text
Set Web server to declare UTF-8 in HTTP Content-Type header
Declare UTF-8 in META tag header
Actually use UTF-8 as the encoding!!
<?php
header("Content-type: text/html; charset=UTF-8");
?>
<html>
<head>
<meta
http-equiv="Content-Type"
content="text/html; charset=UTF-8” />
<title>Fight 文字化け!</title>
</head>
It’s more than just a character set and some encodings…
Unicode provides additional information:
Character name
Character class
“ctype” information, such as if it’s a digit, number, alphabetic, etc.
Directionality (LTR, RTL, etc.) and the Bidi Algorithm
Case mappings (UPPER, lower, and Titlecase)
Default Collation and the Unicode Collation Algorithm (UCA)
Identifier names
Regular Expression syntaxes
Normalization
Compatibility information
Many of these items are in the form of Unicode Technical Reports
http://www.unicode.org/reports
Unicode Normalization has to deal
with more issues:
• single or multiple combining marks
Abc
ABC
abc
abC
aBc
• compatibility characters
• presentation forms
Ǻ
U+01FA
U+00C5 U+0301
abc
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
Ǻ
ways to represent:
U+01FA
U+00C5 U+0301
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
Form D
canonical decomposition
Form C
canonical decomposition
followed by composition
Form KD
kompatibility
decomposition
Form KC
kompatibility
decomposition followed by
composition
Ǻ
Original
Form C
Form D
Form KC
Form KD
U+01FA
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C5 U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C1 U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+212B U+0301
U+212B U+0301
U+212B U+0301
U+01FA
U+0041 U+0301
U+030A
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+0041 U+030A
U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
Not all compatibility characters have a compatibility
decomposition.
Not all characters that look alike or have similar
semantics have a compatibility decomposition.
For example, there are many ‘dots’ used as a period.
Not all character variations are handled by
normalization.
For example, upper, title, and lowercase variations.
Normalization can remove meaning
Some languages are written
predominantly from leftto-right (LTR).
Some languages are written
predominantly from rightto-left (RTL).
(A few can be written topto-bottom or using other
schemes)
Unicode defines character
“directionality” and a “Bidi”
algorithm for rendering text.
Uses logical, not visual,
order.
Uses levels of
“embedding”.
Requires markup changes
in some HTML for full
support.
Characters are encoded in logical order.
Visual order is determined by the layout.
Override and bidi control characters
“Indeterminate” characters
Paste in Arabic
Defines default collation algorithm and sequences
(UTS#10)
Must be tailored by language and “locale” (culture) and
other variations.
Language
Usage
Customizations
Swedish:
z<ö
German:
ö<z
German
Dictionary:
öf < of
German
Telephone:
of < öf
Upper-first
A<a
Lower-First
a<A
Find grapheme, word, and line-break boundaries in
text.
• Tailored by language
• Provides good basic default handling
Remember “all text
has an encoding”?
user input via forms
email
data feeds
existing, legacy data
database instances
uploads
Use UTF-8 for HTML and
Web forms
Use UTF-8 in your APIs
Check that data really is UTF8
Control encoding via code;
avoid hard-coding the
encoding
Watch out for legacy
encodings
Convert to Unicode as soon
as practical.
Convert from Unicode as
late as possible.
Wrap Unicode-unfriendly
technologies
Your System
Map Your System
APIs
use Unicode encoding
hide internal storage
encoding
Convert to
Legacy
Unicode
Interface
Unicode Cloud
Data Stores, Local I/O
use Unicode encoding
consider an encoding
conversion plan
Front Ends
API
Detect / Convert
use Unicode encoding
Back Ends, External
Data
Uses Unicode?
If not, what encoding?
Store the encoding!
Legacy
Encoding
Unicode
Capture
Encoding
Detect / Convert
Input
Be aware of whether you need to
count glyphs, characters, or bytes:
Is the limit “screen positions”,
“characters”, or “bytes of storage”?
Should you be using a different limit?
Which one are you actually counting?
varchar(110)
यूनिकोड
य ूू ि नू क ूो ड
(4 glyphs)
(7 characters)
E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1
(21 bytes)
Code unit
Code point
Character
Glyph
Multibyte encoding
Tofu
Mojibake
Question Marks
“All text has an
encoding”
17 planes of goodness
1.1 million potential code
points
150,000 assigned code
points
3 encodings
UTF-32
UTF-16
UTF-8
Normalize
Bidi
Collation
Case folding
… and so much more
Q&A
“Would you
please write the
code for I18N on
the whiteboard
before you go?”
#import i18n.h
#define
UNICODE
Would you write the code for I18N on the
whiteboard before you go?
#define UNICODE
#import I18N.h