Internationalization

Transcript Internationalization

I N T E R N AT I O N A L I Z AT I O N :
AN INTRODUCTION
Presenter and Presentation
 Addison Phillips
 Globalization Architect
 This Presentation
 “Internationalization and Unicode Conference” Tutorial
 Covers Internationalization and basic concepts, such as
character encodings
Who is this guy?
 Globalization Architect,
Lab126
(you know us as “Amazon Kindle”)
 Chair,
W3C
Internationalization Core WG
 Editor
IETF LTRU-WG
Internationalization is:
 the design and development of a product that is
enabled for target audiences that vary in culture,
region, or language. [W3C]
 a fundamental architectural approach to software
development
Related Concepts
 Localization: creation of a product tailored to a
particular target market
 Translation: process of converting text from one
language to another
 Globalization: unified approach to creating global
products, especially those that support multiple
geographies simultaneously
Opinions differ on
capitalization (C12N);
choose from:

i18N

I18n

I18n

I18N
Very geeky; not very
internationalized
(I19G?)
Mystic Numbering (M4C
N7G)
II N1 T2 E3 R4 N
ATI O NALI ZATI O N
5 6 7 8 9 10 11 12 13 14 15 16 17 18 N
I18N
Localization
Globalization
Canonicalization
=
=
=
L10N
G11N
C14N
A Global Approach
 Internationalization turns technical problems into
business decisions
 Balance priorities based on real user
distribution/requirements



Consider global user population as a whole
Consider specific market requirements on an equal footing
Potential markets for the product
Buy In: The Key to Success
 For internationalization to be a success over time,
there must be commitment:



Management
Product Team
Development Team
•
All developers, not a splinter group
Addressable Market: Why Do It?
Name
Computer
Associates
North
America
Revenue
% of
WW
Western
Europe
Revenue
% of
WW
Asia Pacific
Revenue
% of
WW
$320M
58%
$165M
30%
$36M
7%
IBM
$4,030M
42%
$3,516M
37%
$1,406M
15%
Microsoft
$1,894M
49%
$1,195M
31%
$530M
14%
Oracle
$2,469M
41%
$2,001,M
33%
$1,010M
17%
Sun
$127M
54%
$60M
26%
$37M
16%
Sybase
$370M
59%
$162M
26%
$65M
10%
$19,653M
48%
$13,294M
32%
$6,055M
15%
All IDC tracked
companies
Globalized Product Development
Internationalization turns technical problems
into business decisions.




Localization: Choose which markets to translate user
interface or documentation for with no engineering.
Deployment : Choose whether to serve applications from a
single site, cluster of sites, or in each target market.
Development : Add content and features to products as
necessary in each target market.
Integration and Interoperability: Servers and products can
work together around the world, so customers can truly
create “Enterprise” solutions.
Aspects of Internationalization
 Enabling—the same code supports multiple regions or
cultures. Sometimes called a “global binary”.
 Externalization—plan for localizability by separating
“content” from code. This makes localization for specific
languages, regions, or cultures easy, fast, and cheap.
 Customization—add culturally specific functionality,
presentation, or content to an application.
What, me worry?
 We (wrote it in Java/C#, used Unicode, etc.),









so it is internationalized.
We made the assumption that the product
would only ever have English screens: all
our users understand it anyway.
A localized product is internationalized.
An internationalized product is slow/slower.
It takes longer to write internationalized
code.
We can’t read the screens/it is too hard to
test.
We have no intention of localizing, so no
need to internationalize.
We don’t have any customers there.
The users in (some country) never
complained, so it must work.
This product is 100% fully internationalized.
Development Methodologies
 Independent of
development methodology

Agile? Waterfall? You make
the choice.
 Encompasses the full
development cycle:





Design
Development
QC
Release
Support
Develop Requirements Develop Roadmap
(all customers)
(global deployment)
RTM/GA
(by market)
Develop Requirements
& Architecture
Test
(non-English/non-ASCII)
Code
(Enable, externalize,
customizable)
Design
(internationalized)
The Customization Approach
 “Internationalization is something remedial”
 “Didn’t we do internationalization in the last release?!?”
 Internationalization involves a lot of arcane knowledge (“we don’t
know what to do”)
 “It will interrupt or slow down development.”
 “International features are not important to our U.S. customers—
and they represent our largest market.”
 “The guys in-country have always figured it out before.”
 “Let’s outsource it”
 “We’ll get to it next time”
How That Model Really Looks
bug fixes
1.0
sexy new features
1.0a
2.0 Main Line
International Branch
functionality
gaps: intl
users waiting
for 2.0i now
Merges and Fixes
Lots more people
and cost
1.0i
Lost $ and opportunity
lots of cost to get there
Time
International
Release 1.0
The Problem with Customization
 Code forks. (double, triple coding)
 Lag time for international releases.
 Non-adoption of localized release.
 Full regression of every language.
 Quality or commitment perception.
 Lack of data exchange between language versions.
 Difficult to repeat (every version is a repeat)
 Proliferation of bugs and of support problems.
 International features are cancelled.
 Core product still doesn’t work/can’t address similar
markets.
 Loss of market share.
The Internationalization Approach
 Gather requirements globally
 Enable
 Externalize
 Customize
 Test and support globally
 Localize
Analyzing and Developing a Design
LARGE ANIM AL PICTURES
Large Animal Pictures
Resources
Input
Global Code
Software
Component
I/O
Output
Enterprise Animal Pictures
clients
ogic
API
data feed
partner
or provider
Business Logic
API
Front End
Data Store
Business Logic
Operating Env.
Data Store
Operating Env.
API
Internationalization Issues
 Text Processing

Character encodings, including Unicode, spelling, word breaks, collation,
and so on
 Language


Of the software (localization)
Of solutions built using the software (localizability, data)
 Locale-affected formats

dates, numbers and the like
 Regionally-affected formats

names, addresses, currency, and the like
 Time-related issues

time zone, calendar, holidays, work rules and the like
 Cultural adaptation

presentation, style, position, color use, and the like
 Legal requirements

accessibility, SOX, DRM, moderation, security, content, and the like
“Well, it depends…”
Making Good Design Decisions
 Generalize designs
 Locale independent data structures
 Locale sensitive display
 Externalize cultural or linguistic variations
 Customize as a last resort
Levels of Enablement
 Not Enabled
 Single-Language-at-a-Time (SLAAT)
All components run in the same language and encoding environment
correctly.
 Multi-Locale
Unicode support; components run in different locales, languages,
encodings, and time zones
Test Your Assumptions
Gender:
 Male
× Female
Choose Your Language
How is this company doing?
Enabling
M A K I N G C O D E AW A R E O F C U LT U R E
What is “enabling”?
 Enabled software:
adapts the display, processing, validation, storage, and transmission
of data according to the cultural, linguistic, and regional needs of the
users
 Text, Characters, and Encodings
 Locale Awareness
 Times and Time Zones
A “global binary” is a single
object-code version that is
used in all markets, regardless
of localization.
Text and Encodings
The Biggest Source of Woe
“Character encodings consume more than 80% of my work
day. They are the source of more mis-information and
confusion than any other single thing. And developers
aren’t getting any better educated.”
~Glen Perkins
Globalization Architect
A lot of jargon
Real and bogus jargon you might encounter:
Real Jargon
Potentially Bogus Jargon
Multibyte
kanji
Variable width
double-byte language
Wide character
extended ASCII
Character encoding
ANSI
Coded character set
encoding agnostic
Bidi or bidirectional
Glyph, character, code unit
Unicode
How the computer sees the world
“bits”: 010000010101101101101000
“byte” or “octet”: 01000001 (0x41)
code unit: a unit of physical storage and information interchange
• represent numbers
• come in various sizes (e.g. 7, 8, 16, 32, 64 bits)
how do we map text to the numbers used by computers?
From text to bits
Glyphs


A “glyph” is screen unit of text: it’s a picture of
what users think of as a character.
A “grapheme” is a single visual unit of text.
Characters




A “character” is a single logical unit of text.
A “character set” is a set of characters.
A “code point” is a number assigned to a
character in a character set.
A “coded character set” is a character set
where each character has a code point.
Bytes


A “character encoding” maps a sequence of
code points (“characters”) to a sequence of code
units (such as bytes).
A “code unit” is a single logical unit of storage.
… 0xC3 0x80 …
À
U+00C0
Coded Character Set
 Collection (repertoire) of characters, that is: a set.
 Organized so that each character has a unique numeric
(typically integer) value (code point).
 Examples:
 Unicode
 ASCII (ANSI X3.4)
 ISO 646
 JIS X 208
 Latin-1 (ISO 8859-1)
Character sets are often
associated with a particular
language or writing system.
Character Encoding
 Maps a sequence of code points (characters) to a
sequence of code units (e.g. bytes).

Some encodings use another unit instead of the byte. For example,
some encodings use a 16-bit, 32-bit, or 64-bit code unit.
U+00C0
0xC3 0x80
(usually the most important slide in this entire presentation)
In memory, on disk, on the network, etc.
All text has a character
encoding
When things go wrong, start by asking what the
encoding is, what encoding you expected it to be,
and whether the bytes match the encoding.
Common Encoding Problems
Tofu
hollow boxes
Mojibake
garbage characters
Question Marks
(conversion not supported)
Tofu
Can appear as either
hollow boxes (empty
glyph) or as question
marks (Firefox, for
example)
 Not usually a bug: it’s a
display problem
 Can mask or
masquerade as
character corruption.

Mojibake
When Good Characters Go Bad
Sources of Mojibake
 View text using the
 Convert to or from the
wrong encoding
 Apply a transfer
encoding and forget
to remove it
 Convert to an
encoding twice
wrong encoding
 Overzealous
escaping
 Conversion to entities
(“entitization”)
 Multiple conversions
Encoding Structure
EBCDIC
ASCII
 7 bits = 27 = 128 characters
 Enough for “U.S. English”
Latin-1(ISO 8859-1)
ASCII for
characters 0x00
through 0x7F
Accented letters
and other
symbols 0x80
through 0xFF
One character—many encodings!
char
Cp1252
Cp437
Cp850
È
0xC8
?
0xD4
Windows Code Pages
Windows’s encodings
(called “code pages”)
are generally based on
standard encodings—
plus some additional
characters.
Example:

CP 1252 is based on ISO
8859-1, but includes 27
“extra” characters in the
C1 control range (0x800x9F)
Code Page
 Originally an IBM
character encoding term.
 IBM numbered their
character sets with
“CCSIDs” (coded
character set ids) and
numbered the
corresponding character
encodings as “code
pages”.
 Microsoft borrowed code
pages to create PC-DOS.
 Microsoft defines two kinds
of code pages:




“ANSI” code pages are the ones
used by Windows GUI programs.
“OEM” code pages are the ones
used by command
shell/command line programs.
Neither “ANSI” nor “OEM”
refer to a particular encoding
standard or standards body in
this context.
Avoid the use of ANSI and
OEM when referring to
encodings.
Beyond Single Byte Encodings
 So far we’ve been
looking at single-byte
encodings:




one byte per character
1 byte = 1 character (= 1
glyph?)
256 character maximum
Good enough for most
alphabetic languages
À
Some languages need more
characters.
What about the “double-byte”
languages?
Don’t those take two bytes per
character?
丏丣並
Methods of reaching beyond single-byte
 Escape sequences to select
another character set

Example: ISO 2022 uses escape
sequences to select various
encodings
 Use a larger code unit (“wide”
character encoding)



Example: IBM DBCS code pages
or Unicode UTF-16
216 = 64K characters
232 = 4.2 billion characters
 Use a variable-width encoding
Variable width encodings
use different numbers of
code units to represent
different types of
characters within the same
encoding
Multibyte Encodings
One or more bytes per character




1 byte != 1 character
May use 1, 2, 3, or 4 bytes per
character
May use shift or escape sequences
May encode more than one
character set
 In fact, single-byte encodings
are a special case of multibyte!
Multibyte Encoding: Any
“variable-width” encoding that
uses the byte as its code unit.
Simple Multibyte Encodings
 Specific byte ranges
encoding characters that
take more than one byte.


A “lead byte”
One or more “trailing bytes”
あ
A
1-4-1
1-3-33
(code point)
(code point)
0x82 0xA0
0x61
 Code point != code unit
ｌ
ｅ
ａ
ｄ
ｔ
ｒ
ａ
ｉ
ｌ
ｓ
ｉ
ｎ
ｇ
ｌ
ｅ
ｂ
ｙ
ｔ
ｅ
ｂ
ｙ
ｔ
ｅ
ｂ
ｙ
ｔ
ｅ
JIS X 213
 11,233
characters
 (2) 94x94
character
planes
JIS X 213: A “Multibyte” Character Set
Shift_JIS: A Multibyte Encoding
 In order to reach
more
characters,
Shift_JIS
characters start
with a limited
range of “lead
bytes”
 These can be
followed by a
larger range of
byte values
(“trail byte”)
Shift-JIS
Shift-JIS
 Lead bytes can be
trail byte values
 Trail bytes include
ASCII values
 Trail bytes include
special values
such as 0x5C (“\”)
int pos = strchr(mybuf, ‘@’);
More Complex Multibyte Systems
 Stateful Encodings
 ex. IBM “MBCS” code pages [SI/SO shift between 1-byte
and 2-byte characters]
 ISO 2022 [escape sequence changes character set being
encoded]
Ad hoc Encodings
Encoding Conversion
Common Encoding
Conversion Tools
and Libraries
Templates
ISO 8859-1
• iconv (Unix)
Content
UTF-8
Process
Output
(HTML, XML, etc.)
• ICU (C, C++,
Java)
• perl Encode
• Java
(native2ascii,
IO/NIO)
Data
Shift_JIS
 Document formats
often require a single
character encoding
be used for all parts
of the document.
 When data is merged,
the encodings must be
merged also (or some
of the data will be
“mojibake”).
• (etc.)
Encoding Conversion as Filter
ISO 8859-1
ÀàÐ¡£
ISO 8859-1
UTF-8
детски
»èç‫ينس‬文字
ÀàÐ¡£
??????
»èç?????
????
UTF-8
ÀàÐ¡£
??????
»èç?????
????
Shift_JIS
文字化け
? (0x3F) is the replacement
character for ISO 8859-1
Encoding conversion acts as a “filter”

Replacement characters (“question marks”) replace characters
from the source character set that are not present in the
target character set.
Too Many Fish in the Sea
 Need for more converters and
conversion maps
 Difficulty of passing, storing,
and processing data in multiple
encodings
 Too many character sets…
…leads to what we call “code
page hell”
Unicode / ISO-10646
The Idea Behind Unicode
 Basic Principles










Universal repertoire
Logical order
Efficiency
Unification
Characters, not glyphs
Dynamic composition
Semantics
Stability
Plain Text
Convertibility
 Fights mojibake
because:




characters are from the
common repertoire;
characters are encoded
according to one of the
encoding forms;
characters are interpreted
with Unicode semantics;
unknown characters are
not corrupted
Unicode (ISO 10646)
Unicode is a character set that supports all of the
world’s languages and writing systems.


Code space of up to 0x10FFFF characters (about 1.1 million)
Unicode and ISO 10646 are maintained in sync.


Unicode is maintained by an industry consortium.
ISO 10646 is maintained by the ISO.
What are “planes”?
 Divide Unicode in equal
sized regions of code
points.
 17 planes (0 through 0x10),
each with 65,535
characters.
 Plane 0 is called the Basic
Multilingual Plane (BMP).
 > 99% of text in the wild lives in
the BMP
 Planes 1 through 0x10 are
called supplementary
Unicode as the Universal Character Set
 An organized
collection of
characters.
 Each character has a
code point
aka Unicode Scalar Value
(USV)
 U+0041 <= hex
notation
Compatibility Characters
Many characters were included in
Unicode for round-trip conversion
compatibility with legacy encodings:
①②③４５Ⅵ
¾ǈ¼ǋ½ǆ
︴︷︻︽﹁﹄
ｦｨｩｫｪｭﾞ
Compatibility Characters
‫ﺲﺳﻫﺽﵬﷺ‬
fiflffifflﬅﬔ
includes presentation forms
legacy encoding: a term for nonUnicode character encodings.
Unicode Encodings
 UTF-32


Uses 32-bit code units.
All characters are the same width.
 UTF-16



Uses 16-bit code units.
BMP characters use one 16-bit code unit.
Supplementary characters use two special 16-bit code units: a “surrogate pair”.
 UTF-8




Uses 8-bit code units (bytes!)
It’s a multi-byte encoding!
Characters use between 1 and 4 bytes.
ASCII is ASCII in UTF-8
Unicode Encodings Compared
A (U+0041)
UTF-32:
UTF-16:
UTF-8:
À (U+00C0)
0x0000041
0x0041
0x41
0x000000C0
0x00C0
0xC2 0x80
𐌸(U+10338)
ቐ (U+1251)
UTF-32:
UTF-16:
UTF-8:
UTF-32:
UTF-16:
UTF-8:
0x00001251
0x1251
0xE1 0x89 0x91
0x00010338
0xD800 0xDF38
0xF0 0x90 0x8C 0xB8
UTF-32
 Uses 32-bit code units (instead of the more-familiar
8-bit code unit, aka the “byte”)
 Each character takes exactly one code unit.
U+1251
U+10338
ቑ
0x00001251
𐌸
0x00010338
Advantages and Disadvantages of UTF32
 Easy to process


each logical character takes
one code unit
can use pointer arithmetic
 Not commonly used

Not efficient for storage



11 bits are never used
BMP characters are the
most common—16 bits
wasted for each of these
Affected by processor
architecture (Big-Endian vs.
Little-Endian)
UTF-16
 Uses 16-bit code units (instead of the more-familiar
8-bit code unit, aka the “byte”)


BMP characters use one unit
Supplementary characters use a “surrogate pair”, special code
points that don’t do anything else.
0x1251
U+1251 ቑ
0xD800 0xDF38
U+10338 𐌸
High Surrogate
0xD800-DBFF
Low Surrogate
0xDC00-DFFF
Unique Ranges!
Advantages and Disadvantages of UTF-16
 Most common languages
and scripts are encoded in
the BMP.



Less wasteful than UTF-32
Simpler to process
(excepting surrogates)
Commonly supported in
major operating
environments, programming
languages, and libraries
 May not be suitable for all
applications


Affected by processor
architecture (Big-Endian vs.
Little-Endian)
Requires more storage, on
average, for Western
European scripts, ASCII,
HTML/XML markup.
UTF-8
 7-bit ASCII is itself
 All other characters take 2, 3, or 4 bytes each
 lead bytes have a special pattern
 trailing bytes range from 0x80->0xBF
Lead Bytes
Trail Bytes
Code Points
0xxxxxxx
< 0x80
110xxxxx 10xxxxxx
< 0x800
1110xxxx 10xxxxxx 10xxxxxx
< 0x10000
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Supplementary
Advantages and Disadvantages of UTF-8
 ASCII-compatible
 Default or recommended





encoding for many
Internet standards
Bit pattern highly
detectable (over longer
runs)
Non-endian
Streaming
C char* friendly
Easy to navigate
 Multibyte encoding
requires additional
processing awareness
 Non-shortest form
checking needed
 Less efficient than UTF-16
for large runs of Asian text
Byte Order Mark (BOM)
U+FEFF
 Used to indicate the “byte-order” of UTF-16 code units

0xFE FF; 0xFF FE
 Also used as a Unicode signature by some software (Windows’s
Notepad editor, for example) for UTF-8

0xEF BB BF
Appears as a character or renders as
junk in some formats or on some
systems. For example, older browsers
render it as three bits of mojibake.
The Replacement Character
U+FFFD
 Indicates a bad byte
sequence or a
character that could
not be converted.
 Equivalent to
“question marks” in
legacy encoding
conversions
�
there was a character here,
but it is gone now
Composing Characters Using Combining
Marks
 Composition can create “new” characters
 Base + non-spacing (“combining”) characters
A+˚ = Å
U+0041 + U+030A = U+00C5
a + ˆ + .= ậ
U+0061 + U+0302 + U+0323 = U+1EAD
a + .+ ˆ = ậ
U+0061 + U+0323 + U+0302 = U+1EAD
Complex Scripts
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท
ญั = ญ + ัั
glyph = consonant + vowel
ญัตติที่เสนอได้ผา่ นที่ประชุมด้วยมติเอกฉันท (word boundaries)
Hindi
What is Unicode?
यू नि को ड क्या है ?
यू नि को ड
य ूू ि िू क ूो ड
ि + िू = नि
Tamil
க ொ
‘ko’
U+0B95
U+0BBE
U+0BC6
Combining mark drawn to the left of the base character
Normalization
Unicode Normalization has to deal
with more issues:
• single or multiple combining marks
Abc
ABC
abc
abC
aBc
• compatibility characters
• presentation forms
Ǻ
U+01FA
U+00C5 U+0301
abc
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
Four Normalization Forms
Ǻ
ways to represent:
U+01FA
U+00C5 U+0301
U+00C1 U+030A
U+212B U+0301
U+0041 U+0301 U+030A
U+0041 U+030A U+0301
 Form D
canonical decomposition
 Form C
canonical decomposition
followed by composition
 Form KC
kompatibility decomposition
followed by composition
 Form KD
kompatibility decomposition
Normalization in Action
Ǻ
Original
Form C
Form D
Form KC
Form KD
U+01FA
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C5 U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+00C1 U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+212B U+0301
U+212B U+0301
U+212B U+0301
U+01FA
U+0041 U+0301
U+030A
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
U+0041 U+030A
U+0301
U+01FA
U+0041 U+0301
U+030A
U+01FA
U+0041 U+0301
U+030A
Unicode Defines Character Properties
Unicode provides additional information:










Character name
Character class
“ctype” information, such as if it’s a digit, number, alphabetic, etc.
Directionality (LTR, RTL, etc.) and the Bidi Algorithm
Case mappings (UPPER, lower, and Titlecase)
Default Collation and the Unicode Collation Algorithm (UCA)
Identifier names
Regular Expression syntaxes
Normalization
Compatibility information
Many of these items are in the form of Unicode Technical Reports
Unicode Character Database









code point
name
character class
combining level
bidi class
case mappings
canonical decomposition
mirroring
default grapheme clustering
ӑ (U+04D1)
CYRILLIC SMALL LETTER A WITH BREVE





letter
non-combining
left-to-right
decomposes to U+0430 U+0306
Ӑ U+04D0 is uppercase (and titlecase)
Bi-directional Scripts
 Some languages are
written predominantly
from left-to-right (LTR).
 Some languages are
written predominantly
from right-to-left (RTL).
 (A few can be written topto-bottom or using other
schemes)
Unicode defines character
“directionality” and a “Bidi”
algorithm for rendering text.



Uses logical, not visual,
order.
Uses levels of “embedding”.
Requires markup changes in
some HTML for full support.
Embedding and “Logical Order”
Characters are encoded in logical order.
Visual order is determined by the layout.


Override and bidi control characters
“Indeterminate” characters
Bidirectional Embedding
Paste in Arabic
Transfer Encodings
 A transfer encoding syntax is a reversible transform of encoded
data which may (or may not) include textual data represented in
one or more character encoding schemes.
 Email headers
 URIs
 IDN (domain names)
Abcソース
=?UTF-8?B?QWJj44K
944O844K5?=
Abcソース
“That’s great: I’ll just use Unicode”
 Remember “all text
has an encoding”?






user input via forms
email
data feeds
existing, legacy data
database instances
uploads
 Use UTF-8 for HTML and




Web forms
Use UTF-8 in your APIs
Check that data really is
UTF-8
Control encoding via code;
avoid hardcoding the
encoding
Watch out for legacy
encodings



Convert to Unicode as soon
as practical.
Convert from Unicode as late
as possible.
Wrap Unicode-unfriendly
technologies
Map Your System
APIs


use Unicode encoding
hide internal storage
encoding
Data Stores, Local I/O




Unicode
Interface
Unicode Cloud
API
Detect / Convert
use Unicode encoding
Back Ends, External
Data

Convert to
Legacy
use Unicode encoding
consider an encoding
conversion plan
Front Ends

Your System
Uses Unicode?
If not, what encoding?
Store the encoding!
Legacy
Encoding
Unicode
Capture
Encoding
Detect / Convert
Input
HTML
 Set Web server to declare UTF-8 in HTTP Content-Type header
 Declare UTF-8 in META tag header
 Actually use UTF-8 as the encoding!!
<?php
header("Content-type: text/html; charset=UTF-8");
?>
<html>
<head>
<meta
http-equiv="Content-Type"
content="text/html; charset=UTF-8” />
<title>Fight 文字化け!</title>
</head>
Counting Things
Be aware of whether you need to
count glyphs, characters, or bytes:


Is the limit “screen positions”, “characters”,
or “bytes of storage”?
Should you be using a different limit?
Which one are you actually counting?
varchar(110)
यूनिकोड
य ूू ि नू क ूो ड
(4 glyphs)
(7 characters)
E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1
(21 bytes)
Enabling Code
for formats and presentation
ADAPTING CODE TO LANGUAGE, REGIONAL,
A N D C U LT U R A L VA R I AT I O N
Don’t Code What You Think You Know
 5/2/7
 sometime in February?
 1.234
sometime in May?
sometime in 2005?
 more than 1000? less than 2?
 number, time, currency?
morning or afternoon?
 4.32.MD
Date Formats
Culture
Format
Example
U. S. A.
mdy, /
2/16/05
France
dmy, .
16.2.05
France
dmy, -
16-2-05
CJKT
ymd, /
2005/2/16
CJKT
ymd,年月日
2005年2月16日
Japan
e¥md,
平成17年2月16日
Japan
¥md, /
17/2/16
Time Formats
 U.S.A.:
 France:
 Japan:
 Japan:
 Korea:
 Thai:
 Albanian:
 Arabic:
4:00 p.m.
16.00
1600
ごご4:00
오후 4:32
16:32 น.
5:00
AM
5:00
PM
10:00
PM
4.32.MD
04:32 ‫م‬
Don’t forget to
identify time zone!
More Examples
Assumptions about date tokens:
USA:
titlecase
French:
Sun, Mon, Tue
3 positions,
lun. mar. mer.
Russian:
Cyrillic
USA:
titlecase
French:
Пн Вв Ср
four positions
lowercase
two positions,
Jan, Feb, Mar
3 positions,
janv. févr. mars avr. variable (4 or 5)
positions,
lowercase
Spanish (Spain):
ene, feb, mar
Spanish (Americas):
not titlecase
Ene, Feb, Mar
Calendars: What Year Is It?
 Legal, ceremonial, or popular requirement
 Gregorian
2007
 Japan Emperor:
19 Heisei (平成19年 )
 Thailand (Buddhist):
2551 (Gregorian + 543)
 Chinese (traditional):
4704 (lunar)
 Hebrew
5767 ‫(תשסו‬lunar)
 Hijri (Islamic)
1428 (lunar)
 Armenian
1456 (ԹՎ ՌՆԾԶ )
 etc. etc. etc.
Weekends and Holidays
 When is the weekend?

Friday is part of the weekend in some countries.
 Both official and unofficial holidays vary widely in number. Here are
a few to watch for:

USA:

Japan:
China:
Britain:
France:
Spain:




July 4, MLK, President’s Day, Veteran’s Day, Flag Day,
Columbus Day, Thanksgiving…
Golden Week
New Year’s
Guy Fawke’s Day, Boxing Day
Bastille Day
Reyes Magos
Calendar Display
Number and List Formats
Grouping and decimal
separators:
England:
Germany:
Switzerland:
Swiss money:
France:
India:
12,345.67
12.345,67
12’345,67
12’345.67
12 345,67
12,34,567.89
France uses a non-breaking
space!
India: number of digits in
groupings changes!
List delimiters &
separators can conflict
French example:
2 345,67, 1 012,34, 45,67
hard to read
2 345,67, 1 012,34, 45,67
2 345,67; 1 012,34; 45,67
easier to read
Collation
( A
F A N C Y
W O R D
F O R
“ S O R T I N G ” )
English:
ABC...RSTUVWXYZ
German:
AÄB...NOÖ...SßTUÜV…YZ
Swedish/Finnish: AB...STUVWXYZÅÄÖ
Norwegian:
AB...VWXYÜZÆØÅ
Organizing Information
 “Alphabet” differences
 Additional information
 for example: yomi
 ASCII vs. the world
 Mixed information sets
“Should I be writing all of this down…”
 Wide range of
variation
 Obscure formats
 Difficult to obtain
reliable information
on formats
 Lots of work to
implement and
maintain
Enabling means not
having to know
(m)any of the
details
Supporting International Formats
 Use neutral data
structures


Makes code independent
of locale
Most data types are
locale-neutral:
 Boolean
 String, char
 Number classes
 Date, Calendar
 Encapsulate
formatting/validation
in a function



Format style chosen
dynamically at runtime
Format details don’t have
to be specified or
researched
APIs know the gory
details
Essence of Enabling
 Object to Presentation, Presentation to Object
 Integers
 Floats
 Percents
 Currencies
 Dates
user
java.lang.Locale
 Times
presentation
 Durations
 Collation (lists)
 Weights/measures/sizes
 Resources (user interface strings)
Locale
 an identifier or data structure that allows
programmers to access culturally and linguistically
affected functionality in a system.
Supporting International Formats:
Numbers
French vs. Suisse
NumberFormat
Demo Code
public String formatNumber(int column, Number n, Locale l) {
NumberFormat format;
Currency c;
switch (column) {
default:
case 1:
format = NumberFormat.getInstance(l);
break;
case 2:
format = NumberFormat.getIntegerInstance(l);
break;
case 3:
format = NumberFormat.getPercentInstance(l);
break;
case 4:
format = NumberFormat.getCurrencyInstance(l);
try {
c = Currency.getInstance(l);
} catch (IllegalArgumentException e) {
// can get here if you specify a locale with no
// country or for one with a territory that isn't
// supported (like my favorite territory 'AQ'
// in which case we use the Almighty Buck
c = Currency.getInstance("USD");
}
format.setCurrency(c);
break;
case 0:
return n.toString();
}
return format.format(n);
}
Collation
Example
Break Iterator
Break iterators allow
you to break text into
characters, words,
lines, and sentences.
In the demo, we use a
word break iterator to
find word-breaks. We
also use a character
break iterator to find
approximate glyph
breaks.
BreakIterator iter =
BreakIterator.getWordInstance(b_locale);
iter.setText(str);
int pos = iter.first(); // points to the start of the string
pos = iter.next();
// so move to next break
int longest = 0;
while (pos != BreakIterator.DONE) {
String sub = str.substring(last, pos).trim();
// …
last = pos;
pos = iter.next();
}
Collator
Collator is the class that
does linguistically
correct sorting. In the
demo, it’s really easy to
use: Java Collections
can take a comparator
and do all the work
internally. All we have to
do is provide the right
one.
Collator nativeCol = Collator.getInstance(b_locale);
bMap = new TreeSet(nativeCol);
Complex Types
 Data structures, APIs, or classes built from basic types
must include similar capabilities.



Store data in a locale-neutral or independent format.
Display in a language/regional/culturally sensitive manner
Convert from locale format to locale-neutral or locale-independent
storage format.
Design Time and Data Structures
 Identify your own “locale bias”
 Field names matter!
•
•

“Postal Code”, not “ZIP code”.
Family Name/Given Name, not First Name/Last Name
Avoid problematic fields
•
Postal address parsing? Area code? Etc.
Currency
 Currency formatting is
usually similar to number
formatting. But things can
vary widely here, too:





$1,100.00 [USA]
€1 100,00 [France-Euro]
¥1,100 [Japan]
1.100$00 Esc. [Portugal,
obsolete]
SFr. 1’000.00 [Switzerland]
 Currency associated with the
locale doesn’t always apply.
Store the currency type with
value.

Use ISO 4217 std. codes (USD,
JPY, EUR, RUR)
 Not always one symbol.
 Not always two decimal
places.
 $100 + ¥100 = $101
 Consider neutral displays!
Being Locale Neutral
 Avoid or reduce locale-affected display to
increase portability

Use unambiguous formats, such as ISO 8601-like dates,
especially in log files and the like
•

2005-04-01 14:17:00 UTC
Use consistent formats (‘user locale’), especially in
columns or collections of data
Amount
351,234.56
102,556.78
65,336.00
212,345.00
Currency
USD
EUR
JPY
INR
Amount
351,234.56
102 556,78
65336
2,12,345.00
Currency
USD
EUR
JPY
INR
“String is the Thing”
 Text doesn’t get translated on the fly.
 Don’t use text as an identifier or foreign key.


Use ID Numbers or not-human-readable values instead of requiring text fields
to match.
“Intrinsic” data value versus “display” data value.
 Enumerated values displayed as strings.
 Use display strings.
Enumerated
ACCOUNTS_PAYABLE
Displayed
“Accounts Payable”
“pagável de clientes”
English-like Construction
 Concatenation
 String1 + string2
 Pluralization
 Dog + “s” = “dogs” (sheeps??)
 Lists
 1.23, 2.23, 3.36
 1,23, 2,23, 3,36?
This topic will be covered in
greater depth in the section on
localization.
Databases
 Most databases can only handle one collation sequence per
instance or one collation per index.



Remove reliance on alphalists.
Self-collate short lists.
Pre-collate long lists?
 Example: NLS_SORT controls the way Oracle returns data
(collation sequence).



Global environment variable.
Not necessarily under your control.
Indices are built on a predetermined or binary sort.
Enabling Summary
 Understand Encodings and Unicode
 All text has an encoding!
 Be Locale-Aware
 Create locale-neutral data structures
 Separate display from storage
It’s About Time
D AT E S , T I M E S , D U R AT I O N S , C A L E N D A R S A N D
TIME ZONES
Computer vs. Wall Time
 Incremental Time
 Clock ticks since epoch
(the ticks and epoch vary)
 Usually UTC-based
 Field-based Time
 Zone independent
•

birth date, start date, end date
Zone dependent
•
recurring meeting schedule
Time Zone
 a geographical region that
has common rules for
determining local time.
 These include:




Offset from UTC
Daylight Savings (Summer Time)
behavior
Historic changes in offset or DST
behavior
Political control
Time Zone Affected Scenarios
 Zone independent

only “incremental” times are
necessary
 Local time, past only


future changes to time zone
rules not applicable
example: logging system
 Local time, both past and
future


time zone rule changes may
affect some time values
example: calendar program
 Floating times


events not tied to a specific time
zone
example: birthdate, start date,
definition of “night” for phone
usage
 Recurring events


events that recur—sometimes
during and sometimes not during
daylight savings.
example: weekly status meeting
Time Zone Identifiers
 Often based on the time zone information
database (tzinfo). These identifiers are
sometimes called the Olson ids.
Offset
Etc/UTC
Etc/GMT+1
Continent/Region/City
America/Indiana/
Indianapolis
Ocean/Island(City)
Atlantic/Canary
Pacific/Auckland
Pacific/Pago_Pago
Continent/City
America/Los_Angeles
Europe/Paris
Asia/Tokyo
Antarctica/DumontD
Urville
Locale-Neutral Formats
 Use locale-neutral formats for interchange:
 ISO 8601
 Incremental time values (e.g. time_t)
 Distinguish time zone if necessary for interpretation
•
Offset is not the same as time zone
SQL data types and XML formats
are often field-based, while
programming languages are
usually incremental.
At any given time, in UTC, it is
the same time everywhere that
time is measured.
Durations and Repeating Events
Wall-time:
this meeting is at 2 PM
Pacific time every Tuesday

interval between meetings
may vary in number of
seconds
 Daylight time transitions
 Changes in DST rules
Fixed-duration:
run the virus scanner every
57 minutes

interval is always 342000
milliseconds
Calendars
 Gregorian
 Japanese Imperial
 Hijri
 Thai Buddhist
 Chinese Traditional
 Jewish
 Astronomy
Friday, January 20, 2006
1426 ،‫ ذو الحجة‬20 ،‫الجمعة‬
2006年1月20日星期五
二○○六年一月二十日星期五
平成18年1月20日
平成十八年一月二十日
วันศุกรที
์ ่ 20 มกราคม พ.ศ.
2549
วันศุกรที
์ ่ ๒๐ มกราคม พ.ศ.
๒๕๔๙
Calendars affect the field values
calculated for a given event. “Roll”
of values such as month, week, day,
etc. depend on such relationships.
Calendar code then converts to
incremental times.
Formatting Dates and Times
October 10, 14H 6:05:45 AM JST
Requires more than
just a locale!



date
time zone
calendar
value being
formatted
defines relation to
“wall time”
defines rules for
calculating field
values
1034197545321L
Asia/Tokyo
Japanese Imperial
Example: Java Date Formatting
Computer Time (Data Structure)
java.util.Date: long integer, milliseconds since “epoch” of January 1,
1970, 00:00 UTC
Externalization
M O V I N G L A N G U A G E A N D C U LT U R A L LY
A F F E C T E D D ATA A N D C O M P O N E N T S O U T O F
CODE.
What is Localization?
 The process of tailoring a product to a specific target
market.



Translation of messages
Adaptation to local preferences
Addition (or subtraction) of content or features
Localization is obvious…
 “Localization” is not “Internationalization”!
 Localizability is internationalization.
 Externalize text
 Externalize presentation
 Dynamic composition
 Distribution of language content
 “Plug-in” features
Avoiding Forks
English Version
version française
Deutsche Version
日本語版
Global Binary
Resources
Resources
Resources
Resources
Forked Code Woes
 Hard to fix and maintain
 Different versions in the field
 Delays in releasing localized product
 Different functionality by region
 Confusing for customers/users
 Versions are not interoperable and might not be able
to exchange data!
Other Benefits
 Rename or re-brand product
 Fix spelling or grammar mistakes
 Fix usability
 Make terminology consistent
… all without a rebuild!
What is a ‘Resource’?
any application component loaded
dynamically at runtime, rather than
compiled into the application

in Localization: source code files containing
language, region, or culturally-affected
materials









$SET 1 Prompts
1 ENTER FIRST NAME
2 ENTER LAST NAME
$
$set 2 Error Messages
1 NAME NOT ON DATA BASE
2 ILLEGAL INPUT







a gencat message catalog file
Text
Error messages
Icons
Pictures
Fonts
Colors
Graphics
Sizes
Positions
Magic Numbers
Mnemonics (“Alt+G”,
“F4”, etc.)
File Locations
Dictionaries
Glossaries
Grammar Rules
Code
Non-Translatable Resources
 Some content should be externalized but not translated
 Sometimes referred to as “DNT” for “do not translate”
 Externalize? Yes…
 Segregate DNT material from translated material if possible (by using
separate resource files or separate resource blocks within a file).
 Developers can’t always tell when something should or should not be
DNT… and neither can translators (context is missing)
The “Locale” in “Localization”
 Resources “fall back” to
find the best match
Global Binary
Resources
Falling back
zh-Hans-SG (Chinese, Simplified script, Singapore)
zh-Hans (Chinese, Simplified script)
zh (Chinese)
(root)
Sparse Population
 A given language resource may not contain a
complete set of resources.

Some resource language fall back for each sub-resource (such as a
particular value)
“appName”
“Démo”
“dialogTitle” “Bonjour monde”
“appName” “Demo”
“maxRows” 57
“dialogTitle” “Hello World”
Getting the Right Locale
Client Locale
Server Locale
API Request Locale
client
System Mgmt Locale
Front End
Business Logic
API
Business Logic
Data Store
Data Store
Operating Env.
Operating Env.
One request might serve
multiple purposes or be
seen in multiple
contexts
Resources and Translation
“key”, “display string”
“dialogTitle”, “Dialog Title”
“aMessage”, “This is a message.”
“key”, “ðìsplàÿ stríñg”
“dialogTitle”, “Ðîálòg Tïtlè”
“aMessage”, “Thìß ís â Ｍésßãgê.
Pseudo-Translation
Don’t Build From Text Fragments
 Text fragments are hard to translate
 Fragments may not follow grammar rules
•
•
Cannot know which parts go together
Parts can be reused in incompatible ways
String1 = There are
String2 = no
There are files.
There are no files.
There are 50 files.
There are tables in files.
There are no tables in files.
String3 = tables in
String4 = files.
[] files out of [] were deleted.
An error occurred at [] on [].
Page [] of []
Processing: []% complete.
Issues With Text Composition
 Count:


There were one errors found.
You have earned your 22th set of bonus points.
 Gender:



“Documenti del Chris“
"Documenti della Chris”
"Documenti - Chris"
 Case
 Grammatical Structure

SOV, SVO, etc.
 Word Order and Inter-word Dependency
Sentence Parts Must Agree
 Endings, Gender, Plurality, Case
 e.g. Japanese counting uses different words for different kinds of
objects
 e.g. Slavic languages use different endings for singular, few, many…
Message Format APIs
There were {0}
tables on {1}.
There were
{0,number,integer}
tables on
{1,date,short}.
{1,date}に
{0,number,integer}の
テーブルがあった。
 Number replacement
variables.
 Provide typing and
formatting information
where possible.
 Externalize as a single
unitary string.
Complex Message Formatting
There were no errors.
There was 1 error.
There were 2 errors.
0:There were no errors.
1:There was {0} error.
2:There were {0} errors.
0:не было ошибок
1:была {0} ошибка
2:были {0} ошибки
5:были {0} ошибок
“choice format” APIs allow
for different resources to be
used based on runtime
values.
Examples:
 ordinal numbers (1st, 2nd,
3rd, 4th, etc.)
 complex messages, such as
“27 seconds ago” vs. “10
minutes ago”
number of resources may
need to vary by locale or
language
Images and Icons
 Avoid metaphors
 Avoid cultural sensitivities
 Avoid body parts
 Replace as necessary
 Avoid putting text into graphics
Graphic: $20
Text: $0.06
Images and Culture
 Beware your
biases—even “good”
ones.
Check out our new website
for India!
Isn’t it Swell?
English is very succinct.



Words in other languages are often
longer
Sentences may be longer
Characters may be larger (taller,
wider, or require a bigger point
size)
More Swollen Text
 30% in length (alphabetics, abjads, etc.)
 30% in height (ideographics)
 But… a rule of thumb, not a “fact”
 Measure your results with care.
GUI Layout
Managing English Text
String
Building??
Abbrev.
Eng.
String is the Thing
String is the
Thing?
Dereferencing
 Minimize sentence building
 Minimize arguments per string
 Use subject:predicate wherever possible
Don’t do this:
Your balance is $100.00.
When you can do this:
Balance: $100.00
Dynamic vs. Static Layout
 Magic numbers
 Externalized layouts
 Mnemonics
 Colors
Localizing Styles
 Bolding is not universal for emphasis
 Italicization, Capitalization, etc. are also not universal (some scripts
don’t have these attributes)
 Use Logical not Presentational names
 Describe the function not the appearance. For example, use
“emphasis” instead of “italics”.
中国
Amikake
Wakiten
Use of Color
“Going Down”
“Going Up”
Keyboards
Input Method Editors
Some languages require software to
assemble keystrokes into characters


Asian languages with vary large character sets
Complex scripts with vowel-killers and other
contextual editing requirements
Applications that interact directly with
key-pressed events can disable or
disrupt IME input.

On- and over-the-spot editing
Customization
When is it okay?
 Content should be highly localized
or have locale-specific
requirements:

customization lets you address this
requirement in the most localized possible
manner
Externalization Redux
currencies
dates
numbers
times
Address
images
formats
colors
currencies
dates
titles
Legal
times
rules
numbers
Accounting
rules
Address
text
formats
sounds
Language and Locale
Independent Code
titles
images
colors
Legal
rules
Accounting
rules
sounds
text
Large Animal Pictures
Resources
Input
Global Code
Software
Component
I/O
Output
Customization Examples
 Postal address
validation
 Postal code validation
 Telephone number
formatter
 “Personality”
questions

 Personal name formatter

first/last position, space,
highlighting, formality, etc.
 Tax codes and shipping
schedules
Generic API
blood type vs. sun sign
Generic
Implementation
US
Implementation
DE
Implementation
Impl
Example: Postal Addresses
address1 varchar(32)
country
address2 varchar(32)
address1 varchar(64)
i18n
char(2)
city
varchar(16)
address2 varchar(64)
state
char(2)
city
zip
char(5)
province varchar(64)
varchar(64)
postcode varchar(64)
public interface Address {
public class genericAddress implements Address {
public class USAddress extends genericAddress {
public class UKAddress extends genericAddress {
country=US, postcode=‘WC2 1GH’
// error
country=UK, postcode=‘95111’
// error
country=DE, postcode=‘1A4喪’
// okay?
Building Global Software
BEYOND JUST CODING:
L O C A L I Z AT I O N , Q A , A N D A L L T H AT
The internationalization cycle
 Encompasses the full
development cycle:






Requirements
Design
Development
QC
Release
Support
Support Issues
Develop Roadmap
and Requests
(where is the product going?)
(all customers)
RTM/GA
(by market)
Develop Requirements
& Architecture
Test
(non-English/non-ASCII)
Code
(Enable, externalize,
modularize)
Design
(internationalized)
What is “internationalization QA”?
 Does the enabled product work correctly?
 Non-English configurations
 Non-ASCII data and encoding support
 Cross time zone support
 Market specific features or customizations
 Does localization appear correctly?
 Is the product localizable?
What makes this different
from “regular” QA?
Growing (and Pruning) the Matrix
Include non-English configurations in your test
matrix; include non-ASCII data in your tests.
Be prepared to prune the test
matrix.
What to Test With

Test Non-English configurations
•
•

Test Non-ASCII data
•
•

Non-English locales (lying to your machine)
Native configurations (when does it make sense?)
Encodings, encodings, everywhere
Non-ASCII character values
Test Across Time Zones
•
Two or more time zones; consider international date line (“it’s tomorrow
in Japan”) and DST issues
Planning Testing
Initially
 Get tools that are
enabled!

Automation allows
greater coverage, but only
if it works.
 Plan encodings and
locales as part of the
test matrix.
 Acquire third-party
products as
necessary.
Increasing Maturity
 Use test driven
development
practices.
 Get developers to
write unit tests that
are internationalized.
 Put the ‘i18n’ bugs
into the regression
suite.
Configuring Machines
Create both native and simulated environments:



Native operating systems may have minor but sometimes critical
differences (folder names, keywords, localized registry entries)
Most features don’t run into native differences (easier to work with
English-localized machines)
Don’t buy physical keyboards (use software keyboards) unless your
application relies on scan codes from keys
Localization
Incorporate
Localization is part of the release process too.


Changes to the user interface cost the localization team time and
money.
(Changes to the product cost the documentation and QA folks too)
 May need to institute change control or a UI freeze
Simultaneous Shipment (Simship)
Ideally, to maximize opportunity, ship the target
languages the same day as the source language.


It might not make sense for your product.
But it might not be as difficult as you think it is. It might even be
good for you.
Distribution of Content
 How does the localized text get into the running
product?





Satellite assemblies, DLLs, shared libraries
Message catalogs
Special directory
Database
Etc.
More Distribution
 “Specific Language”
(per-language)
 “Language Included”
(one or more languages)
 “Language Pack”
(product plus something)
English
English
German
German
French
French
English
Global Binary
+
German
French
Completing the Product
 Static content is often under source control and can
be localized “normally”
 Dynamic content may include the initial set of data or
other items which need to be localized beyond
software.





Demos and Demo Data
Dictionary, Language add-ons
Local offers, links to Web store, etc.
Packaging
Regulatory
Quality Checking and Development
Methodologies
 Translation is a human-
oriented task.

Translation time lines are linear
with volume.
 Localized product should be
 Development cycle has to
include time for translators
and quality assurance to
catch up.

tested for functionality


translation can break things
usually the first language finds
most of the bugs
 Translations should be
checked for quality

This does not mean “no agile” or
“no changes”
Do pilot language(s) or movingtarget translation; do better UI
design and usability reviews; etc.
Summary
Internationalization
… is a fundamental architectural approach: it is how
software is built.






Design
Enabling
Externalization
Customization
Testing and Support
Lifecycle
“Would you please
write the code for
I18N on the
whiteboard before
you go?”
#import i18n.h
#define UNICODE
Q&A

Internationalization

Transcript Internationalization

Directory