Internationalization

Download Report

Transcript Internationalization

An Introduction
Internationalization and Unicode Conference #29
Presenter and Presentation
 Addison Phillips
 Globalization Architect
 Yahoo!
 This Presentation
 Tutorial
 About 3 hours duration
 Internationalization and basic concepts
Internationalization is:
 the design and development of a
product that is enabled for target
audiences that vary in culture,
region, or language. [W3C]
 a fundamental architectural
approach to software development
Related Concepts
 Localization: creation of a product
tailored to a particular target
market
 Translation: process of converting
text from one language to another
 Globalization*: unified approach
to creating global products,
especially those that support
multiple geographies simultaneously
Mystic Numbering (M4C N7G)
 Internationalization: I18N
 Localization:L10N
 Globalization: G11N
 Opinions differ on the handling of
capitalization
 Very geeky; not very
internationalized (I19G?)
A Global Approach
 Internationalization turns technical
decisions into business decisions
 Balance priorities based on real user
distribution/requirements



Consider global user population as a
whole
Consider specific market requirements
on an equal footing
Potential markets for the product
Addressable Market: Why Do It?
Name
Computer
Associates
North
America
Revenue
% of
WW
Western
Europe
Revenue
% of
WW
Asia Pacific
Revenue
% of
WW
$320M
58%
$165M
30%
$36M
7%
IBM
$4,030M
42%
$3,516M
37%
$1,406M
15%
Microsoft
$1,894M
49%
$1,195M
31%
$530M
14%
Oracle
$2,469M
41%
$2,001,M
33%
$1,010M
17%
Sun
$127M
54%
$60M
26%
$37M
16%
Sybase
$370M
59%
$162M
26%
$65M
10%
$19,653M
48%
$13,294M
32%
$6,055M
15%
All IDC tracked
companies
Globalized Product Development
With proper internationalization, localization is a
business decision, not a technical decision.
Flexibility:



Deployment : Choose whether to serve content
from a single site, cluster of sites, or in each target
market.
Development : Add content and features to
products as necessary in each target market.
Integration : Servers and products can work
together around the world, so customers can truly
create “Enterprise” solutions.
Aspects of Internationalization
 Enabling—the same code supports
multiple regions or cultures
 Externalization—separate content
from code to make localization for
specific languages, regions, or
cultures easy, fast, and cheap.
 Customization—add culturally
specific content or functionality
Mythology/Favorite Whines
 We (wrote it in Java/C#, used Unicode, etc.), so it is









internationalized.
We made the assumption that the product would only
ever have English screens: all our users understand it
anyway.
A localized product is internationalized.
An internationalized product is slow/slower.
It takes longer to write internationalized code.
We can’t read the screens/it is too hard to test.
We have no intention of localizing, so no need to
internationalize.
We don’t have any customers there.
The users in Khalikistan never complained, so it must
work.
This product is 100% fully internationalized.
The internationalization cycle
 Encompasses the
full development
cycle:





Design
Development
QC
Release
Support
Develop Requirements Develop Roadmap
(all customers)
(global deployment)
RTM/GA
(by market)
Develop Requirements
& Architecture
Test
(non-English/non-ASCII)
Code
(Enable, externalize,
modularize)
Design
(internationalized)
The Customization Approach
 “Internationalization is something remedial”
 “Didn’t we do internationalization in the last
release?!?”
 Internationalization involves a lot of arcane
knowledge (“we don’t know what to do”)
 “It will interrupt or slow down development.”
 “International features are not important to our
U.S. customers—and they represent our largest
market.”
 “Let’s outsource it”
 “We’ll get to it next time”
How That Model Really Looks
Main Line
I18N
Lots more people
Engineer/Test
Global FCS
Can’t ship now: too close
to next source FCS…
Time
The Problem with Customization
 Code forks. (double, triple coding)
 Lag time for international releases.
 Non-adoption of localized release.
 Full regression of every language.
 Quality or commitment perception.
 Lack of data exchange between language





versions.
Difficult to repeat (every version is a repeat)
Proliferation of bugs and of support problems.
International features are cancelled.
Core product still doesn’t work/can’t address
similar markets.
Loss of market share.
The Internationalization Approach
 Gather requirements globally
 Enable
 Externalize
 Customize
 Test and support globally
 Localize
Analyzing and Developing a Design
Large Animal Pictures
Large Animal Pictures
Resources
Input
Global Code
Software
Component
I/O
Output
The Three Locale Problem
Client Tier
Server Tier
Resource Tier
More Layers are Possible
Internationalization Issues
 Text Processing

Character encodings, including Unicode, spelling, word
breaks, collation, and so on
 Language


Of the software (localization)
Of solutions built using the software (localizability, data)
 Locale-affected formats

dates, numbers and the like
 Regionally-affected formats

names, addresses, currency, and the like
 Time-related issues

time zone, calendar, holidays, work rules and the like
 Cultural adaptation

presentation, style, position, color use, and the like
 Legal requirements

accessibility, SOX, security, content, and the like
“Well, it depends…”
Making Good Design Decisions
 Generalize designs
 Locale independent data structures
 Locale sensitive display
 Externalize cultural or linguistic
variations
 Customize as a last resort
Levels of Enablement
 Not Enabled
 Single-Language-at-a-Time (SLAAT)
 All components run in the same
language and encoding environment
correctly.
 Multi-Locale
 Unicode support; components run in
different locales, languages, encodings,
and time zones
Enabling
What is “enabling”?
 Enabled software:
adapts the display, processing,
validation, storage, and transmission of
data according to the cultural, linguistic,
and regional needs of the users
 Text, Characters, and Encodings
 Locale Awareness
 Time Zones
Text and Encodings
The Biggest Source of Woe
“Character encodings consume more
than 80% of my work day. They are
the source of more mis-information
and confusion than any other single
thing. And developers aren’t getting
any better educated.”
~Glen Perkins
Globalization Architect
Macromedia
A lot of jargon
Some are real and some are bogus:
Single-byte
Character encoding
Double-byte
Code point
Multibyte
Extended ASCII
Wide character
Latin-1
Charset
Unicode
Coded character set
EBCDIC
Kanji
Lead byte, trail byte
Bidi or bidirectional
DBCS, MBCS, SBCS
Mojibake
Stateful encoding
Glyph
Legacy encoding
How the computer sees the world
“bits”: 010000010101101101101000
“byte” or “octet”: 01000001 (0x41)
From bits to text
 A “character” is a single logical
unit of text.
 A “character set” is an organized
collection of characters, each with
an integer value (a “code point”).
 A “character encoding” is a
mapping of characters in a
character set to bytes (“coded
character sequence”).
 A “glyph” is a single visual unit of
text.
À
U+00C0
UTF-16: \x00\xC0
UTF-8:
\xC3\x80
fi
Character Set
 A collection (repertoire) of characters,
organized (coded character set), so
that each has a unique, integer
value (code point).
 Examples:



ASCII (ANSI X3.4), ISO 646
JIS X 208
Latin-1 (ISO 8859-1)
Character Encoding
 Maps code points to a sequence of
bytes (code units).
U+00C0
0xC3 0x80
All text (in memory, on disk, on
the network, etc.) has a
character encoding.
Life of an encoding
Template
User Input
Some
Process
Text File
Display
Process
Build
Message
Process
Process
DB
Receive
Message
ASCII
 ASCII was originally a 7-bit standard
 7 bits = 27 = 128 characters
 Enough for “U.S. English”
Latin-1(ISO 8859-1)
ASCII for
characters
0x00 through
0x7F
Accented
letters and
other symbols
0x80 through
0xFF
One character—many encodings!
EBCDIC
Beyond Single Byte Encodings
 8-bit or “single-byte” character
encodings



one byte per character
1 byte = 1 character (= 1 glyph?)
256 character maximum
 “Double-byte” encodings must use
2 bytes per character??
Multibyte Encodings
 One or more bytes per character
 1 byte != 1 character
 May use 1, 2, 3, or 4 bytes per
character (depending on encoding)
Simple Multibyte Encodings
 Specific byte ranges encoding
characters that take more than one
byte.


A “lead byte”
One or more “trail bytes”
 Code point != code unit?
Shift-JIS: A Multibyte Encoding
 In order to reach
more characters,
Shift-JIS
characters start
with a limited
range of “lead
bytes”
 These can be
followed by a
larger range of
byte values
(“trail byte”)
Shift-JIS
Shift-JIS
 Lead bytes
are trail byte
values too
 Trail bytes
include
ASCII values
 Trail bytes
include
special
values such
as 0x5C (“\”)
Working with Multibyte
• Byte Strings
• A sequence of bytes: the programmer directly
accesses the byte sequence.
• wrong: while (++lpStr != ‘a’) {
• wrong: myChar = strchr(lpStr, 0x5c);
• Code written specifically for one character set won’t
work with any other character set.
• Character Strings
• A sequence of logical characters: the programmer
doesn’t know about and can’t directly access the
byte representation.
Working with Multiple Encodings
 Tofu
hollow boxes
 Mojibake
“screen garbage”
 Question Marks
(conversion not supported)
Mojibake
UTF-8
SJIS
EUC-JP
EBCDIC (Cp037)
ISO 2022-JP
GB18030
“Code Page Hell”
 Need for more converters and conversion
maps
 Difficulty of passing, storing, and
processing data in multiple encodings
 Too many character sets…
Unicode / ISO-10646
The Idea Behind Unicode
 Solve the mojibake problem
 Assign a unique bit pattern to every character
in all of the other standards
 Assign unique bit patterns to all the characters
from languages that haven’t yet been
computerized
 Nail down more tightly just what each bit
pattern means and how it is to be used
 This means
 Using more than one byte for each character
 Deciding among various practices for
representing various languages
Unicode (ISO 10646)
• Unicode is a character set that supports all of the
world’s languages and writing systems.*
• Unicode was originally designed as a “wide
character set”—every character represented by 16bits. This allowed for 65,535 potential characters.
• Now has 21 bits of code space (enough for
0x10FFFF characters)
• Unicode is maintained by an industry consortium.
ISO 10646 is maintained by the ISO. 10646 is an
exact duplicate of Unicode (and visa versa).
More than a character encoding
 Unicode provides additional information:
Code point
Character encoding (“code units”)
Character name
Character information, such as if it’s a digit,
number, alphabetic, etc.
 Directionality (LTR, RTL, indeterminate)
 Normalization
 Casing and compatibility information




Basic Multilingual Plane
 The original 16-bit encoding is now called the
“Basic Multilingual Plane” (BMP).
 Most of the data in the world lives in this plane.
Unicode Properties
 Each
character’s
properties
can help you
process and
render text.
Unicode Can Be Composed
 Composition can create “new” characters
 Base + non-spacing (combining)
character(s)
A
+
˚
= Å
U+0041 + U+030A = U+00C5
a+ˆ+.=ậ
U+0061 + U+0302 + U+0323 = U+1EAD
a+.+ˆ=ậ
U+0061 + U+0323 + U+0302 = U+1EAD
 Note: Unicode notation is U+hhhh
Why Unicode
 Unicode solves a number of the problems
we’ve just encountered.
 You can search for any Unicode character:
 ok: myChar = wstrchr(lpUnicodeStr, 0x0040);
 Unicode encodings are not stateful.
 Mix content in any language.
Unicode Encodings
 UTF-32

32 bit encoding. All characters are the same width.
 UTF-16


Used to be called UCS-2… it’s the original Unicode.
Each character is 16-bits wide.
Characters beyond the BMP are represented by
“surrogate pairs”.

Lead and trail surrogates are entirely separate characters from all
other characters.
 UTF-8


A multi-byte, 8-bit encoding using 1 to 4 bytes per
character.
ASCII is ASCII in UTF-8.
UTF-8
 How it looks:
Scalar Value
UTF-16
1st
Byte
2nd
Byte
3rd
Byte
00000000 0xxxxxxx
00000000 0xxxxxxx
0xxxx
xxx
00000yyy yyxxxxxx
00000yyy yyxxxxxx
110yy
yyy
10xx
xxxx
zzzzyyyy yyxxxxxx
zzzzyyyy yyxxxxxx
1110z
zzz
10yy
yyyy
10xx
xxxx
000uuuuu zzzzyyyy
yyxxxxxx
110110ww wwzzzzyy
110111yy yyxxxxxx
11110
uuu
10uu
zzzz
10yy
yyyy
4th
Byt
e
10xx
xxxx
Which Encoding is Right for Me?
 char, wchar_t, and ICU
 Java and C#
 HTML, XML, email, etc.
 Other encodings: ACEs, Punycode,
CESU-8, SCSU, and so on…
“That’s great: I’ll just use Unicode”
 Remember “all text has an
encoding”?


Control encoding via code; avoid
hardcoding the encoding
Watch out for legacy encodings



Transcode to Unicode early
Convert to legacy encoding late
Some technologies are not Unicode-ready
Unicode to the Rescue
Template
User Input
Some
Process
Text File
Display
Process
Build
Message
Process
Process
DB
Receive
Message
Locales and Formats
Adapting code to language, regional,
and cultural variation
Locales
 A “locale” is a data structure that
allows programmers to access
culturally and linguistically affected
functionality in a system.

Most commonly: convert objects to
strings and strings to objects.
Coding for Culture
 Object Presentation
 Object->String, String->Object
 Integers
 Floats
 Percents
 Currencies
 Dates
 Times
 Durations
 Collation (lists)
 Weights/measures/sizes of all types
 Resources (user interface strings)
Internal Value vs. Display Value
 Internal Value
 The data
structure—the
actual bits used to
represent the
object.
 Doesn’t vary by
locales, that is, it is
“locale
independent”
 Display Value or
Display Name


How the data is
presented to the
end user.
May vary by locale,
that is, it is “locale
sensitive”
Internal Value vs. Display Value
0000 0100 0000 0000
0x400
1,024
1 024
1’024
1.024
۱۰۲۴
Dates and Display Value
 Computer Time (Data Structure)
 java.util.Date: long integer, milliseconds since
“epoch” of 1 January 1970, 00:00 UTC

1034197545321L
 Display Values:
 Wed Oct 09 14:08:01 PDT 2002 (POSIX)
 9 Spalio 2002 14.08.01 PDT (lt-LT)
 9/10/2545, 14 นาฬิกา 8 นาทีPDT (th-TH)
Dates and Calendars
 Dates are formatted differently and may use
different calendars altogether.
 System short dates:
 01-10-00 10/01/00 1月10日 12/01/2543 [Thai]
 Long dates:
Wednesday, January 12, 2000 [USA]
mercredi 12 janvier 2000 [French]
Quarta-feira, 12 de Janeiro de 2000 [Portuguese]
12.január 2000 [Slovak]
treðdiena, 2000. gada 12. janvâris [Latvian]
More Dates
 Abbreviations can be a problem:
 USA:
Sun, Mon, Tue
 French:
lun. mar. mer. [four positions]
 USA:
Jan, Feb, Mar
 French:
janv. févr. mars avr. [4-5 positions]
 Spanish:
ene, feb, mar [Spain]
 Spanish:
Ene, Feb, Mar [Latin America]
 Russian:
Пн Вв Ср [two positions]
Calendars Look Different
Ante-meridian?
Weekends and Holidays
 When is the weekend?

Friday is part of the weekend in some countries.
 Both official and unofficial holidays vary widely in
number. Here are a few to watch for:






USA:
July 4, MLK, President’s Day, Veteran’s
Day, Flag Day, Columbus Day, Thanksgiving…
Japan: Golden Week
China:
New Year’s
Britain: Guy Fawke’s Day, Boxing Day
France: Bastille Day
Spain:
Reyes Magos
Range of Variation
Korean
Thai
SHORT
04. 3. 2.
2/3/2004
MEDIUM
2004. 3. 2.
2 มี.ค. 2004
LONG
2004년 3월 2일 (화)
2 มีนาคม 2004
FULL
2004년 3월 2일 화요일
วันอังคารที่ 2 มีนาคม ค.ศ. 2004
Thai (Thailand)
2/3/2547
2 มี.ค. 2547
2 มีนาคม 2547
วันอังคารที่ 2 มีนาคม พ.ศ. 2547
๒ มี.ค. ๒๕๔๗
2004-3-2
2.3.2004
02/03/2004
2.3.2004
02-03-2004
02.03.2004
02-mar-2004
02/03/2004
2 mars 2004
2004.3.2
02/03/2004
2004-03-02
02.Mar.2004
02/03/2004
๒ มีนาคม ๒๕๔๗
2004年3月2日
аўторак, 2, сакавіка 2004
2 / març / 2004
2. březen 2004
2. marts 2004
2. März 2004
2 de marzo de 2004
2 de marzo de 2004
2 mars 2004
Antradienis, 2004, Kovo 2
2 de Março de 2004
2004-03-02
02 Mart 2004 Salı
02 2004 ,‫مارس‬
วันอังคารที่ ๒ มีนาคม พ.ศ. ๒๕๔๗
2004年3月2日 星期二
аўторак, 2, сакавіка 2004
dimarts, 2 / març / 2004
Úterý, 2. březen 2004
2. marts 2004
Dienstag, 2. März 2004
martes 2 de marzo de 2004
martes 2 de marzo de 2004
mardi 2 mars 2004
Antradienis, 2004, Kovo 2
Terça-feira, 2 de Março de 2004
2004-03-02
02 Mart 2004 Salı
02 2004 ,‫مارس‬
Thai (Thailand,TH)
๒/๓/๒๕๔๗
Chinese (China)
04-3-2
Byelorussian
2.3.04
Catalan (Spain)
02/03/04
Czech
2.3.04
Danish (Denmark)
02-03-04
German (Switzerland) 02.03.04
Spanish
2/03/04
Spanish (Argentina) 02/03/04
French (France)
02/03/04
Lithuanian
04.3.2
Portuguese (Brasil)
02/03/04
Albanian
04-03-02
Turkish
02.03.2004
Arabic
02/03/04
Don’t Despair
 Lots of variation… but…
 Most programming languages
and/or operating environments
have support or APIs for getting
display value for an object (and
vice-versa)!
Complex Types
 Data structures, APIs, or classes built
from basic types must include similar
capabilities.



Store data in a locale-neutral or independent
format.
Display in a language/regional/culturally
sensitive manner
Convert from locale format to locale-neutral or
locale-independent storage format.
Data Structuring
 Identify your own “locale bias”
 Field names matter!



“Postal Code”, not “ZIP code”.
Family Name/Given Name, not First
Name/Last Name
Avoid problematic fields

Postal address parsing? Area code? Etc.
Data Validation
Currency
 Currency formatting is
usually similar to
number formatting. But
things can vary widely
here, too:





$1,100.00 [USA]
€1 100,00 [FranceEuro]
¥1,100 [Japan]
1.100$00 Esc. [Portugal,
obsolete]
SFr. 1’000.00
[Switzerland]
 Currency associated with
the locale doesn’t always
apply. Store the
currency type with
value.

Use ISO 4217 std.
codes (USD, JPY, EUR,
RUR)
 Not always one symbol.
 Not always two decimal
places.
 $100 + ¥100 = $101
 Consider neutral
displays!
Being Locale Neutral
 Avoid or reduce locale-affected
display to increase portability

Use unambiguous formats, such as ISO
8601-like dates, especially in log files
and the like


2005-04-01 14:17:00 UTC
Use consistent formats (‘user locale’),
especially in columns:


1,234.56 USD
2,556.78 EUR
Strings are Data, Too
 Textual data doesn’t get translated
on the fly.
String is the Thing
 Don’t use text as an identifier or foreign key.


Use ID Numbers or not-human-readable values
instead of requiring text fields to match.
“Intrinsic” data value versus “display” data value.
 Enumerated values displayed as strings.
 Use display strings.
Enumerated:
Display:
ACCOUNTS_PAYABLE
“Accounts Payable”
pagável de clientes
English-like Construction
 Concatenation
 String1 + string2
 Pluralization
 Dog + “s” = “dogs” (sheeps??)
 Lists
 1.23, 2.23, 3.36
 1,23, 2,23, 3,36?
Organizing Information
 “Alphabet” differences
 Additional information
 for example: yomi
 ASCII vs. the world
 Mixed information sets
Collation Continued
English:
ABC...RSTUVWXYZ
German: AÄB...NOÖ...SßTUÜV…YZ
Swedish/Finnish: AB...STUVWXYZÅÄÖ
Norwegian:
AB...VWXYÜZÆØÅ
Note Y ~ Ü
Spanish: “ch” sorts between “c” & “d”

Color, Charlar, Dar
Databases
 Most databases can only handle one collation
sequence per instance or one collation per index.



Remove reliance on alphalists.
Self-collate short lists.
Pre-collate long lists?
 Example: NLS_SORT controls the way Oracle
returns data (collation sequence).



Global environment variable.
Not necessarily under your control.
Indices are built on a predetermined or binary sort.
Enabling Summary
 Understand Encodings
 All text has an encoding!
 Understand Unicode
 Be Locale-Aware
 Create locale-neutral data structures
 Separate display from storage
Writing Systems
What’s a “Writing System”?
Affected by Writing System
 Set of characters used
 Writing Direction
 Word delimiters
 Hyphenation, line breaks
 Punctuation
Types of Writing Systems
 Alphabet
 Abjad
 Syllabary
 Ideograph
 Bidirectional
 “bidi” or “right-to-left”
“Kanji”
 Japanese writing consists of four
different writing systems.



Romaji: Latin script (ABC…)
Kana: syllabic writing systems
 Hiragana: mostly used for native Japanese
words
(あべせでご)
 Katakana: mostly for words of foreign origin
(コンプタア).
Kanji: Ideographic writing system based on
Chinese hanja or Han ideographs: (日本語)
Bi-directional text
 Visual vs. Logical Order
 Unicode bidi
It’s About Time
Dates, Times, Durations, Calendars
and Time Zones
Computer vs. Wall Time
 Computer Time
 Clock ticks since epoch (the ticks and
epoch vary)
 Usually UTC
 Wall (or Field-based) Time
 Linked to culture and calendar
 Can be either time zone dependent or
independent
Durations
 Intervals, often used for repeating
events (like a weekly status
meeting)


Wall-time: this meeting is at 2 PM
Pacific time every Tuesday (interval
between may vary in number of
seconds)
Fixed-duration: run the virus scanner
every 57 minutes (interval is always
342000 milliseconds)
More Durations
 Can get tricky…
Locale-Neutral Formats
 ISO 8601
Calendars
 Gregorian
 Japanese Imperial
 Hijri
 Thai Buddhist
 Chinese Traditional
 Jewish
 Astronomy
Friday, January 20, 2006
1426 ،‫ ذو الحجة‬20 ،‫الجمعة‬
2006年1月20日星期五
二○○六年一月二十日星期五
平成18年1月20日
平成十八年一月二十日
วันศุกรที
์ ่ 20 มกราคม พ.ศ. 2549
วันศุกรที
์ ่ ๒๐ มกราคม พ.ศ. ๒๕๔๙
Time Zones
 Zone-dependent vs. zone-
independent time
 Summer (Daylight) Time
Externalization
Moving language and culturally
affected data and components out of
code.
Localization is obvious…
 “Localization” is not
“Internationalization”!
 Localizability is internationalization.





Externalize text
Externalize presentation
Dynamic composition
Distribution of language content
“Plug-in” features
What is Localization?
 The process of tailoring a product to
a specific target market.



Translation of messages
Adaptation to local preferences
Addition (or subtraction) of content or
features
Avoiding Forks
English Version
version française
Deutsche Version
日本語版
Global Binary
Resources
Resources
Resources
Resources
Forked Code Woes
 Hard to fix and maintain
 Different versions in the field
 Delays in releasing localized product
 Different functionality by region
 Confusing for customers/users
Other Benefits
 Rename or re-brand product
 Fix spelling or grammar mistakes
 Fix usability
 Make terminology consistent
 … all without a rebuild!
What are Resources?
 “Resources” are source code files
that contain language and locale
affected materials.
Not Just Text
 Anything that can be localized in a product needs
to be in resources












Text
Error messages
Fonts
Colors
Graphics
Sizes
Positions
Magic Numbers
Mnemonic Keys (“Alt+G”, “F4”, etc.)
File Locations
Dictionaries, Glossaries, Grammar Checkers
Code
“Do Not Translate”
 Some content should be
externalized but not translated

“DNT” = “do not translate”
 Externalize? Yes…
 Segregate DNT material from
translated material if possible.
 Developers can’t always tell… and
neither can translators.
The “Locale” in “Localization”
 Resources “fall
back” to find the
best match
Global Binary
Resources
Falling back
zh-Hans-SG (Chinese, Simplified script, Singapore)
zh-Hans (Chinese, Simplified script)
zh (Chinese)
(root)
Sparse Population
 A given language resource may not
contain a complete set of resources.
The “Locale” in “Localization” (2)
 User Locale
 Server Locale
 Multiple
levels
 Resource
Locale

Multiple
levels
Resources and Translation
“key”, “display string”
“dialogTitle”, “Dialog Title”
“aMessage”, “This is a message.”
“key”, “ðìsplàÿ stríñg”
“dialogTitle”, “Ðîálòg Tïtlè”
“aMessage”, “Thìß ís â Mésßãgê.
Pseudo-Translation
Deadly String Cats
“There were 14 errors found.”
[] files out of [] were deleted.
An error occurred at [] on [].
Page [] of []
Processing: []% complete.
Issues with composition
 Count:


There were one errors found.
You have earned your 22th set of bonus points.
 Gender:



“Documenti del Chris“
"Documenti della Chris”
"Documenti - Chris"
 Case
 Grammatical Structure

SOV, SVO, etc.
 Word Order and Inter-word Dependency
Sentence Parts Must Agree
 Endings, Gender, Plurality, Case
 e.g. Japanese counting uses different
words for different kinds of objects
 e.g. Slavic languages use different
endings for singular, few, many…
English
French
Printer
Enabled
Imprimante
Activée
Stacker
Enabled
Module de réception Activé
Stapler options
Enabled
Options d’agrafage
Activées
Word Order Variation
The cat ate the fish (English)
猫は魚を食べた (Japanese)
 Use “Message Formatting” APIs
 Avoid the “+” operator
 Number replacements
 Use choice structures
 Avoid inter-word dependency!

Dereference
Complex message formatting
Segmentation fault…
<p>The cat <emph>ate</emph> the
fish.</p>
Images and Icons
 Avoid metaphors
 Avoid cultural sensitivities
 Avoid body parts
 Replace as necessary
Images and Culture
 Beware your
biases—even
good ones.
Getting the Right Resource
 Use the right locale
 User, server, or data locale
Isn’t it Swell?
 English is very
succinct.



Words in other
languages are often
longer
Sentences may be
longer
Characters may be
larger
More Swollen Text
 30% in length (alphabetics, abdjads,
etc.)
 30% in height (ideographics)
 But… a rule of thumb, not a “fact”

Measure your results with care.
GUI Layout
Managing English Text
String
Building
Abbrev.
Eng.
String is the Thing
String is the
Thing?
Dereferencing
 Minimize sentence building
 Minimize arguments per string
 Use subject:predicate wherever
possible
Don’t do this:
Your balance is $100.00.
When you can do this:
Balance: $100.00
Dynamic vs. Static Layout
 Magic numbers
 Externalized layouts
 Mnemonics
 Colors
Spanish Keyboard
Keys added, deleted, moved.
(Changes application fingering.)
Orange keys enable the input of accented characters
Localizing Styles
 Bolding is not universal for
emphasis

Italicization, Capitalization, etc. are
also not universal
 Use Logical not Presentational
names

Describe the function not the
appearance
中国
Amikake
Wakiten
Use of Color
“Going Down”
“Going Up”
Customization
When is it okay?
 Content must be highly
localized or has localspecific requirements:

customization lets you
address this requirement in
the most localized possible
manner
Externalization Redux
currencies
dates
numbers
times
Address
images
formats
colors
currencies
dates
titles
Legal
times
rules
numbers
Accounting
rules
Address
text
formats
sounds
Language and Locale
Independent Code
titles
images
colors
Legal
rules
Accounting
rules
sounds
text
Customization: Examples
 Dictionaries, glossaries, grammar
checkers, style sheets
 Custom content, templates, data,
sample text, tutorials, training
materials
 Postal address validator, personal
name validator, etc.
Building Global Software
Beyond Just Coding:
Localization, GA, and all that
The internationalization cycle
 Encompasses the
Support Issues
Develop Roadmap
and Requests
(where is the product going?)
(all customers)
full development
cycle:






Requirements
Design
Development
QC
Release
Support
RTM/GA
(by market)
Develop Requirements
& Architecture
Test
(non-English/non-ASCII)
Code
(Enable, externalize,
modularize)
Design
(internationalized)
What is “internationalized QA”?
 Does the enabled product work
correctly?



Non-English configurations
Non-ASCII data and encoding support
Cross time zone support
 Does localization appear correctly?
 Is the product localizable?
Growing (and Pruning) the Matrix
 Include non-English configurations
in your test matrix
Don’t lose control of the matrix.
What to Test With

Test Non-English configurations



Test Non-ASCII data



Non-English locales (lying to your
machine)
Native configurations
Encodings, encodings, everywhere
Non-ASCII character values
Test Across Time Zones

Two or more and consider the
international date line and DST issues
Planning Testing
 Get tools that are enabled!
 Automation allows greater
coverage, but only if it works.
 Plan encodings and locales as part
of the test matrix.
 Acquire third-party products as
necessary.
 Cheat.
Configuring Machines
 Create both native and simulated
environments

Don’t buy physical keyboards (use
software keyboards)
Localization
The UI Freeze
 Localization is part of the release
process too.


Changes to the user interface cost the
localization team time and money.
(Changes to the product cost the
documentation and QA folks too)
Simultaneous Shipment (Simship)
 Ideally, to maximize opportunity,
ship the target languages the same
day as the source language.
Distribution of Content
 How does the localized text get into
the running product?





Satellite assemblies, DLLs, shared
libraries
Message catalogs
Special directory
Database
Etc.
More Distribution
 “Specific Language”
(per-language)
 “Language Included”
(one or more languages)
 “Language Pack”
(product plus something)
Static vs. Dynamic Content
 Dynamic content may include the
initial set of data or other items
which need to be localized too.
Summary
Internationalization
 … is a fundamental architectural
approach: it is how software is built.






Design
Enabling
Externalization
Customization
Testing and Support
Lifecycle
Q&A
“Would you please write the
code for I18N on the
whiteboard before you go?”
#import i18n.h