Transcript Document
ICU Overview
The Open-Source
Unicode Library, v3.2
Markus Scherer
ICU Manager
IBM Globalization Center of Competency
27th Internationalization and Unicode Conference
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Agenda
Background
What is ICU?
Architecture Overview
ICU Features and recent additions
References
Q and A
27th Internationalization and Unicode Conference
2
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Why Globalization?
27th Internationalization and Unicode Conference
3
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode
All world languages
Efficient and effective processing
Lossless data exchange
Enables single-binary global software
But… all languages ⇒ large, complex standard
– 1,400 pages + Annexes + additional standards
– 90,000+ characters
– Major update every 3 years
– 70 character properties, many multi-valued
– Affects many processes: display, line-break, regex, …
27th Internationalization and Unicode Conference
4
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Locales
Features vary widely across languages & countries
– Sorting, line breaks, date/time/number/currency formatting,
codepage conversion, …
– Performance is key: easy to do the right thing; hard to do it
fast
27th Internationalization and Unicode Conference
5
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
What is ICU?
Globalization / Unicode / Locales
Mature, widely used set of C/C++ and Java libraries
– Basis for Java 1.1 internationalization – but goes far beyond
– “ICU4C”: C/C++ libraries; “ICU4J”: Java library
Very portable – identical results on all platforms / programming
languages
– C/C++: 30+ platforms/compilers
– Java: IBM & Sun JDK
Full threading model; customizable; modular
Open source – but not viral
ICU 3.2: 78 languages; 118 countries; 870 codepages
27th Internationalization and Unicode Conference
6
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Who uses ICU? (Examples)
Products Within IBM
– DB2, COBOL, InfoPrint Manager, Lotus Notes, Lotus
Workplace, Tivoli Presentation Services, WebSphere,
XML Parser, …
Other Companies and Organizations
– Adobe, Apple (Mac OS X), BEA, CERN, Cognos,
Debian, HP, Inktomi, JD Edwards, Macromedia,
Mathworks, Mozilla, NCR, OpenOffice, PayPal, SAP,
Siebel, SIL, Software AG, Sun Microsystems (Solaris,
Java), SuSE, Sybase, webMethods, …
27th Internationalization and Unicode Conference
7
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
ICU Features
Unicode text handling
Unicode Regular
Expressions
Charset conversions (870+)
Breaks: word, line, …
Collation & Searching
Formatting
Locales (170+)
– Date & time
Resource Bundles
– Messages
Calendar & Time zones
– Numbers & currencies
Complex-text layout engine
Transforms
– Normalization
– Casing
– Transliterations
27th Internationalization and Unicode Conference
8
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview 1
Locale Based Services
– Locale is an identifier, not a container
– Keywords for variants: de@collation=phonebook
– Recent addition: accept-language support
Resource inheritance: shared resources
root
Language
en
de
zh
Hant
Script
Country
US
IE
27th Internationalization and Unicode Conference
DE
CH
TW
9
Hans
CN
CN
TW
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview 2
Open and Close Service Model
– Open a service object, use it many times, close it when done
– Better performance by avoiding setup costs per operation
– Warning: use properly for maximum performace
ICU Threading Model
– Multiple service objects in use simultaneously, with same or
different attributes
– Large resources shared in read-only cache
27th Internationalization and Unicode Conference
10
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview 3
Data Driven Services
– Customize at build-time or run-time
– Interchange with other platforms;
• same results on each
– Rule-based
• Collation, Word-breaks, Transforms
– Pattern-based
• Formats, UnicodeSet
– Table-based
• Character Conversion
27th Internationalization and Unicode Conference
11
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview – ICU4C
Simple Error Handling
– C++ subset for portability
– Support for multi-threaded environment
Version Management
– Multiple versions at the same time
– Data and library versioning
String Buffer Management
– Preflighting and overflow protection
Misc: Load/Unload ICU
Recent Additions:
– Runtime-settable memory allocation and mutex functions
27th Internationalization and Unicode Conference
12
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Architecture Overview – ICU4J
Supplement for Java
Core globalization (no character conversion or
regular expressions, no GUI components)
– We do supply complex text support for Sun
Modularized: products may add just needed
functionality
27th Internationalization and Unicode Conference
13
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
ICU4J vs. JDK
CLDR 1.2 (Common Locale Data Repository)
Up-to-date globalization: standards-compliant; latest Unicode
– Supplementary character (GB 18030, JIS X 213, HKSCS)
• Java 5 adds handling of supplementary characters
– Full properties – JDK has only a fraction
– Unicode Collation Algorithm
– Local calendars (Thailand, Japan,…); ISO dates
– Currencies, String Search, Int’l Domain Names
– Transforms: Case, Scripts, Normalization
Much faster turn-around on bug fixes, enhancements
27th Internationalization and Unicode Conference
14
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode Text Handling
C
– UChar*: null-terminated or with length
C++
– UnicodeString: full featured string class
Java
– Uses normal JDK String, adds utilities
All handle supplementary characters
– Required for GB 18030/JIS X 0213/HKSCS repertoires
27th Internationalization and Unicode Conference
15
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode Text Handling 2
All Unicode 4.0.1 properties
– Direct API
• Values, names, enumerations
– UnicodeSet
• Fast, compact set operations
• Pattern-based (both Perl & POSIX syntax for properties)
– \p{greek} vs. [:greek:]
• All properties:
– [\p{lowercase}-[a-z]]
– [\p{greek} & \p{uppercase}]
27th Internationalization and Unicode Conference
16
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Data: Recent Additions
Conforms to CLDR 1.2
– 50% more data than CLDR 1.0: adding many translated terms for
languages, scripts, countries, currencies, and time zones.
– Added data for new languages: Malayalam, Oriya, Welsh
Reduced multiplatform install image size
Improved XLIFF-ICU conversion tools
Locale canonicalization spec defined and implemented (C+J)
– Provides interoperability with POSIX and .NET locale IDs, more
RFC 3066 support
27th Internationalization and Unicode Conference
17
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Character Set Conversion
Precise alias information:
– When you ask for “SJIS”, you can request the precise
definition by platform:
• Windows, IBM, Solaris,…
Buffer management
– automatically handles characters that cross buffers
Customizations allowed for:
– illegal sequences
– undefined characters
Unicode Text Compression – SCSU, BOCU-1
27th Internationalization and Unicode Conference
18
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Collation and Searching
Fast international comparison and string search;
fully UCA compliant
– Compressed sort keys, optimized string comparison,
sublinear string search
– incremental sortkeys for radix-sort
Precise binary sortkey stability over time
Fully data driven
API / rule customizations
– strength, normalization, upper vs. lowercase first, ignore
punctuation, sort digits as numbers, …
27th Internationalization and Unicode Conference
19
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Collation and Searching: Recent Additions
Numeric sorting: sequences of digits can be sorted
numerically instead of alphabetically
– e.g., filenames would sort "ab-2" < "ab-10"
– without material performance cost
– with reduced sortkey length.
Significantly improved sorting orders for many other
languages
Data in separate tree, for easier modularization and
maintenance
getFunctionalEquivalent API allows for better caching and UI
support.
27th Internationalization and Unicode Conference
20
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Calendar & Time Zones
International Calendars – Arabic, Buddhist, Hebrew, Japanese
– Required for correct presentation of dates in some countries
Olson timezone support, with localizations
Recent Additions:
– RFC822 time zone format support in DateFormat (C+J) for
compatibility.
– “Universal Time” conversions for high-precision date/time
computations
27th Internationalization and Unicode Conference
21
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Formatting
Date & time: 8 formats per locale
Messages
– Completely localizable, Plural support
Numbers & currencies
– Scientific Notation, Spelled-out (checks, etc.)
– Full Orthogonal Currency support
• INR
• INR
• INR
In Hindi:
In English:
In German:
Rs. 1,234.57
Rs. 1.234,57
Recent Additions
– POSIX migration library
– Allows parsing multiple currencies with one formatter
– Short and stand-alone month/day names
27th Internationalization and Unicode Conference
22
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Transforms
Unicode Normalization
– Highly optimized for performance
– performance utilities: concatenation, detection, comparison
Casing (upper, lower, title, folding)
General Transforms
– Script transliterations
– Half-width/Full-width, Hex, etc.
– Chain transforms together, filter source characters
– Rule-based, customizable at runtime.
IDNA: International Domain Names
27th Internationalization and Unicode Conference
23
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Segmentation: word, line & sentence
Fast state-table implementation
Customizable
– Rule-based – customizable at runtime
– Special customizations, e.g. Thai
Recent Additions:
– Greatly improved performance when going backwards
(common case when doing line break)
– Java
• The rules syntax has been extended. Rules can now return
information about the types of characters they encountered.
• Common compiled (binary) rule format with ICU4C
27th Internationalization and Unicode Conference
24
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Unicode Regular Expressions
Full Regex Implementation
– C only: Java 1.4 has own package (though not as powerful)
All Unicode 4.0.1 Properties
– supported through UnicodeSet
Good performance
– competitive with non-Unicode regex
Recent Additions
– Now features a C API, instead of just C++.
27th Internationalization and Unicode Conference
25
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Complex-text layout engine
Glyph processing, positioning & adjustment
– ligature substitution, contextual forms, kerning, accent placement,
Bidi scripts, etc.
Support for:
– Drawing
– Caret Display
– Hit Testing
– Selection Highlighting
– Caret Movement
– Layout Metrics
– Line Break
Recent addition: Canonical Equivalence: a + ´ or á
27th Internationalization and Unicode Conference
26
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
References
ICU main site:
– http://ibm.com/software/globalization/icu
• New URL
– Links to
• Download ICU
• User Guide, Technical FAQ, Support, Bug Reports
Unicode Consortium
– http://www.unicode.org
• Unicode glossary, Unicode character database
27th Internationalization and Unicode Conference
27
Berlin, Germany, April 2005
ICU Overview: The Open-Source Unicode Library, v3.2
Questions and Answers
27th Internationalization and Unicode Conference
28
Berlin, Germany, April 2005