ICU Overview
Download
Report
Transcript ICU Overview
ICU Overview:
The Open Source Unicode Library
George Rhoten
IBM Globalization Center of Competency
28th Internationalization and Unicode Conference
© 2005 IBM Corporation
ICU Overview: The Open Source Unicode Library
Agenda
Background Information
What is ICU?
Architecture Overview
– Significant New ICU Features
References
Q and A
28th Internationalization and Unicode Conference
2
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Why Globalization?
28th Internationalization and Unicode Conference
3
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Unicode
Handles all modern world languages
Efficient and effective processing
Lossless data exchange
Enables single-binary global software
But… all languages ⇒ large, complex standard
– 1,400 pages + Annexes + additional standards
– 96,000+ characters
– Major update every 3 years
– Minor update about once a year
– 70 character properties, many multi-valued
– Affects many processes: display, line-break, regular expressions, …
28th Internationalization and Unicode Conference
4
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Internationalization, Localization & Locales
Requirements vary widely across languages & countries
– Sorting
– Text searching
– Line breaks
– Date/time/number/currency formatting
– Codepage conversion
– …and so on
Performance is key
– It is easy to do the right thing
– It is hard to do it fast
28th Internationalization and Unicode Conference
5
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
What is ICU?
International Components for Unicode
Globalization / Unicode / Locales
Mature, widely used set of C/C++ and Java libraries
– Basis for Java 1.1 internationalization, but goes far beyond Java 1.1
Very portable – identical results on all platforms / programming
languages
– C/C++: 30+ platforms/compilers
– Java: IBM & Sun JDK
– You can use: C/C++ (ICU4C), Java (ICU4J), C/C++ with Java (ICU4JNI)
Full threading model
Customizable
Modular
Open source – but non-restrictive
28th Internationalization and Unicode Conference
6
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Who uses ICU?
Products Within IBM
– All 5 major software brands
– Many other related software applications
– Used on all IBM operating systems
Other Companies and Organizations
– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business
Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP,
Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks,
MKS, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX,
Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems
(Solaris, Java), SuSE Linux, Sybase, Virage, webMethods, Wine,
Leica Geosystems GIS & Mapping LLC., Xerox, Yahoo!
...and many more
28th Internationalization and Unicode Conference
7
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
ICU Features
Unicode text handling
Breaks: word, line, …
Charset conversions (700+)
Formatting
– Date & time
Collation & Searching
– Messages
Locales from CLDR (250+)
– Numbers & currencies
Resource Bundles
Transforms
Calendar & Time zones
– Normalization
Complex-text layout engine
– Casing
Unicode Regular Expressions
– Transliterations
28th Internationalization and Unicode Conference
8
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Architecture Overview 1
Locale Based Services
– Locale is an identifier, not a container
– Keywords for variants: de@collation=phonebook
Resource inheritance: shared resources
root
Language
en
de
zh
Script
Region
US
IE
28th Internationalization and Unicode Conference
DE
CH
9
Hant
Hans
TW
CN
CN
TW
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Architecture Overview 2
Open and Close Service Model
– Open a service object, use it many times, close it when done
– Better performance by avoiding setup costs per operation
ICU Threading Model
– Multiple service objects in use simultaneously
with same or different attributes
– Large resources shared in read-only cache
– Compatible with Java threading model
28th Internationalization and Unicode Conference
10
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Architecture Overview 3
Data Driven Services
– Customize at build-time or run-time
– Interchange with other platforms;
• same results on each
– Rule-based
• Collation, Word-breaks, Transforms
– Pattern-based
• Date/Time/Number/Message formatting
– Table-based
• Character Conversion
28th Internationalization and Unicode Conference
11
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Architecture Overview – ICU4C
Simple Error Handling
– Thread safe
– Works in C and C++
C/C++ subset for portability
Version Management
– Multiple versions of ICU4C in the same process memory space
– Data and library versioning
String Buffer Management
– Preflighting and overflow protection
Flexible
– Allows Loading and Unloading ICU4C libraries
– Runtime settable memory allocation and mutex functions
28th Internationalization and Unicode Conference
12
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Architecture Overview – ICU4J
Supplement for Java
Core globalization (no character conversion or
regular expressions)
– We do supply complex text support for Sun
Modularized: products may add just needed
functionality
Usually drop-in replacement for JDK functionality
– Changing the import statements is usually all that is
needed
28th Internationalization and Unicode Conference
13
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
ICU4J: Supplement for Java
CLDR (Common Locale Data Repository)
– More fully supported locales than Java
Up-to-date globalization: standards-compliant; latest Unicode
– Supplementary character (GB 18030, JIS X 213, HKSCS)
• Java 5 adds handling of supplementary characters
– Full properties – JDK has only a fraction
– Unicode Collation Algorithm
– Local calendars (Islamic, Japan,…); more time zone localizations
– Currencies, String Search, Internationalized Domain Names
– Transforms: Case, Scripts, Normalization
Much shorter release cycle and quicker support for Unicode standard
28th Internationalization and Unicode Conference
14
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Unicode Text Handling
C (UTF-16)
– UChar*: null-terminated or with length
C++ (UTF-16)
– UnicodeString: full featured string class
Java (UTF-16BE)
– Uses java.lang.String and adds utilities
All handle supplementary characters
– Required for GB 18030 and JIS 213 repertoire
28th Internationalization and Unicode Conference
15
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Unicode Text Handling 2
All Unicode 4.1 properties
– direct API
• values, names, enumerations
– UnicodeSet
• Fast, compact set operations (union, intersection, …)
• Pattern-based (both Perl & POSIX syntax for properties)
– \p{greek} vs. [:greek:]
• All properties:
– [\p{lowercase}-[a-z]]
– [\p{greek} & \p{uppercase}]
28th Internationalization and Unicode Conference
16
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Recent Additions
Conforms to CLDR 1.3
– Adds many translated terms for languages, scripts, regions,
currencies, and time zones.
– Access to more CLDR items
Support for Unicode interpretation of POSIX properties
Charset detection API (ICU4J only)
Better modularization for memory constrained environments
(ICU4C only)
28th Internationalization and Unicode Conference
17
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Character Set Conversion
Precise alias information:
– When you ask for “Shift-JIS”, you can request the precise definition by
platform (e.g. Windows, IBM, Java, … )
Buffer management
– API automatically handles characters that cross buffers
– Can provide offset mappings between byte buffer and UChar buffer
Runtime customizations allowed for:
– illegal sequences
– undefined characters
Unicode Text Compression – SCSU, BOCU-1
Consistent conversion results across platforms
You can use more character sets at runtime or build time
28th Internationalization and Unicode Conference
18
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Collation: Sorting, Searching and Matching
Fast international comparison for string search; fully UCA
compliant
– Compressed sort keys, optimized string comparison, sublinear
string search
– Incremental sortkeys used for radix sorting
Precise binary sortkey stability over time (library versioning)
Fully data driven
– Many common rules provided
Runtime and build time rule customizations
– strength, normalization, upper vs. lowercase first, ignore
punctuation, numeric, …
– Only delta from UCA is needed for rule customization
28th Internationalization and Unicode Conference
19
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Calendar & Time Zones
International Calendars – Islamic, Buddhist, Hebrew, Japanese
– Required for correct presentation of dates in some countries
Olson timezone support with localizations
Recent Additions:
– Many more time zone localizations
28th Internationalization and Unicode Conference
20
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Formatting
Date & time: 8 formats per locale by default
Messages
– Completely localizable, plural support
Numbers & currencies
– Scientific Notation, Spelled-out (checks, etc.)
– Full Orthogonal Currency support
• INR
• INR
• INR
In Hindi:
In English:
In German:
Rs. 1,234.57
Rs. 1.234,57
Recent Additions
– List available currencies API
– Short and stand-alone month/day names
28th Internationalization and Unicode Conference
21
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Transforms
Unicode Normalization
– Highly optimized for performance
– performance utilities: concatenation, detection, comparison
Casing (upper, lower, title, folding)
General Transforms
– Script transliterations
– Half-width/Full-width, Hex, etc.
– Chain transforms together, filter source characters
– Rule-based, customizable at runtime.
String Prep: NFS, Internationalized Domain Names (IDN)
28th Internationalization and Unicode Conference
22
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Segmentation: word, line & sentence
Fast state-table implementation
Customizable
– Rule-based – customizable at runtime
– Special customizations, e.g. Thai
Recent Additions:
– Uses new UText API
• Discontinuous text
• Buffering
• Usable with UTF-8, UTF-16 or UTF-32
28th Internationalization and Unicode Conference
23
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Unicode Regular Expressions
Full Regex Implementation
– C/C++ only: Java 1.4 has own package (though not as
powerful)
All Unicode 4.1 Properties
– Supported through UnicodeSet
Good performance
– Competitive with non-Unicode regex
28th Internationalization and Unicode Conference
24
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Complex-text layout engine
Glyph processing, positioning & adjustment
– Ligature substitution, contextual forms, kerning, accent placement, bidi scripts,
etc.
Support for:
–
–
–
–
–
–
–
–
Information for drawing
Caret Display
Hit Testing
Selection Highlighting
Caret Movement
Layout Metrics
Line Break
Canonical Equivalence: a + ´ or á
Recent Additions:
– Support for more complex scripts
28th Internationalization and Unicode Conference
25
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
References
ICU main site:
– http://www.ibm.com/software/globalization/icu/
– Links to
• Download ICU
• User Guide, Technical FAQ, Support, Bug Reports, Demonstrations
ICU support site:
– http://icu.sourceforge.net/
Unicode Consortium
– http://www.unicode.org/
• Unicode glossary, Unicode character database
28th Internationalization and Unicode Conference
26
Orlando, Florida, September, 2005
ICU Overview: The Open Source Unicode Library
Questions and Answers
28th Internationalization and Unicode Conference
27
Orlando, Florida, September, 2005