Transcript Language / Locale IDs - International Components for Unicode
Language / Locale IDs
M. Davis, IBM A. Phillips, webMethods
Language
"A shprakh iz a diyalekt mit an armey un a flot" Max Weinreich (Joshua Fishman), 1945.
the written form is the most important for computers does include “culturally-specific” formatting (as we’ll see later) does not include currency, time-zone, seat-assignment, etc.
Language Tags: Two Needs
Identification Announce that this text is American, Northern Californian, Casual, PG-13 English Filtering/Matching Accept Any English, Any French, Swiss German,…
Background
RFC 1766 RFC 3066 Used in XML, HTML,… Used both as language ID (narrow sense)
and
locale ID
RFC 3066bis
Successor to 3066 For use in XML, HTML, Java, … Addresses limitations of 3066 First Draft: 2003/10 Latest Draft: 2004/2 http://www.ietf.org/internet-drafts/draft phillips-langtags-01.txt
Final Draft: 2004/5??
Main Goals
Maintain backward compatibility (so that all previous codes would remain valid) Reduce the need for large numbers of registrations Provide a more formal structure to allow parsing into subtags even where software does not have the latest registrations Provide stability in the face of potential instability in ISO 639, 3166, and 15924 codes ( demonstrated instability in the case of ISO 3166) Allow for external extension mechanisms.
Expressiveness
Allows ISO15924 script code subtags and allows them to be used generatively. Adds the concept of a variant subtag and allows variants to be used generatively.
Allows use of UN M49 codes: es-419 = ”Spanish, Latin America” Changes the IANA language tag registry to a language
subtag
registry
Stability
Allows backward/forward compatible parsing Defines a process for handling reuse of values by ISO639, ISO15924, and ISO3166 in the event that they register a previously used value for a new purpose.
Private Use & Extensions
Adds an extension mechanism which does not require registration to use.
Defines the private use tags in ISO639, ISO15924, and ISO3166 as the mechanism for creating private use language, script, and region subtags respectively Defines a syntax for private use variant subtags which can be used without registration.
Structure (Bizarro BNF)
tag lang registered-lang = lang *["-s-" extlang] ["-" script] ["-" region] *["-" variant] ["-x" extensions] =/ "x" extensions ; private use =/ grandfathered-registrations = 2*3 ALPHA ; shortest ISO 639 =/ registered-lang = 5*15 alphanum
script region variant extensions value
Structure II
= 4 ALPHA
; ISO 15924
= 2 ALPHA
; ISO 3166
=/ 3 DIGIT
; UN country #
= 5*15 alphanum = 1* ("-" value) = 1*31 alphanum
Examples I
Simple language code: de (German) fr (French) ja (Japanese) Language code plus Script code : zh-Hant (Traditional Chinese) en-Latn (English written in Latin script) sr-Cyrl (Serbian written with Cyrillic script) Language-Region: de-DE (German for Germany) zh-SG (Chinese for Singapore) cs-CS (Czech for Czechoslovakia) sr-891 (Serbian for Serbia and Montenegro)
Examples II
Language-Script-Region: zh-Hans-CN (Simplified Chinese for the PRC) sr-Latn-891 (Serbian, Latin script, Serbia & Monte.) Language-Script-Region-Variant: en-Latn-US-boont (Boontling dialect of English) Other Mixtures: zh-CN (Chinese for the PRC) en-boont (Boontling dialect of English)
Examples III
Extension mechanism: x-valley-girl de-CH-x-phonebook az-Arab-x-AZE-derbend Extended language subtags: zh-s-min zh-s-min-s-nan-Hant-CN Private Use tags: qaa-Qaaa-QM-xsouthern (all private tags) de-Qaaa (German, with a private script) de-Latn-QM (German, Latin-script, private region) de-Qaaa-DE (German, private script, for Germany)
Examples IV
Some Invalid Tags: de-891-DE (two region tags) a-DE (use of a single character tag) zh-xsouthern-DE (private use variant followed by another tag)
Locale
different interpretations narrow = language broad = any user-preferences user preferences language
Language vs Locale
Which are English?
"Theatre Center News: The date of the last version of this document was 2003 年 3 月 20 日 . A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt." "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
Summary
Improved version of 3066 Used for language
and
locale (in narrow sense) Addresses Issues Script Distinctions Parseability Extensions …
References
Latest Public Draft http://www.ietf.org/internet-drafts/draft-phillips langtags-01.txt
Working Draft http://www.inter-locale.com/ID/draft-phillips langtags-02.html
(HTML version) Language Code Issues (+ Locales) http://oss.software.ibm.com/cvs/icu/~checkout~/icuh tml/design/language_code_issues.html