Transcript Language / Locale IDs - International Components for Unicode
Language / Locale IDs
M. Davis, IBM A. Phillips, webMethods
"A shprakh iz a diyalekt mit an armey un a flot" Max Weinreich (Joshua Fishman), 1945.
the written form is the most important for computers does include “culturally-specific” formatting (as we’ll see later) does not include currency, time-zone, seat-assignment, etc.
Language Tags: Two Needs
Identification Announce that this text is American, Northern Californian, Casual, PG-13 English Filtering/Matching Accept Any English, Any French, Swiss German,…
RFC 1766 RFC 3066 Used in XML, HTML,… Used both as language ID (narrow sense)
Successor to 3066 For use in XML, HTML, Java, … Addresses limitations of 3066 First Draft: 2003/10 Latest Draft: 2004/2 http://www.ietf.org/internet-drafts/draft phillips-langtags-01.txt
Final Draft: 2004/5??
Maintain backward compatibility (so that all previous codes would remain valid) Reduce the need for large numbers of registrations Provide a more formal structure to allow parsing into subtags even where software does not have the latest registrations Provide stability in the face of potential instability in ISO 639, 3166, and 15924 codes ( demonstrated instability in the case of ISO 3166) Allow for external extension mechanisms.
Allows ISO15924 script code subtags and allows them to be used generatively. Adds the concept of a variant subtag and allows variants to be used generatively.
Allows use of UN M49 codes: es-419 = ”Spanish, Latin America” Changes the IANA language tag registry to a language
Allows backward/forward compatible parsing Defines a process for handling reuse of values by ISO639, ISO15924, and ISO3166 in the event that they register a previously used value for a new purpose.
Private Use & Extensions
Adds an extension mechanism which does not require registration to use.
Defines the private use tags in ISO639, ISO15924, and ISO3166 as the mechanism for creating private use language, script, and region subtags respectively Defines a syntax for private use variant subtags which can be used without registration.
Structure (Bizarro BNF)
tag lang registered-lang = lang *["-s-" extlang] ["-" script] ["-" region] *["-" variant] ["-x" extensions] =/ "x" extensions ; private use =/ grandfathered-registrations = 2*3 ALPHA ; shortest ISO 639 =/ registered-lang = 5*15 alphanum
script region variant extensions value
= 4 ALPHA
; ISO 15924
= 2 ALPHA
; ISO 3166
=/ 3 DIGIT
; UN country #
= 5*15 alphanum = 1* ("-" value) = 1*31 alphanum
Simple language code: de (German) fr (French) ja (Japanese) Language code plus Script code : zh-Hant (Traditional Chinese) en-Latn (English written in Latin script) sr-Cyrl (Serbian written with Cyrillic script) Language-Region: de-DE (German for Germany) zh-SG (Chinese for Singapore) cs-CS (Czech for Czechoslovakia) sr-891 (Serbian for Serbia and Montenegro)
Language-Script-Region: zh-Hans-CN (Simplified Chinese for the PRC) sr-Latn-891 (Serbian, Latin script, Serbia & Monte.) Language-Script-Region-Variant: en-Latn-US-boont (Boontling dialect of English) Other Mixtures: zh-CN (Chinese for the PRC) en-boont (Boontling dialect of English)
Extension mechanism: x-valley-girl de-CH-x-phonebook az-Arab-x-AZE-derbend Extended language subtags: zh-s-min zh-s-min-s-nan-Hant-CN Private Use tags: qaa-Qaaa-QM-xsouthern (all private tags) de-Qaaa (German, with a private script) de-Latn-QM (German, Latin-script, private region) de-Qaaa-DE (German, private script, for Germany)
Some Invalid Tags: de-891-DE (two region tags) a-DE (use of a single character tag) zh-xsouthern-DE (private use variant followed by another tag)
different interpretations narrow = language broad = any user-preferences user preferences language
Language vs Locale
Which are English?
"Theatre Center News: The date of the last version of this document was 2003 年 3 月 20 日 . A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt." "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."
Improved version of 3066 Used for language
locale (in narrow sense) Addresses Issues Script Distinctions Parseability Extensions …
Latest Public Draft http://www.ietf.org/internet-drafts/draft-phillips langtags-01.txt
Working Draft http://www.inter-locale.com/ID/draft-phillips langtags-02.html
(HTML version) Language Code Issues (+ Locales) http://oss.software.ibm.com/cvs/icu/~checkout~/icuh tml/design/language_code_issues.html