Language / Locale IDs - International Components for Unicode

Download Report

Transcript Language / Locale IDs - International Components for Unicode

Language / Locale IDs

M. Davis, IBM A. Phillips, webMethods

Language

"A shprakh iz a diyalekt mit an armey un a flot" Max Weinreich (Joshua Fishman), 1945.

the written form is the most important for computers   does include “culturally-specific” formatting (as we’ll see later) does not include currency, time-zone, seat-assignment, etc.

Language Tags: Two Needs

  Identification  Announce that this text is American, Northern Californian, Casual, PG-13 English Filtering/Matching  Accept Any English, Any French, Swiss German,…

Background

    RFC 1766 RFC 3066 Used in XML, HTML,… Used both as language ID (narrow sense)

and

locale ID

RFC 3066bis

      Successor to 3066 For use in XML, HTML, Java, … Addresses limitations of 3066 First Draft: 2003/10 Latest Draft: 2004/2  http://www.ietf.org/internet-drafts/draft phillips-langtags-01.txt

Final Draft: 2004/5??

Main Goals

     Maintain backward compatibility (so that all previous codes would remain valid) Reduce the need for large numbers of registrations Provide a more formal structure to allow parsing into subtags even where software does not have the latest registrations Provide stability in the face of potential instability in ISO 639, 3166, and 15924 codes ( demonstrated instability in the case of ISO 3166) Allow for external extension mechanisms.

Expressiveness

    Allows ISO15924 script code subtags and allows them to be used generatively. Adds the concept of a variant subtag and allows variants to be used generatively.

Allows use of UN M49 codes:  es-419 = ”Spanish, Latin America” Changes the IANA language tag registry to a language

subtag

registry

Stability

  Allows backward/forward compatible parsing Defines a process for handling reuse of values by ISO639, ISO15924, and ISO3166 in the event that they register a previously used value for a new purpose.

Private Use & Extensions

   Adds an extension mechanism which does not require registration to use.

Defines the private use tags in ISO639, ISO15924, and ISO3166 as the mechanism for creating private use language, script, and region subtags respectively Defines a syntax for private use variant subtags which can be used without registration.

Structure (Bizarro BNF)

tag lang registered-lang = lang *["-s-" extlang] ["-" script] ["-" region] *["-" variant] ["-x" extensions] =/ "x" extensions ; private use =/ grandfathered-registrations = 2*3 ALPHA ; shortest ISO 639 =/ registered-lang = 5*15 alphanum

script region variant extensions value

Structure II

= 4 ALPHA

; ISO 15924

= 2 ALPHA

; ISO 3166

=/ 3 DIGIT

; UN country #

= 5*15 alphanum = 1* ("-" value) = 1*31 alphanum

Examples I

   Simple language code:    de (German) fr (French) ja (Japanese) Language code plus Script code :    zh-Hant (Traditional Chinese) en-Latn (English written in Latin script) sr-Cyrl (Serbian written with Cyrillic script) Language-Region:     de-DE (German for Germany) zh-SG (Chinese for Singapore) cs-CS (Czech for Czechoslovakia) sr-891 (Serbian for Serbia and Montenegro)

Examples II

   Language-Script-Region:   zh-Hans-CN (Simplified Chinese for the PRC) sr-Latn-891 (Serbian, Latin script, Serbia & Monte.) Language-Script-Region-Variant:  en-Latn-US-boont (Boontling dialect of English) Other Mixtures:   zh-CN (Chinese for the PRC) en-boont (Boontling dialect of English)

Examples III

   Extension mechanism:    x-valley-girl de-CH-x-phonebook az-Arab-x-AZE-derbend Extended language subtags:   zh-s-min zh-s-min-s-nan-Hant-CN Private Use tags:     qaa-Qaaa-QM-xsouthern (all private tags) de-Qaaa (German, with a private script) de-Latn-QM (German, Latin-script, private region) de-Qaaa-DE (German, private script, for Germany)

Examples IV

 Some Invalid Tags:    de-891-DE (two region tags) a-DE (use of a single character tag) zh-xsouthern-DE (private use variant followed by another tag)

Locale

   different interpretations narrow = language broad = any user-preferences user preferences language

Language vs Locale

  

Which are English?

"Theatre Center News: The date of the last version of this document was 2003 年 3 月 20 日 . A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt." "Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader." "Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric Mader."

Summary

   Improved version of 3066 Used for language

and

locale (in narrow sense) Addresses Issues     Script Distinctions Parseability Extensions …

References

   Latest Public Draft  http://www.ietf.org/internet-drafts/draft-phillips langtags-01.txt

Working Draft  http://www.inter-locale.com/ID/draft-phillips langtags-02.html

(HTML version) Language Code Issues (+ Locales)  http://oss.software.ibm.com/cvs/icu/~checkout~/icuh tml/design/language_code_issues.html

Q&A