Transcript StringPrep: Unicode in Network Protocols
International Components for Unicode
StringPrep: Unicode in Network Protocols
Ram Viswanadha Globalization Center of Competency, San José IBM
29 th Unicode Conference March, 2006 © 2006 IBM Corporation
International Components for Unicode
Agenda
Problem
StringPrep
Profiles of StringPrep
IDNA
StringPrep in ICU
Demo
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Terminology
Domain Name
DNS: Domain Naming Service
URL: Universal Resource Locator
NFKC: Normalization Form KC, compatibility composition, e.g.: ffi → ffi :The ffi_ligature (U+FB03) is decomposed in NFKC (whereas it is not in NFC).
BiDi: Bi-Directional code points
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Why Internationalize?
Users like to use their language/script in
– domain names – URLs – e-mail
Not everyone can read/write English
How to internationalize?
– Use Unicode 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Domain Name: Examples
www.
日本平
.jp
www.
ハンドボールサムズ
.com
www.f
ä rgbolaget.nu www.b
ü cher.de
www.br
æ ndendek æ rlighed.com
理容ナカムラ
.com
あーるいん
.com
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Domain Name: Parts
WWW
.
Domain Label IBM
.
Label Separator COM 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
DNS Protocol Requirements
Minimum impact on DNS protocol's interoperability.
Minimum number of changes
Maximum backwards compatibility
Deterministic resolution of domain names
Single global namespace
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Problems
Unicode contains large number of
– Visually identical , e.g.: і → i – Confusable characters, e.g.: O → 0 – Control codes, e.g.: U+0080- U+009F – Non-Spacing, e.g.: U+00A0 – Invisible characters, e.g.: U+200B – Private Use Characters, e.g.: U+E000-U+FF8F – Punctuation, e.g.: U+002E – Symbols, e.g.: U+2097 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Example
www.arnaudléhors.com
www.
arnaudléhors.com
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Example: Contd.
www.arnaudl
\u00e9
hors.com
www.arnaudl
e\u0301
hors.com
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
StringPrep
Defined by RFC 3454
Framework for preparing Unicode strings
Based on Unicode Version 3.2
Specifies rules for handling
– un-assigned code points – visually similar sequences – Prohibited code points – BiDi code points 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
StringPrep Tables
Unassigned Table
Mapping Tables
– Case mapping, e.g.:\u0041, \u0061 – Deletion, e.g.: \u00AD, \u200C, \u200D
Prohibited Tables
e.g.: LRM, RLM, etc.
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
StringPrep Algorithm
1.
2.
3.
4.
Map Normalize Prohibit Check BiDi
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
NamePrep
Defined by RFC 3491 Profile
1.
Map : Include all code point mappings specified in the StringPrep.
2.
3.
Normalize: Normalize the output of step 1 according to NFKC.
Prohibit: Prohibit all code points specified as prohibited in StringPrep except for the space ( U+0020) code point from the output of step 2.
4.
Check BiDi: Check for bidirectional code points and process according to the rules specified in StringPrep.
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Punycode
Defined by RFC 3492
Algorithm to convert prepared Unicode Strings to ASCII Compatible Encoding (ACE)
Complete
Unique
Reversible
Preserves case information
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Internationalized Domain Names in Applications
Defined by RFC 3490
Prescribes algorithm for using Unicode in DNS
NamePrep : Profile of StringPrep for use in DNS
Punycode : Algorithm for converting prepared Unicode strings to ACE
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
IDNA: ToASCII
www . यहलोगहहन्दीक्ोोंनहीोंबोलसकतेहैं . com ToASCII www . xn — i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd . com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
IDNA: ToUnicode
www . xn — i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd . com ToUnicode www . यहलोगहहन्दीक्ोोंनहीोंबोलसकतेहैं . com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
IDNA: Details
ASCII Full Stop(U+002E)
– Ideographic Full Stop (U+3002) – Full Width Full Stop (U+FF0E) – Half Width Ideographic Full Stop (U+FF61)
Unassigned code points
Letter-Digit-Hyphen (LDH) code points
STD 3 ASCII Rules
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
News article
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Network File System (NFS) Version 4 Profiles
Defined by RFC 3530
nfs4_cs_prep Profile
– Profile for file and path name strings
nfs4_cis_prep Profile
– Profile for NFS server names
nfs4_mixed_prep Profile
– profile for strings in the Access Control Entries 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
XMPP Profiles
Defined by RFC 3920
ResourcePrep
– Profile for resource identifiers within XMPP e.g.: node@domain/ resource
NodePrep
– Profile for node identifiers within XMPP e.g.: node @domain/resource 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Other Profiles
SASLPrep
– RFC 4013 – Profile for Usernames and passwords
MIB Profile
– RFC 4011 – Profile for Mannagement Information Base
iSCSI Names
– RFC 3722 – Profile for internationalized iSCSI names 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
StringPrep Service in ICU
Data driven Customizable Portable C & Java
1.
Procedure for producing a StringPrep profile data file
Run filterRFC3454.pl
2.
3.
Run gensprep Open the profile 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Design Considerations
StringPrep profile characteristics:
– Prescribe a fixed set of tables – Normalization On/Off – Check BiDi On/Off – StringPrep algorithm fixed.
– Profiles once defined are fixed.
Performance critical
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Implementation
Accurate Unicode 3.2 Normalization algorithm
Access to Unicode 3.2 Character Properties
StringPrep algorithm
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
Sizes
International Components for Unicode
Name IDNA NFSCIS NFSCSI NFSMXP NFSMXS NFSCSS
29 th Unicode Conference, San Francisco, CA
Size 21 K 21 K 20 K 14 K 21 K 13 K
March, 2006 © 2005 IBM Corporation
International Components for Unicode
C
UErrorCode status = U_ZERO_ERROR; UParseError parseError; /*open the StringPrep profile */ UStringPrepProfile nameprep = usprep_open(“/usr/joe/mydata”, “nfscsi”, &status); if(U_FAILURE(status)){ /handle the error */ } /prepare the string for use according to the rules specified in the profile */ int32_t retLen = usprep_prepare(src, srcLength, dest, destCapacity, USPREP_ALLOW_UNASSIGNED, nameprep, &parseError,&status); /close the profile*/ usprep_close(nameprep); 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Java
} private static final StringPrep nfscsi = null; public NFSCSIStringPrep (){ try{ InputStream nfscsiFile = this.class.getResourceAsStream("nfscsi.spp"); nfscsi = new StringPrep(nfscsiFile); nfscsiFile.close(); }catch(IOException e){ //handle the exception } } private static byte[] prepare(byte[] src, StringPrep prep) throws StringPrepParseException, UnsupportedEncodingException{ String s = new String(src, "UTF-8"); UCharacterIterator iter = UCharacterIterator.getInstance(s); StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT); return out.toString().getBytes("UTF-8"); 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode UTrie – BMP Access Diagram
UPPER_WIDTH LOWER_WIDTH BMP code point 15 Upper Index LOWER_MASK Lower 0 Data Array Data
0
Block
0
Block
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode UTrie – Supplementary Access Diagram 1
15 Lead Surrogate 110110..
0 Has data for surrogate block?
3
No Folded Trie
2
Yes Lead Surrogate Data Trail Surrogate 15 110111..
9
4 4 5
Pseudo Code Point Data 0 Same for the surrogate block
6
Index + Data Final Data
BMP code points access same as with single-index
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
StringPrep Data Structure
Indexes Array that contains size info of trie & mapping table, options, version numbers etc.
16 Bit Trie word The value in these bits is an index into the mapping table or the delta value from the code point UTrie 0 1 2..15
ON: The code point is prohibited Mapping Table Contains the code point(s) that a single code point maps to 29 th Unicode Conference, San Francisco, CA ON : The value in the next 14 bits is an index into the mapping table OFF: The value in the next 14 bits is an delta value from the code point March, 2006 Values greater than 0xFFF0 have specify the state of the code point. © 2005 IBM Corporation
International Components for Unicode
Demo
http://www.ibm.com/software/glob alization/icu/demo/domain
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Conclusion
Unicode can be used in Network protocols
ASCII compatibility can be achieved
StringPrep applicable for all network protocols
ICU provides StringPrep services
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
References
Moving Towards Internationalized Domain Names
– Paul E. Hoffman
A Tangled Web: Issues of I18N, Domain Names, and the Other Internet protocols
– RFC 2825
Multilingual Domain Name Race
– Suzzanne Topping 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation
International Components for Unicode
Q & A
29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation