StringPrep: Unicode in Network Protocols

Download Report

Transcript StringPrep: Unicode in Network Protocols

International Components for Unicode

StringPrep: Unicode in Network Protocols

Ram Viswanadha Globalization Center of Competency, San José IBM

29 th Unicode Conference March, 2006 © 2006 IBM Corporation

International Components for Unicode

Agenda

Problem

StringPrep

Profiles of StringPrep

IDNA

StringPrep in ICU

Demo

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Terminology

Domain Name

DNS: Domain Naming Service

URL: Universal Resource Locator

NFKC: Normalization Form KC, compatibility composition, e.g.: ffi → ffi :The ffi_ligature (U+FB03) is decomposed in NFKC (whereas it is not in NFC).

BiDi: Bi-Directional code points

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Why Internationalize?

Users like to use their language/script in

– domain names – URLs – e-mail 

Not everyone can read/write English

How to internationalize?

– Use Unicode 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Domain Name: Examples

www.

日本平

.jp

www.

ハンドボールサムズ

.com

www.f

ä rgbolaget.nu www.b

ü cher.de

www.br

æ ndendek æ rlighed.com

理容ナカムラ

.com

あーるいん

.com

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Domain Name: Parts

WWW

.

Domain Label IBM

.

Label Separator COM 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

DNS Protocol Requirements

Minimum impact on DNS protocol's interoperability.

Minimum number of changes

Maximum backwards compatibility

Deterministic resolution of domain names

Single global namespace

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Problems

Unicode contains large number of

– Visually identical , e.g.: і → i – Confusable characters, e.g.: O → 0 – Control codes, e.g.: U+0080- U+009F – Non-Spacing, e.g.: U+00A0 – Invisible characters, e.g.: U+200B – Private Use Characters, e.g.: U+E000-U+FF8F – Punctuation, e.g.: U+002E – Symbols, e.g.: U+2097 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Example

www.arnaudléhors.com

www.

arnaudléhors.com

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Example: Contd.

www.arnaudl

\u00e9

hors.com

www.arnaudl

e\u0301

hors.com

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

StringPrep

Defined by RFC 3454

Framework for preparing Unicode strings

Based on Unicode Version 3.2

Specifies rules for handling

– un-assigned code points – visually similar sequences – Prohibited code points – BiDi code points 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

StringPrep Tables

Unassigned Table

Mapping Tables

– Case mapping, e.g.:\u0041, \u0061 – Deletion, e.g.: \u00AD, \u200C, \u200D 

Prohibited Tables

e.g.: LRM, RLM, etc.

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

StringPrep Algorithm

1.

2.

3.

4.

Map Normalize Prohibit Check BiDi

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

NamePrep

 

Defined by RFC 3491 Profile

1.

Map : Include all code point mappings specified in the StringPrep.

2.

3.

Normalize: Normalize the output of step 1 according to NFKC.

Prohibit: Prohibit all code points specified as prohibited in StringPrep except for the space ( U+0020) code point from the output of step 2.

4.

Check BiDi: Check for bidirectional code points and process according to the rules specified in StringPrep.

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Punycode

Defined by RFC 3492

Algorithm to convert prepared Unicode Strings to ASCII Compatible Encoding (ACE)

Complete

Unique

Reversible

Preserves case information

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Internationalized Domain Names in Applications

Defined by RFC 3490

Prescribes algorithm for using Unicode in DNS

NamePrep : Profile of StringPrep for use in DNS

Punycode : Algorithm for converting prepared Unicode strings to ACE

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

IDNA: ToASCII

www . यहलोगहहन्दीक्ोोंनहीोंबोलसकतेहैं . com ToASCII www . xn — i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd . com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

IDNA: ToUnicode

www . xn — i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd . com ToUnicode www . यहलोगहहन्दीक्ोोंनहीोंबोलसकतेहैं . com 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

IDNA: Details

ASCII Full Stop(U+002E)

– Ideographic Full Stop (U+3002) – Full Width Full Stop (U+FF0E) – Half Width Ideographic Full Stop (U+FF61) 

Unassigned code points

Letter-Digit-Hyphen (LDH) code points

STD 3 ASCII Rules

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

News article

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Network File System (NFS) Version 4 Profiles

Defined by RFC 3530

nfs4_cs_prep Profile

– Profile for file and path name strings 

nfs4_cis_prep Profile

– Profile for NFS server names 

nfs4_mixed_prep Profile

– profile for strings in the Access Control Entries 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

XMPP Profiles

Defined by RFC 3920

ResourcePrep

– Profile for resource identifiers within XMPP e.g.: node@domain/ resource 

NodePrep

– Profile for node identifiers within XMPP e.g.: node @domain/resource 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Other Profiles

SASLPrep

– RFC 4013 – Profile for Usernames and passwords 

MIB Profile

– RFC 4011 – Profile for Mannagement Information Base 

iSCSI Names

– RFC 3722 – Profile for internationalized iSCSI names 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

StringPrep Service in ICU

    

Data driven Customizable Portable C & Java

1.

Procedure for producing a StringPrep profile data file

Run filterRFC3454.pl

2.

3.

Run gensprep Open the profile 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Design Considerations

StringPrep profile characteristics:

– Prescribe a fixed set of tables – Normalization On/Off – Check BiDi On/Off – StringPrep algorithm fixed.

– Profiles once defined are fixed. 

Performance critical

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Implementation

Accurate Unicode 3.2 Normalization algorithm

Access to Unicode 3.2 Character Properties

StringPrep algorithm

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

Sizes

International Components for Unicode

Name IDNA NFSCIS NFSCSI NFSMXP NFSMXS NFSCSS

29 th Unicode Conference, San Francisco, CA

Size 21 K 21 K 20 K 14 K 21 K 13 K

March, 2006 © 2005 IBM Corporation

International Components for Unicode

C

UErrorCode status = U_ZERO_ERROR; UParseError parseError; /*open the StringPrep profile */ UStringPrepProfile nameprep = usprep_open(“/usr/joe/mydata”, “nfscsi”, &status); if(U_FAILURE(status)){ /handle the error */ } /prepare the string for use according to the rules specified in the profile */ int32_t retLen = usprep_prepare(src, srcLength, dest, destCapacity, USPREP_ALLOW_UNASSIGNED, nameprep, &parseError,&status); /close the profile*/ usprep_close(nameprep); 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Java

} private static final StringPrep nfscsi = null; public NFSCSIStringPrep (){ try{ InputStream nfscsiFile = this.class.getResourceAsStream("nfscsi.spp"); nfscsi = new StringPrep(nfscsiFile); nfscsiFile.close(); }catch(IOException e){ //handle the exception } } private static byte[] prepare(byte[] src, StringPrep prep) throws StringPrepParseException, UnsupportedEncodingException{ String s = new String(src, "UTF-8"); UCharacterIterator iter = UCharacterIterator.getInstance(s); StringBuffer out = prep.prepare(iter,StringPrep.DEFAULT); return out.toString().getBytes("UTF-8"); 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode UTrie – BMP Access Diagram

UPPER_WIDTH LOWER_WIDTH BMP code point 15 Upper Index LOWER_MASK Lower 0 Data Array Data

0

Block

0

Block

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode UTrie – Supplementary Access Diagram 1

15 Lead Surrogate 110110..

0 Has data for surrogate block?

3

No Folded Trie

2

Yes Lead Surrogate Data Trail Surrogate 15 110111..

9

4 4 5

Pseudo Code Point Data 0 Same for the surrogate block

6

Index + Data Final Data

BMP code points access same as with single-index

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

StringPrep Data Structure

Indexes Array that contains size info of trie & mapping table, options, version numbers etc.

16 Bit Trie word The value in these bits is an index into the mapping table or the delta value from the code point UTrie 0 1 2..15

ON: The code point is prohibited Mapping Table Contains the code point(s) that a single code point maps to 29 th Unicode Conference, San Francisco, CA ON : The value in the next 14 bits is an index into the mapping table OFF: The value in the next 14 bits is an delta value from the code point March, 2006 Values greater than 0xFFF0 have specify the state of the code point. © 2005 IBM Corporation

International Components for Unicode

Demo

http://www.ibm.com/software/glob alization/icu/demo/domain

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Conclusion

Unicode can be used in Network protocols

ASCII compatibility can be achieved

StringPrep applicable for all network protocols

ICU provides StringPrep services

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

References

Moving Towards Internationalized Domain Names

– Paul E. Hoffman 

A Tangled Web: Issues of I18N, Domain Names, and the Other Internet protocols

– RFC 2825 

Multilingual Domain Name Race

– Suzzanne Topping 29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation

International Components for Unicode

Q & A

29 th Unicode Conference, San Francisco, CA March, 2006 © 2005 IBM Corporation