Surrogate Support in Microsoft Products
Download
Report
Transcript Surrogate Support in Microsoft Products
Surrogate Support in
Microsoft Products
Michael S. Kaplan
Software Design Engineer
Trigeminal Software, Inc.
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
What are surrogates?
"a coded character representation for a
single abstract character that consists of
a sequence of two code units, where the
first unit of the pair is a high surrogate
and the second is a low surrogate"
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
High/low surrogate?
High: U+D800 - U+DBFF
Low: U+DC00 - U+DFFF
Terminology:
– "surrogate pair" preferred over "surrogate
character"
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Conversion example #1
Example #1:
– The first character in the Surrogate range (D800, DC00) as
UTF-32:
1.
D800: binary 1101100000000000 (lower ten bits: 0000000000)
2.
DC00: binary 1101110000000000 (lower ten bits: 0000000000)
3.
4.
Concatenate 0000000000+0000000000 = x0000
Add x10000
Result: U+10000. This makes sense, since the first character in the
Surrogate range follows immediately after the last character in
the 16-bit Unicode range (U+FFFF)
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Conversion example #2
Example #2.
– You have a Unicode character such as U+2040A (a CJK
character in Plane2) and wish to encode it in UTF-16
1.
Subtract x10000 - Result: 1040A
2.
Split into two ten-bit pieces: 0001000001 0000001010
3. Add 1101100000000000 (D800) to the high 10 bits
piece (0001000001) - Result: 1101100001000001 (D841)
4. Add 1101110000000000 (DC00) to the low 10 bits
piece (0000001010) - Result: 1101110000001010
(DC0A)
Your surrogate pair: D841, DC0A
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
UTF-8 conversions
Illegal conversions: six-byte UTF-8 (two
surrogate code points of UTF-16, converted
separately)
legal conversions: four-byte UTF-8 (one
UTF-32 code point)
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
UTF-8 example
Unicode surrogate pair:
aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx
becomes incorrect UTF-8 total 6 bytes:
1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx
Instead, you should take a Unicode surrogate pair:
110110wwwwzzzzyy, 110111yyyyxxxxxx
and convert it to UTF-8 totaling 4 bytes (below,
uuuuu is defined as = wwww+1):
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Encoding choices for MS
UTF-16, mostly
Occasionally UTF-8
Even more occasionally, UTF-32
REASONS:
There was obviously an existing, well-tested set of APIs that
support UCS-2, which is a total subset of UTF-16.
A completely new API set was not required.
A move to UTF-32 would require twice as much space for all
characters.
A move to UTF-8 would require even more than twice as much
space in many cases.
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
The products...
Mostly the new generation of products:
– Windows 2000/XP
– Office XP (some support in Office 2000)
Most of these products supported Unicode
already
– a little bit of extra work needed for surrogate
pairs
– usually just UTF-8 support needed
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Windows 2000/XP
Uniscribe/GDI+ support for rendering
Each surrogate pair is a single grapheme
APIs like CharPrev/CharNext not changed
Extensions to fallback fonts in XP
Font CMAP extensions in XP
Lots of UTF-8 issues fixed in XP
No specific surrogate font/IME (yet)
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Collation for Supplementary chacacters
All Plane-1 (non-ideographic) characters sort after all the
other non-ideographic scripts but before the ideographs.
All Plane 2 (ideographic) characters will be sorted after all
the ideographs on the BMP.
All Plane 3-14 (currently not assigned) will be treated like
any other unassigned characters. (includes plane 14
language tags)
All characters encoded in Plane 15-16 (private use) will be
sorted after all other characters.
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Other system components
MLang
Internet Explorer
IIS 5.0/6.0
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
The downlevel story
No good support for Unicode, let along
supplementary characters
Uniscribe/RichEdit does improve the
downlevel story for display purposes, at
least
Officially, no surrgoate support on Win9x
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
The Office suite
Word
Frontpage
Excel/Access
Outlook
RichEdit 4.0
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Specific Features
Insertion/Deletion of text - All
Cursor movement - All
Font linking/fallback - All (Word's is best)
UTF-8 issues fixed - All
Enhanced word breaking - All (Word/RichEdit)
Vertical text - Word/PowerPoint/Publisher/RichEdit
Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
CHS/CHT/CHP Office
The product and the langpacks support an
extended Unicode IME that handles
supplementary characters
An Extension B font is also included
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Visual Studio[.NET]
String class and globalization namespace
StringInfo
GetTextElementEnumerator
– Handles supplementary characters
– Also handles composite characters
GDI+
IDE support
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
SQL Server
Past - no support
Present - surrogate "safe" (neutral)
Future - surrogate awaree
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Items not supported
Character Map
Graph 10
Outlook 10 mail headers
Collations for supplementary characters
Fonts/IMEs
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Questions?
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)
Surrogate Support in
Microsoft Products
26 April 2001
Surrogate Support in Microsoft
Products, IUC 18 (Hong Kong)