Unicode and Java

Transcript Unicode and Java

Unicode (and Java)
Brice Giesbrecht
Objective of Presentation
The need for Unicode
 How it works
 Differentiate between encodings
 How to get your browser to work…
 See how Java consumes and
produces data

Overview of Presentation
Character Sets
 Unicode
 Encodings
 Unicode Support in Java
 Unicode Support in Databases (?)
 Demonstration (web app)
 Resources
 Door Prizes (for those still awake…)

Character Sets







What is a character set?
Code Page: a mapping in which a sequence of
bits, usually a single octet representing integer
values 0 through 255, is associated with a specific
character (wikipedia)
Most character sets are a direct mapping of a
value to a number (7 bit / 8 bit)
Character sets are NOT fonts!
Encoding is usually a lookup in a table
Most IBM and Microsoft code pages use ASCII as
their base set of characters
The English bias (compare to Indic languages)
Character Sets

Issues Within a single Language

Selectors to overcome 8 bit limitations (especially
for CJK sets)
Historical importance of platforms and hardware
Compatibility (or more likely, lack thereof)
ISCII as an example

Issues outside a single Language





How do you produce content using multiple
languages? (Or the characters from those
languages?)
http://en.wikipedia.org/wiki/Code_page_437
Character Sets

Enter the standards

ISO-646 (ASCII, still 7 bit)



ISO-8859-n





12 whole code points to play with!
C0 Control Set (0x00 – 0x1F)
0x00 – 0x7F ISO-646 IRV
0x80 – 0xFF Different for each set (or part)
ISO 8859-1 (Latin1)
C1 Control Set (0x80 – 0X9F)
ISO-2022


Designed for transmission
Non Latin bases & multi byte sets
Character Sets

Enter Microsoft!

Windows code pages


http://www.microsoft.com/globaldev/reference/wincp.ms
px
Cp1252



Based on ISO 8859-1
C1 code points used for printable characters
Often mislabeled as ISO-8859-1 due to their similarities
Unicode
What is Unicode?
Unicode provides a unique number for
every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Unicode
ISO 10646 1990
 Merged with the Unicode Consortium
Ties a character, name, and a code
point together
 BMP – Basic Multilingual Plane (the
first 65,536 code points)
 ISO and UC Character repertoire are
synchronized
 UCS (Universal Character Set)

Unicode

Q: So are they the same thing?
A: No. Although the character codes and
encoding forms are synchronized between
Unicode and ISO/IEC 10646, the Unicode
Standard imposes additional constraints on
implementations to ensure that they treat
characters uniformly across platforms and
applications. To this end, it supplies an
extensive set of functional character
specifications, character data, algorithms
and substantial background material that is
not in ISO/IEC 10646.
(http://unicode.org/faq/unicode_iso.html)
Unicode

The Unicode Standard includes a set of
characters, names, and coded
representations that are identical with
those in ISO/IEC 10646:2003. It
additionally provides details of character
properties, processing algorithms, and
definitions that are useful to implementers.
[It] strengthens Unicode support for
worldwide communication, software
availability, and publishing.
(http://www.iso.org)
Unicode
UCS Code space: (0x – 0x7FFFFFFF)
128 x 256 x 256 x 256 (GPRC)
2,147,483,648 possible code points
 The Unicode Character Database




Available on line


http://unicode.org/Public/UNIDATA/UCD.html
Main Definition (UnicodeData.txt)
http://www.unicode.org/Public/UNIDATA/
Unicode Code Space (0x – 0x10FFFF)
17 x 256 x 256 1,114,112 code points
Unicode

As of Unicode 5.0.0, 101,063 (9.1%) of
these codepoints are assigned, with
another 137,468 (12.3%) reserved for
private use, leaving 875,441 (78.6%)
unassigned. The number of assigned code
points is made up as follows:
98,884 graphemes
140 formatting characters
65 control characters
2,048 surrogate characters
Unicode
Plane 0 (0000-FFFF)
 Basic Multilingual Plane (BMP)
 Used for most of the alphabets
 Not all code points are used
 Allocated in areas/blocks

Unicode
Plane 1 (10000-1FFFF):
 Supplementary Multilingual Plane
(SMP)
 Historic scripts such as Linear B, but
is also used for musical and
mathematical symbols.

Unicode
Plane 2 (20000-2FFFF)
 Supplementary Ideographic Plane
(SIP)
 Used for about 40,000 rare Chinese
characters that are mostly historic

Unicode
Planes 3 to 13 (30000-DFFFF)
 Unassigned

Unicode
Plane 14 (E0000-EFFFF)
 Supplementary Special-purpose Plane
(SSP)
 glyph (font) selection
 code point + variation selector =
variation sequence


http://www.unicode.org/reports/tr37/tr37-3.html
(Ideographic Variation Database)
Unicode
Plane 15 (F0000-FFFFF)
 Plane 16 (100000-10FFFF)
 Plane 0 (E000-F8FF)
 Private Use Area (PUA)


The use of the PUA was a concept inherited from certain
Asian encoding systems. These systems had private use
areas to encode Japanese Gaiji (rare personal name
characters) in application-specific ways)
Unicode
ConScript Unicode Registry


The purpose of the ConScript Unicode Registry
(CSUR) is to coordinate the assignment of blocks
out of the Unicode Private Use Area (E000-F8FF
and 000F0000-0010FFFF) to constructed/artificial
scripts, including scripts for constructed/artificial
languages.
Cirth, Klingon, Tengwar, etc.
Encodings
Purpose of the following encodings is to
get the Unicode value to you.
Depending on the storage or
transmission protocols, different
encodings will need to be
used. These are not different
character sets, they are ways of
representing the characters in
Unicode.
Encodings

Endianness




0x1234
LE 34 12
BE 12 34
Byte Order Mark - 0xFEFF





Helps Determine Endianness
Unicode 3.2 (0x2060)
0xFFFE reserved
0XFEFF set aside for BOM
Also used to declare encoding (UTF-8)
Encodings
UTF-8




Variable-length character encoding
Can address all characters in the UCS but was
limited by RFC 3629 to just address the Unicode
code space.
BOM – EF BB BF
Format
000000-00007F
000080-0007FF
000800-00FFFF
010000-10FFFF
0zzzzzzz
110yyyyy 10zzzzzz
1110xxxx 10yyyyyy 10zzzzzz
11110www 10xxxxxx 10yyyyyy 10zzzzzz
Encodings
UTF-32/UCS-4







Fixed-length character encoding
Uses 31 bits
UCS-4 capable of addressing entire UCS, but was
restricted to only cover the Unicode code space
UTF-32 only covers the Unicode code space
4E8C, 10302 = 00004E8C, 00010302
BE BOM – 00 00 FE FF
LE BOM – FF FE 00 00
Encodings
UCS-2






Fixed-length encoding
Two-octet
It is NOT UTF-16!
Only addresses BMP
UCS-2BE, UCS-2LE
Obsoleted by UTF-16
Encodings
UTF-16





Variable-length encoding
UTF-16BE, UTF-16LE
BE BOM – FEFF
LE BOM – FFFE
Surrogates are used to address code points
outside the BMP. (We will cover this later)
Encodings
UTF-16 Surrogate Pairs




Needed for code points > 0xFFFF
High Byte 0xD800 – 0xDBFF first surrogate
Low Byte 0xDC00 – 0xDFFF second surrogate
Algorithm:


((cp - 0x10000) high 10 bits) | 0xD800
((cp - 0x10000) low 10 bits) | 0xDC00
Encodings
Which Encoding should you use?






If dealing with CJK or Hindi (>0x0800), UTF-8
requires 3 bytes whereas UTF-16 needs only 2
UTF-8 is great for ASCII whereas UTF-16 needs 2
bytes for it
Java uses UTF-16
Windows uses UTF-16LE internally
UTF-32 not really used that much
UTF-8 and UTF-16 are the most common
Java
J2SE 1.5 version 4.0
 J2SE 1.4 version 3.0
 J2SE 1.3 version 2.1
 Supplementary characters were part
of Unicode 3.1
 Addressed in JSR 204

(http://jcp.org/en/jsr/detail?id=204)
Java
Unicode characters are specified
using \u such as \u0039
 Unicode can be used in source files
 file.encoding=Cp1252 on my machine
 You can change this, but beware…
 Java reads and writes using this
encoding by default
 You can specify the character set to
use for reading or writing

Java
Big5
Big5-HKSCS
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-Solaris
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1381
x-IBM1383
x-IBM33722
x-IBM737
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS950-HKSCS
x-mswin-936
x-PCK
x-windows-874
x-windows-949
x-windows-950
Databases (Maybe)

SQL 92 NATIONAL CHARACTER

The <key word>s NATIONAL CHARACTER are used to specify a
character string data type with a particular implementation-defined
character repertoire. Special syntax (N'string') is provided for
representing literals in that character repertoire.
Collation
 Database Support

MySQL
 Oracle
 Sql Server
 Postgres

Demonstration






Read/Write/Examine UTF-8/UTF-16/UTF16LE encoded text (with Hex editor)
Show encoding settings in Eclipse and Java
Show how windows (and eclipse console)
can/can't display some characters
web browser settings
Chinese article on cracking of SHA-1
Martin Fowler article on dependency
Injection
Resources

The big ones:




The rest:







http://www.unicode.org/Public/UNIDATA/
http://en.wikipedia.org/wiki/Unicode
http://www.evertype.com/standards/csur
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp
http://en.wikibooks.org/wiki/Unicode/Character_reference
http://www.joelonsoftware.com/articles/Unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://czyborra.com/charsets/iso646.html
http://www.fileformat.info/ (GREAT resource)
For fun:



http://www.omniglot.com/
http://en.wikipedia.org/wiki/Constructed_language
http://talideon.com/concultures/wiki/