Transcript Document

The Taxonomic Database
Working Group
International & Special
Characters
International & Special
Characters in Scientific Data
Adrian Rissoné
Information Systems Manager
Department of Palaeontology
The Natural History Museum
[email protected]
• Introduction
• ISO 10646 and the UCS
• Unicode and UTF
• Support in common products
• Sorting & Searching
• Data management products
©The Natural History Museum, London, SW7 5BD, October 2002
1
Introduction
Until a few years ago the only text characters that could be used widely
were the 128 (including control characters) contained within the 7-bit
ANSI/ASCII character set – in practice, only a limited range of
characters from North American English. Later, use of the 8th bit
extended the range to 256 characters, so as to include most Western
European characters and some graphical characters. Non-Western
characters could only be displayed using Windows Code Pages – they
could not be displayed together
More recently, support for multi-byte characters has been gradually
introduced into operating systems (MacOS since version 8.5, Windows
NT/2000), but restricted to certain fonts (eg. Arial Unicode MS)
Inclusion into application products has been slow
2
ISO 10646 and
UCS
The international standard ISO 10646 defines the Universal Character
Set (UCS)
UCS is a superset of all other character set standards. It guarantees
round-trip compatibility to other character sets
UCS contains the characters required to represent practically all known
languages
This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic,
Armenian, and Georgian scripts, but also also Chinese, Japanese and
Korean Han ideographs as well as scripts such as Hiragana, Katakana,
Hangul, Devangari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu,
Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetian, Runic,
Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar,
Sinhala, Thaana, Yi, and others
3
ISO 10646 and
UCS
ISO 10646 defines formally a 31-bit character set
The most commonly used characters, including all those found in older
encoding standards, have been placed in one of the first 65534 positions
(0x0000 to 0xFFFD)
This 16-bit subset of UCS is called the Basic Multilingual Plane
(BMP)
The characters that were later added outside the 16-bit BMP are mostly
for specialist applications such as historic scripts and scientific notation.
Current plans are that there will never be characters assigned outside the
21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over
one million potential future characters
4
ISO 10646 and
UCS
UCS assigns to each character not only a code number but also an
official name
A hexadecimal number that represents a UCS or Unicode value is
commonly preceded by "U+" as in U+0041 for the character "Latin
capital letter A“
The UCS characters U+0000 to U+007F are identical to those in US
ASCII and the range U+0000 to U+00FF is identical to ISO 8859-1
(Latin-1)
5
ISO 10646 and
UCS
• Combining characters
These are similar to the non-spacing accent keys on a typewriter. A
combining character is not a full character by itself. It is an accent or
other diacritical mark that is added to the previous character. This way, it
is possible to place any accent on any character
Combining characters follow the character which they modify
• Precomposed characters
Accented characters that have their own code position, but could also be
represented as a pair of another character followed by a combining
character
6
UCS Implementation Levels
• Level 1
Combining characters and Hangul Jamo characters are not supported
• Level 2
Like level 1, however in some scripts, a fixed list of combining
characters is now allowed (e.g., for Hebrew, Arabic, Devangari, Bengali,
Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai
and Lao). These scripts cannot be represented adequately in UCS
without support for at least certain combining characters
• Level 3
All UCS characters are supported, such that for example mathematicians
can place a tilde or an arrow (or both) on any arbitrary character
7
UCS as a national
standard?
A number of countries have published national adoptions of ISO 106461:1993, sometimes after adding additional annexes with cross-references
to older national standards and specifications of various national
implementation subsets
• China: GB 13000.1-93
• Japan: JIS X 0221-1:2001
• Korea: KS X 1005-1:1995 (includes ISO 10646-1:1993 amendments
1-7)
• Vietnam: TCVN 6909:2001
8
What is
Unicode?
The ISO 10646 standard was a project of the International Organization
for Standardization (ISO)
The Unicode Project was organized by a consortium of (initially mostly
US) manufacturers of multi-lingual software
Fortunately, the participants of both projects realized in around 1991 that
two different unified character sets is not what the world needs. They
joined their efforts and worked together on creating a single code table
Both projects still exist and publish their respective standards
independently, but they have agreed to keep the code tables of the
Unicode and ISO 10646 standards compatible
9
What is
Unicode?
The Unicode Standard published by the Unicode Consortium
corresponds to ISO 10646 at implementation level 3. All characters are
at the same positions and have the same names in both standards
The Unicode Standard defines in addition much more semantics
associated with some of the characters and is in general a better
reference for implementers of high-quality typographic publishing
systems
Unicode specifies algorithms for rendering presentation forms of some
scripts (say Arabic), handling of bi-directional texts that mix for instance
Latin and Hebrew, algorithms for sorting and string comparison, and
much more
10
What is
Unicode?
The ISO 10646 standard on the other hand is not much more than a
simple character set table
However, a nice feature of the ISO 10646-1 standard is that it provides
CJK example glyphs in five different style variants, while the Unicode
standard shows the CJK ideographs only in a Chinese variant
11
What does Unicode look like?
Characters are denoted in the Unicode Standard as an optional U+
followed by their hexadecimal number, using at least 4 digits, such as
"U+1234" or "U+10FFFD"
In XML or HTML this could be expressed as "ሴ" or
"􏿽"
12
UTF-8
(UCS Transformation Format)
UCS and Unicode are just code tables that assign integer numbers to
characters
There exist several alternatives for how a sequence of such characters or
their respective integer values can be represented as a sequence of bytes
The two most obvious encodings store Unicode text as sequences of
either 2 or 4 bytes sequences. The official terms for these encodings are
UCS-2 and UCS-4 respectively
An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply
inserting a 0x00 byte in front of every ASCII byte. If we want to have a
UCS-4 file, we have to insert three 0x00 bytes instead before every
ASCII byte
13
UTF8
Using UCS-2 (or UCS-4) under some operating systems (eg. Unix)
would lead to very severe problems. Some bytes and byte sequences
have a special meaning in filenames and other C library function
parameters
The UTF-8 encoding defined in ISO 10646-1:2000 Annex D (and also
described in section 3.8 of the Unicode 3.0 standard) does not have these
problems
14
UTF8
UTF-8 has the following properties:
• UCS characters U+0000 to U+007F (ASCII) are encoded simply as
bytes 0x00 to 0x7F (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the same
encoding under both ASCII and UTF-8
• All UCS characters >U+007F are encoded as a sequence of several
bytes, each of which has the most significant bit set. Therefore, no
ASCII byte (0x00-0x7F) can appear as part of any other character
15
UTF8
• The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how
many bytes follow for this character. All further bytes in a multibyte
sequence are in the range 0x80 to 0xBF
• All possible 231 UCS codes can be encoded
• UTF-8 encoded characters may theoretically be up to six bytes long
• The sorting order of UCS-4 byte strings is preserved
16
Storing Unicode/UTF-8
Most compliant applications store characters 0 – 255 (0xFF) as a single
character
A few store Unicode as Unicode text strings (&#nnnn;)
UFT-8 characters outside of 0x00 – 0xFF are stored as a multibyte
sequence, the first byte being a count of the number of following bytes
17
HTML & XML
For document and data interchange, the Internet and the World Wide
Web are more and more making use of marked-up text such as HTML
and XML
Because Unicode and UTF-8 “text” characters are interpreted by the
browser they may be included naturally in HTML or XML documents
However, in many instances, markup provides the same, or essentially
similar features to those provided by format characters in the Unicode
Standard for use in plain text and there may be conflict
Another special character category provided by Unicode are
compatibility characters. While there may be valid reasons to support
these characters and their specifications in plain text, their use in
marked-up text can conflict with the rules of the markup language
18
Support for Unicode/UTF-8
Support for Unicode/UTF-8 is very variable, even within product
families. For example, Microsoft Office 2000 supports most UTF-8
characters in data but Frontpage 2000 does not (it does support Unicode)
Some major players do not offer, or have only recently introduced,
compatibility
Database-level support is slowly becoming commonplace but interfaces,
programming-level support and clients are lagging behind. For example,
the latest version of the PHP scripting language does not formally
support UTF-8 (but encoding/decoding functions are available)
Application software depends on the underlying operating system.
Unicode/UTF-8 versions of products are therefore only available on later
versions of the operating system (eg. Windows NT/2000 onwards,
MacOS 8.5 onwards). This is a real problem for application developers
19
Support for Unicode/UTF-8
However clever an application is, it can’t display Unicode/UTF-8 if a
capable font is not available!
The Arial Unicode MS font shipped with Microsoft Office 2000 can
display 51,180 characters. Arial Unicode MS is 23 MB and will have a
significant impact on the performance of your computer
Code 2000 contains many characters that are difficult to find elsewhere.
Apple computers need a compatible font installed, but it should handle
most Microsoft fonts
Bitstream Font Fusion is a new technology which promises to be able to
construct scalable Unicode fonts with much less impact – perhaps only
one sixteenth the size of current fonts.
20
Support for Unicode/UTF-8
An HTML document should include a metatag defining the character set
as UTF-8:
<meta content="text/html; charset=utf-8">
A compatible application, such as a browser with the default font set to
Arial Unicode MS, should then display Unicode correctly
Using only the <span> construct to display a UTF-8 character or
Unicode string is not enough:
<span style=‘font-family:”Arial Unicode MS”’>
For example, Internet Explorer and Netscape 7 will display the character
correctly but Word will not
21
Support for Unicode/UTF-8
Can one trust the products? Maybe not
The next slide shows a Microsoft Internet Explorer 6 representation of
an HTML document with the document encoding set to UTF-8 and the
browser default font set to Arial Unicode MS
Some of my sample characters are not displayed at all where the font is
defined as a non-Unicode font (rectangular boxes are displayed instead).
Curiously, Microsoft Word displays some of those that IE6 does not
Netscape 7 (the latest version) displays all characters regardless of what
the default font is set to. This involves automatic font substitution – it
can be seen in the example - (if a suitable substitution font is available!)
22
Support for Unicode/UTF-8
Microsoft Internet Explorer 6
23
Support for Unicode/UTF-8
Netscape 7.0
24
Support for Unicode/UTF-8
More importantly, note that
setting the font to be italicised
can result in the wrong
character being displayed!
Look at the characters (should
be д), highlighted in red
The results are the same
with IE6 and Netscape 7.0
so the problem looks likely
to be in Windows font
management
25
Sorting Unicode/UTF-8
Sorting is not quite so straightforward as one might hope!
A few products have taken the approach that each character set “locale”
should be sorted independently. The effect of this is to separate the
“Latin” sets
Microsoft products (and others) sort all the Latin sets as a whole,
followed by each other set Greek, then Cyrillic, etc.), one after another
There is no known way to sort Unicode “phonetically”
26
Sorting Unicode/UTF-8
Examples of data sorted by Microsoft Word 2000. The six different
representations of the city “Moscow” were found in less than 1 minute
using Google
Moscou
Moscov
Moscow
Moskow
Москва
Москов
Visean
Viséan
Roemer
Römer
27
Support in Data
Management Systems
A survey conducted in September 2002 of the thirteen Collection
Management Systems listed on the United Kingdom Museums
Documentation Association (MDA) web site revealed only two (ADLIB
and MUSIMS) which claimed full Unicode or UTF-8 compatibility
A further three (CALM, KE Emu and Questor ARGUS) are actively
working on multibyte solutions, one (Specify) did not reply but, given
the core database (SQLServer) in use, should be able to handle UTF-8
28
Support in Data
Management Systems
This has important implications for portal developers
Unmodified, a query may return data in a variety of formats: plain
ASCII, non-Western Windows Code Pages, bespoke fonts, Unicode text
strings, UTF-8/UTF-16, etc.
The onus will be on the provider to supply data mapped to UTF but there
are still likely to be inconsistencies and mapping errors
29
Links & Acknowledgements
Much of the explanation of Unicode & UTF-8 originated in UTF-8 and Unicode FAQ for
Unix/Linux by Markus Kuhn (http://www.cl.cam.ac.uk/~mgk25/unicode.html) and is
reproduced by permission. The original document (and the whole site) also contains
many useful links
ISO 10646 can be ordered from
http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819
The Unicode Project is at http://www.unicode.org
Unicode in XML and other Markup Languages
http://www.unicode.org/unicode/reports/tr20/
There is a useful list of the capabilities of various Windows and Apple (OSX) fonts at
http://www.alanwood.net/unicode/index.html
There is a lighthearted, but informative, UTF-8 sampler at
http://www.columbia.edu/kermit/utf8.html#glass
Bitstream Font Fusion http://www.bitstream.com/categories/developer/fontfusion/index.html
30
In Conclusion …..
31
The Taxonomic Database
Working Group
International & Special
Characters
We’re getting there! - but it’s a slow process
It will be some time before the majority of the applications have full
Unicode/UTF-8 compatibility, especially at the client interface
The development of web-based products (including portals enabling
searching over multiple datasets) is more promising. There are certainly
still problems, mainly at the database interface and in programming
languages, but delivery to a browser client (with the correct fonts
available) is very nearly a reality
Proper handling of multiple human languages, rather than characters is
another story …..
Adrian Rissoné
Information Systems Manager
Department of Palaeontology
The Natural History Museum
[email protected]
32