Transcript Unicode 3.0

Unicode 3.0.1
Mark Davis
www.macchiato.com
New 3.0 Characters
Category
V 2.1
V 3.0
Alphabetics, Symbols
6,511
10,236
CJK Ideographs
21,204
27,786
Hangul Syllables
11,172
11,172
Assigned characters
38,887
49,194
Unassigned code values
18,134
7,827
Sync’ed with ISO/IEC 10646, 2nd edition
Unicode 3.0
New 3.0 Blocks
80 Syriac
176 Mongolian
192 Thaana
256 Braille
128 Sinhala
128 CJK Rad. Sup.
160 Myanmar
224 Kangxi Rad.
384 Ethiopic
16 Ideo. Desc.
96 Cherokee
640 U.C. Ab. Syl.
32 Ogham
32 Bopomofo Ext.
6,582 CJK Ideo. A
1,168 Yi Syllables
96 Runic
64 Yi Radicals
128 Khmer
Unicode 3.0
Property Updates (1)
 Bidirectional properties
 Byte order mark
 Capital letters with iota adscript
 Case
 Combining classes
 Decompositions
Unicode 3.0
Property Updates (2)







Identifier Syntax
Layout controls
Linebreak properties
East-Asian width properties
Misc. Characters: Figure Space, Tilde,…
Ligature Control
Unassigned Code Points
Unicode 3.0
Conformance
 Unicode Transformation Formats
 UTF-16BE, UTF-16LE, UTF-16, UTF-8
 Unicode Bidirectional Behavior
 Other normative character property values
Clause numbering maintained!
 Stability Policies
 Clarification of noncharacters
 Normalization Conformance Test
Unicode 3.0
Unicode Standard Annexes
(UAX)
 Integral part of 3.0.1 Standard
• UAX #09: BIDI
• UAX #11: East Asian Width
• UAX #13: Newline Guidelines
• UAX #14: Line Breaking
• UAX #15: Normalization
• Included in any reference to version 3.0 or later
Unicode 3.0
Unicode Technical Standards
(UTS)
• UTS #06: Compression
– IANA name: SCSU
• UTS #10: Collation
– Note: defined over all Unicode code points
– Values will be updated soon for better ordering
Unicode 3.0
Technical Reports
• UTR #07: Language Tags
• UTR #16: UTF-EBCDIC
 UTR #17: Character Encoding Model
 UTR #18: Regular Expressions
 UTR #19: UTF-32
 UTR #21: Case Mappings
Unicode 3.0
Draft Technical Reports
 UTR #20: Unicode in XML…
 UTR #22: Character Mapping Tables
 UTR #24: Script Names
• Open for public comment
Unicode 3.0
Unicode Character Database
• More Documentation, More Data
– UnicodeData
Blocks
– ArabicShaping
Jamo
– CompositionExclusions
SpecialCasing
– EastAsianWidth
LineBreak
– Unihan
BidiMirroring
– CaseFolding
NormalizationTest
Unicode 3.0
Website changes
• New Look & Feel
• New Navigation
• Enhanced FAQ
• Glossary
• What is Unicode?
• Where is my character?
Unicode 3.0
Beyond 3.0
• Characters
– CJK characters, symbols, music systems,
ancient scripts, extra characters, etc.
– First allocated surrogate pairs
• Properties
– essential for Unicode enablement
Unicode 3.0
Unicode 3.0
• Major new version
• Over 10,000 new characters
• Enhanced character data for implementations
• Reorganized text for better reference
• The version for normalization
• Unicode Character Database 3.0.0
• Available now!
Unicode 3.0
Q&A
Unicode 3.0
Backup Slides
Unicode 3.0
ICU: Paid Advertisement
• Open Source Unicode Enablement Library
– ICU: C/C++ and Java Versions
– IBM Public License
– Friday, 10:00 Helena Shih
• http://oss.software.ibm.com/icu
Unicode 3.0
Enumerated Versions
• Unicode 1.0.0, Unicode 1.0.1
• Unicode 1.1.0, Unicode 1.1.5
• Unicode 2.0.0
• Unicode 2.1.2, Unicode 2.1.5, Unicode
2.1.8, Unicode 2.1.9
• Unicode 3.0.0
– www.unicode.org
Unicode 3.0
Editorial Committee
• Joan Aliprand
• John Jenkins
• Julie Allen (editor)
• Mike Ksar
• Joe Becker
• Rick McGowan
• Mark Davis
• Lisa Moore
• Asmus Freytag
• Ken Whistler
Unicode 3.0
New Characters (2)
Category
V 2.1
V 3.0
Private Use
6,400
6,400
Surrogates
2,048
2,048
65
65
2
2
Assigned code values
47,402
57,709
Unassigned code values
18,134
7,827
Controls
Not Characters
Unicode 3.0
Reference to Versions
• Open repertoire, but backwards compatible
• Characters only added, not removed
– Two early exceptions: ISO sync. & Korean
• Don’t overspecify the version:
– “Version 2.1.0” vs.
“Version 2.1” vs.
“Version 2 or later”
• Includes Technical Reports!!
Unicode 3.0
Versions of the Standard
• major - significant additions
– published as a book
• minor - character additions or more significant
normative changes
– published as a Technical Report
• update - any other changes
– on the website in /standard/versions/
• Example: 2.1.9
Unicode 3.0
Unicode 3.0
• Versioning
• Technical Reports
• Characters
• Properties
• Unicode Character
Database
• Conformance
• Future
Unicode 3.0
Reorganized Text








6: Punctuation
7: European Alphabetics
8: Middle Eastern
9: South Asian
10: East Asian
11: Other (Mongolian, etc.)
12: Symbols
13: Formatting, Controls, Specials
Unicode 3.0
Additionally
• Shift-JIS Index
• Full Radical Stroke Index
– CJK split in several blocks
• Improved Charts
– Especially for CJK Ideographs
• Improved Implementation Guidelines
• General Clarifications
Unicode 3.0