1,228 new characters:

Download Report

Transcript 1,228 new characters:

Unicode 4.0
Mark Davis
President, The Unicode
Consortium
Note: slides differ from proceedings
Overview
 New
Characters
 Conformance
 UAX:
Unicode Standard Annexes
 UCD:
Unicode Character Database
 UTS:
 Not
Unicode Technical Standards
part of the Standard, but can claim
conformance
Properties and Behavior
 Unicode
is not just a list of characters
 Properties and behavior are crucial
 With them, new characters can work
“out of the box”
 Some are part of the standard (BIDI,
Normalization), others are associated
(Collation, Regular Expressions)
New Characters: 1,228

Modern Scripts



Historic Scripts


(additions to) Indic, Khmer, Latin, Greek, Arabic,
Syriac
(minority scripts) Limbu, Tai Le, Osmanya
Linear B, Cypriot, Ugaritic, Shavian, Aegean
Numbers
Symbols


Monograms, digrams, tetragrams, other symbols
modifier & combining characters
New Characters (cont.)
 Special
Characters
 additional
variation selectors (for future
CJK variants), double-diacritics for
dictionary use
 For
a detailed list, see Derived Age in
the UCD 4.0, and the beta Charts.
 Character repertoire corresponds to
ISO/IEC 10646:2003.
Conformance

Substantially improved specification of
conformance requirements



Incorporated UTR #17: Character Encoding
Model, clearly separating encoding forms and
encoding schemes
Tightened definitions of UTF-8, UTF-16, UTF-32
Separate definition of Unicode String

Clarified conformance status of Unicode
Standard Annexes
 Formal definitions of properties & algorithms

Provisional properties
UTF vs. Unicode String

Important Distinction
 UTF



Unique representation for Code Point
All else illegal
C0 80
D800 0061
Unicode String



Sequence of code units
Internal Processing, not interchange
Not necessarily valid UTF
C0 A0
D800 0061
Conformance (cont.)

Formalized policies for stability of the
standard
 Clarification of semantics of important
characters, including BOM
 Revised scope of enclosing combining marks
 Revised semantics of ZWJ for cursive scripts
 Normalization Corrections


U+2F868; U+2F874; U+2F91F; U+2F95F;
U+2F9BF
All corrections subject to strict stability constraints:

For 3.2 repertoire, NFC3.2(X) = NFC4.0(X)
Textual Clarifications

Major changes to Chapters 2, 3, 6, 14 and 15
 Definitive terminology for code points:

graphic, format, control, private-use


surrogate, noncharacter, reserved


= assigned characters
not characters
Substantial improvements to many character
block descriptions, especially Indic
Programming language
identifiers

Now backwards-compatible



Alternate definition for complete stability





Once a Unicode identifier,
Always a Unicode identifier
Fix set of allowed characters
Allow all reserved code points
+ Complete stability
- “Odd” characters
Also see new UTR on Syntax Characters
Case mappings now
normative (but tailorable)
 Clearer
definition of string functions:
 isUpper(),
isLower(), isTitle(), isFold()
 toUpper(), toLower(), toTitle(), toFold()
 Definition
of titlecase uses word
boundaries
 Note
that the Turkic mappings do not
maintain canonical equivalence, without
additional processing.
UAX #9: BIDI

BIDI: Arabic/Hebrew Display


HTML, all modern word processors, OSs,…
New:

canonically equivalence now preserved


shaping is done after reordering


data change, not algorithm
but not across directional boundaries
clarifications of:


ZWJ, ZWNJ
intermediate level processing
UAX #15: Normalization

Unique form for text comparison


W3C Character Model, International Domain
Names, Network File System,…
New:





Description of Stable Code Points.
Notation NFC(x) and isNFC(x), in Notation.
Added pointer to UTN #5 Canonical Equivalences
in Applications
Rewrote Annex 12: Corrigenda for clarity, and to
describe the use of Normalization Corrections.
Added Annex 13: Canonical Equivalence.
UAX #14: Line Breaking

Line-Break (word-wrap) all Unicode text


Customizable for different languages
New:



Negative numbers and dates with hyphens will not
break across lines
Word-Joiner will link any characters (except hard
line breaks)
Behavior of soft hyphen clarified



marks opportunity for breaking, not specific graphic
appearance.
Rules for GL relaxed: SP and ZW override
New Property Values: NL, WJ
UAX #29: Text Boundaries

Default “User Character”, Word, Sentence
boundaries



Customizable for different languages
Word, sentence: tailoring expected
New:


Extracted from 3.0, but significantly revised
Grapheme cluster (“user character”)


Hangul Syllable or other Base
plus (optionally) any number of NSMs
No Sub. Changes
 UAX
#11: East Asian Width
 Guidelines
 UAX
for choosing character width
#24: Script Names
 Default
script assignment
 Used in regular expressions
 Now UAX
Superseded UAXes
 Incorporated
into and thus superseded
by Unicode Version 4.0:
 UAX
#13: Unicode Newline Guidelines
 UAX #19: UTF-32
 UAX #21: Case Mappings
 UAX #27: Unicode 3.1
 UAX #28: Unicode 3.2
Unicode Character Database

Crucial Component of Unicode
 Documentation coalesced into UCD.html.
 New properties and values




UCD fallback props more precisely defined.


Hangul_Syllable_Type, Unicode_Radical_Stroke
CJK numeric values added.
PropertyValueAliases adds block names
for code points not explicitly in data files
New Characters

Appropriate properties assigned
UCD4.0 (cont.)

Modifier letters


Khmer


Two Khmer characters are deprecated; four others
strongly discouraged.
Decimal Digits


The general category of 02B9..02BA, 02C6..02CF
changed to general category Lm.
Numeric_Type=decimal digit now aligned with
General_Category=Nd
Braille

Added script value
UCD4.0 (cont. 2)

Case Mapping


Default Ignorables




Fixed for Turkish, Lithuanian
Hangul Filler characters
Soft-Hyphen, CGJ, ZWS
Arabic End of Ayah and Syriac Abbreviation Mark
no longer DI, shaping classes fixed.
Grapheme_Extend

removes halfwidth katakana marks, most Mc
(except as needed for canonical equivalence)
Unicode Technical Standard
 UTS:
separate standard
 independent
 UTR:
conformance requirements
information and guidelines
 Documents
to UTS
may move from UTR status
UTS #10: Unicode Collation

Significance:



String comparison, matching, searching
Compares all Unicode characters
Handles linguistic features




Accents, Case, Punctuation,…
Contextual weighting,…
Tailor for different languages
Version 4.0.0 due Sept. 2003

From now on, to be sync'ed in repertoire and
version with the Unicode Standard.
UTS #18: Regular Exp.

Significance:


Crucial to many applications: web, XML,…
Unicode adds significant requirements

Level 1: Basic Support




Perl
Level 2: Extended Support
Level 3: Tailored Support
New:


Recently approved as UTS (was UTR)
Adds clearer conformance requirements


Flexible list of features
Partial conformance claims
UTS #6: SCSU
 Simple
Unicode Compression
 Added suitability for XML
 See also Technical Note on BOCU
 Main
difference: preserves binary order
 x < y => BOCU(x) < BOCU(y)
New UTRs
 Draft
UTR #23: Character Properties
 Draft
Character Property Model
 Character
Folding
 Hiragana-Katakana,
 Programming
characters
Case, …
Language IDs, Syntax
Q& A
 Other
talks here:
 Common Locale Data
 interchange
of language-specific data for
sorting, dates, times, currencies
 ICU
 premier
Unicode enablement library
 full-featured, x-platform
 C, C++, Java
Background Slides
Unicode 3.2 (March, 2002)

New Characters: 1,016
 Symbols


Special Characters


Large collection of mathematical symbols,
especially targeted at MathML, recycling symbols,
ornamental brackets.
combining grapheme joiner, word joiner, invisible
operators for math, variation selectors
Modern Scripts

minority scripts of the Philippines
Conformance
 Eliminates
irregular UTF-8
 Defines variation sequences
 Replaces ZWNBSP with Word Joiner
 Clarifies scope of combining marks
(further revised in 4.0)
 Clarifications of conjoining jamo
behavior, hangul syllable structure,
decomposables,
Textual Clarifications
 Combined
vowels in Khmer, characters
discouraged in Khmer
 Use of dingbats
Unicode Standard Annexes
 UAX
#21: Case Mappings (was UTR)
Unicode Character Database

New properties:




IDS_Binary_Operator, IDS_Trinary_Operator,
Radical, Unified_Ideograph,
Default_Ignorable_Code_Point, Deprecated
Soft_Dotted, Logical_Order_Exception
Grapheme_Base,
Grapheme_Extend,Grapheme_Link
DerivedAge
 Normalization Corrections
 Added Property & Property Value Aliases
 Adds StandardizedVariants.html
Related Items

UTS #10: Unicode Collation Algorithm



Ignorable character handling, dual versioning,
more conditions on well-formed weights, separate
weights for CJK and unassigned characters, noncharacters
Note: base version still U3.1
UTR #26: CESU-8
 Unicode Technical Notes
 Updated Character Encoding Stability Policy
 Added Public Review process
 Updated Glossary
Unicode 3.1 (March, 2001)

New Characters: 44,946


Modern scripts


CJK Ideographs (now totaling 71,039)
Historic scripts


First supplementaries encoded!
Old Italic, Gothic, Deseret, Byzantine Musical
Symbols
Symbols

Mathematical Alphanumeric Symbols, (Western)
Musical Symbols
Conformance

Non-shortest-form UTF-8 excluded
 Clarification of the stability of the standard,


code units vs. code points, non-characters,
normative properties, informative properties,
normative references
Revisions of guidelines:

wchar_t, unassigned code points, identifiers

Major revision of Georgian
 Use of ZWNJ and ZWJ for ligatures
 Language tag characters encoded

but discouraged
Unicode Standard Annexes
 UAX
#19: UTF-32
Unicode Character Database

Major revision of PropList properties:




White_Space, Bidi_Control, Join_Control,
Hex_Digit
Alphabetic, Ideographic, Lowercase, Uppercase
ID_Start, ID_Continue, XID_Start, XID_Continue
Noncharacter_Code_Point
Quotation_Mark, Terminal_Punctuation, Math,
Dash, Hyphen, Diacritic, Extender
New properties: Case folding, Scripts
 Added DerivedProperties, NormalizationTest
Related Items

Documented Character Encoding Stability
Policy
 UTS #10: Unicode Collation Algorithm


Merged data files; updated to base version 3.1
UTR #18: Unicode Regular Expression
Guidelines
 UTR #20: Unicode in XML and other Markup
Languages
 UTR #22: Character Mapping Tables
 UTR #24: Script Names
Schedule
 2003, April:
UCD/UAXes
 Final
data files available
 Implementation can proceed
 2003:
September:
 Book
Available