1,228 new characters:
Download
Report
Transcript 1,228 new characters:
Unicode 4.0
Mark Davis
President, The Unicode
Consortium
Schedule
2003, April:
UCD/UAXes
Final
data files available
Implementation can proceed
2003:
September:
Book
Available
New Characters: 1,228
Modern Scripts
Historic Scripts
(additions to) Indic, Khmer, Latin, Greek, Arabic,
Syriac
(minority scripts) Limbu, Tai Le, Osmanya
Linear B, Cypriot, Ugaritic, Shavian, Aegean
Numbers
Symbols
Monograms, digrams, tetragrams, other symbols
modifier & combining characters
New Characters (cont.)
Special
Characters
additional
variation selectors (for future
CJK variants), double-diacritics for
dictionary use
For
a detailed list, see Derived Age in
the UCD 4.0, and the beta Charts.
Character repertoire corresponds to
ISO/IEC 10646:2003.
Conformance
Substantially improved specification of
conformance requirements
Incorporated UTR #17: Character Encoding
Model, clearly separating encoding forms and
encoding schemes
Tightened definitions of UTF-8, UTF-16, UTF-32
Separate definition of Unicode String
Clarified conformance status of Unicode
Standard Annexes
Formal definitions of properties & algorithms
Provisional properties: draft, NRFPT
UTF vs Unicode String
UTF
Unique representation for Code Point
All else illegal
C0 80
D800 0061
Unicode String
Sequence of code units
Internal Processing, not interchange
Not necessarily valid UTF
C0 A0
D800 0061
Conformance (cont.)
Formalized policies for stability of the
standard
Clarification of semantics of important
characters, including BOM
Revised scope of enclosing combining marks
Revised semantics of ZWJ for cursive scripts
Normalization Corrections
U+2F868; U+2F874; U+2F91F; U+2F95F;
U+2F9BF
Textual Clarifications
Major changes to Chapters 2, 3, 6, 14 and 15
Definitive terminology for code points:
graphic, format, control, private-use
surrogate, noncharacter, reserved
= assigned characters
not characters
Substantial improvements to many character
block descriptions, especially Indic
Programming language
identifiers
Now
backwards-compatible
Once
a Unicode identifier,
Always a Unicode identifier
Alternate
Fix
definition for complete stability
set of allowed characters
Allow all reserved code points
+ Complete stability
- “Odd” characters
Case mappings now
normative (but tailorable)
Clearer
definition of string functions:
isUpper(),
isLower(), isTitle(), isFold()
toUpper(), toLower(), toTitle(), toFold()
Definition
of titlecase uses word
boundaries
Note
that the Turkic mappings do not
maintain canonical equivalence, without
additional processing.
UAX #9: The Bidirectional
Algorithm
canonically
data
change, not algorithm
shaping
but
equivalence now preserved
is done after reordering
not across directional boundaries
clarifications
ZWJ,
of:
ZWNJ
intermediate level processing
UAX #14: Line Breaking
Properties
Negative numbers and dates with hyphens
will not break across lines
Word-Joiner will link any characters (except
hard line breaks)
Behavior of soft hyphen clarified
Rules for GL relaxed
marks opportunity for breaking, not specific
graphic appearance.
SP and ZW override GL
New Property Values: NL, WJ
UAX #15: Unicode
Normalization Forms
Description of Stable Code Points.
Notation NFC(x) and isNFC(x), in Notation.
Added pointer to UTN #5 Canonical
Equivalences in Applications
Rewrote Annex 12: Corrigenda for clarity, and
to describe the use of Normalization
Corrections.
Added Annex 13: Canonical Equivalence.
UAX #29: Text Boundaries
New:
extracted from 3.0, but
significantly revised
Default definitions
Word, sentence: tailoring expected
Grapheme cluster (“user character”)
Hangul
Syllable or other Base
plus (optionally) any number of NSMs
No Sub. Changes
UAX
#11: East Asian Width
UAX #24: Script Names
except
now UAX!
Superseded UAXes
Incorporated
into and thus superseded
by Unicode Version 4.0:
UAX
#13: Unicode Newline Guidelines
UAX #19: UTF-32
UAX #21: Case Mappings
UAX #27: Unicode 3.1
UAX #28: Unicode 3.2
Unicode Character Database
Documentation coalesced into UCD.html.
New properties and values
UCD fallback props more precisely defined.
Hangul_Syllable_Type, Unicode_Radical_Stroke
CJK numeric values added.
PropertyValueAliases adds block names
for code points not explicitly in data files
New Characters
Appropriate properties assigned
UCD4.0 (cont.)
Modifier letters
Khmer
Two Khmer characters are deprecated; four others
strongly discouraged.
Decimal Digits
The general category of 02B9..02BA, 02C6..02CF
changed to general category Lm.
Numeric_Type=decimal digit now aligned with
General_Category=Nd
Braille
Added script value
UCD4.0 (cont. 2)
Case Mapping
Default Ignorables
Fixed for Turkish, Lithuanian
Hangul Filler characters
Soft-Hyphen, CGJ, ZWS
Arabic End of Ayah and Syriac Abbreviation Mark
no longer DI, shaping classes fixed.
Grapheme_Extend
removes halfwidth katakana marks, most Mc
(except as needed for canonical equivalence)
Related Items
UTS #10: Unicode Collation Algorithm
UTS #6: SCSU
Added suitability for XML
Draft UTS #18: Unicode Regular Expressions
Not part of Unicode 4.0, but closely related
From 4.0 on, to be sync'ed in repertoire and
version with the Unicode Standard.
Draft as UTS with conformance requirements
Draft UTR #23: Character Properties
Draft Character Property Model
Q& A
Background Slides
Unicode 3.2 (March, 2002)
New Characters: 1,016
Symbols
Special Characters
Large collection of mathematical symbols,
especially targeted at MathML, recycling symbols,
ornamental brackets.
combining grapheme joiner, word joiner, invisible
operators for math, variation selectors
Modern Scripts
minority scripts of the Philippines
Conformance
Eliminates
irregular UTF-8
Defines variation sequences
Replaces ZWNBSP with Word Joiner
Clarifies scope of combining marks
(further revised in 4.0)
Clarifications of conjoining jamo
behavior, hangul syllable structure,
decomposables,
Textual Clarifications
Combined
vowels in Khmer, characters
discouraged in Khmer
Use of dingbats
Unicode Standard Annexes
UAX
#21: Case Mappings (was UTR)
Unicode Character Database
New properties:
IDS_Binary_Operator, IDS_Trinary_Operator,
Radical, Unified_Ideograph,
Default_Ignorable_Code_Point, Deprecated
Soft_Dotted, Logical_Order_Exception
Grapheme_Base,
Grapheme_Extend,Grapheme_Link
DerivedAge
Normalization Corrections
Added Property & Property Value Aliases
Adds StandardizedVariants.html
Related Items
UTS #10: Unicode Collation Algorithm
Ignorable character handling, dual versioning,
more conditions on well-formed weights, separate
weights for CJK and unassigned characters, noncharacters
Note: base version still U3.1
UTR #26: CESU-8
Unicode Technical Notes
Updated Character Encoding Stability Policy
Added Public Review process
Updated Glossary
Unicode 3.1 (March, 2001)
New Characters: 44,946
Modern scripts
CJK Ideographs (now totaling 71,039)
Historic scripts
First supplementaries encoded!
Old Italic, Gothic, Deseret, Byzantine Musical
Symbols
Symbols
Mathematical Alphanumeric Symbols, (Western)
Musical Symbols
Conformance
Non-shortest-form UTF-8 excluded
Clarification of the stability of the standard,
code units vs. code points, non-characters,
normative properties, informative properties,
normative references
Revisions of guidelines:
wchar_t, unassigned code points, identifiers
Major revision of Georgian
Use of ZWNJ and ZWJ for ligatures
Language tag characters encoded
but discouraged
Unicode Standard Annexes
UAX
#19: UTF-32
Unicode Character Database
Major revision of PropList properties:
White_Space, Bidi_Control, Join_Control,
Hex_Digit
Alphabetic, Ideographic, Lowercase, Uppercase
ID_Start, ID_Continue, XID_Start, XID_Continue
Noncharacter_Code_Point
Quotation_Mark, Terminal_Punctuation, Math,
Dash, Hyphen, Diacritic, Extender
New properties: Case folding, Scripts
Added DerivedProperties, NormalizationTest
Related Items
Documented Character Encoding Stability
Policy
UTS #10: Unicode Collation Algorithm
Merged data files; updated to base version 3.1
UTR #18: Unicode Regular Expression
Guidelines
UTR #20: Unicode in XML and other Markup
Languages
UTR #22: Character Mapping Tables
UTR #24: Script Names