Folded Trie: efficient data structure for all of Unicode

Download Report

Transcript Folded Trie: efficient data structure for all of Unicode

Folded Trie: Efficient Data
Structure for All of Unicode
Vladimir Weinstein
[email protected]
Globalization Center of Competency, San Jose, CA
21st International Unicode Conference
Dublin, Ireland, May 2002
1
Introduction
• A lot of data for each code point
• Need appropriate data structures
• Unicode version 3.1 introduced code points
into supplementary space – addressable range
grew to more than a million
• Repetitive data
• Sparsely populated range, especially the
supplementary space
21st International Unicode Conference
Dublin, Ireland, May 2002
2
Data Structures
• Arrays
– Advantages: very fast access time, fast write time
– Disadvantage: Unacceptable memory consumption
• Hash tables
– Advantages: Easy to use, Reasonably fast, General
– Disadvantages: High overhead, complicated sequential
access, slower than array lookup, data within ranges is
not shared
21st International Unicode Conference
Dublin, Ireland, May 2002
3
Data Structures (continued)
• Inversion Maps
– Advantages: simple, very compact, fast boolean
operations
– Disadvantages: worse access time than arrays and
possibly hash tables
• For more details see “Bits of Unicode” at
http://www.macchiato.com/slides/Bits_of_Unicode.ppt
21st International Unicode Conference
Dublin, Ireland, May 2002
4
Tries
• A trie is a structure with one or more indexes
and one data storage.
• Name comes from “Information Retrieval”
• Shares repetitive data
• Good compaction
• Not appropriate for frequently changing data
21st International Unicode Conference
Dublin, Ireland, May 2002
5
Single-Index Trie
• A trie structure with an index array and a data
array.
• Advantages
– Excellent size
– Very good access performance (two array accesses,
shift, mask and addition)
• Disadvantages
– Not appropriate for frequently changing data
– Index array gets too big when dealing with
supplementary code points
21st International Unicode Conference
Dublin, Ireland, May 2002
6
Single-Index Trie Diagram
UPPER_WIDTH
LOWER_WIDTH
LOWER_MASK
BMP code point Upper
15
Index
Lower
0
Data Array
Data
0
0
21st International Unicode Conference
Block
Block
Dublin, Ireland, May 2002
7
Double-Index Trie
•
•
Two index arrays and a data block
Compared to single-index trie:
1. Provides better compression of the index array
2. Worse performance, but still very fast
3. Feasible for supplementary code points
21st International Unicode Conference
Dublin, Ireland, May 2002
8
Double-Index Trie Diagram
UPPER_WIDTH MIDDLE_WIDTH LOWER_WIDTH
MIDDLE_MASK LOWER_MASK
Code point Upper
20
Index 1
Lower
Middle
0
Data
Index 2
0
Index1
Index2
0
Data
21st International Unicode Conference
Block
Dublin, Ireland, May 2002
9
Folded Trie
• Fast access for BMP code points
• Slower access for supplementary code points,
but far less frequent
• Compacts supplementary index
• Needs additional build time processing
• Fast address with UTF-16 code units
– no need to construct code point
21st International Unicode Conference
Dublin, Ireland, May 2002
10
Folded Trie – Supplementary Access Diagram
1
Lead Surrogate
15
110110..
Folded Trie
0
Has data for
surrogate block?
Same for the
surrogate block
3
No
2
Data
Yes
Lead Surrogate Data
Trail Surrogate
15
9
110111..
0
4
4
5
Pseudo Code Point
6
Index + Data
Final Data
• BMP code points access same as with single-index
21st International Unicode Conference
Dublin, Ireland, May 2002
11
ICU Implementation: UTrie
• ICU implementation is called UTrie
• Stores either 16 bit or 32 bit wide data
(extensible in the future)
• Up to 256K different data elements
• Can be frozen and reused as memory mapped
image for fast startup
• Using UTrie requires custom code
More about ICU at the end of presentation
21st International Unicode Conference
Dublin, Ireland, May 2002
12
Range Enumeration
• Allows enumerating over a set of
contiguous maximal ranges of
same data elements
• Elements can be preprocessed by
additional callback
• Saves time when processing the
whole Unicode range by
efficiently walking the trie
structure
21st International Unicode Conference
start-1
Element 1
start
Element 2
Element 2
Element 2
Element 2
Element 2
limit-1
Element 2
limit
Element 3
Dublin, Ireland, May 2002
13
Latin-1 Fast Path
• Build time option
• Allows direct array access for the Latin-1
range (0x00-0xFF)
• Latin-1 range is not compressed if this option
is used
• Appropriate when access for Latin-1 range is
critical
– collation
21st International Unicode Conference
Dublin, Ireland, May 2002
14
Example: Normalization Data
• Normalization data is stored using UTries
• For example, main data has the following
format
31
15
7
6
Extra data index
Combining class BCK FWD
Can be either:
-index to variable length data
- first part of supplementary
lookup value
-Special handling indicator
(Hangul, Jamo)
Combines back
5
3
QC_MAYBE
0
QC_NO
Values for normalization quick
check
Combines forward
• Variable-length data contains composition and
decomposition info
21st International Unicode Conference
Dublin, Ireland, May 2002
15
Example: Character Properties Data
• The result of UTrie lookup is an index
• Double indexing allows for even better compression,
since many code points have the same property value
• UTrie data width is 16 bit (thousands of data entries),
while the property data width is 32 bits (few hundred
unique data words).
Folded Trie
Index
Data
Property data
32 bits
16 bits
21st International Unicode Conference
Dublin, Ireland, May 2002
16
International Components for Unicode
• International Components for Unicode(ICU) is
a library that provides robust and full-featured
Unicode support
• Several library services use the common UTrie
implementation
• Wide variety of supported platforms
• open source (X license – non-viral)
• C/C++ and Java versions
• http://oss.software.ibm.com/icu/
21st International Unicode Conference
Dublin, Ireland, May 2002
17
Conclusion
• UTrie data structure provides good
compression with fast access
• The main constraint for usage is the nature of
the data that needs to be stored
• Designed for repetitive and sparse data
21st International Unicode Conference
Dublin, Ireland, May 2002
18
Q&A
21st International Unicode Conference
Dublin, Ireland, May 2002
19
Folding and Surrogate Access
• Folding process compacts the index for
supplementaries and moves it right above the
BMP index
• Access in ICU4C:
– Define a C callback, invoked when special lead
surrogate is detected
– Manually detect special lead surrogates
• In ICU4J, provide a subclass with a method
that detects special lead surrogates
21st International Unicode Conference
Dublin, Ireland, May 2002
20
Summary
•
•
•
•
•
•
•
•
Introduction: Storing Unicode data
Types of data structures
Tries
Single-index trie
Double-index trie
Folded trie
Usage of folded trie in normalization
Usage of folded trie for character properties
21st International Unicode Conference
Dublin, Ireland, May 2002
21