Migration of a 4GL and Relational Database to Unicode

Transcript Migration of a 4GL and Relational Database to Unicode

Migration of a 4GL
and Relational
Database to Unicode
Tex Texin
International Product Manager
© 1998, Progress Software Corporation
1
Presentation Goals
• Outline Migration Steps
• Describe Design Considerations
• Leverage Existing Double-byte
Implementation
• Describe Impact on 4GL and Report
Formats
© 1998, Progress Software Corporation
2
PROGRESS Application
Development Suite
• Powerful tools for the rapid creation of
distributed business applications
• Creates character, GUI, or web-based
clients with common source
• Host-based, client-server, or n-tier
distribution on variety of platforms
• Scalable, robust RDBMS and open
• International, double-byte enabled
© 1998, Progress Software Corporation
3
Possible Configuration Options
GUI Client
Client-Server
Web-based
Client
Host-based
Optional n-tier
Character Client Application
Server
© 1998, Progress Software Corporation
4
Database
Server
Progress
Database
Other
Database
Why do our customers need
Unicode?
• Many do not... However,
• Multinationals deploy across regions
with incompatible character sets, yet
they must share data between them.
• Programs are distributed worldwide with
one container of text in many languages.
• Certain applications require multilingual
databases. E.g. Translation systems and
web-based applications.
© 1998, Progress Software Corporation
5
The Existing Architecture
• 1.5M lines of C code
• 0.3M lines of 4GL code
• Double-byte enabled
–
–
–
–
–
CJK, 9 double-byte charsets supported
2-byte only, no 3 or 4-byte
No shift-sequenced charsets
DBE changes earmarked, easy to find
4 years, 3 developers, 2 QA
© 1998, Progress Software Corporation
6
Estimated cost of implementing
UCS-2, was very big!
• Changing to 16-bit text units affects
almost every source module
– Largest cost is separating char variables
based on usage for text or binary data.
– Use 16-bit null terminators, ignore 8-bit
“A” 0041, 0000 “Ã” 0100, 0000
– Pointer arithmetic (advance 2 bytes)
– Sizing (bytes or characters)
– New API to use new WIDE TEXT datatype
© 1998, Progress Software Corporation
7
Product requirements for a
multilingual version
• Minimize cost for application migration
• Minimize cost for application upgrade
• Minimize support cost
– One executable!
• Maintain user-definable character sets
Add UTF-8 as just another character set
– UTF-8 algorithms are compatible with
other charsets
© 1998, Progress Software Corporation
8
Scaled down multilingual
proposal: UTF-8 implementation
• Implement UTF-8 as 3-byte character set
–
–
–
–
Leverage & extend double-byte enabling
Places to change are already earmarked
Restrict to composed characters for now
Restrict to no surrogates
Supports all the markets we are in
• UTF-8-enable 4GL and RDBMS first
– Provides multilingual logic and storage
– Java+other client technologies coming
© 1998, Progress Software Corporation
9
Architecture changes
UTF-8-enabling the string library
• N-byte enable character+string functions
– GetNextChar, GetPreviousChar
– GetCharacterSize (table-based)
– Modified IsFirstByte
• New GetColumnLength
• New datatype normalized “BIG” char
• Minor algorithm changes for efficiency
– Find Character
© 1998, Progress Software Corporation
10
Architecture changes
UTF-8-enabling character tables
• String libraries use character tables
– Alphanumeric, Lead-byte, Tail-byte
– Upper, lower case (700+ characters)
• New property ColumnCount
• New table formats
– Old architecture presumed 256 byte table
– Now organized by range lists and trie
• Update table compiler & allow hex entry
© 1998, Progress Software Corporation
11
Architecture changes
UTF-8-enabling sorting
•
•
•
•
How to sort multilingual data?
Binary sort used for double-byte data
With UTF-8, Europe is 2-byte, CJK 3-byte
Solution
– Binary sort on server
– Client uses native sort
• Bump key length limit for UTF-8
• Next phase will be enhanced sort
© 1998, Progress Software Corporation
12
Architecture changes
Character conversion algorithms
• Existing, user-definable, conversions
– Single-byte character set table maps
– Double-byte Shift-JIS - EUCJIS algorithm
• New table-driven automated conversions
–
–
–
–
Single-byte to UTF-8, and back
Double-byte to UCS-2 and back
UTF-8 - UCS-2
Trie for speed and memory optimization
• Requires significant QA for data integrity
© 1998, Progress Software Corporation
13
Architecture changes
Impact on the 4GL user
• 4GL is character set independent
• Almost all functions are character-based
• 3 functions require optional byte-basing
– Length, Substring, Overlay
– Options: Byte, Character
• Add new option: Column
• Format (Picture) Phrase
– “XXXX” has different meaning for UTF-8
© 1998, Progress Software Corporation
14
Status
•
•
•
•
Functioning Well
Going to second beta
Implemented with very low cost
Performance is OK
– Metrics not yet available
• Testing is most significant cost
– Reviewing all character set properties
– Evaluating all conversions
© 1998, Progress Software Corporation
15
Futures
• For the Progress International Team
– Multilingual Clients
– Enhanced Character Folding
– Enhanced Sorting
• For Progress Customers
– Deployment of multilingual databases
– Worldwide access to these databases
– Worldwide deployment of multi-language
applications
© 1998, Progress Software Corporation
17
Conclusions
• Migration can be achieved in phases
• Migration thru UTF-8 can be low cost
• Double-byte applications can migrate
easily to UTF-8
• Asian users can integrate with other
languages now
• Non-English users can integrate with
Asian languages now
© 1998, Progress Software Corporation
18
© 1998, Progress Software Corporation
19