Transcript ICU4J

IBM Software Group

Unicode Text and Regular Expression

Andy Heninger 9/9/2004 ® © 2003 IBM Corporation

IBM Software Group

Overview

 Regular Expressions have long been used for  Searching text data  Parsing, extracting fields  Text manipulation, find & replace  Regular Expressions and Unicode Text data are a good Match.

 Regular Expression Languages have evolved new features to work more conveniently and powerfully with Unicode data.

 Talk Focus is on these Unicode related features.

IBM Software Group

What Are Regular Expressions

   Think of Wildcards Select or match text Available in editors, languages, tools, databases  Not the topic today Literal text

* +

[a-z] (whatever) Matches itself

Match 0 or more times Match one or more times Character Range. Match any one grouping

IBM Software Group

Character Ranges

 [a-z]  Match any one character falling in the specified range  Relies on the existence of some ordering of characters, to determine what falls between a and z. Typically charset order.

 Only works for English  No accented characters  No letters from other alphabets (Greek, Arabic, etc.)  Still widely used.

IBM Software Group

POSIX Character Classes

 Remove dependency on charset ordering  Convenient, more likely to be correct than [a-z]  [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:xdigit:] [:blank:] [:print:] [:graph:] [:upper:] [:punct:]  Implementers must provide definitions for different charsets

IBM Software Group

POSIX -> Unicode

 Unicode has a very rich character property system  Unicode TR 18 defines POSIX classes in terms of properties  [:alpha:]  [:digit]  [:space:]  [:upper:] Alphabetic = TRUE General Category = Decimal Number White Space = TRUE Uppercase = TRUE  Direct access to Unicode properties in Character Set expressions is a key feature for Unicode Regular Expression.

IBM Software Group

A Quick Look at the Unicode General Category

 Central to Regular Expressions with Unicode Text  Categorize every character as one of  Letter  Number  Separator  Punctuation  Marks  Symbols  Others  Subcategories within each. Examples  Letter,  Symbols, Uppercase, lowercase, Other, … Math, Currency, Modifiers, …  Mark, spacing, non-spacing, enclosing

IBM Software Group

Unicode Property Based Character Classes

 TR 18 Recommended Properties for Basic Unicode support includes  General Category  Script  Alphabetic  Uppercase  Lowercase  White Space  Examples: [:Script=Greek:] [\p{Script=Greek}] [\p{Alphabetic}] POSIX syntax Perl syntax

IBM Software Group

Set Operations

 [^\p{Letter}]  [\p{Letter}\p{Number}]  [\p{Letter}&\p{script=Cyrllic}]  [\p{Letter}-\p{Latin}] Negation Union Intersection Difference  Important for a character set the size of Unicode.

IBM Software Group

Script and Block Properties

 [\p{script=Thai}] [\p{block=Thai}]  Unicode Script Property  Categorizes each character by script – Latin, Cyrillic, Arabic, etc.

 Shared characters classified as “Common”. Numbers, punctuation, etc.

 Not the same as Language.

 Unicode Block Property  Categories by block – contiguous range of characters.

 Basic Latin, Latin-1 Supplement, Latin Extended A, Latin Extended B  Greek, Hebrew, and more.

 Has Limitations

IBM Software Group

Code Points, Code Units, UTF 8/16/32

  Matching happens on Code Points (0 – 10ffff) UTF-8 bytes or UTF-16 Surrogate Halves not visible   Match results independent of encoding form.

Glitches  Implementations without surrogate support  Perl’s \x

IBM Software Group

Normalization

 \p{Alphabetic}  n  \p{Non Spacing Mark} 

… n n i i n n i i n n i i ñ n ˜ a a ñ n ˜ a a ñ n ˜ a a n n i i ñ n ˜ a a

IBM Software Group

Normalization

 Approaches to the Problem  Data may be pre-normalized, nothing extra needed.

 Use Normalization option, if available.  Application Normalizes the data first

IBM Software Group

Line Endings

 Unicode has More  \u000A \u000C \u000D \u0085 \u2028 \u2029 \u000D \u000A Line Feed Form Feed Carriage Return Next Line (NEL) Line Separator Paragraph Separator CR/LF sequence  Matches normally stop at line ends, but overridable.

 Line endings always match as a single character, including the CR/LF sequence  No \n sequence to match any line ending 

IBM Software Group

Caseless Matching

 Simple – one to one character relation between pattern and text being matched.

 Full – one to many  German Sharp-S

ß

uppercases to ‘SS’  Expensive in complexity of implementation, speed.

 Existing implementations provide simple form only.

IBM Software Group

Grapheme Clusters

 Definition: what a user would consider a character, or what would display as a single character.

 Multi-codepoint Clusters  Base char + combining marks  Example: decomposed form of Ň  Hangul (Korean) syllables  Unicode-enabled regular expressions should provide  Match a grapheme cluster  Test whether match position is on a boundary.

IBM Software Group

Word Boundaries, \b

 Classic RE Feature  Boundaries between “word” and “non-word” characters  “Word” characters include all Alphabetic.

 Non-spacing marks never separated from base, otherwise ignored.

 UAX 29 Boundaries  Better, but different, results.

Hello There. G’day 123.456

Classic RE Hello There. G’day 123.456

Unicode Word Boundaries

IBM Software Group

Unicode TR 18

 Unicode Technical Standard #18, Regular Expressions  Guidelines for how to adapt RE implementations to Unicode  Three Levels of Support, Basic, Extended, Tailored  Basic Support requires  Access to common Unicode Character Properties  Set (character class) Operations – Union, Intersection, Subtraction  Simple Unicode Loose (caseless) matching  Unicode Line separator characters  Supplementary Character support  Hex notation for Unicode code points

IBM Software Group

Unicode TR 18

 Extended Unicode Support  More properties, characters by name. (GREEK CAPITAL LETTER EPSILON)  Canonical Equivalents (normalization)  Unicode style word boundaries  Full case insensitive matching  Matching default grapheme clusters and boundaries  Tailored Support. Language or Locale specific behavior for a number of matching constructs.

 No implementations available yet.

IBM Software Group

Implementations

 Implementations providing significant Unicode support  Perl.  Major innovations to regular expressions   Early adopter of Unicode Perl features and syntax widely adopted.

 Java JDK 1.4

 Microsoft .NET

 IBM ICU4C

IBM Software Group

Conclusion

 Regular expressions provide a great way to analyze and manipulate Unicode data.

 Mainstream implementations are readily available.

IBM Software Group

Questions