Unicode for Under Resourced Languages

Transcript Unicode for Under Resourced Languages

Unicode for Under Resourced
Languages
Daniel Yacob
The Ge’ez Frontier Foundation
SALTMIL 5: Genoa, Italy 2006
Overview
• What is “Unicode”?
– More than Just Encoded Letters!
• Working with Unicode
– How Unicode can help you.
– Resources and how to apply them.
• Working for Unicode
– How you can help Unicode.
– How Unicode can help your U-RL.
My Background
• Started Ethiopic software work in 1993
– transliterator, keyboard, fonts
• Amharic Computational Linguistics in 1994
• “Extended Ethiopic” Unicode
Standardization 1995-2004
• Corpus Collection 1997 – Present
• Began Using Unicode in 1995 for Ethiopic
– but no Unicode standard existed until 2000!
My Background
• Little or no Unicode based resources in
1993-1997
– Today there is almost always an OpenSource
project that you can start with and extend.
– Minimize the time and labour you put into
developing basic resources.
– Avoid the maintenance trap.
• We will assume the worst case scenario
– You work on a language, using a script, with
no pre-existing software resources at all.
What Unicode is
Unicode …
– is a consortium
– is a process
– is a community
– is a conference
– is a database
– is a standard
– is a collection of standards
What Unicode is not
Unicode …
– is not a font
– is not a keyboard system
– is not a transliteration system
– is not the ISO
– is not perfect
– is not complete
Over 80 Scripts not Encoded!
India, Nepal,
Bangladesh:
Southeast Asia
• Chakma
• Methei / Manipuri
• Newari
• Sorang
Sompeng
• Varang Kshiti
• Batak
• Cham
• Javanese
• Pahawh Hmong
• Viet Thai
(excluding China):
China:
• Lanna
• Naxi Geba
• Naxi Tomba
• Pollard
Africa:
• Bamum
• Bassa
• Mende
Courtesy of Michael Everson: http://evertype.com
Over 80 Scripts not Encoded!
•Ahom
•Alpine
•Aramaic
•Avestan
•Aztec Pictograms
•Balti
•Brahmi
•Büthakukye
•Byblos
•Chalukya
•Chola
•Cypro-Minoan
•Egyptian
Hieroglyphs
•Elbasan
•Elymaic
•Grantha
•Hatran
•Iberian
•Indus Valley
•Jurchin
•Kaithi
•Kawi
•Khotanese
•Kitan Large
Script
•Kitan Small
Script
•Landa
•Linear A
•Luwian
•Mandaic
•Manichaean
•Mayan
Hieroglyphs
•Meroitic
•Modi
•Nabataean
•North Arabic
•Numidian
•Old Hungarian
•Old Permic
•Orkhon
•Pahlavi
•Palmyrene
•Proto-Elamite
•Pyu
•Rongorongo
•Samaritan
•Satavahana
•Sharada
•Siddham
•South Arabian
•Soyombo
•Takri
•Tangut
Ideograms
•Uighur
•Vedic accents
Courtesy of Michael Everson: http://evertype.com
Current State of the Unicode
Standard: New Script Additions
For Unicode 5.0 (2006):
For Unicode 5.1 (2008):
N’Ko (West Africa)
Lepcha (India)
Ol Chiki (India)
Vai (Liberia)
Saurashtra (India)
Myanmar minorities (Myanmar)
Kayah Li (Myanmar)
Rejang (Indonesia)
Sundanese (Indonesia)
Carian, Lycian, Lydian
(historical)
Balinese (Indonesia)
Phags-pa (historical)
Phoenician (historical)
Cuneiform (historical)
Courtesy of Michael Everson: http://evertype.com
Working with Unicode
Unicode is all About Text
• Most applicable to problems where
language is represented by text.
• Unicode addresses some vocabulary but
under the scope of localization (CLDR).
• May not be the solution if you are not
working with text represented in written
form
– Although, Unicode can be used for symbol
processing
Working with Unicode
Operating Systems
• Most anything from this millennia.
• Apple MacOS Version ≥ 9.2
• Microsoft Windows CE, NT, XP, 2000
• Solaris ≥ 2.8
• Any GNU/Linux (for console use)
– GNOME 2.0 or KDE 2.0 and Later
Working with Unicode
The International Phonetic Alphabet (IPA)
Working with Unicode
The International Phonetic Alphabet (IPA)
• SIL Charis, Doulos, Gentium
– free and most complete
– matches “New Times Roman” style
– http://scripts.sil.org/IPAhome
Working with Unicode
If you need more letters…
• Create Your own Fonts!
• Use the Unicode Private Use Area (PUA)
– this is Unicode’s extension mechanism.
– does not break compatibility with Unicode
software.
– you must send your fonts with your work.
– encode non-letter symbols (tokens, tags), no
need for fonts.
Working with Unicode
The PUA
• 6,400 code points in the range E000-F8FF
• 218 additional available in “planes” 15 & 16
• Work in Plane 0 first (0000 – FFFF)
• Intended for company logos, ligatures
used by typesetting software, etc.
Working with Unicode
Creating Your Own Fonts
• Bitmap (BDF)
– Faster to create
– One size per font, not so scalable
– Works best with X-Windows (Unix)
• Outline (TrueType, PostScipt, OpenType)
– Takes more time
– Scalable
– MS Windows, Mac, Modern Unixes
Working with Unicode
Bitmap Editors
• Each letter is a matrix of pixels, like tiles
• You toggle them on or off to shape your
letters
• GBDFED for recent GNOME/Linux
• XBDFED for general Unix
• Or search for “BDF Editor”
Working with Unicode
Working with Unicode
Bitmap Editors
Zoom View Within Edit Window
Working with Unicode
Outline Editors
• Create Bezier
curves to outline
scalable shapes
• Here traced
around a scanned
image
• FontForge
http://fontforge.sf.net
Working with Unicode
Creating Your Own Keyboards
• No standard formats
• Different on every operating system
• May require some painful programming
– transliteration may be a better alternative.
• For small amounts of typing try:
Ctrl+Shift+X1X2X3X4
Ctrl+Shift+1234
Working with Unicode
Creating Your Own Keyboards
Linux
• Migration Toward Smart Common Input
Method (SCIM)
– simple table based
– more complex as needed
– http://scim.sf.net
- or Yudit, Emacs for older Unixes, but you can
only type in these applications.
Working with Unicode
Creating Your Own Keyboards
Windows
• Keyman, most mature & robust
• Keyboards created with KeymanDeveloper
– $59 academic and developing world license
– worth every cent
– compiled keyboards also run under Linux with
a SCIM module
– http://tavultesoft.com
Working with Unicode
Text Processing
• International Components for
Unicode (ICU)
– http://icu.sf.net
– Java, C/C++
– Bindings in: Python, Ruby, C#,
Perl 6 (some Perl 5)
– started by IBM, is OpenSource
– managed by the Unicode president
– check with ICU before
• 700+ Encoding Conversions
– convert legacy systems to and from Unicode
– migrate corpora to Unicode
Working with Unicode
Text Processing
ICU: Normalization
• Equate letters and
diacritical symbols
n
˜
+
006E
u
0303
0065
e
00EA
^
+
0302
+
0065
ê
=
030A
+
.
0323
+
.
0323
ü
00FC
°
0031
e
=
0308
+
ñ
00F1
¨
+
0075
A
=
Å
212B
.
0323
+
^
0302
=
ệ
1EC7
Working with Unicode
Text Processing
ICU: Regular Expressions
• Applies the Unicode Character Database
• Categorize every character as one of
–
–
–
–
–
–
–
Letter
Number
Separator
Punctuation
Marks
Symbols
Others
• Subcategories within each. Examples
– Letter,
Uppercase, lowercase, Other, …
– Symbols, Math, Currency, Modifiers, …
– Mark,
spacing, non-spacing, enclosing
• Defines 80 character property types
Working with Unicode
Text Processing
ICU: Regular Expressions
Set Operations
• [^\p{Letter}]
• [\p{Letter}\p{Number}]
• [\p{Letter}&\p{script=Cyrllic}]
• [\p{Letter}-\p{Latin}]
Negation
Union
Intersection
Difference
• Important for a character set the size of Unicode.
Working with Unicode
Text Processing
ICU: Regular Expressions
• Enhanced Word Boundaries:
Hello There.
G’day 123.456 Classic RE
Hello There.
G’day 123.456 Unicode Word Boundaries
Working with Unicode
Text Processing
ICU: Regular Expressions
• Equivalence Classes
– [=e=] matches all “e” [eèéêëēĕėęě]
– not yet implemented
– use Perl instead
Working with Unicode
Overloading Perl Regex with Regexp::Ethiopic
Simple Plurals:
[#7#]ች
vs
[ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች
Working with Unicode
Overloading Perl Regex with Regexp::Ethiopic
• /[#3#]ያ/
– አንባቢያን
– ሚያዚያ
– ኢትዮጵያዊያን
• /[#3,6#]ያ/
– አንባቢያን
– ሚያዚያ
– ኢትዮጵያዊያን
አንባብያን
ሚያዝያ
ኢትዮጵያውያን
Working with Unicode
Text Processing
ICU: Transliteration
• Defined by “transform rules”
– One to one mappings:
• α <> a;
• β <> b;
– Context Rules:
• β } [aeiou] > b;
• β } [^aeiou] > v;
Working with Unicode
Text Processing
ICU: Transliteration
• Defined by “transform rules”
– Applying UCD Properties
• Θ } [:LowercaseLetter:] <> Th;
• Θ <> TH;
– Reverse Transliteration Context Rules
• σ < [:^Letter:] { s } [:^Letter:] ;
• ς < s } [:^Letter:] ;
• σ<s;
Working with Unicode
Text Processing
• ICU: Transliteration
– Gets much more sophisticated
• See also Perl’s Text::Transliterate
Working for Unicode
Taking Your Work a Step Further
• You’ve helped create an orthography
–now make it official.
• You’ve worked with a pre-existing un-encoded
script using the PUA –now formalize it.
• You’ve created a transliteration system
–make it an ISO standard.
• You’ve identified a dialect –encode it in ISO 639.
• You’ve developed a keyboard
–make it a national standard.
• etc.
Working for Unicode
Why go the extra mile kilometer?
• Ethnic pride and identity is promoted.
• Literacy efforts can be encouraged.
• The study of historic scripts is kept alive.
• Communication between and amongst members
of the community is promoted.
• Government communication in times of
emergency (disease, war, natural disaster).
• Leads to localization, greater access to ICT.
• …and you become the expert!
Working for Unicode
What to Consider
• The work will be more social than technical.
• The work will take years (at least two).
• Review Encoding History
– Has this been attempted before and failed? Why?
– Are there any non-Unicode encodings?
• Determine the Stakeholders
– The Government –will they support you, oppose you, jail you?
– Political Parties, Religious, Education, Cultural Groups
• does anyone have something to lose by the encoding?
• Communicate, Communicate, Communicate…
– and be transparent.
– the perception of being closed breeds suspicion and opposition.
• …even 11 years after the fact, trust me on this.
Working for Unicode
New Keyboard?
• No international standardization working
groups
• Contribute Keyboard back to main project
• Contact Local ICT Professionals
Organization
• Contact Local University CS Department
• Contact Local Standards Body
Working for Unicode
New Language or Dialect?
• Contact the ICO/DIS 639-3 Registration
Authority
– http://sil.org/iso639-3/
– [email protected]
• Contact Language or Cultural Authority
• Contact Local University Linguistics
Department
Working for Unicode
New Orthography? Or Un-encoded?
• Contact the ISO 15924 Registration Authority
– http://unicode.org/iso15924/
•
•
•
•
•
•
Contact Language or Cultural Authority
Contact Local ICT Professionals Organization
Contact Local University CS Department
Contact Local University Linguistics Department
Contact Local Standards Body
Contact the Script Encoding Initiative
Working for Unicode
The Script Encoding Initiative
• http://linguistics.berkeley.edu/sei
• Works with users on script proposals.
• Helps raise money for script proposals to be
written and free fonts to be created.
• Works collaboratively with other groups (e.g.
SIL) to avoid duplication of effort.
• Helps seek experts to review proposals.
• Participates at standards meetings on behalf of
minority groups and scholars.
~fini~
• Conclusion
–
–
–
–
–
–
Use Unicode Now!
You can do it!
Yes you can do it!
There are no excuses anymore…
…its 2006 already, I’m telling you can do this!
and when you do (remember I have faith in you!) consider
feeding back into the system via standardization.
– Be a good citizen of earth, always ☺.
Thank You for Listening.
Are There Any Questions?
This presentation: http://yacob.org/papers/

Unicode for Under Resourced Languages

Transcript Unicode for Under Resourced Languages

Directory