Transcript Slide 1

Urdu on Linux
International Support
Tafseer Ahmed
Department of Computer Science
University of Karachi
Note:
All the issues and support discussed for
Urdu are also applicable for other
Pakistani Languages like Sindhi, Pashto,
Punjabi, Balochi etc.
Character Encoding
Font
Text Display Engine
Character, Script, Glyph and Font
Character
The character is identified as an abstract
entity, such as "LATIN CHARACTER
CAPITAL A" or ”ARABIC CHARACTER
HA”.
Every Character has only one position/
code point in character representation
schemes like Unicode.
Glyph
The visual representation of the character
made on screen or paper is called a Glyph.
A Character can have more than one
Glyphs.
Script
Script is writing Style of a language.
For Example, English and French are written
in Roman Script and Urdu and Farsi are
written in Arabic Script
Writing Styles of Urdu
Naskh
Nastaleeq
Character Encoding
Character Encoding
 Data and hence Text is stored in computer using
Binary Numbers.
 Character Encoding scheme like ASCII, EBCIDIC
gives mapping of (English) Characters to Binary
Numbers (for storage and processing).
 Character of any language can have character
encoding. This is basis of Code Pages.
 Every language has a Code Page which have
encoding of that language’s characters.
Character Encoding of Urdu
 Propriety Standards (Biggest Problem in Urdu
Software Development)
 Urdu Zabta Takhti (National standard code page of
Urdu)
 Unicode (International Standard for Multilingual
Characters)
Unicode
 Unicode is repository of characters of
almost all languages of the world.
 Unicode has more than 65,000 codepoints for characters.
 All Software vendors are now supporting
or switching to Unicode.
0xFFFF
Compatibility
Private use
™
Unicode / ISO 10646
Future use
Ideographs
(Hanzi, Kanji,
Hanja)
16-bit international
character encoding
Hangul
Kana
Symbols
Punctuation
Thai
Indian
Arabic, Hebrew
Greek
Latin
ASCII
0x0000
A
0041
(null)
9662 FF96 4F85 0000
Font for Text Display
Open Type Font
 OpenType is a new cross-platform font
file format developed jointly by Adobe and
Microsoft.
 It is an extension of True Type Font.
 OpenType Font may contain more than
65,000 glyphs.
 One character may correspond to several
glyphs.
 A rich mapping between characters and
glyphs, which supports ligatures,
positional forms, alternates, and other
substitutions.
 Information to support features for twodimensional positioning and glyph
attachment.
 It Explicit script and language information,
so a text-processing application can
adjust its behavior accordingly
Tables in OTF Font






CMAP (Character to Glyph Mapping)
GDEF (Glyph Definition Data)
GPOS (Glyph Position Data)
GSUB (Glyph Substitution Data)
BASE (Baseline Data)
JSTF (Justification Data)
GSUB
An Example of OTF Tables
 information for substituting glyphs to
render the scripts and language systems
supported in a font.
 Types of Substitution
 A Single Substitution replaces a single glyph
with another single glyph.
 An Alternate Substitution identifies
functionally equivalent but different looking
forms of a glyph.
 A Multiple Substitution replaces a single
glyph with more than one glyph. This is used
to specify actions such as ligature
decomposition.
 A Ligature Substitution replaces several
glyph indices with a single glyph index.
 Contextual substitution describes glyph
substitutions in context–that is, a substitution of
one or more glyphs within a certain pattern of
glyphs.
Each substitution describes one or more input
glyph sequences and one or more substitutions
to be performed on that sequence.
Text Display Engine
The Alphabet Soup
 GNOME is a desktop environment for the user,
as well as a powerful application framework for
the software developer.
 GTK+ is a multi-platform toolkit for creating
graphical user interfaces offering a complete set
of widgets.
 GTK+ is based on three libraries :
GLib
Pango
ATK library
 GNOME uses GTK+ for graphical user interface.
 GNOME and GTK+ are open source software
and part of GNU Project
Pango
 Word “Pango” consists of:
Greek "Pan" / U03A0 U03B1 U03BD / All
Japanese "Go" / U8A9E / Language
 Pango project is an open-source framework for
the layout and rendering of internationalized text.
 Pango uses Unicode (UTF-8 encoded strings) for all
of its encoding, and will eventually support output
in all the worlds major languages.
Pango Fonts
Pango give support to following fonts
 Bitmap Fonts
under the X windowing system,
 Type1 fonts
Adobe Standard
 TrueType fonts
Apple and Microsoft Standard
 OpenType fonts
Adobe and Microsoft Standard
The Layout and Rendering Pipeline
abc PAY ALIF KAF SEEN TAY ALIF NOON
Itemization
The input string is broken into portions rendered with a
consistent font, with a consistent language tag, and with
a specific bidirectional embedding level.
{abc} {PAY ALIF KAF SEEN TAY ALIF NOON}
Reordering
The items are reordered from logical order into visual
order according to their bidirectional embedding levels.
{abc} {NOON ALIF TAY SEEN KAF ALIF PAY}
The Layout and Rendering Pipeline
(contd.)
Glyph Selection (Shaping)
The characters in each item are turned into glyphs.
Justification
The glyph strings created in the previous step are
adjusted to fit the line-justification policies that are in
place.
 Rendering
The justified glyph strings are rendered in their final
order onto the output device.
abc ‫پاکستان‬
Sample Screenshots
The GTK+ color selector
localized to Farsi
GTK+ labels rendering
various languages
Web Resources
• www.unicode.org
• www.adobe.com/type/opentype/
• www.microsoft.com/typography/developers/opentype/
• communities.msn.com/MicrosoftVOLTuserscommunity/
• www.gtk.org
• www.pango.org
• i18n.kde.org
• tremu.gov.pk/tremu/workingroups/url.htm