Character - Uni Konstanz

Download Report

Transcript Character - Uni Konstanz

Tafseer Ahmed
Department of Computer Science
University of Karachi
Types of Urdu Software
Development
Word
Processing
Word Processors, Active-X Controls (Text Box,
List, Buttons, Menus)
Information
Processing
Character Encoding, Database Operations (Sort,
Search, Comparison)

Language Processing
(Grammar Checkers, Translators, Speech
Synthesizers and Recognizers, OCRs)
Character, Script,
Glyph and Font
Character
The character identified is an abstract
entity, such as "LATIN CHARACTER
CAPITAL A" or ”ARABIC CHARACTER
HA”.
Every Character has only one position/
code point in character representation
schemes like Unicode.
Script
Script is writing Style of a language.
For Example, English and French are
written in Roman Script and Urdu and
Farsi are written in Arabic Script
Glyph
The visual representation of the character
made on screen or paper is called a Glyph.
A Character can have more than one Glyphs.
Character Encoding




Data and hence Text is stored in computer
using Binary Numbers.
Character Encoding scheme like ASCII,
EBCIDIC gives mapping of (English)
Characters to Binary Numbers (for storage
and processing).
Character of any language can have
character encoding. This is basis of Code
Pages.
Every language has a Code Page which have
encoding of that language’s characters.
Character Encoding of Urdu

Propriety Standards (Biggest Problem in
Urdu Software Development)

Urdu Zabta Takhti (national standard code
page of Urdu)

Unicode (International Standard for Multilingual
Characters)
Urdu Zabta Takhti
Unicode



Unicode is repository of characters of
almost all languages of the world.
Unicode has more than 65,000 codepoints for characters.
All Software vendors are now
supporting or switching to Unicode.
0xFFFF
Compatibility
Private use
Future use
Ideographs
(Hanzi, Kanji,
Hanja)
Hangul
Kana
Symbols
Punctuation
Thai
Indian
Arabic, Hebrew
™
Unicode / ISO
10646
 16-bit
international
character encoding
 Windows 2000 uses
Unicode version 2.0
Greek
Latin
ASCII
0x0000
A
(null)
0041 9662 FF96 4F85 0000
Approaches to Urdu Fonts

Naskh (Character Based)

Nastaleeq (Character Based)

Nastaleeq (Ligature Based)
Major Problems in
developing Urdu Fonts

Many Glyphs corresponding to One
Character.

Only 256 positions are available in
Font File. So all Ligatures cannot be
stored in a single file.

A Special purpose Urdu Word
Processor is required to implement
glyph joining and substitution logic.
Open Type Font (OTF)
True Type Font

The TrueType font technology consists
of two components:



TrueType Font
TrueType Rasterizer
One Glyph corresponding to one Code
Point/ Position in the Font File.
Open Type Font


OpenType is a new cross-platform font
file format developed jointly by Adobe
and Microsoft.
It is an extension of True Type Font.

OpenType Font may contain more than
65,000 glyphs.

One character may correspond to
several glyphs.

A rich mapping between characters
and glyphs, which supports ligatures,
positional forms, alternates, and other
substitutions.

Information to support features for
two-dimensional positioning and
glyph attachment.

It Explicit script and language
information, so a text-processing
application can adjust its behavior
accordingly
Tables in OTF Font






CMAP (Character to Glyph Mapping)
GDEF (Glyph Definition Data)
GPOS (Glyph Position Data)
GSUB (Glyph Substitution Data)
BASE (Baseline Data)
JSTF (Justification Data)
GDEF Table

Glyph Class Definition

Simple

Ligature

Combining Mark
Component


Attachment Point List
Glyph Attachment Points defined in GPOS

Ligature Caret List Table

Traditional Urdu Word Processor

Open Type enabled Word Processor
CMAP

The mapping of GIDs of Glyphs to
character code point.
GSUB

information for substituting glyphs to
render the scripts and language
systems supported in a font.

Types of Substitution

A Single Substitution replaces a single
glyph with another single glyph.

An Alternate Substitution identifies
functionally equivalent but different looking
forms of a glyph.

A Multiple Substitution replaces a single
glyph with more than one glyph. This is used
to specify actions such as ligature
decomposition.

A Ligature Substitution replaces several
glyph indices with a single glyph index.

Contextual substitution describes glyph
substitutions in context–that is, a substitution
of one or more glyphs within a certain
pattern of glyphs.
Each substitution describes one or more
input glyph sequences and one or more
substitutions to be performed on that
sequence.
GPOS

precise control over glyph placement for
sophisticated text layout and rendering in
different script and language system.

To properly render Urdu glyph, a text
processing client must modify both
horizontal and vertical positional of glyph.

Entry and Exit Points
BASE
Contains information about baseline
offsets on a script-by-script basis.
JSTF
Contains justification information,
including whitespace and Kashida
adjustments.
VOLT
Visual Open LayOut Tool
Developed By Microsoft
Glyph Grid





Glyph Name
Glyph type
Glyph ID
Unicode
Components
Glyph Group




Glyph Name
Glyph Group
Glyph Range
Glyph
Enumeration
Substitution Tool





Lookup Name
Lookup Type
Process Marks
Process Base
Glyph
Text Flow
Positioning Tool




Lookup Header
Lookup Type
Glyph Positioner
Glyph
Adjustment
(Single,Pair,Anchor)


Cursive
Attachment
Caret Positioning
Urdu Support in Software







Windows XP
Windows 2000
Office 2000 and XP
Internet Explorer 5.5
Visual Studio
Java
Urdu Support in Windows 2000




Input locale (Currency, Date)
Keyboard
Write Urdu anywhere (Notepad,
Windows Explorer)
RTL( Right to Left) Controls including
Windows, Text Boxes
Issues in Urdu Databases

Unicode Urdu Characters are not in A
Sequence.
Need Collating Sequence for Sorting

Diacritics (Aarab)
Problem in Sorting and Comparison with diacritics.
Web Resources

http://www.microsoft.com/typography/devel
opers/opentype/

http://microsoft.com/globaldev/

http://communities.msn.com/MicrosoftVOLT
userscommunity/

http://www.adobe.com/type/opentype/

www.unicode.org