Digital Text Primer

Download Report

Transcript Digital Text Primer

Digital Text Primer
Prepared for: AIEA Roundtable on Digitization of Armenian Documents
Saturday 7 October 2006, University of Geneva, Switzerland
Roland Telfeyan <[email protected]>
Robert Coffin <[email protected]>
October 2, 2006 • Charlotte, North Carolina
Contents
• Text encoding
 ASCII Problem
 Unicode Solution
• OCR
 ABBYY FineReader
 Sample scans
2
1963: ASCII
• Telegraph machines
• American Standard Code for Information
Interchange (ASCII)
• 128 numbers representing
 Printed characters, like ‘A’, ‘B’, ‘+’, ‘=’, etc.
 Commands to control the print head of the
teletype, like “carriage return”, “line feed”, “tab”,
“back space”, etc.
3
ASCII: Cont’d
• No indication of type appearance
• Only numbers representing letters
0
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
nul
bs
dle
can
sp
(
0
8
@
H
P
X
`
h
p
x
1
9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
soh
ht
dc1
em
!
)
1
9
A
I
Q
Y
a
i
q
y
2
10
18
26
34
42
50
58
66
74
82
90
98
106
114
122
stx
nl
dc2
sub
"
*
2
:
B
J
R
Z
b
j
r
z
3
11
19
27
35
43
51
59
67
75
83
91
99
107
115
123
etx
vt
dc3
esc
#
+
3
;
C
K
S
[
c
k
s
{
4
12
20
28
36
44
52
60
68
76
84
92
100
108
116
124
eot
np
dc4
fs
$
,
4
<
D
L
T
\
d
l
t
|
5
13
21
29
37
45
53
61
69
77
85
93
101
109
117
125
enq
cr
nak
gs
%
5
=
E
M
U
]
e
m
u
}
6
14
22
30
38
46
54
62
70
78
86
94
102
110
118
126
ack
so
syn
rs
&
.
6
>
F
N
V
^
f
n
v
~
7
15
23
31
39
47
55
63
71
79
87
95
103
111
119
127
bel
si
etb
us
'
/
7
?
G
O
W
_
g
o
w
del
4
Early Keyboards
• Keyboards were “hard-wired.”
• To get a lowercase ‘b’, you press
the [B] key, making the keyboard
emit code 98.
5
Mid 1970’s: Computer Fonts
• An array of glyphs, one per
ASCII code
• Character code 97 (‘a’) can be
rendered variously:
a, a, a, ա, ...
6
Significance of Fonts
• Fonts were the first flexible
mapping interposed between the
hard-wired keyboard and the printed
glyphs.
• This technology made the Macintosh
famous.
7
Font Design Dilemma
• Now 97 can mean not only
‘a’ but ‘ա’.
• However, should 98 mean
‘բ’ or ‘պ’?
8
Font Design Dilemma
• Font designers assigned glyphs to specific
character codes that satisfied their own
personal keyboard layout preferences.
• An Armenian text file could not be
viewed reliably in absence of the
font used to create it.
9
1986 to 2006: NeXT to Mac
• Steve Jobs (whose mother is a Hagopian)
invented the NeXT computer
• It had user-definable Keyboard Layouts
• Today’s Mac OS X is 90% NeXT
• Today, the placement of letters on a
keyboard is a user preference, like the
location of windows on a screen.
10
Unicode
• The character set has been extended to allow
for more than 95,000 characters.
• The goal is a set of standard character codes
for every known language.
• For the first time, Armenian (and
other) characters have their own codes,
defined by a de-facto international
standard.
11
Unicode (Cont'd)
• The Unicode Character Set is a
standard definition of character codes for
the glyphs of most known languages.
• Armenian codes range from 1328 to 1423
(95 codes).
12
But I like my old system
• If you want Armenian, Georgian, Greek, Hebrew,
Arabic, Chinese, and more all on the same page using
one font with with a consistent look, …
• If you want to type using your own key layout, …
• If you want others to be able to read your text
in absence of the font or keyboard layout or
computer system you used, …
• … use Unicode.
13
But I have a lot of ASCII
• Unicode conversion tools at: http://www.telf.com/
14
95,000 Glyphs?
• With more than 95,000 potential glyphs in a
Unicode font, any one font can represent
multiple language scripts.
• How can a computer keyboard address all
these characters?
• User-defined keyboard layouts map
selected characters in the Unicode font
to the physical keyboard.
15
Review: Two Main Points
• Keyboard layouts are user preferences
that have nothing to do with legibility of
text on another system.
• Unicode text is legible in absence of the
fonts or keyboard mappings or possibly
the application used to compile it.
16
1985
Physical Keyboard
Different fonts had
different glyphs for
the same character.
Kevork
font
“K”
Tigran
font
“G”
ASCII 67
Code saved in file
17
1995
Virtual Keyboard
(User Selected)
Physical
Keyboard
Armenian
Key Code 67
Unicode characters are saved in text file—
the same Unicode character code for the
same glyph, regardless of font.
Armenian
Letter “Gim”
Unicode 0533
Keyboard
Preference
Hebrew
Hebrew
Letter “Gimel”
Unicode 05D2
Georgian
Georgian
Letter “Gan”
Unicode 10A2
Any
Unicode
Standard
Font
“Գ ”
(Multilingual)
“‫”ג‬
ABCD…
ΑΒΓΔ …
… ‫ܐܒܓܕ‬
… ‫אבגד‬
ԱԲԳԴ …
ႠႡႢႣ …
“Ⴂ ”
18
OCR
• ABBYY FineReader is a commercial
multilingual OCR software that recognizes
Armenian and many other languages.
• Built-in dictionaries assist in checking
accuracy, and all text is handled through
Unicode.
19
FineReader
• The program is simple yet powerful.
• The program links each letter of text with
its location in the scanned image, for fast
proofreading.
20
FineReader (Cont’d)
• Ample control over page layout
• Tools to automate large batches
• Outputs Word, PDF, HTML, XML, …
21
OCR: Results
• Armenian accuracy depends on typeface and
richness of the internal dictionary.
• Arial Armenian: ~99.9%
• Times, Aramian, Nork: ~96%
• Երկաթագիր, Գրաբար manuscripts: not too
good ~70%
22
FineReader: Conclusion
• Tuned for modern, Arial-like letters.
• We are working with ABBYY to improve
recognition rates on old manuscripts and
books.
23
Screen Shots
• On the next slides are:
A screenshot of FineReader
A scanned image
MS Word output
24
FineReader Screen
Recognized
Text
Scan
25
Example Original Scan
26
MS Word Text Output
27
Further Information
• Questions, suggestions, and
corrections are welcome.
• Updates will be posted to
www.telf.com
28