Transcript CHAPTER 3

Chapter Four
Documents: The raw material
How to Build a Digital Library
Ian H. Witten and David Bainbridge
Documents






Building blocks of digital libraries
Many different standards for documents
Internationalization
Fixed versus fluid
Permanent versus transient
Indexing
Standards Organizations


American National Standards Institute (ANSI)
International Standards Organization (ISO)
Representing Characters

EBCDIC



ASCII (1968)





American Standard Code for Information Interchange
Represented with 7 bits
Does not support many foreign languages
Many expansions made to the basic ASCII character set
ISCII (1983)



Extended Binary Coded Decimal Interchange Code
Represented in 8 bits
Indian Script Code for Information Interchange
Hindi and related languages
GB and Big-5 for Chinese
Unicode



Successor of ASCII
ISO-10646 (1993)
Universal





Aims to represent ALL the world’s languages
Default encoding for HTML and XML
Development began in 1988 as a joint effort
between Apple and Xerox
Unicode standard continues to evolve
Round-trip compatibility – Unicode can be
mapped to/from any character set without loss
Unicode Character Set

Unicode standard is massive
Two subsets of standard: ISO 10646-1/2
 94,000 characters defined


Represents scripts
Scripts versus languages
 Punctuation shared among scripts


Universal character set – characters at the
core of Unicode
Five Zones of Unicode





Alphabetic scripts (Western languages,
Latin, Greek, Cyrillic, Hebrew and Arabic)
Ideographic scripts (Chinese, Japanese,
Korean)
Other characters (Braille, mathematical
symbols)
Surrogates
Reserved codes
Composite and Combining
Characters
Unicode Terms


Character: abstract form of a letter
Glyph: a particular rendition of a character on a page




Code Point: a Unicode value, specified by prefixing U+




Different fonts  different glyphs
Unicode does not distinguish between different glyphs
Characters are abstract members of linguistic scripts, not
graphic entities
Includes ligatures and combining diaeresis
Canonical and compatibility equivalence
Deprecated characters
Code Range: range of values that characters span
Unicode Character Encoding


UTF: Unicode character set Transformation Format
UTF-32


ISO standard uses a 32 bit (4 byte) value
Unicode consortium uses 21 bits

32 planes (5 bits) of 65,536 characters (16 bits)






Basic multilingual plane (living languages)
Supplementary multilingual plane (historic scripts, other alphabets
Supplementary ideographic plane (ancient Chinese ideographs)
Supplementary special-purpose plane (tags for languages)
UTF-16, UTF-8
Hindi and related scripts (ISCII)
Representing Documents


Plain text
Full-text indexing






Bag of words
Inversion of the text
Inverted files
Granularity of document
Granularity of index
Word segmentation


Chinese and Japanese are written without spaces
Spacing in Chinese sentences can completely
change the meaning
Page Description Languages


Device independence
PostScript (also a programming language)




First commercially developed page
description language (1985)
Fonts: Type 1, TrueType, OpenType
Text extraction
Using PostScript in a digital library
Page Description Languages



Portable Document Format (PDF)
Successor to Postscript
PDF versus Postscript


Not a full-scale programming language
New features for interactive display






Random access to pages
Hierarchically structured content
Navigation within a document
Hyperlinks
File format: header, objects, cross-references, trailer
Searchable image option
Word-Processor Documents

Rich Text Format




Native Word formats



240 page specification
Document-level metadata
Conversion to HTML software available
Binary
Proprietary
LaTeX format


Typed formatting commands
Non-proprietary
Representing Images

Lossless image compression
GIF
 PNG
 JPEG-Lossless
 JPEG-2000


Lossy image compression


JPEG
Progressive refinement
Representing Audio and Video




Evolution of signals over time
Sample rate
Samples per second
Multimedia compression
Codec
 Asymmetry
 Redundancy

MPEG




ISO Moving Picture Experts Group (1988)
Audio and Video at 1.5 Mbit/second
Family of standards
MPEG-1



Low resolution video, 30 fps, near CD quality
Layer 3 – MP3
MPEG-2



Higher quality video (DVD)
Supports interlaced images (Broadcast TV)
Multichannel audio
MPEG


MPEG-3 abandoned
MPEG-4





MPEG-7


Low bandwidth networks – mobile and WWW
Object based (vs. signal based)
Interactive
Strategies for identifying and managing intellectual property
Metadata description for content delivered via MPEG-1,2,4
MPEG-21


Multimedia lifecycle
Interoperability
MPEG



Television standards: NTSC, PAL, SECAM
Digital television and video standard: CCIR601
MPEG video


MPEG audio



Frames: intra (I), predicted (P), bidirectional (B)
Acoustic masking
Three compression layers
Mixed media


Time-stamped packets are multiplexed into a single stream
Typically – video 1.2 Mbits/s and audio takes .3 Mbits/s
Other Multimedia Formats

Audio and Video



AVI (Microsoft)
Quicktime (Apple)
Streaming



RealAudio, RealVideo, RealOne (Realsystems)
ASF (Microsoft)
Audio only



WAV (Microsoft, IBM)
AIFF (Apple)
AU (Sun)
Multimedia in a Digital Library

Indexing and browsing structures
Text-based
 Content-based



Summarizing audio and video
Digitizing media
Linear resolution, color depth, frame rate, sample rate
 Preservation issues
