Transcript CHAPTER 3
Chapter Four Documents: The raw material How to Build a Digital Library Ian H. Witten and David Bainbridge Documents Building blocks of digital libraries Many different standards for documents Internationalization Fixed versus fluid Permanent versus transient Indexing Standards Organizations American National Standards Institute (ANSI) International Standards Organization (ISO) Representing Characters EBCDIC ASCII (1968) American Standard Code for Information Interchange Represented with 7 bits Does not support many foreign languages Many expansions made to the basic ASCII character set ISCII (1983) Extended Binary Coded Decimal Interchange Code Represented in 8 bits Indian Script Code for Information Interchange Hindi and related languages GB and Big-5 for Chinese Unicode Successor of ASCII ISO-10646 (1993) Universal Aims to represent ALL the world’s languages Default encoding for HTML and XML Development began in 1988 as a joint effort between Apple and Xerox Unicode standard continues to evolve Round-trip compatibility – Unicode can be mapped to/from any character set without loss Unicode Character Set Unicode standard is massive Two subsets of standard: ISO 10646-1/2 94,000 characters defined Represents scripts Scripts versus languages Punctuation shared among scripts Universal character set – characters at the core of Unicode Five Zones of Unicode Alphabetic scripts (Western languages, Latin, Greek, Cyrillic, Hebrew and Arabic) Ideographic scripts (Chinese, Japanese, Korean) Other characters (Braille, mathematical symbols) Surrogates Reserved codes Composite and Combining Characters Unicode Terms Character: abstract form of a letter Glyph: a particular rendition of a character on a page Code Point: a Unicode value, specified by prefixing U+ Different fonts different glyphs Unicode does not distinguish between different glyphs Characters are abstract members of linguistic scripts, not graphic entities Includes ligatures and combining diaeresis Canonical and compatibility equivalence Deprecated characters Code Range: range of values that characters span Unicode Character Encoding UTF: Unicode character set Transformation Format UTF-32 ISO standard uses a 32 bit (4 byte) value Unicode consortium uses 21 bits 32 planes (5 bits) of 65,536 characters (16 bits) Basic multilingual plane (living languages) Supplementary multilingual plane (historic scripts, other alphabets Supplementary ideographic plane (ancient Chinese ideographs) Supplementary special-purpose plane (tags for languages) UTF-16, UTF-8 Hindi and related scripts (ISCII) Representing Documents Plain text Full-text indexing Bag of words Inversion of the text Inverted files Granularity of document Granularity of index Word segmentation Chinese and Japanese are written without spaces Spacing in Chinese sentences can completely change the meaning Page Description Languages Device independence PostScript (also a programming language) First commercially developed page description language (1985) Fonts: Type 1, TrueType, OpenType Text extraction Using PostScript in a digital library Page Description Languages Portable Document Format (PDF) Successor to Postscript PDF versus Postscript Not a full-scale programming language New features for interactive display Random access to pages Hierarchically structured content Navigation within a document Hyperlinks File format: header, objects, cross-references, trailer Searchable image option Word-Processor Documents Rich Text Format Native Word formats 240 page specification Document-level metadata Conversion to HTML software available Binary Proprietary LaTeX format Typed formatting commands Non-proprietary Representing Images Lossless image compression GIF PNG JPEG-Lossless JPEG-2000 Lossy image compression JPEG Progressive refinement Representing Audio and Video Evolution of signals over time Sample rate Samples per second Multimedia compression Codec Asymmetry Redundancy MPEG ISO Moving Picture Experts Group (1988) Audio and Video at 1.5 Mbit/second Family of standards MPEG-1 Low resolution video, 30 fps, near CD quality Layer 3 – MP3 MPEG-2 Higher quality video (DVD) Supports interlaced images (Broadcast TV) Multichannel audio MPEG MPEG-3 abandoned MPEG-4 MPEG-7 Low bandwidth networks – mobile and WWW Object based (vs. signal based) Interactive Strategies for identifying and managing intellectual property Metadata description for content delivered via MPEG-1,2,4 MPEG-21 Multimedia lifecycle Interoperability MPEG Television standards: NTSC, PAL, SECAM Digital television and video standard: CCIR601 MPEG video MPEG audio Frames: intra (I), predicted (P), bidirectional (B) Acoustic masking Three compression layers Mixed media Time-stamped packets are multiplexed into a single stream Typically – video 1.2 Mbits/s and audio takes .3 Mbits/s Other Multimedia Formats Audio and Video AVI (Microsoft) Quicktime (Apple) Streaming RealAudio, RealVideo, RealOne (Realsystems) ASF (Microsoft) Audio only WAV (Microsoft, IBM) AIFF (Apple) AU (Sun) Multimedia in a Digital Library Indexing and browsing structures Text-based Content-based Summarizing audio and video Digitizing media Linear resolution, color depth, frame rate, sample rate Preservation issues