WMES3304/WXGB5009 : INFORMATION RETRIEVAL

Download Report

Transcript WMES3304/WXGB5009 : INFORMATION RETRIEVAL

WMES3103 : INFORMATION
RETRIEVAL
TEXT AND MULTIMEDIA
LANGUAGES AND PROPERTIES
INTRODUCTION




Text - main form of communicating data
and information
Text also supplemented with multimedia
elements - to make the contents of an IRS
more attractive and interactive
Website with a combination ot text and
multimedia will be visited by many as
compared to one which is text-based only
IRS - text and multimedia is depicted via
special languages.
Metadata
New concept on information – metadata
 Information about data arrangement, data
domain and relationship between the two
 Data about data
 2 types – descriptive and semantic



descriptive Metadata – metadata which explain
about document or one unit of information
Commonly used Metadata :





Authors
Date of publication
Source of publication
Length of document
Type of document
Metadata
semantic Metadata –resembles subject that
can be obtain from the contents of the
document – subjects heading
 Keywords
 LC Code

TEXT




With computers, we need to code text into
binary digits
First coding schemes – EBCDIC and
ASCII – 7 bits to code each symbol
Then, ASCII
changed to 8 bits to
accommodate other languages, accents and
diacritical marks
Oriental languages – Unicode – 16 bits
TEXT
Formats
 No one single format for a text document
 Good IRS system should be able to retrieve
information from any format
 Initially, IRS will convert a document to an
internal format but this had a lot of disadvantages
 Now, many new format has been developed for
document interchange
TEXT






RTF – Rich Text Format for word processing
PDF – Portable Document Format for displaying
and printing documents
Postscript – powerful programming language for
drawing
MIMT – Multipurpose Internet Mail Exchange to
encode e-mail
Files are compressed – Compress (Unix), ARJ
(PCs), ZIP
Convert binary files to ASCII text –
uuencode/uudecode, binhex
MARKUP LANGUAGES






Markup = extra textual syntax that can be used to
describe formatting actions, structure
information, text semantics, attributes, etc.
Formal markup languages are more structured
Marks = tags - initial and ending tag surrounding
the marked text
Standard metalanguage = SGML
New metalanguange for Web = XML (eXtensible
Markup Language) = subset of SGML
Most popular markup language used for the Web
= HTML (HyperText Markup Language)
MULTIMEDIA




Applications that handle different types of digital
data originating from distinct types of media
Text, sound, images, video
Digital data distinct and different in volume,
format, and processing requirements
Different types of formats necessary for storing
each type of media
MULTIMEDIA

Different formats used commonly on the
Web and in digital libraries





Images
Audio
Moving Images
Textual Images
Graphics and Virtual Reality
IMAGES






XBM, BMP, PCX – direct representation of a bitmapped (or pixel-based)
GIF (Graphic Interchange Format) – includes
compression and good for black or white or with
small number of clours or gray levels (256)
JPEG (Joint Photographic Experts Group) – includes
compression
TIFF (Tagged Image File Format) – used to exchange
different documents between different applications
and different computer platforms
TGA (Television Targa image file) – associated with
video game boards
Various other image formats
AUDIO




Must be digitized before storage
AU, MIDI (standard format to interchange music
between electronic instruments and computers),
WAVE – for small pieces of digital audio
Audio libraries – RealAudio or CD formats
Animation or moving pictures


MPEG (Moving Pictures Expert Group) – related to
JPEG
Others – AVI, FLI, QuickTime
TEXTUAL IMAGES





Images that contain mainly typed or typeset text
Obtained by scanning the documents
For archival purposes
Saved as images but with further compression
Textual and non-textual stored and compressed
separately and when neded can be combined and
displayed together
GRAPHICS AND VIRTUAL REALITY





3-dimensional graphics found on Web
CGM (Computer Graphics Metafile) standard
Metafile = collection of elements
CGM standard specifies which elements are
allowed to occur in which positions in a metafile
VRML (Virtual Reality Modeling Language) –
file format for describing interactive 3D objects
and worlds - universal interchange format for 3D
graphics and multimedia - can be used for various
applications
MULTIMEDIA DOCUMENTS
MARKUP


HyTime = Hyper/Time-based Structuring
Language – standard defined for
multimedia documents markup
SGML architecture which specifies the
generic hypermedia structure of documents