Transcript Chapter 1

Document Computing
Technologies for Managing
Electronic Document Collections
Ross Wilkinson ... [et al.]
Circulation Counter [RES3H] ZA4080 .D63 1998
Chapter 1
Document Lifecycle
What is a document?
A document records a message
from people to people.
Characteristics of a document
• Content
• Structure
• Metadata
Metadata
• A message has a context, which is
important for understanding the message.
• A document contains not only the contents
of a message, but also some information
about the document, e.g. author, date,
recipients.
• We called such information the metadata
about the document.
Why Document Management?
• It is hard to find documents.
• It is hard to organize documents.
• It is hard to control documents.
• Metadata helps document management.
Benefits of Document Management
• Location-independent delivery of
documents upon demand
• Controlled access to documents
• A record of the life of a document
• Better re-use of documents
Chapter 2
Electronic Document Description
Document Content
• Simplest type of content – unformatted
text
• Text retrieval system based on search by
keywords
• E.g Windows Desktop Search (video)
• Optical character recognition (OCR)
system
Document Structure
• Even unformatted text has some
structures, e.g. lines, words, images, etc.
• A document may have elaborate
structures.
• Two levels of structures:
– Logical structure
– Presentational structure
Logical structures
• Example:
TO: John D.
FROM: Kate M.
DATE: 7/8/98
I have finished Stage B of the design. Could you take a look at it?
• Simple logical structure: lines of text
• A logical structure of a memo: (see next
slide)
A logical structure for a memo
Memo
Head
Sender
Receiver
Body
Date
Paragraph
Presentational Structure
• A different presentational structure for the
same memo
John D., 7/8/98
I have finished Stage B of the design. Could you take a look at it?
Kate M.
Presentation medium
• The content of the same document can be
presented in different media with different
presentational structures:
• E.g. a PDF file vs. a online Web page
Metadata
• Generally, we need metadata to capture:
– Registration information
– Usage information
– Structural properties
– Contextual information
– Content description
– Historical information
The Dublin Core metadata set
•
•
•
•
•
•
•
•
Title
Creator
Subject
Description
Publisher
Contributors
Date
Type
• Format: e.g. HMTL,
pdf
• Identifier: e.g. URI
• Source
• Language
• Relation
• Coverage: duration
• Rights: e.g. copyright
Document Description Language
(DDL)
• For use by document management system
• E.g. RTF, Postcript, SGML
• DDL support:
– Language support, media support, transparency,
structure, link support, metadata support
• Other DDL characteristics:
– Document creation, import conversion, export
transformation, update, presentation quality,
presentation flexibility, etc.
Examples of DDLs
• ASCII (American Standard Code for
Information Interchange)
• Unicode
• ASCII and Unicode offer very limited
support
• Rich Text Format
• TeX and LaTeX
• SGML, HTML, XML
• Postscript, PDF
Rich Text Format (RTF)
• Developed by Microsoft
• For interchange between Microsoft Word
and other software
• Main purposes:
– Preserve information in Word (blocks of text)
• Example: next slide
{\rtf1\adeflang1025\ansi\ansicpg1252\uc2\adeff0\deff0\stshfdbch13\stshfloch0\stshf
hich0\stshfbi0\deflang2057\deflangfe1028{\fonttbl{\f0\froman\fcharset0\fprq2{\*\pan
ose 02020603050405020304}Times New Roman
…
{\title John D}{\author Dr. Yeung}{\operator Dr.
Yeung}{\creatim\yr2008\mo3\dy18\hr15\min24}{\revtim\yr2008\mo3\dy18\hr15\mi
n25}{\version1}{\edmins1}{\nofpages1}{\nofwords14}{\nofchars81}{\*\company
Lingnan University}{\nofcharsws94}
…
\ltrch\fcs0 \insrsid1782868\charrsid1782868 \hich\af0\dbch\af13\loch\f0 John D.,
7/8/98
\par \hich\af0\dbch\af13\loch\f0 I have finished Stage B of the design. Could you
take a look at it?
\par
\par \hich\af0\dbch\af13\loch\f0 Kate M\hich\af0\dbch\af13\loch\f0 .
\par }\pard \ltrpar\ql
\li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0
{\rtlch\fcs1 \af0 \ltrch\fcs0 \insrsid4811147
\par }}
TeX and LaTeX
• TeX created by Donald Knuth
• TeX is a typesetting software.
• LaTeX created based on TeX by Leslie
Lamport
• LaTeX use markup constructs to separate
logical description from presentation.
• LaTeX example: see next slide
• To learn LaTeX: click.
\documentclass{article}
\usepackage{times}
\pagestyle{empty}
\begin{document}
\title{Sample Document}
\author{
W. L. Yeung\\Department of Computing and Decision Sciences\\
Lingnan University, Hong Kong\\[email protected]}
\maketitle
\section{Introduction}
…
\section{Conclusion}
…
\end{document}
SGML
• Standard Generalized Markup Language
• To describe a document in SGML, we
need:
– An SGML declaration
– A document type definition (DTD)
– A document instance
• An SGML declaration specifies which
characters are used in the DTD. Normally
a default is used.
SGML (cont.)
• A document type definition (DTD) defines
the rules for forming a class of documents,
i.e. the grammar of a document class.
• The building blocks of SGML documents
are elements.
• A DTD for the memo document: next slide.
<!-– DTD for office memo -->
<!-- ELEMENT CONTENT -- >
<!ELEMENT memo
- - (head, body, close?) >
<!ELEMENT head
0 0 (to & from & date) >
<!ELEMENT to
- - (#PCDATA) >
<!ELEMENT from
- - (#PCDATA) >
<!ELEMENT date
- - (#PCDATA) >
<!ELEMENT body
- - (#PCDATA) >
<!ELEMENT par
- - (#PCDATA) >
<!ELEMENT close - - (#PCDATA) >
<!--
ELEMENT NAME
VALUE
DEFAULT -- >
<!ATTLIST memo
status (con|pub) pub >
<!ATTLIST par
id
id
#IMPLIED >
DTD
• An element definition gives the name of
the element, then the rules for building that
element.
• Elements can contain other elements.
• Terminal (basic) elements often consist of
parsed character data “#PCDATA” or
“#CDATA”.
The memo in SGML
<MEMO>
<TO> John D </TO>
<FROM> Kate M </FROM>
<DATE> 7/8/1998 </DATE>
<BODY>
<PAR>
I have finished Stage B of the design.
</PAR>
</BODY>
</MEMO>
HTML
•
•
•
•
Hypertext Markup Language
For World Wide Web (WWW) documents
Conforms to a SGML DTD
HTML is presentation oriented:
instructions (tags) are inserted into a
document to for presentation effects
• The DTD for HTML is available on
http://www.w3.org/TR/html401/sgml/dtd.html
The memo in HTML
<!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML//EN”>
<HTML>
<HEAD>
<TITLE>Memo</TITLE>
<META NAME=“DC.AUTHOR” CONTENT=“Kate M”</META>
<META NAME=“DC.DATE” CONTENT=“7/8/1998”</META>
</HEAD>
<BODY>
<H1>Memo</H1>
<P>I have finished Stage B of the <A
HREF=“/team3/design2”>design<A>.
</P>
</BODY>
</HTML>
XML
• Extensible Markup Language
• Three basic definitions:
– XML for representing data and documents
– XLink and XPointer for representing interdocument linking
– XSL for representing presentation
• XML is a near-subset of SGML
XML (Cont.)
• Two classes of XML documents:
– Valid XML documents: documents that conform to a
specific supplied DTD
– Well-formed documents: only satisfy a simple default
grammar, without conforming to a specific DTD
• XML has become the cornerstone of electronic
commerce as it allows businesses to exchange
electronic documents according to some
standard formats based on XML.
Postscript
• Developed by Adobe
• For representing documents that are to be
printed (mainly on laser printers)
• A page description language optimized for
printing text, images, graphics.
Portable Document Format (PDF)
• Developed by Adobe
• A page description language for representing
text, graphics and images
• A PDF file contains presentation information on
pages, annotations, links, fonts, etc.
• Support delivery of electronic documents exactly
as they would appear in printed form.
• Not designed for editing or document format
exchange.