IN350 Class 2: Document Properties and Markup Languages

Download Report

Transcript IN350 Class 2: Document Properties and Markup Languages

IN350 Class 2: Document Properties and
Markup Languages
August 30, 2001
Judith A. Molka-Danielsen
Reference: Parts of Chapter 6 handout,
Chapter 1: XML A Primer
Overview

Review Properties of Documents

Introduce the concept of Markup Languages.

Begin to talk about XML

Visit the Lab (room 076) and match student
groups with accounts on Oracle.

Create a table and populate it.

(In the BIG picture, this course IN350 is about
theories and issues for document processing
and management. Some of the practical
exercises will be to create indicies and do
searches, and have nothing to do with XML.)
Classes of document processing

Text Processing: Initially computers were used to
do tedious repetitive calculations (billing
transactions) on information.

Often the calculations required preprocessing or
typesetting of text.

Other issues include information storage (and
compression algorithms to optimally store) and
storage methods (indexing) and approaches to
information retrieval.

Finally there was the preparation and processing
of text for presentation purposes.
Classes of document processing

Document Processing: In the 1980s technologies
like the PC, ethernet, laser printers, and graphical
user interfaces with bit map displays, and text
processing that was object based, allowed for
indivduals to process documents. A text processing
system called Scribe (by Brian Reid at CMU),
represented a new kind of processing.

In text processors like IBM's Script, the user
marked up text in terms of syntax characteristics,
such as "12 point bold courier".

But Scribe formatted in terms of structural
characteristics like, "heading". This was a
transition to document processing.
Classes of document processing



Hypertext Processing: In the 1990s we saw the
development of internetworks, and ubiquitous
interfaces (windows).
Tim Berners-Lee at the National Radiation Lab at
CERN created HTML and URL (Uniform
Resource Locator) protocols so that a simple
standardized form of markup, based on Scribe,
could be used to describe documents and
naming scheme would allow for the universal
identification of documents.
So documents could be and viewed in graphical
format and large collections linked across
multiple internets. This is hypertext processing.
Properties of Documents
Syntax - can express structure, presentation style,
semantics, and external actions. It can be implicit in the
contents of a document or expressed in a language.
Structure - a structural element like a section can have
can have a Formating Style associated with it that tells
how the elements relate to each other within the
document.
Presentation Style - is how the document is displayed
or printed. It can be embedded in the documents such
as in TeX, and use macros LaTeX. Or can be defined
separately as CSS for HTML documents. Presentation
style can be determined by the author (in applications
or languages) or the reader (Web browser).
Semantics - the meaning within a language, can be
associated with use.
Characteristics continued...
Metadata - information about the organization of the
data. Data about the data. Such as, author,
publication date, subject codes, etc.
XML (Chapter 1: Structured Label Information)
There is a difference between Data and Documents.
 Documents are formated.
WYSIWYG word processors have problems
 They make documents that are for one output
medium (printer,online)
 Proprietary codes are for both style & format
 But it is hard to convert old document collections
(merge latex and word)
 Formats like ”headline” only mean BIG font size,
no meaning within the document
 People use too many options within a document
(30 fonts on a page.
File formats 

Text and formats
Word processing formats that are binary formats
include Word and WordPerfect.
text - ASCII (American Code for Information Interchange)
by ANSI X3.6. Alternativly there is 16 bit Unicode (ISO
10616).

raster graphics  TIFF Tag Information File Format
 GIF - Graphic Interchange Format
 JPEG - Joint Photographic Experts Group
 An example of a vector graphics standard is
CGM Computer Graphics Metafile

printing - PostScript, PDF, EPS, PCL, LCDS, XML
Printing Formats, ISO-IEC 10180 Standard Page
Description Language, ISO-IEC 8624 Open Document
Architecture (ODA)
Text and formats
File formats continued  multimedia
 MPEG (motion picture expert group)


AVI (audio video interleaved)
email
 email header - RFC822
 SMTP - Simple Mail Transport Protocol,
RFC823
 POP - Post Office Protocol
 IMAP - Intelligent Mail Access Protocol (more
advanced than POP)
 MIME - Multimedia Internet Mail Extension
(attachments)
Text and formats
File formats continued 

For document interchange between
applications there is RTF (rich text format).
Compression formats include ARJ, ZIP, and
uuencode/uudecode.
What is Markup?
•Markup is everything in a document that is not content.
Typesetters used procedural markup to lay out instructions of
how a document should look. (16 pt bold Helvetica)
•Word Processing software like Microsoft Word uses Procedural
markup. They have a specific set of markup codes. The codes
apply to a single physical way of presenting information, such as
on a printed page. It doesn't define the appearance on other media
like CD-ROM or Internet.
•Descriptive markup, or generic markup, describes the
structure of the document rather than the appearance. Content is
separate from style. You can publish on all media using the same
structure instruction set.
SGML
SGML (Standard Generalized Markup Language, ISO
8879, 1986), specifies a standard method for describing
the structure of the document. Structural elements are
for example: title, chapter, paragraph. It is an extensible
Meta Language. It can supports an infinite variety of
document structures like: information bulletins, technical
manuals, parts catalogs, design specifications, reports,
letters, memos.
The Document Type Definition (DTD) describes the
structure of the document. (like a database schema in a
database). The DTD provides a framework of elements
(chapters, headers). The DTD specifies rules for the
relationship between elements, ie. a chapter header must
come after the start of a chapter. A document intance is a
document whose contents is tagged in conformance with
a DTD. A DTD can be applied throughout the whole
organization.
SGML continued
SGML uses tagging to identify the contents position
within a DTD structure. So we insert tags around the
content. You can nest elements. A parser program
verifies that a document follows the rules of a DTD.
The parser checks if the document is structurally
correct.
Documents can be ported to different formats for
different output medium (printer, screen, CD Rom,
speaker, TV)
Style is usally handled separately by style sheets,
like Cascading Style Sheets (CSS).
HTML
HTML (first version in 1992, Dec. 1999 version 4.01) a
tagging language that could be used on the World
Wide Web for text formatting and linking documents. It
adopts the syntax of SGML and is an application of
SGML described by a particular DTD. HTML is not an
extensible language. Authors cannot add their own
tags. HTML supports style sheets written in CSS
language (color, font, layout for web pages.) and
Frameset to partition the browser window.
XHTML is modular approach to allow the support of
markup tags in smaller client devices like cell phones,
TVs, cars, kiosks, etc.
Chapter 1: positive comments on HTML
HTML uses tags to separate content (text)
from format (structure, appearance).
It lets amateurs control markup (good and
bad)
HTML tags were used for appearance
formatting, but little attentiaol was used
toward content structuring.
Chapter 1: negative comments on HTML
HTML did not offer enough custom control over the
WYSIWYG environment.
Things looked different in different browsers (reader
interpreted, not author interpreted).
Navigating through hypertext requires user memory.
Designing hypertext (document collections) for easy
searching is hard to do. Spiders, crawlers, robots,
AltaVista index all try to index the web.
Chapter 1: comments on CSS
Cascading Style Sheets helped HTML by
freeing tags like <font> and <b> from carrying
format information. Puts them in the style
sheet.
It lets tags like <header> carry structure
information.
CSS is a styling tool that can work with other
markup languages like XML.
Chapter 1: comments on CSS
The Document
Formating
• Structure
• Appearance
Content
•Information
•Data
Structure – HTML does this a little bit.
Appearance – or presentation, before HTML did this
with tags like <b> but now all structure
control should be taken out of HTML
documents and put in CSS or XSL files.
Chapter 1: a migration to XML was needed
Binary files (in native formats) compress tightly for
efficient transmission, but they are complex and
proprietary. (XML files are larger, with markup there is
more to store and transfer.(-))
To change documents between applications is hard.
Must save data in text formats & move. Conversions
were not always good. (XML writers define write
formats, standards for loading, saving, open transfer)
Lock-in let MS sell new versions of word that could read
old format, save in new format, and then old versions
could not read the files in new format.
XML will handle document description and data
description. Will not lose structure and labels in move.
XML
XML (XML 1.0, 1998, Extensible Markup Language)
is also a meta language in that it describes other
languages. There is not pre-defined list of elements.
Elements are specified using a DTD or Schema. Also
style sheets can be used to specify the output format
of each element (XSL).
XML is based on SGML but it is a subset and is
considered easier to program. XML is also supported
to be viewed in most current versions of browsers.
XML related standards
XPath Specifications for the data model and
grammar for navigating an XML document. XSL
eXtensible Stylesheet Language includes a
language for transforming XML documents (XSLT)
and a formatting vocabulary (XSLFO).
XSLT eXtensible Stylesheet Language
Transformation defines a transformation language
to convert XML documents into other formats.
XLL extensible linking language allow logic to be
placed on linking.
XML related standards & groups
OAGIS The Open Application Group's
(www.openapplications.org) Integration Specification for
interoperability between ERP packagesOASIS-ebXML
Organization for the Advancement of Structured In- formation
Standards (OASIS) Electronic Business XML (www.ebxml.org).
FinXML Financial Markup Language (www.finxml.com) supports
a universal standard for data interchange within the capital
market.
FpML Financial Products Markup Language (www.fpml.org)
enables e-commerce activities in the financial derivatives field.
OFX Open Financial Exchange (www.ofx.net) for the electronic
exchange of financial data.
Other languages
MathML - tags for presenting formulas
SMIL - language for scheduling multimedia (Synchronized
Multimedia Integration Language). It uses XML markup to identify
and manage the presentation of files containing text, images,
sound and video in multi-media presentations.
RDF - resource description format, format to contain metadata
inform for XML.
HyTime - an SGML architecture that specifies the generic
hypermedia structure of documents. Allows for the design of
metaDTDs, for complex multimedia presentations, such as
providing music with other media presentation.
See for more information on markup languages
http://www.w3.org/
XML Technologies for Oracle
XML parser. Used to parse, construct, and validate XML
documents.
XPath engine. A utility that searches in-memory XML
documents, using the declarative syntax of XPath, another
element of the XML standard.
XSLT processor. Supports XSLT in the Oracle database,
allowing you to transform XML documents into different formats.
XML SQL utility. A utility that helps produce XML documents
from SQL and lets you easily insert XML-based data into Oracle
database tables.
XSQL pages. A technology that lets you assemble XML data
declaratively and then publish that data with XSLT.