Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.4

Download Report

Transcript Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.4

Advanced Information Retreival
Chap. 06: Text and Multimedia Languages
and Properties
(Introduction, Metadata and Text)
6.4
Text markup languages
• SGML
• HTML
• XML
Document Markup languages
• Text Markup: additional information that is added to
the text (but is not part of the text) to provide such
things as
– Formatting instructions
– Structural information
– Semantics
• Markup languages have evolved from providing
instructions on the printing style of each document
part, to providing information on the function of
each document part.
– Example: Instead of marking a section heading to be
printed with large font size and boldface font, it is marked
up as “section heading”.
– Allows formatting to be interpreted differently in different
situations.
– Allows documents originating from different sources to be
appear more uniformly.
Document Markup languages (cont.)
• Formatting markup languages
– TeX, Troff and similar typesetting languages interleave
actual document text and formatting instructions. A
processor reads such files and produces a formatted
output suitable for printing.
• Web markup Languages:
– SGML: A metalanguage for structuring large
documents. It defines a document structure and its
associated markup conventions.
– HTML: A hypertext language used for linking and
displaying Web documents. A browser reads such files
and produces output suitable for display.
– XML: A subset of SGML, used for semantic markup.
An increasingly popular language for data exchange.
Standard Generalized Markup
Language (SGML)
• SGML evolved from an earlier markup
language called Generalized Markup
Language (GML) created by IBM in the
1960s.
• Strictly speaking, SGML is not a markup
language, but a metalanguage: A language
for defining markup languages.
• It is instantiated to specific languages by
defining individual document types and their
corresponding markup languages.
• Well known instance of SGML: HTML
SGML(cont.)
• An SGML document consists of three major parts:
– 1. The SGML declaration describes the document
character set, the codes used to identify and delimit
markup sequences, and so on.
– 2. The Document Type Definition (DTD) defines the
model for documents: the various elements of the
document, how these elements relate to one another,
what are their possible attributes, and so on.
– 3. The document instance contains the marked-up
contents of the document. The instance contains a
reference to the DTD to be used in interpreting it.
• SGML markup does not describe the semantics of
•
the markup.
The semantics of elements and attributes are
described in a separate document (or in
comments embedded in the DTD).
SGML(cont.)
• Basic DTD syntax:
– ELEMENT: defines a tag.
– ATTLIST: defines the possible attributes of a tag.
– PCDATA, NDATA: indicate ASCII text or binary data,
respectively.
– – – indicates that both the begin and end tags are
required, – O indicates that the begin tag is required
and the end tag is optional, etc.
– The symbols | , ? * + are regular expression syntax,
indicating (respectively) disjunction, concatenation,
zero or one occurrences, any number of occurrences,
one or more occurrences.
– Much of the syntax of SGML is similar to the syntax of
XML (to be discussed later in more detail).
SGML (cont.)
• Example: SGML specification for electronic mail messages.
– DTD:
<!--SGML DTD for electronic messages-->
<!ELEMENT> e-mail
- - (prolog, contents)>
<!ELEMENT> prolog
- - (sender, address+, subject?,
Cc*)>
<!ELEMENT> (sender | address | subject | Cc) – O (#PCDATA)>
<!ELEMENT> contents
- - (par | image | audio)+>
<!ELEMENT> par
- O (ref | #PCDATA)+>
<!ELEMENT> ref
- O EMPTY>
<!ELEMENT> (image audio)
- - (#NDATA)>
<!ATTLIST e-mail
id
ID
#REQUIRED
date_sent
DATE
#REQUIRED
status
(secret | public) public>
<!ATTLIST ref
id
ID
#REQUIRED>
<!ATTLIST (image | data)
id
ID
#REQUIRED>
SGML (cont.)
• An example document with previous DTD:
<!—Example of use of previous DTD-->
<!DOCTYPE e-mail SYSTEM "email DTD">
<e-mail id=94108rby date_sent=02101998>
<prolog>
<sender> Pablo Naruda </sender>
<address> Federico Garcia Lorca </address>
<address> Ernest Hemingway </address>
<subject> Pictures of my house in Isla Negra
<Cc> Gabriel Garcia Marquez </Cc>
</prolog>
<contents>
<par>
As promised in my previous letter, I am sending two digital
pictures to show you my house and the splendid view of the
Pacific Ocean from my bedroom (photo <ref idref=F2>).
</par>
<image id=F1> "photo1.gif" </image>
<image id=F2> "photo2.jpg" </image>
<par>
<regards from the south, Pablo.
</contents>
</e-mail>
Hypertext Markup language
• HTML is an instance of SGML, created in
1989.
• HTML is the standard language for storing
documents on the World Wide Web.
• HTML follows the SGML conventions, and
has a DTD, but documents do not make
explicit reference to this DTD.
• HTML is under continuous evolution
(currently, version 4).
HTML (cont.)
• Some notable features of HTML:
– Tags that determine the way certain text, such as titles,
is rendered on the screen.
– Tags that are links to other documents, letting users
navigate from document to document.
– Markup for forms, that let the user fill out information
and electronically
– send or e-mail the data to the document author, initiate
sophisticated searches, or order goods or services.
– Tags for embedding other types of media such as
pictures or audio.
– Tags for embedding programs (using Java applets or
JavaScript).
– Tags for storing metadata.
HTML (cont.)
• Example: HTML document and how it is seen in the browser.
<html>
<head>
<title>HTML Example</title>
<meta name=rby content="Just an example">
</head>
<body>
<h1>HTML Example</h1>
<p>
<hr>
<p>
HTML has many <i>tags</i>, among them:
<li> links to other <a href=http://www.w3c.org/>
pages</a>
<li> paragraphs (p), headings (h1, h2, etc.)
<li> font types (b,i),
<li> horizontal rules (hr),
<li> indented lists and items (ul, li),
<li> images (img), tables, forms etc.
</ul>
<p>
<hr>
<p>
<img align=left src="at_work.gif">
HTML (cont.)
• Cascade Style Sheet (CSS): Definition of style rules
•
•
that tell a browser how to present a document.
CSS provide Web authors a powerful tool for
improving the aesthetics of Web pages.
Example: The following code defines the color and
font-size properties for H1 and H2 elements. It tells
the browser to show level-one headings in an extralarge, red font, and to show level two headings in a
large, blue font.
<head>
<title>CSS Example</title>
<style type="text/css">
h1 { font-size: x-large; color: red }
h2 { font-size: large; color: blue }
</style>
</head>
Extensible Markup Language (XML)
• Defined by the WWW Consortium (W3C).
• Originally intended as a document markup language
•
•
•
•
to replace HTML as the language for publishing
documents on the Web.
Derived from SGML (Standard Generalized Markup
Language), but simpler to use than SGML.
Extensible (i.e., a meta-language): Users can add
new tags, and separately specify how a tag should
be handled for display.
The ability to specify new tags and to create nested
tag structures make XML useful for exchange of data
(not just documents).
Much of the use of XML has been in data exchange
applications, not as a replacement for HTML.
XML (cont.)
• Data interchange is critical in today’s networked
world. Paper flow of information between
organizations is being replaced by electronic flow of
information.
– Banking: funds transfer.
– Order processing; especially inter-company orders.
– Scientific data: chemistry, genetics.
• With XML, each application area sets its own
•
standards for representing information.
Each XML-based standard defines the valid
elements, using
– XML-type specification (DTD or XML schema) to specify
the syntax.
– Textual descriptions of the semantics.
• A wide variety of tools is available for parsing,
browsing and querying XML documents or data.
Basic Structure
• Basic Syntax:
– ELEMENT: Section of data beginning with <tagname> and
ending with matching </tagname>.
– Mixing text with elements is allowed.
• Example:
<university>
<student>
<student-id> 123-456 </student-id>
<student-name> Johnson </student-name>
<major>
Information Systems </major>
</student>
<enrollment>
This enrollment is still incomplete.
<student-id> 123-4567 </student-id>
<course-id> INFS-623 </course-id>
</enrollment>
</university>
Attributes
• Elements can have attributes.
– Attributes are specified by name=value pairs inside the
starting tag of an element.
– An element may have several attributes, but each
attribute name may occur only once.
• Example:
<student degree-level="graduate">
<student-id> 123-456 </student-id>
<student-name> Johnson </student-name>
<major>
Information Systems </major>
</student>
– Note that the same information could also be specified
with a subelement:
<degree-level> graduate </degree-level>
– Suggestion: Use attributes for identifiers of elements,
and use subelements for contents.
Namespaces
• XML data has to be exchanged between organizations.
– The same tag name may have different meaning in different
organizations, causing confusion on exchanged documents.
– Specifying a unique string as an element name avoids such
confusion.
– Better solution: use unique-name:element-name.
– Avoid using long unique names all over the document by
using XML namespaces.
• Example:
<university Xmlns:GMU="http://www.GMU.edu">
…
<GMU:student>
<GMU:student-name> Johnson </GMU:student-name>
<GMU:major> Information Systems </GMU:major>
</GMU:student>
…
</university>
Schemas
• Database schemas constrain what information can be
•
•
•
•
stored, and the data types of stored values.
XML documents are not required to have an associated
schema.
However, schemas are very important for XML data
exchange, to allow a site to automatically interpret data
received from another site.
Two mechanisms for specifying XML schema:
– Document Type Definition (DTD): Widely used.
– XML Schema: Newer, more complex.
We shall discuss only DTD schemas.
Document Type Definition (DTD)
• A DTD specifies the type of an XML document. It
constraints the structure of XML data by declaring
•
•
– The elements that can occur.
– The subelements that can/must occur inside an
element, and how many times.
– The attributes that an element can/must have.
A DTD does not constrain data types:
– All values represented as strings in XML.
DTD syntax:
<!ELEMENT element (subelements-specification) >
<!ATTLIST element (attributes) >
DTD (cont.)
• Subelements are either
– Names of other elements.
– #PCDATA (parsed character data – character strings).
– EMPTY (no subelements) or ANY (anything can be a subelement).
• Example:
– <!ELEMENT student (student-id student-name)>
– <!ELEMENT student-id (#PCDATA)>
– <!ELEMENT student-name(#PCDATA)>
• Subelement specification may have regular expressions.
Notation:
– |
– alternatives
– + – 1 or more occurrences
– * – 0 or more occurrences
• Example:
<!ELEMENT university (( student | course | enrollment)+)>
DTD (cont.)
• Example: University DTD with information on
students, courses and enrollments.
<!DOCTYPE university [
<!ELEMENT university(( student | course | enrollment )+)>
<!ELEMENT student(student-id student-name major)>
<!ELEMENT course(course-id course-title credits)>
<!ELEMENT enrollment(student-id course-id)>
<!ELEMENT student-id (#PCDATA)>
<!ELEMENT student-name (#PCDATA)>
<!ELEMENT major(#PCDATA)>
<!ELEMENT course-id(#PCDATA)>
<!ELEMENT course-title(#PCDATA)>
<!ELEMENT credits(#PCDATA)>
]>
DTD (cont.)
• Attribute specifications include three components:
– Attribute name
– Attribute type:
• CDATA (character data).
• ID (identifier) or IDREF (ID reference) or IDREFS (multiple
IDREFs) .
– Attribute value information:
• The value must be specified in each element (#REQUIRED).
• There is a default value (value).
• Examples:
– <!ATTLIST student degree-level CDATA
“undergraduate">
– <!ATTLIST student
student-id ID
# REQUIRED
enrollments IDREFS # REQUIRED >
DTD (cont.)
• Attributes of type ID, IDREF and IDREFS:
– An element can have at most one attribute of
type ID.
– The ID attribute value of each element in an
XML document must be distinct (hence, the ID
attribute value is an identifier).
– An attribute of type IDREF must contain the ID
value of an element in the same document.
– An attribute of type IDREFS contains a set of (0
or more) ID values.
– Each ID value must contain the ID value of an
element in the same document.
DTD (cont.)
• Example: A University DTD with ID and IDREF
attribute types.
<!DOCTYPE university [
<!ELEMENT enrollment (course-title, grade)>
<!ATTLIST enrollment
enrollment-no ID
# REQUIRED
enrolled-student IDREF # REQUIRED>
<!ELEMENT student(student-name, major)>
<!ATTLIST student
student-id ID
# REQUIRED
enrollments IDREFS # REQUIRED>
…
declarations for course-title, grade, student-name, and major
…
]>
DTD (cont.)
• Example: XML data for the previous DTD.
<university>
<student student-id="123-456" enrollments="E703 E812">
<student-name> Johnson </student-name>
<major> Information Systems </major>
</student>
<enrollment enrollment-no="E703" enrolled-student="123456">
<course-title> Information Retrieval </course-title>
<grade> B+ </grade>
</enrollment>
<enrollment enrollment-no="E812" enrolled-student="123456">
<course-title> Database Systems </course-title“>
<grade> A </grade>
</enrollment>
</university>
DTD (cont.)
• DTD has several limitations:
– No typing of text elements and attributes (all values are
strings).
– Difficult to specify unordered sets of subelements
(order is often irrelevant in databases).
– IDs and IDREFs are untyped; for example, the
enrollments attribute of a student may contain a
reference to another student, which is meaningless
(enrollments should ideally be constrained to refer to
enrollmentelements).
• XML Schema is a more sophisticated schema
•
language which addresses these drawbacks of
DTDs, and offers many more features.
XML schema is significantly more complicated
than DTDs, and is not yet widely used.
Querying and transforming XML data
• Querying XML data and translation of
information from one XML schema to another
are closely related, and are handled by the
same tools.
• Standard XML querying/translation languages
– Xpath: Simple language consisting of path
expressions.
– XSLT: Simple language designed for translation
from XML to XML and XML to HTML.
– Xquery: An XML query language with a rich set of
features.
Tree model of XML data
• Query and transformation languages are based on
•
a tree model of XML data.
An XML document is modeled as a tree, with
nodes corresponding to elements and attributes.
– Element nodes have children nodes, which can be
attributes or subelements.
– Text in an element is modeled as a text node child of
the element.
– Children of a node are ordered according to their order
in the XML document.
– Element and attribute nodes (except for the root node)
have a single parent, which is an element node.
– The root node has a single child, which is the root
element of the document.
Tree model of XML data (cont.)
• The XML tree for the university example
XSLT
• XSLT stands for Extensible Stylesheet
Language Transformations
• It is used to transform XML documents
into other kinds of documents, e.g. HTML,
PDF, XML, …
• XSLT uses two input files:
– The XML document containing the actual data
– The XSL document containing both the
“framework” in which to insert the data, and
XSLT commands to do so
XSLT Architecture
Source
XML doc
XSL
processor
XSL
stylesheet
Target
Document
Some special transforms
•
•
•
•
XML to HTML— for old browsers
XML to LaTeX—for TeX layout
XML to SVG—graphs, charts, trees
XML to tab-delimited—for db/stat
packages
• XML to plain-text—occasionally useful
• XML to XSL-FO formatting objects
XSLT Data Model
•
•
•
•
XSLT reads an XML documents as a source tree
Transforms the documents into a result tree
Transformations are specified in a stylesheet
To navigate the tree XSLT uses XPath
Introduction to XPath
• XPath is a syntax for addressing parts of
an XML document by
– describing paths through the document
hierarchy
– specifying constraints to match against the
document's structure
• XSL uses XPath expressions to
– determine which elements match a template
– select nodes upon which to perform
operations
XPath Basics
• XPath expressions superficially resemble UNIX
pathnames, e.g.
poem/stanza/line
•
refers to "all line elements which are children
of stanza elements which are children of poem
elements"
XPath expressions are evaluated relative to a
"context node", which is analogous to the
"current working directory" in UNIX or DOS.
The XPath expression for this is "."
XPath Basics: a Simple Example
• Consider the following XML document:
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: a Simple Example
(cont.)
• The XPath "poem/stanza/line" selects
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: wildcards
• The XPath "poem/stanza/*" selects
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: descendants
• The XPath "poem//punch" selects:
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: sequencing
• "poem/stanza/line[1]" selects:
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: sequencing (cont.)
• "poem/stanza/line[position() =
last()]" selects:
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
XPath Basics: selecting text
nodes
• "poem/author/text()" selects:
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: conditionals
• "poem/stanza[punch]" selects:
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
XPath Basics: conditionals:
equality
• “//line[text()="I'm a poet"]”
<poem>
<title>Roses</title>
<author>Ima Poet</author>
<stanza>
<line>Roses are red</line>
<line>violets are blue</line>
</stanza>
<stanza>
<line>I'm a poet</line>
<punch>and you're not!</punch>
</stanza>
</poem>
A simple XSL example
• File data.xml:
<?xml version="1.0" ?>
<?xml-stylesheet type="text/xsl" href="render.xsl"?>
<message>Hello World!</message>
• File render.xsl:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0”
xmlns:xsl="http://www.w3.org/1999/XSL/Transfor
m">
<!-- one rule, to transform the input root (/) -->
<xsl:template match="/">
<html><body>
<h1><xsl:value-of select="message"/></h1>
</body></html>
</xsl:template>
</xsl:stylesheet>
Stylesheet (.xsl file)
• It is a well-formed XML document
• It is a collection of template rules
• A template rule consists of pattern and a
template
• Pattern is specified in Xpath and locates
the node of the XML tree.
• The located node is replaced by the
template in the result tree
The .xsl file
• An XSLT document has the .xsl extension
• The XSLT document begins with:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transf
orm">
• Contains one or more templates, such as:
<xsl:template match="/"> ... </xsl:template>
• And ends with:
</xsl:stylesheet>
Finding the message text
• The template <xsl:template match="/"> says
to select the entire file
– You can think of this as selecting the root node of
the XML tree
• Inside this template,
– <xsl:value-of select="message"/> selects the
message child
– Alternative Xpath expressions that would also
work:
• ./message
• /message/text()
• ./message/text()
Putting it together
• The XSL was:
<xsl:template match="/">
<html><body>
<h1><xsl:value-of select="message"/></h1>
</body></html>
</xsl:template>
•
•
•
•
The
The
The
The
<xsl:template match="/"> chooses the root
<html><body> <h1> is written to the output file
contents of message is written to the output file
</h1> </body></html> is written to the output file
• The resultant file looks like:
<html><body>
<h1>Hello World!</h1>
</body></html>
How XSLT works
• The XML text document is read in and stored as
•
•
a tree of nodes
The <xsl:template match="/"> template is used
to select the entire tree
The rules within the template are applied to the
matching nodes, thus changing the structure of
the XML tree
– If there are other templates, they must be called
explicitly from the main template
• Unmatched parts of the XML tree are not
•
changed
After the template is applied, the tree is written
out again as a text document
Where XSLT can be used
• A server can use XSLT to change XML files
into HTML files before sending them to the
client
• A modern browser can use XSLT to
change XML into HTML on the client side
– This is what we will mostly be doing in this
class
• Most users seldom update their browsers
– If you want “everyone” to see your pages, do
any XSL processing on the server side
Modern browsers
• Internet Explorer 6 best supports XML
• Netscape 6 supports some of XML
• Internet Explorer 5.x supports an obsolete
version of XML
– If you must use IE5, the initial PI is different
(you can look it up if you ever need it)
xsl:value-of
• <xsl:value-of select="XPath expression"/>
selects the contents of an element and
adds it to the output stream
– The select attribute is required
– Notice that xsl:value-of is not a container,
hence it needs to end with a slash
• Example (from an earlier slide):
<h1> <xsl:value-of select="message"/> </h1>
xsl:for-each
• xsl:for-each is a kind of loop statement
• The syntax is
<xsl:for-each select="XPath expression">
Text to insert and rules to apply
•
</xsl:for-each>
Example: to select every book (//book) and
make an unordered list (<ul>) of their titles
(title), use:
<ul>
<xsl:for-each select="//book">
<li> <xsl:value-of select="title"/> </li>
</xsl:for-each>
</ul>
Filtering output
• You can filter (restrict) output by adding a
criterion to the select attribute’s value:
<ul>
<xsl:for-each select="//book">
<li>
<xsl:value-of
select="title[../author='Terry
Pratchett']"/>
</li>
</xsl:for-each>
</ul>
• This will select book titles by Terry Pratchett
Filter details
• Here is the filter we just used:
<xsl:value-of
select="title[../author='Terry Pratchett']"/>
• author is a sibling of title, so from title we have
•
•
to go up to its parent, book, then back down to
author
This filter requires a quote within a quote, so we
need both single quotes and double quotes
Legal filter operators are:
=
!=
&lt;
&gt;
– Numbers should be quoted, but apparently don’t have
to be
But it doesn’t work right!
• Here’s what we did:
•
•
<xsl:for-each select="//book">
<li>
<xsl:value-of
select="title[../author='Terry
Pratchett']"/>
</li>
</xsl:for-each>
This will output <li> and </li> for every book, so
we will get empty bullets for authors other than
Terry Pratchett
There is no obvious way to solve this with just
xsl:value-of
xsl:if
• xsl:if allows us to include content if a given
•
•
condition (in the test attribute) is true
Example:
<xsl:for-each select="//book">
<xsl:if test="author='Terry Pratchett'">
<li>
<xsl:value-of select="title"/>
</li>
</xsl:if>
</xsl:for-each>
This does work correctly!
xsl:choose
• The xsl:choose ... xsl:when ...
xsl:otherwise construct is XML’s equivalent
of Java’s switch ... case ... default
statement
• The syntax is:
<xsl:choose>
<xsl:when test="some condition">
... some code ... • xsl:choose is often
</xsl:when>
used within an
<xsl:otherwise>
xsl:for-each loop
... some code ...
</xsl:otherwise>
</xsl:choose>
xsl:sort
• You can place an xsl:sort inside an xsl:foreach
• The attribute of the sort tells what field to
sort on
• Example:
<ul>
<xsl:for-each select="//book">
<xsl:sort select="author"/>
<li> <xsl:value-of select="title"/> by
<xsl:value-of select="author"/> </li>
</xsl:for-each>
</ul>
– This example creates a list of titles and
authors, sorted by author
xsl:text
• <xsl:text>...</xsl:text> helps deal with two
common problems:
– XSL isn’t very careful with whitespace in the
document
• This doesn’t matter much for HTML, which collapses all
whitespace anyway (though the HTML source may look
ugly)
• <xsl:text> gives you much better control over
whitespace; it acts like the <pre> element in HTML
– Since XML defines only five entities, you cannot
readily put other entities (such as &nbsp;) in your XSL
• &amp;nbsp; almost works, but &nbsp; is visible on the
page
• Here’s the secret formula for entities:
<xsl:text disable-output-escaping="yes">&amp;nbsp;</xsl:text>
Creating tags from XML data
• Suppose the XML contains
<name>Dr. Abolhassani's Home Page</name>
<url>http://sharif.edu/~abolhassani</url>
• And you want to turn this into
<a href="http://sharif.edu/~abolhassani">
Dr. Abolhassani's Home Page</a>
• We need additional tools to do this!
Creating tags--solution 1
• Suppose the XML contains
•
•
•
•
<name>Dr. Abolhassani's Home Page</name>
<url>http://sharif.edu/~abolhassani</url>
<xsl:attribute name="..."> adds the named
attribute to the enclosing tag
The value of the attribute is the content of
this tag
Example:
<a>
<xsl:attribute name="href">
<xsl:value-of select="url"/>
</xsl:attribute>
<xsl:value-of select="name"/>
</a>
Result: <a href="http://sharif.edu/~abolhassani">
Dr. Abolhassani's Home Page</a>
Creating tags--solution 2
• Suppose the XML contains
<name>Dr. Abolhassani's Home Page</name>
<url>http://sharif.edu/~abolhassani</url>
• An attribute value template (AVT) consists of
•
•
•
braces { } inside the attribute value
The content of the braces is replaced by its
value
Example:
<a href="{url}">
<xsl:value-of select="name"/>
</a>
Result:
<a href="http://sharif.edu/~abolhassani">
Dr. Abolhassani's Home Page</a>
Modularization
• Modularization--breaking up a complex program
into simpler parts--is an important programming
tool
– In programming languages modularization is often
done with functions or methods
– In XSL we can do something similar with
xsl:apply-templates
• For example, suppose we have a DTD for book
with parts titlePage, tableOfContents, chapter,
and index
– We can create separate templates for each of these
parts
Book example
• <xsl:template match="/">
<html> <body>
<xsl:apply-templates/>
</body> </html>
</xsl:template>
• <xsl:template match="tableOfContents">
<h1>Table of Contents</h1>
<xsl:apply-templates
select="chapterNumber"/>
<xsl:apply-templates
select="chapterName"/>
<xsl:apply-templates
select="pageNumber"/>
</xsl:template>
• Etc.
xsl:apply-templates
• The <xsl:apply-templates> element
applies a template rule to the current
element or to the current element’s child
nodes
• If we add a select attribute, it applies the
template rule only to the child that
matches
• If we have multiple <xsl:apply-templates>
elements with select attributes, the child
nodes are processed in the same order as
the <xsl:apply-templates> elements
Applying templates to children
• <book>
<title>XML</title>
<author>Gregory Brill</author>
</book>
With this line:
XML by Gregory Brill
• <xsl:template match="/">
<html> <head></head> <body>
<b><xsl:value-of select="/book/title"/></b>
<xsl:apply-templates
select="/book/author"/>
</body> </html>
</xsl:template>
Without this line:
XML
<xsl:template match="/book/author">
by <i><xsl:value-of select="."/></i>
</xsl:template>
Tools for XSL Development
• There are a number of free and
commercial XSL tools available
– XSLT processors:
• MSXML, which currently supports the latest XSLT
specification (native Win32)
• Xalan from Apache (C++, Java)
– Editors and browsers
• Internet Explorer 6.0
• XML Spy (commercial)
Cocoon
• Cocoon is Apache’s dynamic XML
Publishing Framework.
• Cocoon uses XSLT.
• Cocoon allows separation of content, logic
and presentation. making sure people can
interact and collaborate on a project,
without stepping on each other toes, and
component-based web development.
• Cocoon is a web-application that runs
using Apache Tomcat (Cocoon.war).
What Cocoon can do
browser
q
re
ue
st
...
...
F/
PD
L/
TM
H
WML
st
st
reque
reque
XML
document
(static or
dynamic)
New
Device
?
Cocoon Pipeline
Cocoon introduced the idea of a pipeline to
handle a request. A pipeline is a series of steps
for processing a particular kind of content.
XML
Document
SAX
File Generator
XSLT Transformer
HTML File
SAX
HTML serializer
Sitemap
In Cocoon, configuration information for the pipelines
that an application requires is defined in a file named
sitemap.
References
• Specifications:
– http://www.w3.org/Style/XSL
– http://www.w3.org/TR/xslt
– http://www.w3.org/TR/xpath
– http://www.w3.org/TR/xsl
• An excellent XSLT tutorial:
– http://www.cafeconleche.org/books/bible2/chapters/ch17.html
• Another tutorial:
– http://www.w3schools.com/xsl
• Microsoft (MSXML3):
– http://msdn.microsoft.com/xml
• Saxon:
– http://saxon.sourceforge.net/
• Xalan:
– http://xml.apache.org./xalan/overview.html
Extended document standards
• You can define your own XML tag sets, but here
are some already available:
–
–
–
–
–
–
–
–
–
XHTML: HTML redefined in XML
SMIL: Synchronized Multimedia Integration Language
MathML: Mathematical Markup Language
SVG: Scalable Vector Graphics
DrawML: Drawing MetaLanguage
ICE: Information and Content Exchange
ebXML: Electronic Business with XML
cxml: Commerce XML
CBL: Common Business Library