XML and Beyond - Technion – Israel Institute of Technology

Download Report

Transcript XML and Beyond - Technion – Israel Institute of Technology

Document Type Descriptors
(DTDs)
Imposing Structure on XML
Documents
1
Document Type Descriptors
• Document Type Descriptors (DTDs) impose
structure on an XML document
• Using DTDs, we can specify what a “valid”
document should contain.
• These specifications require more than just being
well-formed, e.g., what elements are legal, what
nesting is allowed
• DTDs do not have very great expressive power,
e.g., cannot specify types
2
What is it good for?
• DTDs can be used to define special languages of
XML, i.e., restricted XML for special needs
• Examples:
– FOAF
– SVG (scalable vector graphics)
– WML (a kind of html for wireless devices)
– SOAP (for web services)
– XHTML (well-formed version of HTML)
• Standards can be defined using DTDs, for data
exchange and special applications can be written
3
Address Book DTD
• Suppose we want to create a DTD that describes
legal address book entries
• This DTD will be used to exchange address book
information between programs
• How should it be written? (What is a legal
address?)
• We discuss both element definitions and
attribute definitions
4
Element Definitions
5
Example: An Address Book
<person>
<name> Homer Simpson </name> Exactly one name
<greet> Dr. H. Simpson </greet> At most one greeting
<addr>1234 Springwater Road </addr>
<addr> Springfield USA, 98765 </addr>
<tel> (321) 786 2543 </tel>
<fax> (321) 786 2544 </fax>
<tel> (321) 786 2544 </tel>
As many address
lines as needed
Mixed telephones
and faxes
<email> [email protected] </email>
</person>
At least
one email
6
Specifying the Structure
• How do we specify exactly what must appear in a
person element?
• In a DTD, we can specify the permitted content for
each element.
• The permitted content is specified as a regular
expression
• We show the general syntax, and then an example
7
a
Element a
e1?
0 or 1 occurrences of expression e1
e1*
0 or more occurrences of expression e1
e1+
1 or more occurrences of expression e1
e1,e2
Expression e2 after expression e2
e1|e2
Either expression e1 or expression e2
(but not both!)
(e)
Grouping
#PCDATA
Parsed character data (i.e., text)
EMPTY
No content
ANY
Any content
(#PCDATA|a1|..|an)* Mixed content
8
What’s in a person Element?
• The expression is:
– name, greet?, addr*, (tel | fax)*, email+
• We discuss what each part of this means
– name = there must be a name element
– greet? = there is an optional greet element (i.e.,
0 or 1 greet elements)
– name, greet? = the name element is followed by
an optional greet element
9
What’s in a person Element? (cont.)
name, greet?, addr*, (tel | fax)*, email+
• addr* = there are 0 or more address elements
• tel | fax = there is a tel or a fax element
• (tel | fax)* = there are 0 or more repeats of tel or
fax
• email+ = there are 1 or more email elements
10
What’s in a person Element? (cont.)
name, greet?, addr*, (tel | fax)*, email+
• Does this expression differ from:
– name, greet?, addr*, tel*, fax*, email+
– name, greet?, addr*, (fax | tel)*, email+
– name, greet?, addr*, (fax | tel)*, email, email*
– name, greet?, addr*, (fax | tel)*, email*, email
11
DTD For the Address Book
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE addressbook [
<!ELEMENT addressbook (person*)>
<!ELEMENT person
(name, greet?, address*, (fax | tel)*,
email+)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT greet (#PCDATA)>
<!ELEMENT address(#PCDATA)>
<!ELEMENT tel
(#PCDATA)>
<!ELEMENT fax (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
12
Example
• Requirements:
– Every country must have a name as the first node.
– Every country must have a capital city as the following
node.
– A country may have a king.
– A country may have a queen.
• What is wrong with the following:
– <!ELEMENT country (name,capital?,king*,queen)>
13
Unambiguity
• A DTD must be 1-unambigious, i.e., it must
be clear at any moment when parsing a
document, which point we are at in the
regular expression
• Which of the following is 1-unambigious?
– (a,b)|(a,c)
– a,(b|c)
<a> </a>
<b> </b>
• We now formalize these ideas…
14
Languages
• An element definition defines a language,
i.e., the set of all legal series of children
• Example: Which of the following are in the
language defined by a*,(b|c),a+
– aba
– abca
– aab
– aaacaaa
15
Automata
• Languages can also be defined using an automata
• An automata is:
– a set of states Q.
– an alphabet 
– a transition function , which associates a pair (q,a)
with a state q’
– an initial state q0
– a set of accepting states F
• A word a1…an is in the language defined by an
automata if there is a path from q0 to a state in F
with edges labeled a1,…,an
16
Automata Example: What Language
Does this Define?
b
a
q0
q2
q1
a
17
Automata Example: What Language
Does this Define?
b
a
q0
q2
q1
a
b
c
q3
18
Automata Example: What Language
Does this Define?
b
a
q0
q2
q1
b
b
c
q3
Note that this automata is non-deterministic!
19
Non-Deterministic Automata
• An automaton is non-deterministic if there is
a state q and a letter a such that there are at
least two transitions from q via edges
labeled with a
– What words are in the language of a nondeterministic automata?
• We now show how to create a Glushkov
automata from a regular expression
20
Creating an automata from an
element definition
a*,(b|c),a+
Step 1: Normalize the expression
by replacing any occurrence of an
expression e+ with e,e*
Step 2: Use subscripts to number
each occurrence of each letter
a*,(b|c)a,a*
a1*,(b1|c1)a2,a3*
21
Creating an automata from an
element definition
Step 3: Create a state for each
subscripted letter, and a state q0
a1*,(b1|c1)a2,a3*
Step 4: Choose as accepting
states all subscripted letters with
which it is possible to end a word
b1
q0
a1
a2
a3
c1
22
Creating an automata from an
element definition
Step 5: Create a transition from a
state lj to a state kj if there is a word
in which kj follows li. Label the
transition with k
a1*(b1|c1)a2,a3*
b1
q0
You fill in the
transitions!
a1
a2
a3
c1
23
1-unambigious
• A language is 1-unambigious if its Glushov
automata is deterministic.
– otherwise it is 1-ambigious
– element definitions in a DTD must be 1-unambigious!
• Examples: Create a Glushkov automata for the
following and check whether the corresponding
languages are 1-unambigious
– (a,b)|(a,c)
– a,(b|c)
– a?, d+, b*, d*, (c|b)+
24
Ambigious Example
• Replace the following with a 1-unambigious
equivalent expression
<!ELEMENT country
(president | king | (king,queen) | queen)>
<!ELEMENT president (#PCDATA)>
<!ELEMENT king
<!ELEMENT queen
(#PCDATA)>
(#PCDATA)>
25
Another Example
• Customers at may pay with a combination of credit
cards and cash.
• If cards and cash are both used the cards must
come first.
• There may be more than one card.
• There may be no more than one cash element.
• At least one method of payment must be used.
• Find a 1-unambigious definition for the element
payment, using the elemenrs card and cash
26
Attribute Definitions
27
More DTD Syntax
• XML documents can have elements, which can
have attributes. How are they defined?
• General Syntax:
<!ATTLIST element-name
attribute-name1 type1 default-value1
attribute-name2 type2 default-value2
….
attribute-namen typen default-valuen>
• Example: <!ATTLIST height dim CDATA “cm”>
28
<!ATTLIST element-name
attribute-name1 type1 default-value1
attribute-name2 type2 default-value2
….
attribute-namen typen default-valuen>
• type is one of the following (there are additional
possibilities that we don’t discuss)
CDATA
character data
(en1|en2|..) value must be one from the given list
ID
value is a unique id
IDREF
IDREFS
value is the id of another element
value is a list of other ids
29
<!ATTLIST element-name
attribute-name1 type1 default-value1
attribute-name2 type2 default-value2
….
attribute-namen typen default-valuen>
• default-value is one of the following
value
The default value of the attribute
#REQUIRED
The attribute value must be
included in the element
#IMPLIED
The attribute does not have to be
included
30
Examples
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
dimension (cm | in) #REQUIRED
accuracy CDATA #IMPLIED
resizable CDATA “yes”
>
31
Specifying ID and IDREF Attributes
<!DOCTYPE family [
<!ELEMENT family
<!ELEMENT person
<!ELEMENT name
<!ATTLIST person
id
mother
father
children
]>
(person)*>
(name)>
(#PCDATA)>
ID
#REQUIRED
IDREF #IMPLIED
IDREF #IMPLIED
IDREFS #IMPLIED>
32
Specifying ID and IDREF
Attributes (cont.)
• The attributes mother and father are references to
IDs of other elements
• However, those are not necessarily person
elements!
• The mother attribute is not necessarily a reference
to a female person
References to IDs
have no type
33
Some Conforming Data
<family>
<person id=“lisa” mother=“marge” father=“homer”>
<name> Lisa Simpson </name>
</person>
<person id=“bart” mother=“marge” father=“homer”>
<name> Bart Simpson </name>
</person>
<person id=“marge” children=“bart lisa”>
<name> Marge Simpson </name>
</person>
<person id=“homer” children=“bart lisa”>
<name> Homer Simpson </name>
</person>
</family>
34
Consistency of ID and IDREF
Attribute Values
• If an attribute is declared as ID
– the associated values must all be distinct (no confusion)
– In other words, No two ID attributes can have the same
value
• If an attribute is declared as IDREF
– the associated value must exist as the value of some ID
attribute (no dangling “pointers”)
• Similarly for all the values of an IDREFS attribute
35
Is This Legal?
<family>
<person id=“superman” mother=“lara” father=“jor-el” >
<name> Clark Kent </name>
</person>
<person id=“kara” children=“laura” >
<name> Linda Lee </name>
</person>
</family>
36
Is This Legal?
<family>
<person id=“superman” mother=“lara” father=“jor-el” >
<name> Clark Kent </name>
</person>
<person id=“kara” children=“laura” >
<name> Linda Lee </name>
</person>
<fruit id=“jor-el” >
<name> Banana </name>
</fruit>
</family>
37
Adding a DTD to the Document
• A DTD can be internal
– The DTD is part of the document file
• or external
– The DTD and the document are on separate
files
• An external DTD may reside
– In the local file system (where the document is)
– In a remote file system (by using a URL)
38
Connecting a Document with its DTD
• An internal DTD:
<?xml version="1.0"?>
<!DOCTYPE db [<!ELEMENT ...> … ]>
<db> ... </db>
• A DTD from the local file system:
<!DOCTYPE db SYSTEM "schema.dtd">
• A DTD from a remote file system:
<!DOCTYPE db SYSTEM
"http://www.schemaauthority.com/schema.dtd">
39
Valid Documents
• A document with a DTD is valid if it
conforms to the DTD, i.e.,
– the document conforms to the regularexpression grammar,
– types of attributes are correct, and
– constraints on references are satisfied
40
DTD Issues
41
DTDs Problems (1)
• DTDs are rather weak specifications by DB &
programming-language standards
– Only one base type – PCDATA
– No useful “abstractions”, e.g., sets
– IDREFs are untyped
– No constraints, e.g., child is inverse of parent
– Tag definitions are global
– Not easily parsed (since they are not XML)
• Some extensions of XML impose a schema or
types on an XML document, e.g., XSchema
42
DTD Problems (2)
• How would you say that element a has
exactly the children c, d, e in any order?
• In general, can such definitions be written
efficiently?
43
Be Careful (1)
<DOCTYPE genealogy [
<!ELEMENT genealogy (person*)>
<!ELEMENT person (
name,
dateOfBirth,
person,
-- mother
person )>
-- father
...
]>
What is the problem with this?
44
Be Careful (2)
<DOCTYPE genealogy [
<!ELEMENT genealogy (person*)>
<!ELEMENT person (
name,
dateOfBirth,
person?,
-- mother
person? )>
-- father
...
]>
What is now the problem with this?
45