DTD++ 2.0: Adding support for co

Download Report

Transcript DTD++ 2.0: Adding support for co

DTD++ 2.0: Adding support for
co-constraints
Davide Fiorello
Nicola Gessa
Paolo Marinelli
Fabio Vitali
University of Bologna
Two sales pitches here

DTDs aren’t dead yet and should not be

Co-constraints are important, and the very
next step in validation
Next: The war of schema languages
2/31
The war of schema languages
DTD?
XML Schema?
Relax NG?
Schematron?
ISO/IEC 19757 DSDL (especially part9:
“Data type- and namespace-aware DTDs”)
My own story




The project NormeInRete (http://www.normeinrete.it):
XML-ization of national and regional laws and basically
any kind of normative document in Italy
Supported by the Italian Office of the Prime Minister, the
Ministry of Justice and the department for Informatics in
Public Administration. All national laws and regional laws
from 3 (soon 7) of the 20 regions are now available in XML
and locatable through URNs.
Yours truly is the main author of the DTDs and
documentation manuals providing guidance for
conversion.
The document type contains 150+ elements and 50+
attributes, dealing with content, meta-content, evolution in
time and space, non-ASCII characters. By the end of the
year we will deal with judicial documents.
Next: NormeInRete: DTD or XML Schema?
4/31
NormeInRete: DTD or XML Schema?



Started in 1999, the first versions of the rules was
readied in 2000: necessarily DTD!
The syntax is clear, easy to look up and use, wellknown by the users and tool implementers.
The birth of XML Schema created many discussions on
whether to switch:






Next: But…
“All my friends use XML Schema”
“XML Spy creates very nice drawings of an XML Schema”
“XML Schema is the future”
“Admit you don’t know the first thing about XML Schema”
In truth, there is very little real reason to switch: DTDs
are fine for our purposes.
So far, the parts are balanced. European integration
may provide the necessary pressure.
5/31
But…
… is the switch inevitable?
Next: Are DTDs dead?
6/31
Are DTDs dead?

The need for an XML-based syntax


For automatic processing and generation
The presence of strong competition


XML Schema
Relax NG
The absence of many important features
Yes, but …
 DTDs are easier to learn,
 DTDs are easier to read,
 DTDs are easier to use
 Many people still think in terms of DTDs

Next: So: DTD++ 1.0 (Extreme Markup 2003)
7/31
So: DTD++ 1.0 (Extreme Markup 2003)


The idea: create a DTD-like language that is as
powerful as the most powerful validation language: XML
schema.
Syntax from DTD, structures and concepts from XML
Schema:




Namespace support
Complex types for managing markup structures
Simple types for managing constraints on data containers
Use as much as possible of DTD syntax, invent as little
as possible, recycle concepts with new meanings.
Next: What about XML-based syntax?
8/31
What about XML-based syntax?

Semantic equivalence to another XML-based
schema language means this is no longer a
problem.
Just convert it!

All human tasks use the original DTD++ form,
All computer task use the corresponding XSD
version. Conversion is easy and fast.
Next: A taste of DTD++ (1)
9/31
A taste of DTD++ (1)

Anonymous complex types in XSD are content models
<!ELEMENT X (A?, (B | C)[2-5], D*) >

Predefined simple types are predefined keywords
<!ELEMENT A (#PCDATA)> or <!ELEMENT A (#STRING)>
<!ELEMENT B (#INTEGER)>
<!ELEMENT C (#DATE)>

Anonymous simple types add facets to predefined
simple types. Syntax for facets uses well-known
mathematical constructs: for instance {} for lengths and []
for ranges.
<!ELEMENT D (#INTEGER[,100])>
Next: A taste of DTD++ (2)
10/31
A taste of DTD++ (2)

Named types are named entities using different
characters to differentiate themselves
<!ENTITY # myInt “(#INTEGER[0,100])”>
<!ELEMENT D #myInt; >
<!ENTITY @ myType “(A?, (B | C)[2-5], D*)” >
<!ELEMENT X @myType; >

Complex types that specify attributes have an additional
block of quotes:
<!ENTITY @ myType “(A?, (B | C)[2-5], D*)”
“anAttr #STRING{10} #IMPLIED”>
<!ELEMENT X @myType; >
Next: A taste of DTD++ (3)
11/31
A taste of DTD++ (3)

Mixed content models extend the DTD syntax to allow any
structure allowable with XSD:
<!ENTITY @ myType “#PCDATA (A?, (B | C)[2-5], D*)” >
<!ELEMENT X @myType; >

The ANY structure is extended
<!ELEMENT comment ANY[0,3]{http://www.foo.org}>

Target namespaces use the newly introduced TARGETNS
structure
<!TARGETNS “http://www.foo.org”>
<!TARGETNS ns “http://www.bar.org”>
<!ELEMENT name (ns:firstname)>
<!ELEMENT ns:firstname (#PCDATA)>
Next: Limits
12/31
Limits




No support (yet) for keys, keyrefs, uniques.
No local elements
No support for refs
Only two design styles supported:



Salami slices
Garden of Eden.
No redefine or include (but no need for them)
Next: Co-constraints and what are they for
13/31
Co-constraints and what are they for
Better constraints
Real-life constraints
Constraints difficult to formalize
Is DTD++ 1.0 enough, then?


No, since XML Schema is not enough
XML Schema cannot express all the structure and data constraints
that document designers may need:





Mutual exclusion (“element x may have either the a attribute or the b
attribute, but not both”)
Deep exclusions (“element x cannot contain, at any level of its subtree,
element y”)
Structure-dependent structures (“if the item is gratis, i.e., the attribute
gratis is present, then no price should be specified, i.e., the element
price should be absent”)
Data-dependent structures (“if the address is a PO box, then the
address must include a PO box number, otherwise it must include a
street name and a street number”)
These kinds of constraints are known as co-constraints, or cooccurrence constraints. Most real life XML document types have
one or more of those constraints.
Next: For example…
15/31
For example…

XHTML



XSLT



“In a template element at least one of the match and name attributes
must be present”
Again, the DTD and XML schema cannot express this requirement, and
specify both attributes as optional.
XML Schema itself



“a elements cannot contain other a elements” (appendix B)
Both the normative DTD and the non normative XML Schema cannot
express fully this requirement (they only express a weaker form: “a
elements cannot directly contain other a elements”)
“An element definition must either contain a ref or a name attribute,
but not both. Furthermore, if the name attribute is present, then the type
attribute or one of the simpleType or complexType elements must
be present, but not two.”
The normative XML schema can only specify all these elements and
attributes as optional.
… and plenty more…
Next: Who cares?
16/31
Who cares?


Documents that contain violations to these rules are still
considered valid by the XML schema validator.
Three solutions:



Hope for the best (“It won’t happen”) - subject to Murphy’s Law
Provide a default behavior (“If both attributes are present, consider
the first only”)
Provide validation code within the downstream application
?
DOM
parser
XML
doc
Schema
validator
DOM
tree
rules
Not
well-formed
DOM
Tree +
PSVI
?
?
downstream
application
invalid
Next: SchemaPath and DTD++ 2.0
17/31
SchemaPath and DTD++ 2.0




At the WWW2004 conference, we presented
SchemaPath, our proposal to minimally extend XML
Schema to handle co-constraints.
The idea is to find a way to conditionally assign types to
elements and attributes. Furthermore, a non-satisfiable
type is added for specifying error conditions to avoid.
SchemaPath maintains the XML Schema syntax, adds
only ONE construct and ONE pre-defined simple type,
maintains important XML Schema properties (the
validation theorem and round-tripping and reverse
round-tripping properties), and does not impact the
PSVI for valid documents.
DTD++ 2.0 is the DTD-like syntax for Schematron
Next: DTD++ 2.0
18/31
DTD++ 2.0

Conditional assignment of types





Multiple definitions of the same element, each conditioned by an
XPath expression. Implicit and explicit priorities are used.
Each condition is tested on the instance element, and the one that
holds with the highest priority is selected.
The type specified by the selected definition is assigned to the
element.
This is NOT a way to provide conditional types: types are just plain
old DTD++ 1.0 (XML Schema) types.
The #ERROR simple type


When we want to specify the non-validity of a condition, we assign
the element the #ERROR type.
The #ERROR type is a non-satisfiable type, whose presence in
the instance document always and automatically signals a
validation error.
Next: Examples
19/31
Examples

Mutual exclusion

“Element x may have either the a attribute or the b attribute
but not both”. Suppose we have defined a type myType with both a
and b attributes as optional
<xsd:element name=“x”>
<xsd:alt cond=“(@a and @b)” type=“xsd:error”/>
<xsd:alt
type=“myType”/>
</xsd:element>
<!ELEMENT x “(@a and @b)” #ERROR>
<!ELEMENT x “” @myType;>

Data-dependent structures

“The element quantity must be an integer if the unit element
is ‘items’, and it must be a decimal value if the unit element is
‘meters’”. Suppose we have already defined the data type for the
unit element to only contain the values “meters” or “items”.
<xsd:element name=“quantity”>
<xsd:alt cond=“../unit=‘items’” type=“xsd:integer”/>
<xsd:alt cond=“../unit=‘meters’” type=“xsd:decimal”/>
Next: One
possible solution to the W3C problems (1)
20/31
</xsd:element>
One possible solution to the W3C problems (1)

XHTML
“a elements cannot contain other a elements” (appendix B)
<!ELEMENT A “.//a” (#ERROR)>
<!ELEMENT A “”
(@inlineType;)>


XSLT
“In a template element at least one of the match and name
attributes must be present”
<!ELEMENT template "not(@match) and not(@name)"
(#ERROR) >
<!ELEMENT template "" (@templateType;) >
<!ENTITY @ templateType "%templateContent;"
"match (#patternType;) name(#NCName;)">

Next: One possible solution to the W3C problems (2)
21/31
One possible solution to the W3C problems (2)

XML Schema

“An element definition must either contain a ref or a name attribute, but
not both. Furthermore, if the name attribute is present, then the type
attribute or one of the simpleType or complexType elements must be
present, but not two.”
<!ELEMENT simpleType (@localSimpleType;)>
<!ELEMENT complexType (@localComplexType;)>
<!ENTITY @ element "(simpleType|complexType)"
"name (#NCName;) #IMPLIED
ref (#QName;) #IMPLIED
type (#QName;) #IMPLIED">
<!ELEMENT element "@name and @ref":4 (#ERROR)>
<!ELEMENT element "(@type or @ref) and (xsd:simpleType or
xsd:complexType)":3 (#ERROR)>
<!ELEMENT element "../xsd:schema and @ref":2 (#ERROR)>
<!ELEMENT element "not(@ref) and not(@name)":1 (#ERROR)>
<!ELEMENT element "":0 (@element;)>
Next: The “Trojan Milestones” requirements
22/31
The “Trojan Milestones” requirements
“1. the element must be empty exactly when its sID or eID
attribute is set.
2. when eID is present, no other attributes are permitted.
3. each sID/eID value should occur only twice (once on
sID and once on eID)
4. empty elements with matching sID and eID values
should match up in proper pairs and in order.
Note that because of the second rule above, no
attributes may be required for milestoneable elements.
Schema languages that can make attributes optional
or required depending on the presence of other
attributes (in this case eID) do not suffer this problem.”
[DeRose, Extreme Markup 2004]
Next: A DTD++ 2.0 solution to the Trojan Milestones
requirements
23/31
A DTD++ 2.0 solution to the Trojan
Milestones requirements
<!ENTITY @ startMarker “EMPTY”
“sID ID #REQUIRED %regularAtts;”>
<!ENTITY @ endMarker “EMPTY”
“eID IDREF #REQUIRED”>
<!ELEMENT X “”:0 %regularCM; >
<!ATTLIST X “”:0 %regularAtts;>
<!ELEMENT X “@sID”:2 @startMarker;>
<!ELEMENT X “@sID = preceding::*/@sID”:3 #ERROR>
<!ELEMENT X “@eID=preceding::X/@sID”:4 @endMarker;>
<!ELEMENT X “@eID = preceding::*/@eID”:3 #ERROR>
<!ELEMENT X “@eID”:2 #ERROR>
Next: Implementation of the DTD++2.0 parser
24/31
Implementation of the DTD++2.0 parser

A DTD++ 2.0 validator exists and can be tested online at
http://tesi.fabio.web.cs.unibo.it/dpp

It is a Java application and a plain XML Schema validating engine
(tested with Xalan and MS XML parsers)

The application is a pre-processor to any XML Schema validator,
and, given an XML document X and a DTD++ document D,




it converts D into (one or more) equivalent Schemapath file SP
It converts SP into a plain XML Schema file XS
It converts X into a different XML file X’, so that
XS validates X’ if and only if SP validates X and thus if and only if D
validates X
Next: … but who cares for DTD anyway?
25/31
… but who cares for DTD anyway?
This part is not in the published paper
 On July 21st, 2004 we did a test on the relative speed
and precision of DTD++ and XML schema
 14 volunteers (10M, 4F) were summoned, all 3rd and
4th year computer science students, versed in both
DTD and XML schema (they all had passed with good
marks bot the Web Technologies exam and specifically
the questions on DTDs and XML schema)
 The volunteers were divided in two groups and given 15
questions. Half had to solve them using XML schema,
half using DTD++.
Next: The test
26/31
The test

The 15 questions were identical in both tests,
and regarded:
A. Write XML: applying the rules from a schema and
write valid XML fragments (5 questions)
B. Validate XML: applying the rules from a schema
and find errors in XML fragments (5 questions)
C. Write Schemas: write a fragment of schema given
a plain text description of the problem (5 questions)
Next: A sample question
27/31
A sample question

Verify whether the fragment:
<order>
<to id=”125”>John Smith</to>
<lines><line>
<art>130</art>
<description>Some nice stuff</description>
<col>Red</col>
<price>0,65</price>
<quant>130</quant>
</line></lines>
</order>
is valid with respect the following DTD++ fragment:
<!ELEMENT order (to, lines) >
<!ELEMENT to (#STRING)>
<!ATTLIST to id
ID
#REQUIRED>
<!ELEMENT lines (line+) >
<!ELEMENT line (art, col, price, quant)>
<!ELEMENT art (#PCDATA{,20}) >
<!ENTITY # colors (“red | blue | green | yellow)” >
<!ELEMENT col (#colors;) >
<!ELEMENT
quant (#INTEGER]0,]) >
Next: The
results
<!ELEMENT price (#DECIMAL]0,]) >
28/31
The results

DTD++ resulted a clear winner in all categories





36% faster on group A (Write XML)
53% faster on group B (Validate XML)
Twice as fast (99%) on group C (Write Schemas)
The question on the previous slide was answered on the average
in 0:01:33 with DTD++, and in 0:03:03 average with XML Schema.
Errors are slightly more with DTD++ than XML schema (123%),
but this might be due to the fact that the language was brand new.
Of course the volunteers are very few, and the test might
be considered non-significant, but it gives at least an initial
approximate measure of the relative value of the two
languages.
 An interesting note is that one of the volunteer converted
the XML Schema into DTD fragments with textual
annotations before answering each question.
Next: Demo
29/31

Demo

A demo of the validating engine and the full
result of the tests can be found at
http://tesi.fabio.web.cs.unibo.it/dpp

Time for a demo?
Next: Conclusions
30/31
Conclusions




DTDs are faster to learn and use
XML Schema are powerful and expressive
Schematron-like co-constraints are even more
expressive
Why learning three languages?




DTD++ 1.0 is semantically equivalent to a relevant subset of
XML schema
SchemaPath provides co-constraints with a very limited syntax
and the new idea of conditional assignment of types (rather
than conditional typing)
DTD++ 2.0 uses the same principle with a DTD-like syntax
What now? Maybe ISO/IEC 19757 - DSDL:
Part 5
Part 9
Fine presentazione
Data types
Data type- and namespace-aware DTDs
31/31