Diapositive 1

Download Report

Transcript Diapositive 1

Typing semistructured data
Serge Abiteboul
Web Data Management
Master Informatique
Typing semistructured data
10/9/2007
1
Organization
• Motivations
• Automata
– Automata on words
– Ranked tree automata
– Unranked tree automata
– Automata and monadic second-order logic
– Automata – to compute
• XML typing: DTD, XML schema
• Graphs and bisimulation
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
2
Motivation
Master Informatique
Typing semistructured data
10/9/2007
3
XML typing
• Not compulsory
• Simplify writing software for XML
– Improve interoperability between programs
• Improve storage and performance
• Ease querying: data guide
• Simplify data protection
– Reject illegal update – like relational dependencies
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
4
Improve storage
Company
person
Root
company
works-for
managed-by
Company
Employee
c.e.o.
nam e
…
…
a d d re ss
…
…
c .e .o .
…
…
name
address
name
o id
…
…
string
Lower-bound schema
Employee
o id
…
…
nam e
…
…
m a n a g e d -b y
…
…
w o rk s-fo r
…
…
Store rest in overflow graph
Master Informatique
Typing semistructured data
10/9/2007
5
Improve performance
Bib
paper
year
int
journal
select X.title
from Bib._ X
where X.*.zip = “12345”
book
address
title
string
string
title
author
string
string
last first
zip city streetname name
string
Master Informatique
string
string
string
select X.title
from Bib.book X
where X.address.zip = “12345”
Typing semistructured data
10/9/2007
6
Type checking
• Who checks
– XML editor: check that the data conforms to its type
– XML exchange, e.g., with Web service
• Server when delivering the data
• Client/application: when receiving it
• Dynamic verification: after the data is produced
• Static verification: verification of the program that
generates the data
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
7
Static verification
• Input: input type T and code of function f
– f is Xquery, Xpath, XSLT, etc.
• Verification of T’
– Is it true that d╞T, f(d)╞T’ ?
• Type inference
– Find the smallest T’ such that d╞T, f(d)╞T’
• Rapidly undecidable because of “joins”
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
8
Example
for $p in doc("parts.xml“)//part[color=“red"]
return <part>
<name>$p/name/text()</name>
<desc>$p/desc/node()</desc>
</part>
Result type
(part (name (string) desc (any) )*
If the type of parts.xml//part/desc is string
(part (name (string) desc (string) )*
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
9
Difficulty
for $X in Input, $Y in Input do { print ( <b/> }
Input: <a/> <a/>
Result: <b/> <b/> <b/> <b/>
Problem: { bi  i=n2 for n ≥ 0 } cannot be described in XML schema
There is no « best » result
–
–
–
–
–
b*
 + b2 b*
 + b2 + b4b*
 + b2 + b4 + b9b*
…
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
10
Why tree automata?
•
•
•
•
XML = unranked trees
No theory for XML
Rich theory for strings: Automata
Extend to
rich theory for ranked trees: Tree automata
– Nice algorithms
– Nice theorems
– Can this carry to unranked trees and XML?
• Yes!
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
11
From strings to trees
a
a
b
b
a
Word
Finite State
Automata
Web Data
Management
Master
Informatique
a
b
b
b
b
a
a
a
b
b
b
a
b
b
b
a
b
a
Binary tree…
Ranked tree automata
Typing semistructured data
a
b
b
b
b
Unranked tree automata
no bound on number of children
10/9/2007
12
Why not then use
unranked tree automata?
• Missing practical gadgets
• Complexity of verification
– Goal: typing at reasonable cost
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
13
Automata
Automata on words
Master Informatique
Typing semistructured data
10/9/2007
14
Finite state automata on words
(, Q, q0 , F ,  )
Transitions
Alphabet
 :   Q  P(Q)
State
Initial state
q0  Q
Master Informatique
Accepting states
F Q
Typing semistructured data
10/9/2007
15
Nondeterministic automaton:
Example
 a, q0   q0 ,q1 
 b, q1   q0 
 , q1   q2 
 , q2   q3 
 , q3   q3 
  a, b
Q  q0 , q1 , q2 ,q3 
F  q2 
a
q0
b
q0
a
q0
q1
Web Data
Management
Master
Informatique
a
b
q0
q0
q1
q1
a
q0
q0
KO
Typing semistructured data
b
q0
a
q0
q1
q0
q1
10/9/2007
q2
OK
16
Reminder
• Deterministic
 , q  q0 
– No  transition
– No alternative transitions such as  a, q0   q0 ,q1
• Determinization
– It is possible to obtain an equivalent deterministic automaton
– State of new automaton = set of states of the original one
– Possible exponential blow-up
• Minimization
• Limitations – cannot do
– Context-free languages
a b , n  Ν
n n
• Essential tool – e.g., lexical analysis
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
17
Reminder (2)
•
•
•
•
L(A) = set of words accepted by automata A
Regular languages
Can be described by regular expressions, e.g. a(b+c)*d
Closed under complement
 *  L( A)
• Closed under union, intersection
L( A)  L( B )
L( A)  L( B )
– Product automata with states (s,s’)
where s is from A and s’ is from A’
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
18
Automata on words versus trees
a
Left to right
a
b
b
Right to left
a
B
o
t
t
o
m
u
p
b
b
b
b
a
a
No difference
Web Data
Management
Master
Informatique
a
T
o
p
d
o
w
n
b
Differences
Typing semistructured data
10/9/2007
19
Automata
Automata on ranked trees
Master Informatique
Typing semistructured data
10/9/2007
20
Binary tree automata
• Parallel evaluation
(, Q, F ,  )
a
• For leaves:
 :   P(Q)
• For other nodes:
 :   Q  Q  P(Q)
B
o
t
t
o
m
u
p
q2
q”
q1
b
b
q’
b
b
a
q
a
q
a
q”
q
b
q’
Typing semistructured data
Master Informatique
10/9/2007
21
Bottom-up tree automata
 a, q, q'  r, r '
• Bottom-up: if a node labeled a has its children in
states q, q’ then the node moves
nondeterministically to state r or r’
• Accepts is the root is in some state in F
• Not deterministic if alternatives or -transitions:
 a, q, q'  {r, r '}  , r   r '
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
22
Example: deterministic bottom-up
  0,1,,
Q  q0 , q1
1 0  q0 
1 1  q1
F  q1
 2 , q1 , q1   q1
 2 , q0 , q0 ,  2 , q1 , q0 ,  2 , q0 , q1   q0 
 2 , q0 , q0   q0 
 2 , q1 , q1 ,  2 , q1 , q0 ,  2 , q0 , q1   q1
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
23
Boolean circuit evaluation
q0
0
v
1
q1
1
q1
1 0  q0 
1 1  q1
q1
v
1
q1
0
q0
q1
q1
v
q1
v
v
v
q1
q1
1
q1
1
q1
OK
Web Data
Management
Master
Informatique
Typing semistructured data
 2 , q1 , q1   q1
 2 , q0 , q0   q0 
 2 , q1 , q0   q0 
 2 , q0 , q1   q0 
 2 , q0 , q0   q0 
 2 , q1 , q1   q1
 2 , q1 , q0   q1
 2 , q0 , q1   q1
10/9/2007
24
Regular tree language = set of trees
accepted by a bottom-up tree automaton
Master Informatique
Typing semistructured data
10/9/2007
25
Regular tree languages
The following are equivalent
– L is a regular tree language
– L is accepted by a nondeterministic bottom-up
automaton
– L is accepted by a deterministic bottom-up
automaton
– L is accepted by a nondeterministic top-down
automaton
Deterministic top-down is weaker
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
26
Top-down tree automata
 a, q"  q, q'
• Top-down: if a node labeled a is in state q”,
then its left child moves to state q (right to q’)
• Accepts is all leaves are is in states in F
• Not deterministic if
 a, q"  q, q', r, r '
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
27
Why deterministic
top-down is weaker?
• Consider the language
– L = { f(a,b), f(b,a) }
• It can be accepted by a bottom-up TA
– Exercise: write a BUTA A such that L = L(A)
• Suppose that B is a deterministic top-down TA with L
= L(B)
– Exercise: Show that B also accepts {f(a,a)}
– A contradiction
Fact: No deterministic top-down tree automata accepts
L
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
28
Ranked trees automata: Properties
•
•
•
•
Like for words
Determinization
Minimization
Closed under
– Complement
– Intersection
– Union
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
29
But…
• XML documents are unranked:
book (intro,section*,conclusion)
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
30
Automata
Automata on unranked tree
Master Informatique
Typing semistructured data
10/9/2007
31
Unranked tree automata
 2 , t   t  2 , t , t   t  2 , t , t , t   t...
 2 , f    f   2 , t , f    f   2 , f , t    f ...
 2 , t   t  2 , t , f   t  2 , f , t   t...
 2 , f    f   2 , f , f    f   2 , f , f , f    f ...
Issue: represent an infinite set of transitions
Solution: a regular language
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
32
Unranked tree automata (2)
 a, L(Q)  r1,...,rm
• Rule:
• Meaning: if the states of the children of some
node labeled a form a word in L(Q), this node
moves to some state in {r1,…,rm}
 2 , And1  t where And1  t 
 2 , And0   f  where And0  (t  f ) * f (t  f ) *
 2 , Or1  t where Or1  (t  f ) * t (t  f ) *
 2 , Or0   f  where Or0  f 
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
33
Building on ranked trees
a
a
b
b
a
b
b
b
a
b
b
a
b
b
b
a
b
b
Ranked tree: FirstChild-NextSibling
F: encoding into a ranked tree
• F is a bijection
F-1: decoding
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
34
Building on
bottom-up ranked trees (2)
• For each Unranked TA A, there is a Ranked TA
accepting F(L(A))
• For each Ranked TA A, there is an unranked TA
accepting F-1(L(A))
• Both are easy to construct
Consequence: Unranked TA are closed under
union, intersection, complement
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
35
Determinization
• Determinization always possible for bottom-up
  , w  Q* , thereexistsa unique rule  ( , L)
such thatw  L.
• Can we use the FirstChild-NextSibling
encoding
– No: it does not preserve determinism
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
36
Top-down?
• This is more delicate
• Transition (a,q)=A(a,q)
– The state of the automata A(a,q) when reading
the labels of the children of a node labeled a
determines the states of the children of that node
– Accepts if all the leaves are in accepting state
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
37
Boolean circuit evaluation
v
v
q0
1 0
q0
0
v
q1 1 q1
1
q0 1 q1 q1
1
q0 q1
1
Web Data
Management
Master
Informatique
q1
q0
0
1
q1
q1
q0
v
v
q0
v
0
A tree is accepted if,
for some possible
run, the states of all
leaves are final
q0
v
v
q1
q0
1
q1 q0
1
q1
Typing semistructured data
10/9/2007
38
Automata
Automata and
monadic second-order logic
Master Informatique
Typing semistructured data
10/9/2007
39
Monadic second-order logic
• Representation of a tree as a logical structure
a 1
b 2
b 3
a
4
b 5
b 6
b 7
a 8
b 9
E(1,2), E(1,3)… E(3,9)
S(2,3), S(3,4), S(4,5)…S(8,9)
a(1), a(4), a(8)
b(2), b(3), b(5), b(6), b(7), b(9)
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
40
Monadic second-order logic
E(1,2), E(1,3)… E(3,9)
S(2,3), S(3,4), S(4,5)…S(8,9)
a(1), a(4), a(8)
b(2), b(3), b(5), b(6), b(7), b(9)
MSO syntax
 :: x  y  E ( x, y)  S ( x, y)  a( x)  ... 
      x 
Quantification
X  X ( x)  X
Set
variable
Web Data
Management
Master
Informatique
over a set
variable
Typing semistructured data
10/9/2007
41
Example of MSO
• Each a node has a b-descendant
• This corresponds to the formula
  xa( x)  X        where
  X ( x)
  yz ( E ( y, z )  X ( y)  X ( z ))
  y( X ( y)  b( y))
For each node x labeled a: each set X that () contains
x and that () is closed under descendant, X contains
some y labeled b
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
42
Bridge
Theorem: for a set L of trees, the following are
equivalent
1. L = L(A) for some bottom-up tree automata A
i.e. L is definable with bottom-tree automata
2. L = {T | T satisfies } for some MSO formula 
i.e. L is definable in MSO
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
43
XML typing
DTDs
Master Informatique
Typing semistructured data
10/9/2007
44
DTD
• Describe the children of a node of a label a by
a regular expression
• Bizarre syntax
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
Web Data
Management
Master
Informatique
populationdata (continent*) >
continent (name, country*) >
country (name, province*)>
province (name, city*) >
city (name, pop) >
name (#PCDATA) >
pop (#PCDATA) >
Typing semistructured data
10/9/2007
45
DTD and deterministism
• Regular expressions in DTD should be
deterministic
– Complicated definition
• Intuition: the corresponding automata should
be deterministic
– (a+b)*a is not
– When reading <a>, one cannot tell whether it is
an a from (a+b) or if it is the a of the end
– (b*a)(b*a)* is an equivalent expression that is
deterministic
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
46
Very efficient validation
• It suffices to verify for each node a that the
word formed by the labels of its children is
accepted by the finite state automata Aa
• Possible to type check the document while
scanning it, e.g. with SAX parser
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
47
Very efficient validation (2)
<!ELEMENT a ( b c ) >
<!ELEMENT b ( d+ ) >
<a><b><d/><d/></b><c/></a>
a
b
d
Aa
s
d
t
b
Ab
c
s’
u
c
d
Web Data
Management
Master
Informatique
t’
s’
t’
ust
Accept
d
Typing semistructured data
10/9/2007
48
Warning
• The previous example can be checked with a
simple automata on words
• But not the following one
<!ELEMENT part ( part* ) >
• The stack is needed for accepting
<a>…<a></a>…</a>
n <a>
n </a>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
49
Some bad news for DTD
• Not closed under union
DTD1
…
<!ELEMENT
<!ELEMENT
DTD2
…
<!ELEMENT
<!ELEMENT
used( ad*) >
ad ( year, brand )>
new( ad*) >
ad ( brand )>
• L(DTD1)  L(DTD2) cannot be described by a DTD but
can be described easily by a tree automata
– Problem with the type of ad that depends of its parent
• Also not closed under complement
• Limited expressive power
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
50
Car example continued
Car
Used
New
Brand
Year
Brand
“Renault”
“2008”
“BMW”
• The best DTD we can choose does not distinguish
between ads for used and new cars
– <!ELEMENT ad (year?, brand) >
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
51
Decoupled types in XML schema
• Each type corresponds to a label, not
conversely
car: [car]
used:[used]
new: [new]
ad1: [ad]
ad2: [ad]
( used + new )*
(ad1*)
(ad2*)
(year, brand)
(brand)
• The tags are in green; type names in blue
• Nice closure properties
• Many other « gadgets » in XML schemas
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
52
XML typing
XML Schemas
Master Informatique
Typing semistructured data
10/9/2007
53
XML Schema
• Often criticized & unnecessarily complicated
• Boosted by Web services
•
•
•
•
Richer than DTD – decoupled types
Deterministic top-down tree automata (close to)
XML schemas are extensible
Many other useful functionalities
– Namespaces
– Atomic types
– Integrity constraints, etc.
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
54
An XML schema is an XML document
• Since it is an XML syntax, it can use XML tools
– Editor
– Type checker
– Etc.
• The type of all XML schemas can be described with
an XML schema
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
55
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetnamespace="http://www.net-language.com">
<xs:element name="book">
<xs:complexType>
<xs:sequence>
<xs:element name="title" type="xs:string"/>
<xs:element name="author" type="xs:string"/>
<xs:element name="character"
minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="friend-of" type="xs:string"
minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="since" type="xs:date"/>
<xs:element name="qualification" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="isbn" type="xs:string"/>
</xs:complexType>
</xs:element>
</xs:schema>
Master Informatique
Typing semistructured data
10/9/2007
56
Simple elements and atomic types
Definition:
<xs:element name="xxx" type="yyy"/>
with common types:
xs:string; xs:decimal; xs:integer; xs:boolean; xs:date; xs:time
Examples
<xs:element name="lastname" type="xs:string"/>
<xs:element name="age" type="xs:integer"/>
<xs:element name="dateborn" type="xs:date"/>
Instances of such elements
<lastname>Refsnes</lastname>
<age>34</age>
<dateborn>1968-03-27</dateborn>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
57
Attributs
Definition: <xs:attribute name="xxx" type="yyy"/>
Example
<xs:attribute name="lang" type="xs:string"/>
Instance of such attribute
<lastname lang="EN">Smith</lastname>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
58
Complex elements
• Empty element
<product pid="1345"/>
• Contains only other elements
<employee> <firstname>John</firstname>
<lastname>Smith</lastname> </employee>
• Contains only text
<food type="dessert">Ice cream</food>
• Contains both elements and text
<description> It happened on <date lang="norwegian">
03.03.99</date> .... </description>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
59
Restriction of simple elements
<xs:element name="age">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="100"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
Other restrictions: enumerated types, patterns, etc.
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
60
Restriction on complex elements
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
61
Possible to name a type
<xs:element name="employee">
<xs:complexType> <xs:sequence>
<xs:element name="firstname"
type="xs:string"/> <xs:element
name="lastname"
type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Only the "employee" element can
use the specified complex type
(<sequence> indicates an order
on child elements)
Alternative
<xs:element name="employee"
type="personinfo" />
<xs:complexType
name="personinfo">
<xs:sequence> <xs:element
name="firstname"
type="xs:string"/> <xs:element
name="lastname"
type="xs:string"/>
</xs:sequence>
</xs:complexType>
Typing semistructured data
Master Informatique
10/9/2007
62
Other gadgets
• Import of types associated to a namespace
– <import nameSpace = "http:// ..."
schemaLocation =
"http:// ..." />
• Possible to include an existing schema
– <include schemaLocation="http:// ..."/>
• Possible to extend/redefine an existing schema
– <redefine schemaLocation="http:// ..."/>
....
Extensions ...
</redefine>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
63
Example: a DTD
<!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
<!ATTLIST EMAIL
LANGUAGE (Western|Greek|Latin|Universal) "Western"
ENCRYPTED CDATA #IMPLIED
PRIORITY (NORMAL|LOW|HIGH) "NORMAL">
<!ELEMENT TO (#PCDATA)>
<!ELEMENT FROM (#PCDATA)>
<!ELEMENT CC (#PCDATA)>
<!ELEMENT BCC (#PCDATA)>
<!ATTLIST BCC
HIDDEN CDATA #FIXED "TRUE">
<!ELEMENT SUBJECT (#PCDATA)>
<!ELEMENT BODY (#PCDATA)>
<!ENTITY SIGNATURE "Bill">
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
64
The same in a variant of XML schema
(more verbose)
<?xml version="1.0" ?>
<Schema name="email" xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<AttributeType name="language"
dt:type="enumeration" dt:values="Western Greek Latin Universal" />
<AttributeType name="encrypted" />
<AttributeType name="priority" dt:type="enumeration" dt:values="NORMAL LOW HIGH" />
<AttributeType name="hidden" default="true" />
<ElementType name="to" content="textOnly" />
<ElementType name="from" content="textOnly" />
<ElementType name="cc" content="textOnly" />
<ElementType name="bcc" content="mixed">
<attribute type="hidden" required="yes" />
</ElementType>
<ElementType name="subject" content="textOnly" />
<ElementType name="body" content="textOnly" />
<ElementType name="email" content="eltOnly">
<attribute type="language" default="Western" />
<attribute type="encrypted" />
<attribute type="priority" default="NORMAL" />
<element type="to" minOccurs="1" maxOccurs="*" />
<element type="from" minOccurs="1" maxOccurs="1" />
<element type="cc" minOccurs="0" maxOccurs="*" />
<element type="bcc" minOccurs="0" maxOccurs="*" />
<element type="subject" minOccurs="0" maxOccurs="1" />
<element type="body" minOccurs="0" maxOccurs="1" />
</ElementType>
</Schema>
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
65
Where to place XML schemas
DTD
XML schema
Deterministic
top-down tree automata
.
Tree automata
• Some bizarre restriction
– Inside an element, no two types with the same tag
• Closer to DTDs than to tree automata
• Efficient type validation
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
66
Exercise: coupled vs decoupled
• Write a realistic DTD1 for new cars
– With make, model, engine…
• Write a realistic DTD2 for used cars
– Also year, miles, zipcode
• Write an XML schema for L(DTD1)  L(DTD2)
– Using decoupled schema
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
67
Automata
Automata to compute
Master Informatique
Typing semistructured data
10/9/2007
68
Another use of automata: XPATH
$x in //a/b
b
(0)
a
a
b
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
a
b
DFA
Typing semistructured data
10/9/2007
69
Example: //a/b
b
(0)
(01)
a
a
b
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
a
b
DFA
Typing semistructured data
10/9/2007
70
Example: //a/b
b
a
a
b
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
(0)
(01)
(01)
a
b
DFA
Typing semistructured data
10/9/2007
71
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
(0)
(01)
(01)
(02)
a
b
DFA
Typing semistructured data
10/9/2007
72
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
(0)
(01)
(01)
a
b
DFA
Typing semistructured data
10/9/2007
73
Example: //a/b
b
(0)
(01)
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
a
b
DFA
Typing semistructured data
10/9/2007
74
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
(0)
(01)
(01)
a
b
DFA
Typing semistructured data
10/9/2007
75
Example: //a/b
b
(0)
(01)
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b
a
b
DFA
Typing semistructured data
10/9/2007
76
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
(0)
(01)
(02)
a
b
DFA
Typing semistructured data
10/9/2007
77
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
(0)
(01)
(02)
(01)
a
b
DFA
Typing semistructured data
10/9/2007
78
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
a
(0)
(01)
(02)
(01)
(02)
b $x
DFA
Typing semistructured data
10/9/2007
79
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
(0)
(01)
(02)
(01)
a
b $x
DFA
Typing semistructured data
10/9/2007
80
Example: //a/b
b
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
(0)
(01)
(02)
a
b $x
DFA
Typing semistructured data
10/9/2007
81
Example: //a/b
b
(0)
(01)
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
a
b $x
DFA
Typing semistructured data
10/9/2007
82
Example: //a/b
b
(0)
a
a
b
$x
$x
NFA
Web Data
Management
Master
Informatique
$x
a
b $x
a
b $x
DFA
Typing semistructured data
10/9/2007
83
Determinization: exponential blow up
//a/*/*/b
Web Data
Management
Master
Informatique
Typing
data
Typingsemistructured
semistructured
data
10/9/2007
84
Proposal : k-pebble transducers
stack
[milo,suciu,vianu]
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
85
k-pebble transducers: result
root
a
c
b
a
a
a
b
b
Capture a core aspect of Xquery but not the
data management part
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
86
Graphs and bisimulation
Master Informatique
Typing semistructured data
10/9/2007
87
Graph
•
•
•
•
Graph semistructured data
Graph simulation
Graph bisimulation
Data guides
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
88
Semistructured data = Labeled graph
&r
employee
employee employee
employee employee
manages
employee
employee
manages
manages
manages
manages
&p1
&p2
managedby
&p3
company worksfor worksfor
&p4
&p5
&p6
&p7
managedby
&p8
managedby
managedby
worksfor
employee
managedby
worksfor worksfor
worksfor
worksfor
worksfor
&c
• Possibly a root – in red
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
89
Rooted graph
• OEM = Object Exchange Model
• With ID-IDREF, XML is a graph model as well
• Labeled (rooted) graph (E,r)
– Set N of edges
– A finite ternary relation E  NNLabel
– E(s,t,l) = there is an edge from s to t labeled l
– r is a node in the graph
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
90
Equality revisited
• {1,2,2,1,5} = {1,2,5}
– Ignores the order
• For trees, if we ignore the order of siblings and
use a “set” semantics
a
b
d
=
c
d
Web Data
Management
Master
Informatique
a
b
d
b
d
c
d
Typing semistructured data
10/9/2007
91
Simulation
A simulation  of (E,r) with (E’,r’) is a relation
between the nodes of E and E’ such that
1. (r,r’)
2. if (s,s’) and E(s,t,l) for some l then there
exists t’ with (t,t’) and E’(s’,t’,l’)
(we simulate a move in E by a move in E’)
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
92
Bisimulation
Given , E, E’,
 is a bisimulation if
 is a simulation of E with E’ and
-1 is a simulation of E’ with E
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
93
Examples
bisimulation Not bisimulation
a
a
d
a
G
a
a
a
d
G’
a
d
a
G”
They all have the same paths from the root
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
94
A more complex example of
graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
leads
R
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
"adminstr." "PR"
p6
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
employee
t1
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
95
Graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
leads
R
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
"adminstr." "PR"
p6
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
t1
t1
employee
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
96
Graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
leads
R
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
"adminstr." "PR"
p6
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
t1
t1
employee
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
97
Graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
leads
R
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
"adminstr." "PR"
p6
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
employee
t1
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
98
Graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
R
leads
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
p6
"adminstr." "PR"
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
R
employee
t1
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
99
Graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
leads
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
p6
"adminstr." "PR"
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
R
employee
t1
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
100
Graph bisimulation
root
programmer
employee
statistician
c1
employee
c2
employee
project
e1
e2
workson
leads
R
p1
"exercise"
workson
workson
p2
p3
"lecture" "finance"
e3
leads
e4
workson
consults
workson
workson leads
p4
p5
"adminstr." "PR"
p6
p7
"undergrad" "grad"
workson
consults
leads
p8
p9
"postgrad"
"web"
programmer | statistician
R
employee
t1
t2
_
STRING
projects
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
101
Computing bisimulation in ptime
• Start with  = N  N’ (for N, N’ the set of
nodes)
• While there exists (x,x’) in  that violate the
definition of simulation, remove (x,x’) from 
• This computes the maximal bisimulation in
ptime
(Note: this maximal bisimulation exists because  is
a bisimulation, and if 1, 2 are bisimulation,
1  2 is also one)
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
102
What does this have to do
with typing?
• Take a very complex graph E
• How do you describe it?
• By a “smaller” graph T that is a bisimulation of
E
• There may be several bisimulation with more
and more details
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
103
Rough bisimulation
Root
&r
employee
company
employee
Bosses
&p1,&p4,&p6
worksfor
Company
&c
Web Data
Management
Master
Informatique
manages
Regulars
&p2,&p3,&p5,&p7,&p8
managedby
worksfor
Typing semistructured data
10/9/2007
104
More precise one
Root
&r
company
managedby
employee
Employees
&p1,&p1,&p3,P4
&p5,&p6,&p7,&p8
manages
worksfor
Bosses
&p1,&p4,&p6
worksfor
Company
&c
Web Data
Management
Master
Informatique
manages
Regulars
&p2,&p3,&p5,&p7,&p8
managedby
worksfor
Typing semistructured data
10/9/2007
105
Other “typing”: data guide
• See the graph as an automata with root as the
start symbol and only accepting states
• This graph accepts all the paths from the root
• Obtain an equivalent, minimal, deterministic
automata
– This is the data guide for the graph
– It can be used for describing the data
– It can be used to support Graphical Query
Interfaces
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
106
{root}
Data guide
programmer statistician employee
project
{p1,p2,p3,p4,p5,
p6,p7,p8,p9}
{c1}
employee
employee
{e1,e2}
{e2,e3}
{c2}
workson
{p1,p3,p5,
p7,p9}
{e1,e2,e3,e4}
leads consults
{p2,p4,
p6,p8}
{p4,p9}
root
programmer c1
employee
statistician
employee
workson
c2
employee
{p1,p3}
leads
workson
leads consults
{p2,p4}
{p1,p3,p5,p7}
{p4,p6}
{p4}
project
e1
e2
workson
e3
workson leads
leads workson
p1
"exercise"
p2
p3
consults
e4
workson workson
p5
p6
consults
leads
worksonleads
p4
• Gives all the paths
from the root
• Automata minimization
workson
p7
p8
p9
"lecture"
"finance""adminstr.""PR" "undergrad"
"grad" "postgrad" "web"
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
107
What you should remember
•
•
•
•
•
Tree automata = theoretical foundation for XML
Bottom-up tree automata are nice
Top-down and determinism together  limitations
XML documents do not have to be typed
Typing may be very useful for XML
– In particular for software managing XML data
• DTD: simple but limited
• XML Schema: more expressive but still limited
• Graph data: bisimulation is the answer
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
108
Merci
Master Informatique
Typing semistructured data
10/9/2007
109
Bibliography
• TATA: the book, Tree Automata Techniques
and Applications, tata.gforge.inria.fr/
– The book on the topic and it is free
• XML schema, see http://w3.org
http://www.w3schools.com/schema/
Web Data
Management
Master
Informatique
Typing semistructured data
10/9/2007
110