Document 7521090

Download Report

Transcript Document 7521090

Keys for XML
Peter Buneman, Susan Davidson, Wenfei Fan
Carmem Hara , Wang-Chiew Tan
University of Pennsylvania
Temple University
Universidade Federal do Parana, Brazil
Jonathan Mamou
1
Keys in DB design
Essential part of DB design
 Invariant connection between the tuple and the
real-world entity
 Important in update
– Guarantee that an update will affect precisely one
tuple
…
Keys for XML
2
Keys in XML
 XML documents are to do – at least - double
duty as databases
 Examination of existing DTDs reveals a number
of cases in which some element or attribute is
specified as a “unique identifier” in comments
 Various key specifications in XML Standard,
XML Data, XML Schema
Keys for XML
3
Components: XML vs. relational DB
Name course grade
Smith
Math
B
Jones
Math
A+
Smith
CS
A-
<db>
<student>
<name> Smith </name>
<course> Math </course>
<grade> B </grade>
</student>
<student>
<name> Jones </name>
<course> Math </course>
<grade> A+ </grade>
</student>
<student>
<name> Smith </name>
<course> CS </course>
<grade> A- </grade>
</student>
</db>
Keys for XML
4
Components: XML vs. relational DB
(cont’d)
DB
 If 2 tuples agree on their
name and course
attributes they agree
everywhere
XML
 If 2 elements agree on
the name and course
subelements then they
are the same element
 Node identification?
 Equality?
Keys for XML
5
Nodes - Value Equality
 name key for person nodes
 name may have a complex structure: first-
name, last-name
db
company
dept
employee
employee
name
government
company
name
firstName
“Bill”
university
employee employee employee
...
lastName
@id
@id
@id
name
“Bill Clinton”
“Clinton”
Keys for XML
6
Hierarchical structure
 Hierarchically structured databases, e.g.
scientific data formats
 Top-level key to identify components of a
document
 Secondary key to identify sub-components
– Book/chapter/section
– Bible/book/chapter/verse
Keys for XML
7
Absolute and relative keys
In an XML document, how to identify
 A book?
 a chapter?
db
 a section?
book
title
“XML”
“1”
chapter
chapter
number section section number section section
number text
“1” “...”
number “6”
“10”
book
book
book
number number
“1”
Keys for XML
“5”
title
chapter chapter
“SGML”
number number
text
“1”
“10”
“…”
8
XML standard - ID attribute
<!ATTLIST book
<!ATTLIST chapter
<!ATTLIST section
title
number
number
ID
ID
ID
#required>
#required>
#required>
 Internal “pointers” rather than keys
 Scoping: ID attribute unique within the entire document rather
than among a designated set of elements
– can’t express relative keys, e.g., for chapters/sections.
 Limit to using attributes rather than elements
 unary: at most one ‘key’ can be defined, in terms of a single
attribute
 value equality: on text (string)
 defined in a attribute type : keys must come with a DTD
Keys for XML
9
XML Data
 Introduces a notion of keys explicitly
<elementType id="booktable">
<element id="titleID" type="#title">
<element type="#author">
<element type="#pages">
<key id="bookkey">
<keyPart href="#titleID"/>
</key>
</elementType>
 BUT
– Can only be defined for element types rather than for certain
collections of elements e.g. book, articles, …
Keys for XML
10
XPath
 Possible to specify interesting fragments of a
document
 Syntax similar to navigating directories in a
file system
//arbitrary path
. empty path
/ document root - path concatenator
* any single node name
Keys for XML
XPath example
 Select BBB elements which have any attribute
<AAA>
<BBB id = "b1"/>
<BBB id = "b2"/>
<BBB name = "bbb"/>
<BBB/>
</AAA>
 //BBB[@*]
Keys for XML
12
Xpath example (cont’d)
<AAA>
<BBB>
</BBB>
<XXX>
<DDD>
//GGG/ancestor::*
<FFF>
<GGG>
</GGG>
</FFF>
</DDD>
</XXX>
<CCC>
</CCC>
</AAA>
Keys for XML
13
XML-Schema
<element name = “book”>
<complexType>
<sequence>
<element name=“title” type=“string”/>
<element name=“chapters” max0occurs=“unbounded”>
<complexType>
...
</complexType>
</element>
</sequence>
</complexType>
<key name=“k” >
<selector xpath=“.”/>
<field xpath=“title”/>
</key>
</element>
Keys for XML
14
XML Schema (cont’d)
 Allow to specify keys in term of XPath expressions
 BUT
– XPath is a relatively complex language (move down,
sideways, upwards, predicates and functions can be
embedded)
– Equivalence/containment of XPath expressions is unresolved
 No efficient way to tell whether two keys are equivalent.
– Value equality: restricted to text
– Relative key not addressed
– Structural requirement: key paths must exist and be unique.
Keys for XML
15
A new key constraint language
for XML
 Powerful enough to express absolute and relative
keys
 Simple enough to be reasoned about efficiently
– Equivalence/containment
– consistency (satisfiability)
– implication (keys derived from others)
 Capturing the semistructured nature of XML data:
– independent of any types/schema
– no structural requirements: tolerating missing/multiple
key paths
Keys for XML
16
Outline
 Node addresses – testing whether 2 nodes are the same
node
 Value equality – testing whether 2 nodes have the same
value
 Path expression language
 Absolute key
 Key Inference
 Relative key
 Strong key
 Some issues
Keys for XML
17
Tree representation
 DOM (Document Object Model)
 Document is a hierarchical structure of nodes
– Element nodes
– Attribute nodes
– Text nodes
Keys for XML
18
Tree representation (cont’d)
<db>
<composer>
<name> J.S. Bach </name>
<born> 1685 &</born>
<work num="BWV82“>
<title> Ich habe genug </title>
</work>
<work num="BWV552“>
</work>
</composer>
<composer period="baroque“>
<name> G.F. Handel </name>
<work num="HWV19“>
<title> Art Thou Troubled? </title>
</work>
</composer>
</db<
Keys for XML
19
Tree representation (cont’d)
db
1
2
composer
1
name
1
“J.S. Bach”
2
born
1
3
work
4
1
work
work
1
@periode
1
@num
1
2
name
@num
1
title
“1685”
composer
@num
num
title
num
“BWV82”
num
“G.F. Handel”
“HWV19”
1
“BWV552”
periode
“Iche abe genug”
“Baroque”
Keys for XML
“Art Thou Troubled”
20
Tree representation (cont’d)
 Attribute node: name+text, terminal
 Text node: text, terminal
 Element node:
– name, may have children
– Text and element children held in an array
• Index in the array determined by the order of the subelement in
the document
– Attribute children held in a dictionary
• Name of the attribute used as the index
 Edge label uniquely identify children
Keys for XML
21
Node Address
 A path of edge labels from the root uniquely identifies
a node <l1#…#ln>
– <1#2#1>, <1#3#@num>
 An attribute node can only occur at the end of a node
address
 Order of attributes is unimportant
 Order of subelements specified by their indexes
 Address of a subnode relative to a node
– Any subnode of a node with address <a> will have a node
address of the form <a#b> where <b> is the address of the
subnode relative to <a>.
Keys for XML
22
Value Equality
 Value of a node
1.A set S of relative addresses of its subnodes
2.A partial function from S to names
3.A partial function from S to texts
 2 nodes are value-equal if they agree on 1, 2, 3
 Notation: a =v b
Keys for XML
23
Value Equality (example)
S = {., <1>, <2>, <1,1>, <2,1>}
db
person
person
...
person
person
name
@phone
1
“123-4567”
firstName
1
“George”
name
2
1
lastName
1
“Bush”
Keys for XML
firstName
1
“George”
@pnone
2 “234-5678”
lastName
1
“Bush”
24
Path expressions
 How to identify nodes in a tree?
 Expression involving node names (tags +
attributes) that describes a set of paths in the
document tree
– XPath (XML-Schema)
– Regular expressions (semistructured data)
Keys for XML
25
Regular Path Expressions
In the normal syntax of regular
expressions:
db
depts
dept
emps
db.emps.emp
emp
emp
db.(depts.dept.mgr
|emps.emp)
name
name
name
db._*.name
“Mary”
“John”
“Bill”
mgr
Keys for XML
26
Language for path expression
 2 necessary properties
– Concatenation operation, not uniform presentation
in XPath
• Concatenate a/b with /c/d : a/b//c/d
– A path should only move down the tree
• Navigation axis in XPath
Keys for XML
27
Language for path expression
 Empty path “ε”
(“.”)
 Node name (tag/attribute name)
 Wild card “_”, single node name
 Arbitrary path “_*”
 Concatenation of paths P, Q is P.Q
(“*”)
(“//”)
(“/”)
 Notation
– n[P]: set of nodes (node addresses) reached by starting at node n
and following a path that conforms to P
– [P] := root[P]
Keys for XML
28
Examples
 Simple path
– <2#2>[title]
= {<2#2#1>}
– [composer.work] = {<1#3>, <1#4>, <2#2>}
 Complex path
– <2#2>[_*]
– [composer._]
– [_*.num]
= {<2#2>, <2#2#1>, <2#2#1#1>,
<2#2#@num>}
= {<1#1>, <1#2>, <1#3>, <1#4>,
<2#1>, <2#2>}
= {<1#3#@num>, <1#4#@num>,
<2#2#@num>}
Keys for XML
29
Absolute key
30
Key specification
Necessary to specify
– Set on which we are defining the key (relation)
– “Attributes” (set of column names)
 Pair (Q, {P1, …, Pn})
– Target path Q path expression: target set on which
the key constraint is to hold
– Key path {P1, …, Pn} set of simple path expressions
Keys for XML
31
Key specification (cont’d)
– Target path Q
– Key path {P1, …, Pn}
 For any node n in [Q], there is a set of nodes
n[Pi] found by following Pi from n (may be
empty)
 Examples
1.
(person.employees, {name.firstname, name.lastname})
2.
3.
(composer, {name})
(composer, {born})
Keys for XML
32
Formal Definition
A node n satisfies a key specification (Q,{P1,... , Pk}) iff for
any n1, n2 in n[Q],
if for all i, 1<= i <= k , there exist z1 in n1[Pi] and z2
in n2[Pi] such that z1 =v z2
then n1 = n2.
 Value equality z1 =v z2
 Node equality : 2 nodes are equal if they have the same
node address n1 = n2
 The values associated with key paths uniquely identify a
node in the target set
 Not part of the schema, data
Keys for XML
33
Remarks
 For any n1, n2 in [Q], if Pi is missing at either n1 or n2
then n1[Pi] and n2[Pi] are by definition disjoint
 Multiple nodes
<db>
<A> <B> 1 </B> </A>
<A> <B> 1 </B> <B> 2 </B> </A>
</db>
Key (A, {B}) with respect to the root.
The document does not satisfy the key.
Keys for XML
34
Example of keys
 (_*.person, {id})
– 2 persons elements are disjoint on their id fields
 (person, {ε})
– Any 2 person nodes immediately under the root have different
values
 (employee, {})
– Empty key. There is at most one employee under the root
 (_*, {id})
– Any 2 nodes are disjoint on their id fields up to value-equality
– Semantics of ID attribute in the XML standard
Keys for XML
35
XML vs. relational
XML, paths that define
keys
Relational DB
– Need not exist (nullvalued keys)
– Do not have to be unique
– Key paths specify a set of
addresses within a
document
Keys for XML
– Key values cannot be
null, must exist
– Have to be unique
– 1NF requires each
component of every tuple
to be atomic value, not set
36
Remarks
 Equivalence of 2 path expressions is decidable
 Given a definition of equality on tree, do we need to have
more than one key path in a key specification?
– All key attributes must be represented as subnodes of some node
– Constrain this node to contain only those subnodes
– Too restrictive, unnecessary interference between key
specifications and data models
 Allow a (possible empty) set of nodes at the end of each
key path
– How to require each of the key paths to exist and to be unique?
Keys for XML
37
Remarks (cont’d)
 Language of path expression
– Need something more powerful to express Q
(person.(mother | father)*, {id})
A person element followed by zero or more father or
mother elements
 Provisional language of path expressions
 Does not change in the way of the theory
Keys for XML
38
Key inference
 In relational DB
– Infer some keys from the presence of others
 If (Q, S) is a key and S  S’, then so is (Q, S’)
– Counterpart of relational inference rule
 If (Q.Q’, {P}) is a key, then so is (Q, {Q’.P})
– tree-like structure : if a node is identified in a tree then
its ancestor are also determined I.e. if a key path P
uniquely identifies a node n in [Q.Q’] then Q’.P is a key
path for the ancestor of n in [Q].
Keys for XML
39
Key Inference (cont’d)
 If (Q,S) is a key and Q’  Q, then (Q’, S) is also a key
– Any key of the set [Q] is also a key for any subset of [Q]
 For any finite set Σ of keys, there exists an (finite) XML
document satisfying Σ
– Key paths may be missing, e.g. (_*,{id})
• If key path was required to exist at all nodes specified by the
target path, the XML document would have to be infinite to
satisfy the key
– Only holds in the absence of DTDs
Keys for XML
40
Key Inference
 Key K = (X, {})
 DTD D: <!ELEMENT foo (X, X)>
foo
X
foo
X
X
 No XML document that both conforms to D and
satisfies K
 DTDs interact with XML key constraint
Keys for XML
41
Relative Key
42
Relative key - Motivation
 Motivated by scientific data format, hierarchical structure,
large set of entries at the top-level
 Protein sequence database Swiss-prot
– Accession number (key) for each entry
– Within each entry, sequence of citations each identified by a
number 1, 2, 3, …
 Linguistic database – recording of speech
– Data sets held in files
– Metadata provided by directory structure
– /timit/train/dr1/fcjjf0/sa1.wav
– TIMIT corpus, training set, dialect region 1, female speaker,
speaker-ID "cjf0", sentence text "sa1", speech waveform file
Keys for XML
43
An absolute key for books
An absolute key to identify a book: (book, {title} )
 target path: book, starting from the root and identifying a
collection of books
 key path: title; its value uniquely identifies a book
absolute: defined on the entire document
db
book
title
“XML”
“1”
chapter
chapter
number section section number section section
number text
“1” “...”
number “6”
“10”
book
book
book
number number
“1”
Keys for XML
“5”
title
chapter chapter
“SGML”
number number
text
“1”
“10”
“…”
44
Relative key - definition
 Like the key of a weak entity set in DB
Studios(name, address)
Crews(number)
A document satisfies a relative key
specification (Q, (Q’,S)) iff for all nodes n
in [Q], n satisfies the key (Q’,S).
 Absolute keys are a special case of relative keys
– (Q’,S) equivalent to (ε, (Q’,S))
Keys for XML
45
A relative key for chapters
A relative key: (book, (chapter, {number} ) )
A chapter number uniquely identifies a chapter within a book!
 Context path: book
 target path: chapter, starting at a book
 key path: number
relative: defined on sub-documents, relative to the context
db
book
title
“XML”
“1”
chapter
chapter
number section section number section section
number text
“1” “...”
number “6”
“10”
book
book
book
number number
“1”
Keys for XML
“5”
title
chapter chapter
“SGML”
number number
text
“1”
“10”
“…”
46
Absolute/Relative Key
 What is the difference between
– Absolute key (book.chapter, {number})
– Relative key (book, (chapter, {number} ) )
db
book
title
“XML”
“1”
chapter
chapter
number section section number section section
number text
“1” “...”
number “6”
“10”
book
book
book
number number
“1”
Keys for XML
“5”
title
chapter chapter
“SGML”
number number
text
“1”
“10”
“…”
47
A relative key for sections
Key: (book.chapter, (section, {number} ) )
A section number uniquely identifies a section within a particular chapter
of a particular book!
relative to the chapter containing the section, and to the book containing
the chapter
db
book
title
“XML”
“1”
chapter
chapter
number
section
number
text
“1”
“...”
section
number section section
number “6”
“10”
book
book
book
number number
“1”
Keys for XML
“5”
title
chapter
chapter
“SGML”
number
number
“1”
“10”
text
“…”
48
Transitivity of relative keys
 A relative key such as
(bible.book.chapter,(verse, {number}))
does not uniquely identify a particular verse in
the bible
 Book name, chapter number, verse number 
verse
Keys for XML
49
“immediately precedes” relation
(Q1, (Q’1,S1)) immediately precedes
(Q2, (Q’2,S2)) if Q2 = Q1.Q’1
– (bible, (book,{name}))
immediately precedes
(bible.book, (chapter,{number}))
– Any absolute key immediately precedes itself
Keys for XML
50
“precede” relation
Precede is the transitive closure of the
immediately precedes relation
– Qn = Q1.Q’1…Q’n-1
(bible, (book, {name})),
(bible.book,(chapter, {number})),
(bible.book.chapter,(verse, {number}))
Keys for XML
51
Transitivity of relative keys
 A set Σ of relative keys is transitive if for any
relative key K1 = (Q1,(Q’1,S1)) in Σ there is a
key K2 = (ε,(Q’2,S2)) in Σ which precedes K1
 Any transitive set of relative key must contain
some absolute key
Keys for XML
52
Transitivity of relative keys - example
TRANSITIVE SET
(ε,(bible.book, {name}))
(bible.book,(chapter, {number}))
(bible.book.chapter,(verse,
{number}))
Keys for XML
53
Insertion-friendly relative keys
 Transitive key specification
(ε, (university, {name}))
(university, (dept.employee, {emp-id}))
 Identify an employee: university name + emp-id
 Add an employee: specify a dept for the employee
 No way to identify a dept
– Many ways to add an employee!!!
Keys for XML
54
Insertion-friendly relative keys (cont’d)
 Insert an element in the “keyed” part of the document
unambiguously by specifying where to insert the element
using keys.
 A set Σ of relative keys is insertion-friendly if it is
transitive and whenever (Q1,(Q’1.n,S1))  Σ, there is a
relative key (Q2,(Q’2,S2))  Σ where |Q’2| > 0 and Q1.
Q’1 = Q2.Q’2.
– n is a node name
 Every element with a prefix along the path Q1.Q’1 can
be identified through some keys
Keys for XML
55
Insertion-friendly relative keys (cont’d)
(ε, (university, {name}))
(university, (dept, {dept-name}))
(university, (dept.employee,
{emp-id}))
n = employee
Keys for XML
56
Insertion-friendly relative keys (cont’d)
(ε, (university, {name}))
(university, (dept, {dept-name}))
(university, (dept.employee, {emp-id}))
 Nothing about the dept is necessary to identify
employees!!!
 Anomaly that occurs in non-second NF of relational
databases
 Employees should not be children of department nodes,
but only of university nodes
 Linkage between employees and department should be
expressed through a foreign key
Keys for XML
57
Notation for relative key
 If system of relative keys is transitive, it forms a
hierarchical structure  create a compressed
syntax for such systems
 Basic syntactic form
Q1{P1 ,...,Pk1}.Q2{P1,...,Pk2}. ...Qn{P1 ,...,Pkn}
Keys for XML
58
Notation for relative key (cont’d)
 bible{}.book{name}.chapter{number}.verse{
number}
(ε, (bible, {}))
(bible, (book, {name})
(bible.book, (chapter,{number}))
(bible.book.chapter, (verse,{number}))
 company{name}[.employee{id},
.department{name}]
company{name}.employee{id}
company{name}.department{name}
Keys for XML
59
Notation for relative key
 Compact and understandable
 Ensure the internal consistency of the document
 To tell other how to cite a component of our
document
 Our document have a structured “core”
Keys for XML
60
Strong keys
61
Stronger definitions of keys
 Requirements imposed by a key in relational
DB:
– Uniqueness of a key
– Existence of key
 Key paths exist and are unique (for 1  i  n,
n[Pi] contains exactly one node)
– name is unique at <1>
– work and num are not unique at this node
Keys for XML
62
Stronger definitions of keys (cont’d)
A node n satisfies a strong key specification
(Q, {P1, …, Pk}) if
– For all n’ in n[Q] and for all Pi, Pi exists and is
unique at n’.
– For any n1, n2 in n[Q], if for all I, n1[Pi] =v n2[Pi]
then n1=n2
Keys for XML
63
Stronger definitions of keys (cont’d)
 (_*.person, {id})
– Any 2 person elements, have unique id and differ on
those elements
 (person, {ε})
– Unchanged
 (employees, {})
– Unchanged
Keys for XML
64
Stronger definitions of keys (cont’d)
 (_*, {k})
– Every element has a key k, including element whose
name is k
 Finite satisfiability?
 Impose an infinite chain of k nodes
– No finite document satisfies it
 Because of the requirement of existence of key
paths
– Structural constraint
Keys for XML
65
Relative Strong Key
A document satisfies a strong relative key
specification (Q, (Q’,S)) iff for all nodes n in
[Q], n satisfies the strong key (Q’,S)
Keys for XML
66
“Unconstrained” XML :
Node names as key values
67
Node names as key values
 Key specification must cover the practical cases
without using definitions that are too complex to
allow any kind of reasoning about keys
 Issue in “unconstrained” XML: interchanging
structure (the names) with data (their values)
Keys for XML
68
“unconstrained” XML
<db>
<parts>
<widget>
<id> 123 </id>
<w> 1.5 </w>
</widget>
<widget>
<id> 234 </id>
<w> 2.5 </w>
</widget>
<gadget>
<id> 123 </id>
<w> 3.2 </w>
</gadget>
</parts>
</db>
<db>
<parts>
<part>
<type> widget </type>
<id> 123 </id>
<w> 1.5 </w>
</part>
<part>
<type> widget </type>
<id> 234 </id>
<w> 2.5 </w>
</part>
<part>
<type> gadget </type>
<id> 123 </id>
<w> 3.2 </w>
</part>
</parts>
</db>
Keys for XML
69
Node names as key values (cont’d)
 “Unconstrained” XML
– Type of a part is expressed in the tag
– Key constraint: parts{}[.widget{id},.gadget{id}]
 Alternative XML representation
– type expressed as an attribute or subelement of a
part element
– Key constraint: parts{}[.part{type,id}]
Keys for XML
70
Introducing a new part type
 Introduce a thingy
 “unconstrained”
– Change key specification
– parts{}[.widget{id},.gadget{id},.thingy{id}]
 Alternative
– No change parts{}[.part{type,id}]
 Ability to interchange structure and data is
supposed to be one of the strong points of
semistructured data and XML
Keys for XML
71
Solution
 Adding a “virtual” subelement node-name to
each named node, whose value consists of
the node name
 Key: parts{}._{node-name, id}
 Does not alter any of the properties
expected to hold for keys
 Account for any practical use of tag names
in keys
Keys for XML
72
Conclusion
 A new key constraint language for XML:
– independent of any schema specifications for XML
– powerful enough to express absolute and relative keys
– simple enough to be reasoned about efficiently
 In contrast to their relational counterparts:
– XML keys are more complex
– the analyses of XML keys are far more intricate
Keys for XML
73
References
 Peter Buneman, Susan Davidson, Wenfei Fan,
Carmem Hara, and Wang-Chiew Tan. Keys for XML.
WWW10 (2001)
http://db.cis.upenn.edu/DL/xmlkeys.ps
 Peter Buneman, Susan Davidson, Wenfei Fan,
Carmem Hara, and Wang-Chiew Tan. Reasoning
about keys for XML. University of Pennsylvania.
Technical Report MS-CIS-00-26, 2000
http://db.cis.upenn.edu/DL/absolute-full.ps
Keys for XML
74