3.2 Document Object Model

Download Report

Transcript 3.2 Document Object Model

3.2 Document Object Model (DOM)


How access structured documents uniformly
in parsers, browsers, editors, databases,...?
Overview of the W3C DOM Spec
» Level 1, W3C Rec, Oct. 1998
» Level 2, W3C Rec, Nov. 2000
» Level 3 Validation, Core, and Load and Save
W3C Recs (Spring 2004)
W3C DOM Activity has been closed
SDPL 2011
3.2: Document Object Model
1
DOM: What is it?

An object-based, language-neutral API for
XML and HTML documents
– Allows programs and scripts to build, access, and modify
documents
– Supports designing of
querying, filtering,
transformation, formatting etc.
applications on top of DOM implementations

Instead of “Serial Access XML” could think as
“Directly Obtainable in Memory”
SDPL 2011
3.2: Document Object Model
2
DOM structure model

Based on O-O concepts:
– objects (encapsulation of data and methods)
– methods (to access or change object’s state)
– interfaces (declaration of a set of methods)

Somewhat similar to the XPath data model (to be
discussed with XSLT and XQuery)  syntax-tree
– Tree structure implied by abstract relationships defined
by the API; Data structures of an implementation may
differ
SDPL 2011
3.2: Document Object Model
3
<invoice form="00"
type="estimated">
<addressdata>
<name>John Doe</name>
<address>
<streetaddress>Pyynpolku 1
</streetaddress>
<postoffice>70460 KUOPIO
</postoffice>
</address>
</addressdata>
...
DOM structure model
invoice
John Doe
...
addressdata
address
name
Document
form="00"
type="estimated"
streetaddress
postoffice
Pyynpolku 1
70460 KUOPIO
Element
Text
SDPL 2011
NamedNodeMap
3.2: Document Object Model
4
Structure of DOM Level 1
I: DOM Core Interfaces
– Fundamental interfaces
» basic interfaces: Document, Element, Attr, Text, ...
– "Extended" (XML specific) interfaces
» CDATASection, DocumentType, Notation, Entity,
EntityReference, ProcessingInstruction
II: DOM HTML Interfaces
– more convenient access to HTML documents
– we'll ignore these
SDPL 2011
3.2: Document Object Model
5
DOM Level 2
– Level 1: basic representation and manipulation of
document structure and content
(No access to the contents of a DTD)

DOM Level 2 adds
– support for namespaces
– Document.getElementById("id_val"),
to access elements by ID attr values
– optional features (we’ll skip these)
» interfaces to document views and style sheets
» an event model (for user actions on elements)
» methods for traversing the document tree and manipulating
regions of document (e.g., selected in an editor)
SDPL 2011
3.2: Document Object Model
6
DOM Language Bindings

Language-independence:
– DOM interfaces are defined using OMG Interface
Definition Language (IDL, defined in Corba
Specification)

Language bindings (implementations of
interfaces) defined in the Recommendation for
– Java (See the Java API doc) and
– ECMAScript (standardised JavaScript)
SDPL 2011
3.2: Document Object Model
7
Core Interfaces: Node & its variants
Node
Document
DocumentFragment
Element
Attr
CharacterData
Comment
Text
DocumentType
Notation
EntityReference
SDPL 2011
CDATASection
“Extended
interfaces”
Entity
ProcessingInstruction
3.2: Document Object Model
8
Node
getNodeType, getNodeName,
getNodeValue
getOwnerDocument
getParentNode
hasChildNodes, getChildNodes
getFirstChild, getLastChild
getPreviousSibling, getNextSibling
hasAttributes, getAttributes
appendChild(newChild)
insertBefore(newChild,refChild)
replaceChild(newChild,oldChild)
removeChild(oldChild)
DOM interfaces: Node
invoice
form="00"
type="estimatedbill"
...
addressdata
name
address
Document
Element
Text
SDPL 2011
NamedNodeMap
John Doe
streetaddress
postoffice
Pyynpolku 1
70460 KUOPIO
3.2: Document Object Model
9
Type and Name of a Node

node.getNodeType():
short int constants 1, 2, …, 12 for
Node.ELEMENT_NODE,
Node.ATTRIBUTE_NODE,
Node.TEXT_NODE, …

node.getNodeName()
– for an Element = element.getTagName()
– for an Attr: the name of the attribute
– for anonymous nodes:
"#text", "#document", "#comment" etc
SDPL 2011
3.2: Document Object Model
10
The Value of a Node

node.getNodeValue()
– content of a text node,
value of attribute, …;
null for an Element (Notice !)
– (C.f. XPath, where node’s value is its full textual
content)
– DOM 3 provides full text content with method
node.getTextContent()
SDPL 2011
3.2: Document Object Model
11
Object Creation in DOM



Each DOM Node n belongs to a Document:
n.getOwnerDocument()
Objects that implement interface X are
created by factory methods
Document.createX(…)
E.g: when doc is a Document object
doc.createElement("A"),
doc.createAttribute("href"),
doc.createTextNode("Hello!")
Loading & saving specified in DOM3 (or
implementation-specific , or via JAXP)
SDPL 2011
3.2: Document Object Model
12
DOM interfaces: Document
Node
Document
getDocumentElement
getElementById(IdVal)
getElementsByTagName(tagName)
form="00"
type="estimated"
invoice
createElement(tagName)
createTextNode(data)
...
addressdata
address
name
Document
John Doe
streetaddress
postoffice
Pyynpolku 1
70460 KUOPIO
Element
Text
SDPL 2011
NamedNodeMap
3.2: Document Object Model
13
Node
DOM interfaces: Element
Element
getTagName()
hasAttribute(name)
getAttribute(name)
setAttribute(attrName, value)
removeAttribute(name)
getElementsByTagName(name)
invoice
form="00"
type="estimatedbill"
invoicepage
addressee
addressdata
Document
name
address
Element
John Doe
Text
SDPL 2011
NamedNodeMap
streetaddress
3.2: DocumentPyynpolku
Object Model
1
postoffice
70460 KUOPIO 14
Text Content Manipulation in DOM

for objects c that implement the
CharacterData interface
(Text, Comments, CDATASections):
–
–
–
–
–
c.substringData(offset, count)
c.appendData(string)
c.insertData(offset, string)
c.deleteData(offset, count)
c.replaceData(offset, count, string)
( = c.deleteData(offset, count);
c.insertData(offset, string) )
SDPL 2011
3.2: Document Object Model
15
DOM CharacterData

DOM strings are 0-based sequences of
16-bit characters:
C: Hello world, nice to see you!
0
1
2
01234567890123456789012345678
C.getLength()-1
C.substringData(6, 5) = ?
C.substringData(0, C.getLength()) = ?
SDPL 2011
3.2: Document Object Model
16
Interfaces to node collections (1)

NodeList for ordered lists of nodes
<- Node.getChildNodes() and
Element/Document
.getElementsByTagName("name")
» (proper) descendant elements of type "name" in
document order ("*" ~ any element type)
1
2
E
3
E .getElementsByTagName(“E")=
4
A
5
SDPL 2011
A
E
6
E
3.2: Document Object Model
17
Typical child-node access pattern

Accessing specific nodes, or iterating over a
NodeList:
– to process all children of node:
for (i=0;
i<node.getChildNodes().getLength();
i++)
process(node.getChildNodes().item(i));
SDPL 2011
3.2: Document Object Model
18
Interfaces to node collections (2)

NamedNodeMap for unordered sets of nodes
accessed by their name:
<- Node.getAttributes(),
DocumentType.getEntities()

DocumentFragment
– Termporary container of child nodes
– Disappears when inserted in tree

NodeLists and NamedNodeMaps are "live":
– reflect updates of the doc tree immediately
– See next
SDPL 2011
3.2: Document Object Model
19
NodeLists are “live”

E.g., this would delete every other child of n:
NodeList cList = n.getChildNodes();
for (i=0; i<cList.getLength(); i++)
n.removeChild(cList.item(i));
– What happens?
n
cList
A
B
C
D
i=0
i=1
i=2
SDPL 2011
3.2: Document Object Model
20
DOM: XML Implementations

Java-based parsers
e.g. Apache Xerces, Apache Crimson, …
In MS IE browser: COM programming interfaces for
C/C++ and Visual Basic; ActiveX object
programming interfaces for script languages

Perl: XML::DOM (Implements DOM Level 1)

Others, say, database APIs?

– Vendors of different kinds of systems participated in the
W3C DOM WG
SDPL 2011
3.2: Document Object Model
21
A Java-DOM Example

Command-line tool RegListMgr for
maintaining a course registration list
– with single-letter commands for listing, adding,
updating and deleting student records

Example:
$ java RegListMgr reglist.xml
Document loaded succesfully
> l
list the contents
…
40: Tero Ulvinen, TKM1, [email protected], 2
41: heli viinikainen, tkt5, [email protected], 1
SDPL 2011
3.2: Document Object Model
22
Registration list: the XML file
<?xml version="1.0" ?>
<!DOCTYPE reglist SYSTEM "reglist.dtd">
<reglist lastID="41">
<student id="RDK1">
<name><given>Juho</given>
<family>Ahopelto</family></name>
<branchAndYear>TKT4</branchAndYear>
<email>[email protected]</email>
<group>2</group>
</student>
<!-- … and the other students … -->
</reglist>
SDPL 2011
3.2: Document Object Model
23
Registration List: the DTD
<!ELEMENT reglist (student*)>
<!ATTLIST reglist
lastID CDATA #REQUIRED >
<!ELEMENT student
(name, branchAndYear, email, group)>
<!ATTLIST student
id ID #REQUIRED >
<!ELEMENT name (given, family)>
<!ELEMENT given (#PCDATA)>
<!-- … and the same for family,
branchAndYear, email,and group -->
SDPL 2011
3.2: Document Object Model
24
Loading and Saving the RegList

Loading of the registration list into DOM
Document doc implemented with a JAXP
DocumentBuilder
– (to be discussed later)
– doc is a handle to the Document

Saving implemented with a
JAXP Transformer
– to be discussed later
SDPL 2011
3.2: Document Object Model
25
Listing student records (1)
NodeList students =
doc.getElementsByTagName("student");
for (int i=0; i<students.getLength(); i++)
showStudent((Element) students.item(i));
private void showStudent(Element student) {
// Collect relevant sub-elements:
Node given =
student.getElementsByTagName("given").item(0);
Node family = given.getNextSibling();
Node bAndY = student.
getElementsByTagName("branchAndYear").item(0);
Node email = bAndY.getNextSibling();
Node group = email.getNextSibling();
SDPL 2011
3.2: Document Object Model
26
Listing student records (2)
// Method showStudent continues:
System.out.print(
student.getAttribute("id").substring(3));
System.out.print(": " +
given.getFirstChild().getNodeValue() );
// or given.getTextContent() with DOM3
// .. similarly access and display the
// value of family, bAndY, email, and group
// …
} // showStudent
SDPL 2011
3.2: Document Object Model
27
Lessons of accessing DOM

Access methods for relevant nodes
– getElementsByTagname(“tagName”)
» robust wrt structure modifications
– Also others, if structure known (validated)
» getFirstChild(), getLastChild(),
getPreviousSibling(),
getNextSibling()

Element nodes have no value!
– Get the value from child Text nodes,
or use getTextContent()
SDPL 2011
3.2: Document Object Model
28
Adding New Records

Example:
add students
> a
First name (or <return> to finish): Antti
Last name: Ahkera
Branch&year: tkt3
email: [email protected]
group: 2
First name (or <return> to finish):
Finished adding records
> l
…
41: heli viinikainen, tkt5, [email protected], 1
42: Antti Ahkera, tkt3, [email protected], 2
SDPL 2011
3.2: Document Object Model
29
Implementing addition of records (1)
Element rootElem = doc.getDocumentElement();
String lastID = rootElem.getAttribute("lastID");
int lastIDnum = java.lang.Integer.parseInt(lastID);
System.out.print(
"First name (or <return> to finish): ");
String firstName =
terminalReader.readLine().trim();
while (firstName.length() > 0) {
// Get the next unused ID:
ID = "RDK" + new Integer(++lastIDnum).toString();
// … Read values lastName, bAndY, email,
// and group from the terminal, and then ...
SDPL 2011
3.2: Document Object Model
30
Implementing addition of records (2)
Element newStudent =
newStudent(doc, ID, firstName, lastName,
bAndY, email, group);
rootElem.appendChild(newStudent);
System.out.print(
"First name (or <return> to finish): ");
firstName = terminalReader.readLine().trim();
} // while firstName.length() > 0
// Update the last ID used:
String newLastID =
java.lang.Integer.toString(lastIDnum);
rootElem.setAttribute("lastID", newLastID);
System.out.println("Finished adding records");
SDPL 2011
3.2: Document Object Model
31
Creating new student records (1)
private Element
newStudent(Document doc, String ID,
String fName, String lName, String bAndY,
String email, String grp) {
Element stu = doc.createElement("student");
stu.setAttribute("id", ID);
Element newName = doc.createElement("name");
Element newGiven = doc.createElement("given");
newGiven.appendChild(doc.createTextNode(fName));
Element newFamily = doc.createElement("family");
newFamily.appendChild(doc.createTextNode(lName));
newName.appendChild(newGiven);
newName.appendChild(newFamily);
stu.appendChild(newName);
SDPL 2011
3.2: Document Object Model
32
Creating new student records (2)
// method newStudent(…) continues:
Element newBr =
doc.createElement("branchAndYear");
newBr.appendChild(doc.createTextNode(bAndY));
stu.appendChild(newBr);
Element newEmail = doc.createElement("email");
newEmail.appendChild(doc.createTextNode(email));
stu.appendChild(newEmail);
Element newGrp = doc.createElement("group");
newGrp.appendChild(doc.createTextNode(group));
stu.appendChild(newGrp);
return stu;
} // newStudent
SDPL 2011
3.2: Document Object Model
33
Lessons of modifying DOM

Each node must be created with
– Document.create...(“nameOrValue”)

– Attributes of an element more easily with
setAttribute(“name”, “value”)
... and connected to the structure
– Normally with parent.appendChild(newChild)


Updates and deletions in the RegListMgr
similarly, by manipulating the DOM structures
-> exercises
SDPL 2011
3.2: Document Object Model
34
Efficiency of SAX vs DOM?

DOM has reputation of requiring more
resources than streaming interfaces like SAX

Small experiment of this hypothesis:
Test task: Retrieve the title of the last section
that mentions "XML Schema definition
language"

– Target docs: repeats of fragments from W3C XML
Schema Recommendation (Part 1)
– Environment: JDK 1.6, Red Hat Linux 6, 3 GHz
Pentium with 1 GB RAM
SDPL 2011
3.2: Document Object Model
35
The speed of DOM vs SAX

On small documents, up to ~ 2 MB, the SAX &
DOM based solutions are roughly equal:
SAX v s DOM processing times
1400
time (ms)
1200
~ 3.0 MB/s
SAX
DOM
~ 3.9 MB/s
1000
800
600
400
500
SDPL 2011
1000
1500
2000
2500
document size (KB)
3000
3.2: Document Object Model
3500
4000
36
Resource needs of DOM vs SAX

On larger documents, up to ~ 60 MB, the
DOM application becomes faster than SAX(!)
– throughput ~ 8 MB/s
– SAX ~ 4 MB/s

But DOM takes relatively much of RAM
– here ~ 6 x the size of the input XML document

The SAX application runs in fixed space of ~ 6 MB
SDPL 2011
3.2: Document Object Model
37
Summary of XML APIs so far


Give applications access to the structure and
contents of XML documents
Event-based APIs (e.g. SAX)
– notify application through parsing events
– efficient

Object-model (or tree) based APIs (e.g. DOM)
– provide a full parse tree
– more convenient, but require much resources with
large documents

Major parsers support both SAX and DOM
– used through proprietary methods
– used through JAXP
(-> next)
SDPL 2011
3.2: Document Object Model
38