3.2 Document Object Model
Download
Report
Transcript 3.2 Document Object Model
3.2 Document Object Model (DOM)
How access structured documents uniformly
in parsers, browsers, editors, databases,...?
Overview of the W3C DOM Spec
» Level 1, W3C Rec, Oct. 1998
» Level 2, W3C Rec, Nov. 2000
» Level 3 Validation, Core, and Load and Save
W3C Recs (Spring 2004)
W3C DOM Activity has been closed
SDPL 2011
3.2: Document Object Model
1
DOM: What is it?
An object-based, language-neutral API for
XML and HTML documents
– Allows programs and scripts to build, access, and modify
documents
– Supports designing of
querying, filtering,
transformation, formatting etc.
applications on top of DOM implementations
Instead of “Serial Access XML” could think as
“Directly Obtainable in Memory”
SDPL 2011
3.2: Document Object Model
2
DOM structure model
Based on O-O concepts:
– objects (encapsulation of data and methods)
– methods (to access or change object’s state)
– interfaces (declaration of a set of methods)
Somewhat similar to the XPath data model (to be
discussed with XSLT and XQuery) syntax-tree
– Tree structure implied by abstract relationships defined
by the API; Data structures of an implementation may
differ
SDPL 2011
3.2: Document Object Model
3
<invoice form="00"
type="estimated">
<addressdata>
<name>John Doe</name>
<address>
<streetaddress>Pyynpolku 1
</streetaddress>
<postoffice>70460 KUOPIO
</postoffice>
</address>
</addressdata>
...
DOM structure model
invoice
John Doe
...
addressdata
address
name
Document
form="00"
type="estimated"
streetaddress
postoffice
Pyynpolku 1
70460 KUOPIO
Element
Text
SDPL 2011
NamedNodeMap
3.2: Document Object Model
4
Structure of DOM Level 1
I: DOM Core Interfaces
– Fundamental interfaces
» basic interfaces: Document, Element, Attr, Text, ...
– "Extended" (XML specific) interfaces
» CDATASection, DocumentType, Notation, Entity,
EntityReference, ProcessingInstruction
II: DOM HTML Interfaces
– more convenient access to HTML documents
– we'll ignore these
SDPL 2011
3.2: Document Object Model
5
DOM Level 2
– Level 1: basic representation and manipulation of
document structure and content
(No access to the contents of a DTD)
DOM Level 2 adds
– support for namespaces
– Document.getElementById("id_val"),
to access elements by ID attr values
– optional features (we’ll skip these)
» interfaces to document views and style sheets
» an event model (for user actions on elements)
» methods for traversing the document tree and manipulating
regions of document (e.g., selected in an editor)
SDPL 2011
3.2: Document Object Model
6
DOM Language Bindings
Language-independence:
– DOM interfaces are defined using OMG Interface
Definition Language (IDL, defined in Corba
Specification)
Language bindings (implementations of
interfaces) defined in the Recommendation for
– Java (See the Java API doc) and
– ECMAScript (standardised JavaScript)
SDPL 2011
3.2: Document Object Model
7
Core Interfaces: Node & its variants
Node
Document
DocumentFragment
Element
Attr
CharacterData
Comment
Text
DocumentType
Notation
EntityReference
SDPL 2011
CDATASection
“Extended
interfaces”
Entity
ProcessingInstruction
3.2: Document Object Model
8
Node
getNodeType, getNodeName,
getNodeValue
getOwnerDocument
getParentNode
hasChildNodes, getChildNodes
getFirstChild, getLastChild
getPreviousSibling, getNextSibling
hasAttributes, getAttributes
appendChild(newChild)
insertBefore(newChild,refChild)
replaceChild(newChild,oldChild)
removeChild(oldChild)
DOM interfaces: Node
invoice
form="00"
type="estimatedbill"
...
addressdata
name
address
Document
Element
Text
SDPL 2011
NamedNodeMap
John Doe
streetaddress
postoffice
Pyynpolku 1
70460 KUOPIO
3.2: Document Object Model
9
Type and Name of a Node
node.getNodeType():
short int constants 1, 2, …, 12 for
Node.ELEMENT_NODE,
Node.ATTRIBUTE_NODE,
Node.TEXT_NODE, …
node.getNodeName()
– for an Element = element.getTagName()
– for an Attr: the name of the attribute
– for anonymous nodes:
"#text", "#document", "#comment" etc
SDPL 2011
3.2: Document Object Model
10
The Value of a Node
node.getNodeValue()
– content of a text node,
value of attribute, …;
null for an Element (Notice !)
– (C.f. XPath, where node’s value is its full textual
content)
– DOM 3 provides full text content with method
node.getTextContent()
SDPL 2011
3.2: Document Object Model
11
Object Creation in DOM
Each DOM Node n belongs to a Document:
n.getOwnerDocument()
Objects that implement interface X are
created by factory methods
Document.createX(…)
E.g: when doc is a Document object
doc.createElement("A"),
doc.createAttribute("href"),
doc.createTextNode("Hello!")
Loading & saving specified in DOM3 (or
implementation-specific , or via JAXP)
SDPL 2011
3.2: Document Object Model
12
DOM interfaces: Document
Node
Document
getDocumentElement
getElementById(IdVal)
getElementsByTagName(tagName)
form="00"
type="estimated"
invoice
createElement(tagName)
createTextNode(data)
...
addressdata
address
name
Document
John Doe
streetaddress
postoffice
Pyynpolku 1
70460 KUOPIO
Element
Text
SDPL 2011
NamedNodeMap
3.2: Document Object Model
13
Node
DOM interfaces: Element
Element
getTagName()
hasAttribute(name)
getAttribute(name)
setAttribute(attrName, value)
removeAttribute(name)
getElementsByTagName(name)
invoice
form="00"
type="estimatedbill"
invoicepage
addressee
addressdata
Document
name
address
Element
John Doe
Text
SDPL 2011
NamedNodeMap
streetaddress
3.2: DocumentPyynpolku
Object Model
1
postoffice
70460 KUOPIO 14
Text Content Manipulation in DOM
for objects c that implement the
CharacterData interface
(Text, Comments, CDATASections):
–
–
–
–
–
c.substringData(offset, count)
c.appendData(string)
c.insertData(offset, string)
c.deleteData(offset, count)
c.replaceData(offset, count, string)
( = c.deleteData(offset, count);
c.insertData(offset, string) )
SDPL 2011
3.2: Document Object Model
15
DOM CharacterData
DOM strings are 0-based sequences of
16-bit characters:
C: Hello world, nice to see you!
0
1
2
01234567890123456789012345678
C.getLength()-1
C.substringData(6, 5) = ?
C.substringData(0, C.getLength()) = ?
SDPL 2011
3.2: Document Object Model
16
Interfaces to node collections (1)
NodeList for ordered lists of nodes
<- Node.getChildNodes() and
Element/Document
.getElementsByTagName("name")
» (proper) descendant elements of type "name" in
document order ("*" ~ any element type)
1
2
E
3
E .getElementsByTagName(“E")=
4
A
5
SDPL 2011
A
E
6
E
3.2: Document Object Model
17
Typical child-node access pattern
Accessing specific nodes, or iterating over a
NodeList:
– to process all children of node:
for (i=0;
i<node.getChildNodes().getLength();
i++)
process(node.getChildNodes().item(i));
SDPL 2011
3.2: Document Object Model
18
Interfaces to node collections (2)
NamedNodeMap for unordered sets of nodes
accessed by their name:
<- Node.getAttributes(),
DocumentType.getEntities()
DocumentFragment
– Termporary container of child nodes
– Disappears when inserted in tree
NodeLists and NamedNodeMaps are "live":
– reflect updates of the doc tree immediately
– See next
SDPL 2011
3.2: Document Object Model
19
NodeLists are “live”
E.g., this would delete every other child of n:
NodeList cList = n.getChildNodes();
for (i=0; i<cList.getLength(); i++)
n.removeChild(cList.item(i));
– What happens?
n
cList
A
B
C
D
i=0
i=1
i=2
SDPL 2011
3.2: Document Object Model
20
DOM: XML Implementations
Java-based parsers
e.g. Apache Xerces, Apache Crimson, …
In MS IE browser: COM programming interfaces for
C/C++ and Visual Basic; ActiveX object
programming interfaces for script languages
Perl: XML::DOM (Implements DOM Level 1)
Others, say, database APIs?
– Vendors of different kinds of systems participated in the
W3C DOM WG
SDPL 2011
3.2: Document Object Model
21
A Java-DOM Example
Command-line tool RegListMgr for
maintaining a course registration list
– with single-letter commands for listing, adding,
updating and deleting student records
Example:
$ java RegListMgr reglist.xml
Document loaded succesfully
> l
list the contents
…
40: Tero Ulvinen, TKM1, [email protected], 2
41: heli viinikainen, tkt5, [email protected], 1
SDPL 2011
3.2: Document Object Model
22
Registration list: the XML file
<?xml version="1.0" ?>
<!DOCTYPE reglist SYSTEM "reglist.dtd">
<reglist lastID="41">
<student id="RDK1">
<name><given>Juho</given>
<family>Ahopelto</family></name>
<branchAndYear>TKT4</branchAndYear>
<email>[email protected]</email>
<group>2</group>
</student>
<!-- … and the other students … -->
</reglist>
SDPL 2011
3.2: Document Object Model
23
Registration List: the DTD
<!ELEMENT reglist (student*)>
<!ATTLIST reglist
lastID CDATA #REQUIRED >
<!ELEMENT student
(name, branchAndYear, email, group)>
<!ATTLIST student
id ID #REQUIRED >
<!ELEMENT name (given, family)>
<!ELEMENT given (#PCDATA)>
<!-- … and the same for family,
branchAndYear, email,and group -->
SDPL 2011
3.2: Document Object Model
24
Loading and Saving the RegList
Loading of the registration list into DOM
Document doc implemented with a JAXP
DocumentBuilder
– (to be discussed later)
– doc is a handle to the Document
Saving implemented with a
JAXP Transformer
– to be discussed later
SDPL 2011
3.2: Document Object Model
25
Listing student records (1)
NodeList students =
doc.getElementsByTagName("student");
for (int i=0; i<students.getLength(); i++)
showStudent((Element) students.item(i));
private void showStudent(Element student) {
// Collect relevant sub-elements:
Node given =
student.getElementsByTagName("given").item(0);
Node family = given.getNextSibling();
Node bAndY = student.
getElementsByTagName("branchAndYear").item(0);
Node email = bAndY.getNextSibling();
Node group = email.getNextSibling();
SDPL 2011
3.2: Document Object Model
26
Listing student records (2)
// Method showStudent continues:
System.out.print(
student.getAttribute("id").substring(3));
System.out.print(": " +
given.getFirstChild().getNodeValue() );
// or given.getTextContent() with DOM3
// .. similarly access and display the
// value of family, bAndY, email, and group
// …
} // showStudent
SDPL 2011
3.2: Document Object Model
27
Lessons of accessing DOM
Access methods for relevant nodes
– getElementsByTagname(“tagName”)
» robust wrt structure modifications
– Also others, if structure known (validated)
» getFirstChild(), getLastChild(),
getPreviousSibling(),
getNextSibling()
Element nodes have no value!
– Get the value from child Text nodes,
or use getTextContent()
SDPL 2011
3.2: Document Object Model
28
Adding New Records
Example:
add students
> a
First name (or <return> to finish): Antti
Last name: Ahkera
Branch&year: tkt3
email: [email protected]
group: 2
First name (or <return> to finish):
Finished adding records
> l
…
41: heli viinikainen, tkt5, [email protected], 1
42: Antti Ahkera, tkt3, [email protected], 2
SDPL 2011
3.2: Document Object Model
29
Implementing addition of records (1)
Element rootElem = doc.getDocumentElement();
String lastID = rootElem.getAttribute("lastID");
int lastIDnum = java.lang.Integer.parseInt(lastID);
System.out.print(
"First name (or <return> to finish): ");
String firstName =
terminalReader.readLine().trim();
while (firstName.length() > 0) {
// Get the next unused ID:
ID = "RDK" + new Integer(++lastIDnum).toString();
// … Read values lastName, bAndY, email,
// and group from the terminal, and then ...
SDPL 2011
3.2: Document Object Model
30
Implementing addition of records (2)
Element newStudent =
newStudent(doc, ID, firstName, lastName,
bAndY, email, group);
rootElem.appendChild(newStudent);
System.out.print(
"First name (or <return> to finish): ");
firstName = terminalReader.readLine().trim();
} // while firstName.length() > 0
// Update the last ID used:
String newLastID =
java.lang.Integer.toString(lastIDnum);
rootElem.setAttribute("lastID", newLastID);
System.out.println("Finished adding records");
SDPL 2011
3.2: Document Object Model
31
Creating new student records (1)
private Element
newStudent(Document doc, String ID,
String fName, String lName, String bAndY,
String email, String grp) {
Element stu = doc.createElement("student");
stu.setAttribute("id", ID);
Element newName = doc.createElement("name");
Element newGiven = doc.createElement("given");
newGiven.appendChild(doc.createTextNode(fName));
Element newFamily = doc.createElement("family");
newFamily.appendChild(doc.createTextNode(lName));
newName.appendChild(newGiven);
newName.appendChild(newFamily);
stu.appendChild(newName);
SDPL 2011
3.2: Document Object Model
32
Creating new student records (2)
// method newStudent(…) continues:
Element newBr =
doc.createElement("branchAndYear");
newBr.appendChild(doc.createTextNode(bAndY));
stu.appendChild(newBr);
Element newEmail = doc.createElement("email");
newEmail.appendChild(doc.createTextNode(email));
stu.appendChild(newEmail);
Element newGrp = doc.createElement("group");
newGrp.appendChild(doc.createTextNode(group));
stu.appendChild(newGrp);
return stu;
} // newStudent
SDPL 2011
3.2: Document Object Model
33
Lessons of modifying DOM
Each node must be created with
– Document.create...(“nameOrValue”)
– Attributes of an element more easily with
setAttribute(“name”, “value”)
... and connected to the structure
– Normally with parent.appendChild(newChild)
Updates and deletions in the RegListMgr
similarly, by manipulating the DOM structures
-> exercises
SDPL 2011
3.2: Document Object Model
34
Efficiency of SAX vs DOM?
DOM has reputation of requiring more
resources than streaming interfaces like SAX
Small experiment of this hypothesis:
Test task: Retrieve the title of the last section
that mentions "XML Schema definition
language"
– Target docs: repeats of fragments from W3C XML
Schema Recommendation (Part 1)
– Environment: JDK 1.6, Red Hat Linux 6, 3 GHz
Pentium with 1 GB RAM
SDPL 2011
3.2: Document Object Model
35
The speed of DOM vs SAX
On small documents, up to ~ 2 MB, the SAX &
DOM based solutions are roughly equal:
SAX v s DOM processing times
1400
time (ms)
1200
~ 3.0 MB/s
SAX
DOM
~ 3.9 MB/s
1000
800
600
400
500
SDPL 2011
1000
1500
2000
2500
document size (KB)
3000
3.2: Document Object Model
3500
4000
36
Resource needs of DOM vs SAX
On larger documents, up to ~ 60 MB, the
DOM application becomes faster than SAX(!)
– throughput ~ 8 MB/s
– SAX ~ 4 MB/s
But DOM takes relatively much of RAM
– here ~ 6 x the size of the input XML document
The SAX application runs in fixed space of ~ 6 MB
SDPL 2011
3.2: Document Object Model
37
Summary of XML APIs so far
Give applications access to the structure and
contents of XML documents
Event-based APIs (e.g. SAX)
– notify application through parsing events
– efficient
Object-model (or tree) based APIs (e.g. DOM)
– provide a full parse tree
– more convenient, but require much resources with
large documents
Major parsers support both SAX and DOM
– used through proprietary methods
– used through JAXP
(-> next)
SDPL 2011
3.2: Document Object Model
38