WIS class: Web Engineering

Download Report

Transcript WIS class: Web Engineering

TU/e
technische universiteit eindhoven
Web Data and Metadata
Geert-Jan Houben
TU/e
technische universiteit eindhoven
Contents
• Evolution in Web data
• Techniques and Languages for Web data:
–
–
–
–
XML
XML Querying: XQuery
RDF (& RQL)
OWL
Note: here the context, not the details!
TU/e
technische universiteit eindhoven
Evolution
TU/e
technische universiteit eindhoven
Future of the Web
1. common syntax: XML
•
•
HTML: a fixed set of tags complicates the
identification of information elements
XML allows to define data structures:
•
Tags with freely chosen names
–
•
•
•
No predefined tags enables definition, transmission,
validation and interpretation of data between applications
(and organizations)
Freely chosen attributes
Simple definition: DTD
Extended definition: XML-Schema
TU/e
technische universiteit eindhoven
<skills>
<people>
<person>
<name>Bob</name>
<know-how>Quilt</know-how>
</person>
<person>
<name>Peter</name>
<know-how>Quilt</know-how>
<know-how>XML-GL</know-how>
</person>
</people>
<seminars>
<seminar>
<topic>Quilt</topic>
<participant>
<name>Karin</name>
<name>Alice</name>
</participant>
</seminar>
</seminars>
</skills>
TU/e
technische universiteit eindhoven
//person/name[../know-how="Quilt"]
$union$
//seminar[topic="Quilt"]/participant/name
TU/e
technische universiteit eindhoven
Future of the Web
2. Specification of meaning: RDF
•
•
•
Resource: denotes an information item, e.g. via a URL
Property type: name of a property of a resource
Value: value for that property
Example:
Resource
Property type
Value
=
=
=
URL of web page
“author”
“John Smith”
TU/e
technische universiteit eindhoven
<?xml:namespace ns = "http://www.w3.org/RDF/RDF/" prefix = "RDF" ?>
<?xml:namespace ns = "http://purl.oclc.org/DC/" prefix = "DC" ?>
<?xml:namespace ns = "http://person.org/BusinessCard/" prefix = "CARD" ?>
<RDF:RDF>
<RDF:Description RDF:HREF = "http://uri-of-Document-1">
<DC:Creator RDF:HREF = "#Creator_001"/>
</RDF:Description>
<RDF:Description ID="Creator_001">
<CARD:Name>John Smith</CARD:Name>
<CARD:Email>[email protected]</CARD:Email>
<CARD:Affiliation>Home, Inc.</CARD:Affiliation>
</RDF:Description>
</RDF:RDF>
TU/e
technische universiteit eindhoven
Future of the Web
3. Meaning: ontologies
•
•
•
Ontology = a vocabulary with associated
meaning
Possibility to define synonyms,
specializations and other relationships
Use of same ontology = contract on meaning
of words (tags, attributes)
•
Often, industry or domain dependent
TU/e
technische universiteit eindhoven
Future of the Web
4. Logic to derive conclusions
•
Necessary in electronic commerce: What do
messages mean exchanged between supplier and
customer?
5. Goal: trust in the meaning of communication
between Web systems, and hence the possibility
to automate using agents
Ref: www.w3.org
TU/e
technische universiteit eindhoven
Web Data Integration
• WIS repository (back-end) typically
assembled from different heterogeneous
sources, e.g. databases, files, WWW
• To manage (coordinate) data from different
sources, metadata helps to structure the
data
TU/e
technische universiteit eindhoven
Metadata
•
•
•
•
Describing the data and its availability
Sometimes provided by sources
Needed by IS
Engineering metadata:
– Meaning
– Validity
– Quality
• Specifying “logistics” of data
TU/e
technische universiteit eindhoven
XML
Semistructured data
TU/e
technische universiteit eindhoven
XML: Complex data
• Structure is irregular (missing/extra data)
• Schema does not exist or is unknown
• Schema is rapidly evolving
• Relational and ODB models are too rigid
• Standard is a document/hypertext language HTML
• Solution: semistructured data model XML
– data model consists of a type definition language, a
query/update language and more
TU/e
technische universiteit eindhoven
XML Environment
• Follow-up of SGML, markup language for
documents, and OO databases
• XML eXtensible Mark-up Language
– W3C and most industrial companies [B2B]
– Main idea: separate content and presentation
– Use tags to represent structure and semantics
Ref: www-rocq.inria.fr/~abitebou/pub/lics01.ppt
TU/e
HTML = Hypertext Language
technische universiteit eindhoven
The <b> X23 </b> new camera
Ref
Name Price
replaces the <b> X22 </b>. It
X23 Camera 359.99
comes equipped with a flash
R2D2 Robot 19350.00
(worth by itself <i>53.99 $</i>)
Z25
PC
1299.99 hard and provides great quality for
only <i>359.99 $</i>.
Information System
Text + presentation
Where is the data ?
HTML
TU/e
technische universiteit eindhoven
XML = Semistructured Data
<product-table>
Ref
Name Price
< product reference=”X23">
X23 Camera 359.99
<designation> camera </designation>
R2D2 Robot 19350.00
<price unit=Dollars> 359.99 </price>
Z25
PC
1299.99
<description> … </description>
...
Information System easy </product>
< product reference=”R2D2">
<designation> Robot </designation>
Data + Structure <price unit=Dollars> 19350 </price>
Semistructured: <description> … </description>
...
more flexible
</product-table>
XML
TU/e
technische universiteit eindhoven
XML Flexibility
• no fixed set of tags
• no fixed interpretation/rendering of tags
• no fixed structure
TU/e
technische universiteit eindhoven
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state>
<zip>90952</zip>
</shipTo>
<billTo country="US">
<name>Robert Smith</name>
<street>8 Oak Avenue</street>
<city>Old Town</city>
<state>PA</state>
<zip>95819</zip>
</billTo>
<comment>Hurry, my lawn is going wild!</comment>
<items>
<item partNum="872-AA">
<productName>Lawnmower</productName>
<quantity>1</quantity>
<USPrice>148.95</USPrice>
<comment>Confirm this is electric</comment>
</item>
<item partNum="926-AA">
<productName>Baby Monitor</productName>
<quantity>1</quantity>
<USPrice>39.98</USPrice>
<shipDate>1999-05-21</shipDate>
</item>
</items>
</purchaseOrder>
TU/e
technische universiteit eindhoven
XML Documents
•
•
•
•
•
•
elements and attributes
elements are ordered
attribute values are strings
well-formed documents (e.g. proper nesting)
namespaces: vocabularies for tags
valid documents: DTD, Schema
TU/e
technische universiteit eindhoven
DTD: a grammar
 Product*
 Name Price? Cat
(Part Quantity)*
Part
 BasicPart + ComposedPart
BasicPart
 Name
ComposedPart  Name (Part Quantity)*
Catalog
Product
TU/e
technische universiteit eindhoven
XML Schema
• to define a class of documents: conforming
to a schema
• in XML syntax
• built-in types
TU/e
technische universiteit eindhoven
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation>
<xsd:documentation xml:lang="en">
Purchase order schema for Example.com.
Copyright 2000 Example.com. All rights reserved.
</xsd:documentation>
</xsd:annotation>
<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
<xsd:element name="comment" type="xsd:string"/>
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
<xsd:complexType name="USAddress">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence>
<xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/>
</xsd:complexType>
...
</xsd:schema>
TU/e
technische universiteit eindhoven
...
<xsd:complexType name="Items">
<xsd:sequence>
<xsd:element name="item" minOccurs="0" maxOccurs="unbounded">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="productName" type="xsd:string"/>
<xsd:element name="quantity">
<xsd:simpleType>
<xsd:restriction base="xsd:positiveInteger">
<xsd:maxExclusive value="100"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="USPrice" type="xsd:decimal"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="partNum" type="SKU" use="required"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<!-- Stock Keeping Unit, a code for identifying products -->
<xsd:simpleType name="SKU">
<xsd:restriction base="xsd:string">
<xsd:pattern value="\d{3}-[A-Z]{2}"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
TU/e
technische universiteit eindhoven
Typing XML
• Not really, the true spirit of the Web, but essential
for data management: query optimization, user
interfaces, applications
• Differences with standard database typing
– Collections are sequences instead of sets
– Types may be very large (e.g., from integration)
– Data is more irregular so types should be more
permissive
– New issues sometimes: you have the data, extract its
type: an approximate type
TU/e
technische universiteit eindhoven
More on XML
• The Database Models course in BIS, given
by De Bra and Paredaens, will pay much
more attention to the XML data model.
• Also, look at the W3C site: w3c.org
TU/e
technische universiteit eindhoven
XML Querying
XQuery
TU/e
technische universiteit eindhoven
XML query language
• XML is used for data exchange on the Web
• W3C develops standard: XML Query Working
Group
• XML Query Data Model
• XPath and XQuery
Ref: www.w3.org/XML/Query
TU/e
technische universiteit eindhoven
XPath
• Path expressions in OO databases
/Students/Student/Status
• Semistructured:
– missing parts
/Students//Status
– conditions
/Students/Student[Status=“U4”]
• Indexing, wildcards
• Selection, string manipulation, aggregation,
attribute existence, union
TU/e
technische universiteit eindhoven
XSLT
• XSL: XML Stylesheet Language
– (XSLT: XSL Transformations)
• declarative language for transforming XML
documents using an XSLT processor
TU/e
technische universiteit eindhoven
XQuery
• http://www.w3.org/XML/Query
• “the” standard for XML querying
• Goal WG: “data model for XML documents, a set
of query operators on that data model, and a query
language based on these query operators”
• General query language (next to XPath + XSLT)
TU/e
XQuery Path Expressions
technische universiteit eindhoven
Based on XPath
In the second chapter of the document named “zoo.xml”, find the
figure(s) with caption “Tree Frogs”.
document(“zoo.xml”)/chapter[2]//
figure[caption=“Tree Frogs”]
Find captions of figures that are referenced by <figref> elements in the
chapter of “zoo.xml” with title “Frogs”.
document(“zoo.xml”)/chapter[title=“Frogs”]//
figref/@refid->fig/caption
TU/e
XQuery Element Constructor
technische universiteit eindhoven
Generate an <emp> element that has an “empid” attribute. The value of
the attribute and the content of the subelements are specified by
variables that are bound in other parts of the query.
<emp empid={$id}>
{$name}
{$job}
</emp>
TU/e
XQuery FLWR Expression
technische universiteit eindhoven
FOR var IN expr
LET var := expr
WHERE expr
RETURN expr
binding-clause
binding-clause
select-predicate
output-generation
List the titles of books published by Morgan Kaufmann in
1998.
FOR $b IN document(“bib.xml”)//book
WHERE $b/publisher = “Morgan Kaufmann” AND $b/year
= “1998”
RETURN $b/title
TU/e
technische universiteit eindhoven
FLWR Expression
List each publisher and the average price of its books.
FOR $p IN distinct(document(“bib.xml”)//publisher)
LET $a := avg(document(“bib.xml”)/book[publisher=$p]/price)
RETURN
<publisher>
<name>{$p/text()}</name>
<avgprice>{$a}</avgprice>
</publisher>
TU/e
Operators and Functions
technische universiteit eindhoven
Find the maximum depth of the document named “partlist.xml”.
NAMESPACE xsd=http://www.w3.org/2001/XMLSchema-datatypes
FUNCTION depth(ELEMENT $e) RETURNS xsd:integer
{
-- An empty element has depth 1
-- Otherwise, add 1 to max depth of children
IF empty($e/*) THEN 1
ELSE max(depth($e/*)) + 1
}
depth(document(“partlist.xml”))
TU/e
Conditional Expression
technische universiteit eindhoven
Make a list of holdings, ordered by title. For journals, include the editor,
and for all other holdings, include the author.
FOR $h IN //holding
RETURN
<holding>
{$h/title,
IF $h/@type=“Journal”
THEN $h/editor
ELSE $h/author
}
</holding> SORTBY (title)
TU/e
Quantified Expressions
technische universiteit eindhoven
Find titles of books in which both sailing and windsurfing are mentioned
in the same paragraph.
FOR $b IN //book
WHERE SOME $p IN $b//para SATISFIES
contains($p,”sailing”) AND contains($p,”windsurfing”)
RETURN $b/title
Find titles of books in which sailing is mentioned in every paragraph.
FOR $b IN //book
WHERE EVERY $p IN $b//para SATISFIES
contains($p,”sailing”)
RETURN $b/title
TU/e
technische universiteit eindhoven
Other expressions
• Sequence-related expressions
– Example: ($x,$y,$z)
– PRECEDES, FOLLOWS
• Operators on data types
– INSTANCEOF
– CAST
– TREAT
TU/e
technische universiteit eindhoven
More on XQuery
• The Database Models course in BIS, given
by De Bra and Paredaens, will pay much
more attention to XML query languages.
• Also, look at the W3C site: w3c.org
TU/e
technische universiteit eindhoven
RDF
RQL
TU/e
Resource Description Framework
technische universiteit eindhoven
• W3C standard for metadata description
• Describes the “meaning” of data like Web sites, parts
of HTML pages, etc.
• Makes data “machine - understandable” – allows
automated data processing
• Framework that allows you to make simple assertions
about anything: distributed and extensible (as is the
Web)
• “meaning” expressed via “subclass of”
Ref: www.w3.org/RDF, www.w3.org/TR/rdf-primer
TU/e
technische universiteit eindhoven
Basic RDF Model
• Recognizes 3 object types:
– Resources – always named by URI, e.g. web
site, part of web page, others
– Properties – an attribute of a Resource, its
characteristics
– Statements – Resource + Property + Property
Value
TU/e
Basic RDF Model Example
technische universiteit eindhoven
• RDF representation of the sentence:
“Ora Lassila is the creator of the resource
www.w3.org/Home/Lassila.”
Statement:
Subject (Resource)
www.w3.org/Home/Lassila
Predicate (Property)
Object (Literal)
Creator
“Ora Lassila”
TU/e
Basic RDF Model Example
technische universiteit eindhoven
• Diagram of the statement:
Creator
www.w3.org/Home/Lassila
• In general:
<subject> HAS <predicate><object>
here
www.w3.org/Home/Lassila HAS Creator Ora Lassila
Ora Lassila
TU/e
technische universiteit eindhoven
RDF and XML
•RDF can be implemented using XML
•The example of complete XML for the previous
example is:
<?xml version=“1.0”>
<rdf:RDF
xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:s=http://description.org/schema/>
<rdf:Description about=www.w3.org/Home/Lassila>
<s:Creator>Ora Lassila</s:Creator>
</rdf:Description>
</rdf:RDF>
TU/e
Structured Value Example
technische universiteit eindhoven
• “The employee with ID 85740, Ora Lassila, with
Email [email protected], is the creator of the
resource www.w3.org/Home/Lassila”
In XML it is:
www.w3.org/Home/Lassila
<rdf:RDF>
<rdf:Description about=“www.w3.org/Home/Lassila”>
<s:Creator>
Creator
<rdf:Description about=“www.w3.org/staffid/85740”>
www.w3.org/staffid/85740
<v:Name>Ora Lassila</v:Name>
<v:Email>[email protected]</v:Email>
Name
Email
</rdf:Description>
</s:Creator>
Ora Lassila
[email protected]
<rdf:Description>
</rdf:RDF>
TU/e
technische universiteit eindhoven
RDF - more
•
•
•
•
Property value can be literal or resource
One subject can have more than one property
It is possible to make statements about statements
It is possible to refer a collection of resources
(containers) of 3 types:
– Bag – a property has multiple values, order has no significance
– Sequence – a property has multiple value, order is significant
– Alternative – list of literals/resources representing alternatives for
single property
TU/e
RDF Schemas and Namespaces
technische universiteit eindhoven
• Meaning of terms used in statements like “Creator”, “Name”, “Email”
is expressed by referencing to RDF Schemas (“domain-definition”)
• RDF Schema provides information about the interpretation of the
statement in given RDF model
• RDF Schema is usually separate document
• To avoid confusion between different definitions of the same term,
RDF Schemas use Namespace facility.
xmlns:s=“http://description.org/schema”
xmlns:v=“http://description.org/differentschema”
<s:Creator>Ora Lassila</s:Creator>
<v:Creator>Ora Lassila</v:Creator>
TU/e
RDF Query Language
technische universiteit eindhoven
• Querying RDF metadata
– SQL/XQL style approach, viewing RDF metadata as
relational or XML database [RDF Query Specification
(IBM)]
– viewing Web descriptions by RDF metadata as knowledge
base, applying knowledge representation and reasoning
techniques [W3C related]
• RQL
Ref: 139.91.183.30:9090/RDF/publications/bda01.PDF
139.91.183.30:8999/RQLdemo/
TU/e
technische universiteit eindhoven
RQL
subClassOf(Artist)
subClassOf^(Artist)
SELECT $C1, $C2
FROM {$C1}creates{$C2}
SELECT X, Y
FROM {X}last_modified{Y}
WHERE Y >= 2000-01-01
TU/e
technische universiteit eindhoven
OWL
TU/e
technische universiteit eindhoven
OWL
• Web Ontology Language
• used to explicitly represent meaning of
terms in vocabularies and relationships
between terms: ontology
– ontology engineering
• beyond XML and RDF(S)
• revision of DAML+OIL
TU/e
technische universiteit eindhoven
Stack
• XML: surface syntax for structured documents (no
semantic constraints on meaning)
• XML Schema: restricting structure of XML documents
• RDF: datamodel for objects (resources) and relationships,
provides simple semantics for this datamodel
• RDF Schema: vocabulary for describing properties and
classes of RDF resources, with semantics for
generalization-hierarchies
• OWL: adds vocabulary for describing properties and
classes, e.g. relations between classes (disjoint),
cardinality (exactly one), equality, richer typing of
properties, characteristics of properties (symmetry),
enumerated classes
TU/e
technische universiteit eindhoven
OWL Sublanguages
• OWL Lite: classification hierarchy and
simple constraints
• OWL DL: maximum expressiveness while
retaining computational completeness and
decidability (description logics)
• OWL Full: maximum expressiveness and
syntactic freedom of RDF with no
computational guarantees
TU/e
technische universiteit eindhoven
OWL Lite
• RDF Schema features: Class, rdf:Property, rdfs:subClassOf,
rdfs:subPropertyOf, rdfs:domain, rdfs:range, Individual
• (In)Equality: equivalentClass, equivalentProperty,
sameIndividualAs, differentFrom, allDifferent
• Property characteristics: inverseOf, TransitiveProperty,
SymmetricProperty, FunctionalProperty,
InverseFuntionalProperty
• Property type restrictions: allValuesFrom, someValuesFrom
• Restricted cardinality: minCardinality (0/1), maxCardinality
(0/1), cardinality (0/1)
• Class intersection: intersectionOf
TU/e
technische universiteit eindhoven
OWL DL and Full
• Class axioms: oneOf, disjointWith,
equivalentClass, rdfs:subClassOf (both
applied to class expressions)
• Boolean combinations of class expressions:
unionOf, intersectionOf, complementOf
• Arbitrary cardinality: minCardinality,
maxCardinality, cardinality
TU/e
technische universiteit eindhoven
References
• There is a lot of information available
through the W3C site.
• Depending on your background, have a
close look at some of the languages and the
ideas behind them.
TU/e
technische universiteit eindhoven