Transcript Document

Web Data Management
WebOQL
1
OVERVIEW
• Data model supports abstractions for modeling
record-based data, structured documents and
hypertexts
• Supports querying small databases represented
as documents (such as catalogs), restructuring
single pages (converting a large page into
smaller pages), restructuring sets of pages, for
example, creating an index page containing a
hyperlink to each of them and adding to each
page a hyperlink to index page.
• Restructuring the content of a web site in order to
show the same content in another view
2
Data Model
The WebOQL data model introduces the hypertree: a tree
based Data model representing structured document
containing hyperlinks
Hypertrees are Ordered arc-labeled trees with two kinds of
arcs – Internal and external.
Internal arc:
represent structured objects
External arc:
represent references (links),
cannot have descendants and
their records must contain a
‘URL’ field.
3
Data Model
Example:
[Label: arik home page.
URL: www…/index.html]
[Label: moshe home page.
URL: www…/index.html]
[Label: seminar in www.
URL: www…/s.html]
[Label: databases.
URL: www…/index.html]
4
Data Model
Hyper trees are a useful data structure because they
have three important abstractions:
•Collections
•Nesting
•Ordering
The reference notion which is very important to the web
structure is captured through the distinction between internal
and external arcs.
Because the nodes have no type the tree can hold
heterogeneous records within its arcs.
5
Data Abstractions
WEB
schema
PAGE
a pair (t,F) where: t is a hypertree and
browsing
function
F : URLs
Hypertrees
F(u) where u is a URL
6
Tree operators
Definitions:
Tails: tails of a tree t are trees obtained by chopping
prefixes of t.
Simple tree: simple trees of a tree t are the trees that are
composed of an arc that stems from the root of t
and its sub-tree .
Subtree: subtrees of t are the trees at the end of arcs which
stem from the root of t.
7
[Label:3]
[Label:1]
Tree t
[Label:2]
[A:1] [A:2]
[B:1]
[Label:1]
[Label:2]
[A:1]
[A:2]
[Label:3]
[B:1]
Simple trees of t
[A:1]
[A:2]
[B:1]
null
Sub trees of t
8
Tails of T ! (prefixes)
[Label:1]
[A:1] [A:2]
[Label:3]
[Label:3]
[Label:2]
[B:1]
[Label:3]
[Label:2]
[B:1]
9
Tree operators
Concatenate :
Tree1 + Tree2
Connects two trees by their roots:
t1:
t2:
t1 + t2:
[label1: b]
[label1: a]
[label1: b]
[label1: c1]
[label1: c1]
[label1: a1]
[label1: c]
[label1: c]
[label1: a1]
[label1: c2]
[label1: a2]
[label1: c2]
[label1: a2]
10
Tree operators
Hang :
[ Arc1 / Tree1 ]
Hangs the tree from a new arc.
t1:
[ label1: a / t1 ]
[label1: a]
[label1: a1]
[label1: a1]
[label1: a2]
[label1: a2]
11
Tree operators
Prime :
Tree’
The first subtree of the argument.
t1’ :
t1:
[label1: a]
[label1: b]
[label1: a1]
[label1: a1]
[label1: a2]
[label1: a2]
12
Tree operators
Head :
Tree & [x]
The first x simple trees of the argument. If x is not specified
then only the first simple tree.
t1:
t1& :
[label1: a]
[label1: a]
[label1: b]
[label1: a1]
[label1: a1]
[label1: a2]
[label1: a2]
13
q4
q4’
q5
q5&
q6
q5!
q7
q5&2
14
HANG
[Label: “papers from smith”, Format: “ps.Z”/q1]
[Label:Papers from smith
Format:ps.Z]
[Title:Recent………..
Url:http://………..]
[Title : Are………..
Url:http://www……….]
HANG + concatenate
[Tag: “UL”/[Tag: “LI”, Text: “First Child”]+
[Tag: “LI”, Text: “Second Child”]+
[Tag: “LI”, Text: “Third Child”]]+
[Url: “http://a.b.c.”, Label “Click Here”]
[Url: “http://a.b.c.”,
Label “Click Here”]
[Tag:UL]
[Tag:LI
Text:FirstChild]
[
]
[ ]
15
Tree operators
Peek :
Arc.field
Extracts a field from an arc’s label, e.g. Example.Group
can have a value of ‘students’. If this field does not
exist a value of ‘null’ is returned.
IsField :
Arc?field
Test for the presence of a field from an arc’s label,
e.g. Example?Group evaluates to true, while
Example?Name evaluates to false.
16
Definitions
• Page – when a hypertree has an associated URL
that identifies it.
• Web – Collection of interrelated pages.
• External Arc of each page is a link in the web
• Schema – A web can optionally have a
distinguished page to provide entry point to the
web
17
•No Schema: One must know URL of one
or more pages
http://a.b.c./one.html
http://a.b.c./three.html
http://a.b.c./two.html
18
Web
Weboql query
Web
New page
Schema
http://a.b.c./three.html
http://a.b.c./one.html
http://a.b.c./four.html
http://a.b.c./two.html
19
[Tag: “UL”/[Tag: “LI”, Text: “First Child”]+
[Tag: “LI”, Text: “Second Child”]+
[Tag: “LI”, Text: “Third Child”]]+
[Url: “http://a.b.c.”, Label “Click Here”]
[Url: “http://a.b.c.”,
[Tag:UL]
[Tag:LI
Text:FirstChild]
Label “Click Here”]
[ ]
[ ]
<UL>
<LI> First Child
<LI> Second Child
<LI> Third Child
</UL>
<A HREF=“http://a.b.c.”> Click Here </A >
20
[Url:http://a.b.c.
Label: Click here]
[Tag: LI
Text:First Child]
[Tag: LI
Text:Third Child]
[Tag: LI
Text:Second Child]
Tree representing HTML document consisting of a
list and a hyperlink
•Trees are ordered
•Arcs are not labeled with atomic values but
records
21
[group:Card]
[group:DBMS]
[group:ProgLang]
[Label:Abstract
Url: www…]
Paper Database CS papers
22
SELECT - FROM - WHERE
This familiar query language construct is used by WebOQL as
the main construct of queries.
Select Query to evaluate
[y.Label, y.URL]
From
x in example, y in x!
Definition of variables
Where A boolean condition
x.Seniority = 8
23
SELECT - FROM - WHERE
For each instantiation of the variables in the from clause check
the condition in the where clause, if its true then evaluate the
query in the select clause and append it to the result.
[Label: seminar in www.
URL: www…/s.html]
[Label: databases.
URL: www…/index.html]
24
Select [y.title, y.publication]
From x in cs papers, y in x’
missing data
Publication - undefined
25
• Compute a listing of the papers’ publication
data grouped by title.
Select [x.Title /
Select [z.Publication] from y in csPapers, z in y’
Where x.title = z.title ]
From w in csPapers , x in w’
26
• Schema – a distinguished hypertree
• Browsing function – maps strings (URLs)
to hypertree, it defines a graph where the
nodes are pages and there is an arc between
node a and b if the content of the page at
node a contains an external arc whose url
attribute is the url of the page at node b.
27
•
•
•
•
Analogy with Relational database
Hypertree > Relations
Webs > databases
Schema of a web >catalog of a database
28
• Select [x.Tag]
From x in
browse(http://www.cs.toronto.edu”)
[Tag:head]
[Tag : body]
29
• SFW creates a web
• Select Title and URLs of papers authored
by Smith.
Select [y.Title, y’.URL] as schema
From x in csPapers , y in x’
Where y.authors ~”smith”
30
Queries
• Create a web page with URL “Group
Names” whose content is the list of group
names (assume that there is no such page in
the current web)
• Select [x.Group] as “Group Names” from x
in csPapers
31
Queries
• Create several pages ; one for each research
group (using the group name as URL). Each
page contains the publications of the
corresponding group
• Select x’ as x.Group from x in csPapers
32
Data Model
• Records as Labels on Arcs
• Internal and External Arcs
[Tag: UL
Text: one of the…]
[Tag: H1,
Text: City Overview…]
[Tag: L1,
Text: If you are interested…]
[Tag: LI,
Text: One of the…]
[Tag: L1,
Text: All the hotels…]
[Label: Theatres Online,
Url: http://www…,
Base: http://www…,
Text: This page contains...]
[Tag: XYZ,
Text: One of the…]
[Tag: XYZ,
Text: If you are…]
[Tag: XYZ,
Text: Contains…]
[Label: All the Hotels,
Url: http://www…,
Base: http://www…,
Text: These are all…]
[Tag: XYZ,
Text: …]
[Label: Sports Zone,
Url: http://www…,
Base: http://www…,
Text: Sports Zone…]
33
Query: list elements containing “ticket”
doc := “http://www.citynet.com/overview.html”;
[tag “UL”/
Select y
from y in doc !’
where y’.text ~ “ticket”]
[Tag: UL]
[Tag: LI]
[Tag: LI]
[Label: Theatres Online,
Url: http://www…,
Base: http://www…,
Text: This page contains...]
[Tag: XYZ,
Text: One of the…]
[Tag: XYZ,
Text: If you are…]
[Tag: XYZ,
Text: …]
[Label: Sports Zone,
Url: http://www…,
Base: http://www…,
Text: Sports Zone…]
34
Web restructuring
Using these tree operators we have shown how a tree can
be restructured.
To restructure a web we must have a function which maps
one web to another. The new web has some hypertree
as its schema while the browsing function is an extension
of the old web’s browsing function - targets URLs which were
not previously targeted.
The way it is done in WebOQL is by using the AS clause.
35
Web restructuring
Generally the select clause of WebOQL has the form of:
Select q1 as s1, q2 as s2, …., qn as sn
Si can be either the key word schema, or a string query.
An as clause which evaluates to schema defines the
schema of the web.
[Title: y.Group] as schema
Title: students
Title: professors
36
Web restructuring
Generally the select clause of WebOQL has the form of:
Select q1 as s1, q2 as s2, …., qn as sn
Si can be either the key word schema, or a string query.
An as clause which evaluates to a string defines a page
and is treated as the URL for it.
[x.Name] as y.Group
students
[Name: moshe]
[Name: arik]
37
Web restructuring
After a web is created there are two possibilities : either query
it further (restructure it) or return it to the host application.
If we want to return the web to the host application for the
sake of showing it to a browser then we must format the
pages in an HTML compliant way. This is easily done by
restructuring it using HTML tags as labels.
38
Document restructuring
Web documents are a perfect example of semi structured data
since they do not have a fixed schema and can have various
irregularities. In an HTML document most of the tags may
appear any number of times or not at all.
WebOQL uses a wrapper which creates abstract syntax trees
(AST) from any arbitrary HTML document. This is easily done
since the markup tags of HTML reflects the logical relationship
between the various information items.
Example:
<UL>
<LI> item 1. </LI>
<LI> item 2. </LI>
<LI> item 2. </LI>
</UL>
39
• Generate a web consisting of a page for
each research group containing a title and
author of all its publications, and an index
web page , that lists all the groups and
provides links to their pages
newWeb  Select unique [Name : x.Group,
url : x.Group] as schema
[y.Title, y.Authors ] as x.Group
From x in csPapers, y in x’
40
[Name: Card Punching
Url: Card Punching]
[Name:…
Url:..]
“As Schema”
[Name: Prog. Lang
Url: Prog.Lang..]
Card Punching
[Titles: Recent…
Authors: Smith]
Prog. Lang.
[Titles: Assembly Lan
[Titles: Cobol…
Authors: John,..]
Authors: James J]
[Titles: Arc…
Authors: Smith]
“As x. group”
41
NewerWeb  newWeb |
select [ Tag: “H3”, Text: y.Title ] +
[ Tag: “BR”, Text: y.Publication ] +
[ Tag: “BR”, Text: y.Authors ] +
[ Tag: “P” ]
as x.Name
from x in schema, y in x.Name
|
select [ Tag: “H2”, Text: “Publications of the” *
x.Name * “ Group” ] + x.Name +
• [ Tag: “A”, Label: “To Index”, Url:
“http://a.b.c/Index of Projects.html” ]
• as “http://a.b.c/” * x.Name * “.html”
• from x in schema
•
•
•
•
•
•
•
•
•
42
• |
• select [ Url: “http://a.b.c/Index of Projects.html” ]
as schema,
• [ Tag: “H2”, Text: “Index of Projects” ] +
• [ Tag: “UL” /
• select [ Tag: “LI” /
• [Tag: “A”, Label: x.Name,
• Url: “http://a.b.c/” * x.name * “.html”
• ]]
• from x in schema
• ] as “http://a.b.c/Index of Projects.html
43
<H2> Index of Projects </H2>
<UL>
<LI> <A HREF = “http://a.b.c./cardpunching.html”>
Card Punching
</A>
</LI>
<LI> <A HREF = “http://a.b.c./programminglanguages.html”>
Programming Languages
</A>
</LI>
<LI> …..
</UL>
Index Page
44
<H2>Publications of the Card Punching group </H2>
<H3> recent Discoveries in Card Punching </H3>
<BR> Technical Report TROIS
<BR> Peter Smith, John Brown
<P>
<H3> Are Magnetic Media Better ? </H3>
<BR> ACM TOCP Vol 3 No. (1942) pp.2337
<BR> Peter Smith, John Brown
<P>
<A HREF=“http://a.b.c./IndexnProject.html”>
To index
</A>
Group Pages
45
Document restructuring
Navigation patterns:
In the examples we have seen the variables used in the queries
ranged over simple trees of the tree we queried, however in the
WWW variables may range over several linked sub trees whose
structure is not fully known to us.
select [x.text]
from x in “someone’s.html”
via ^*[Tag = “H2”]
^ - record predicate which is true for every internal arc.
[Tag=“H2”] - record predicate which is true for every
arc which has an ‘H2’ tag.
46
Document restructuring
Navigation patterns:
In the examples we have seen the variables used in the queries
ranged over simple trees of the tree we queried, however in the
WWW variables may range over several linked sub trees whose
structure is not fully known to us.
select [x.text]
from x in “someone’s.html”
via >*[not(Tag = “H2”)]
> - record predicate which is true for every external arc.
[not(Tag=“H2”)] - record predicate which is true for every
arc which does not have an ‘H2’ tag.
47
Document restructuring
Navigation patterns:
When navigation patterns are omitted then the query is treated
as if there was a navigation pattern which always evaluated to
true.
Variables are instantiated in left to right depth-first or
breadth-first search. Since the default is depth-first to use
breadth-first the key word viabfs is used instead of via.
48
Navigation Pattern
[Not (Tag = “A”)]* - Path of any length composed of arcs not
having an attribute tag with value “A”.
[Tag = “LI”] [Tag = “A”] – path of length 2
^*> - all paths in a tree that lead from root to an external arc
Select [x.url]
from x in “http://a.b.c./index.html”
Via [not (tag = “Table”)]*>
All the external arcs in the document pointed to by the
“http”……” that do not occur within a table
49
Select [x.url,x.text]
From x in “http://a.b.c./root.html”
Via (^*[Labled “Next’’]>)*
What this query will produce?
50
[Tag: H3,
Text: Price…]
[Tag: H3,
Text: Price…]
[Tag: UL]
[Tag: LI]
[Tag: LI]
[Tag: UL]
[Tag: LI]
Select X ! &
From X in http://a.b.c./large.html
via ^* [Tag = “H3”]
Where X!.Tag=“UL” and X.Text ~ “Price”
51
[Tag: H2,
Text: Publications of the]
[Tag: H3,
Text:]
[Tag: BR,
Text:]
[Tag: BR, [Tag: P,
Text: y] Text: ]
[Tag: H3,
Text:]
[Tag: P,
Text: ]
[Tag: BR,
Text: y]
[Tag: BR,
Text:]
[Label: To index,
Url:
Base:
http://a.b.c./cardpunching.html,
Text: indexofprojects]
Tree generated by Query
[Tag: “OL”/Select [Tag: “LI” / X&3]
from X in http://a.b.c./cardpunching.html!
where X.tag = “H3”
[Tag: OL]
[Tag: LI]
[Tag: H3]
[Tag: LI]
52
[Tag: “OL”/Select [Tag: “LI”/
Select y
from y in X while not y.Tag=“p”]
From X in http://a.b.c.//IrregularDoc.html”!
where X.tag = “H3”
]
53
select [x.proj name, x.proj descr] as “projects”
[x.emp name, x.emp phone] as “people”
[x.proj name] as “x.proj name”
[x.emp name] as “x.emp name”
From x in “SQLDb. Select proj name, emp name,
emp phone, proj descr from proj, emp, worksin
where Emp.id = worksIn.empid and
proj.id = worksIn.projId;”
Project web
Generate a web containing a page for each project, a page for each
person and two index pages, listing all the projects and all the
people, a person’s page contains pointers to the Projects in which
he /she is involved and a project page contains pointers to the
pages or the people involved in it.
54
[Tag: UL,
Text: …]
….
[Tag: H2,
Text: Card Punching…]
[Tag: UL,
Text: Recent…]
[Tag: LI,
Text: Recent…]
[Tag: H2,
Text: Programming…]
[Tag: LI,
Text: Are Magnetic…]
….
[Tag: CITE,
Text: Are Magnetic…]
[Tag: XYZ,
Text: Are Magnetic]
[Tag: BR,
Text: ]
[Tag: B,
Text: Peter Smith…]
[Tag: H2,
Text: Databases…]
[Tag: UL,
Text: Cobol in AI Sam James…]
[Tag: LI,
Text: Cobol in…]
….
[Tag: BR,
Text: ]
[Tag: LI,
Text: Assembly for…]
….
[Tag: BR,
Text: ]
[Label: Full Version,
Url: http://www…/paper2.ps.z,
Base: http://www…/cspapers.html,
[Label: Abstract,
Url: http://www…/abstr2.html, Text: 1k098k79…]
Base: http://www…/cspapers.html,
Text: Are Magnetic Media…]
[Tag: BR,
Text: ACM TOCP Vol. 3 No. (1942) pp 23-37]
55
Select [Title: y”.Text, Authors: y”!!.text]
From x in “http://www.a.b.c./paper.html”,y in x’
Where x.Tag = UL
Retrieve titles and authors of each paper
x range over simple trees and y over elements
under UL
56
Select [title: y”.Text,
authors: y”!!.text,
Publications: y”!3.Text
ps-url: y’!4.url
abstract-url:y’!!.url]
as “pubsdb: insert”
From X in http://www.a.b.c./paper.html,
y in X!’
Where X.tag = “H2”
57
[Tag: H1,
Text: Reports in …]
[Tag: HR,
Text:]
[Tag: H2,
Text: David Rice]
[Tag: H2,
Text: John Smith]
[Tag: CITE,
Text: Indexing]
[Tag: BR,
Text:]
[Tag: BR,
Text: ]
[Tag: CITE,
Text:Efficient]
[Tag: P,
Text: ]
[Tag: XYZ,
Text:CS-TR-0327..]
[Label: Indexing Sound,
Url: http://www…/pl.ps.gz,
Base: http://www…./trs.html,
Text: ;sd..sGhj&9870….]
[Label:Abstract Available Online,
Url: http://www…/pl.html,
Base: http://www…./trs.html,
Text: Indexing Sound….]
…
[Tag: HR,
Text:]
[Tag: XYZ,
Text:CS-TR-0120..]
[Tag: P,
Text: ]
[Tag: BR,
Text:]
[Tag: XYZ,
Text:CS-TR-0029..]
[Label: Efficient Clustering….,
Url: http://www…/p2.ps.gz,
Base: http://www…./trs.html,
Text: .fHjs*9))fujs…….]
[Label:Temporal Constraints,
Url: http://www…/p3.ps.gz,
Base: http://www…./trs.html,
Text: ;+-9ivm27&813nd….]
58
Select [title: Y.text
author: X.text
publications: Y!!.Text
PS-Url: Y’:Url
abstract-url:Y!4.Url
] as “PubsDb: insert”
From X in “http://www.x.y.z./papers.html”
Y in X! while not (Y.Tag = “HR”)
where X.Tag = “H2”
and Y.Tag=“CITE”
59
[Tag: UL,
Text: …]
….
X
[Tag: H2,
Text: Card Punching…]
[Tag: UL,
[Tag: H2,
Text: Recent…] Text: Programming…]
y
y’
y”
[Tag: XYZ,
Text: Recent….]
[Tag: LI,
[Tag: LI,
Text: Are Magnetic…]
Text: Recent…]
….
[Tag: CITE,
Text: Recent…]
[Tag: BR,
Text: ]
[Tag: BR,
Text: ]
[Tag: BR,
[Tag: BR,
Text: Technical…….]
Text: ] [Tag: B,
Text: Peter Brown…]
[Tag: H2,
Text: Databases…]
[Tag: UL,
Text: Cobol in AI Sam James…]
[Tag: LI,
[Tag: LI,
Text: Assembly for…]
Text: Cobol in…]
….
….
[Label: Full Version,
Url: http://www…/paperl.ps.z,
Base: http://www…/cspapers.html,
Text: #hH6YiaP….]
[Label: Abstract,
Url: http://www…/abstrl.html,
Base: http://www…/cspapers.html,
Text: It is company…]
Figure 5.6 Instantiation of Variables in Query 4
60
Query 4:
csPapers
select[Group: X.Text /
select[Title: y”.Text ,
Authors: y”!!.Text,
Publication:y”!3.Text/
[Label: “Abstract”,Url:y’!!.Url]+
[Label: “Full Version”,Url:y’!4.Url]
]
from y in X!’
]
from X in “http://www.a.b.c./papers.html”
where X.Tag = “H2”
61
Architecture
query
web
API
Query Engine
URL
tree
Wrapper Manager
Wrapper
DBMS
Wrapper
File
System
Wrapper
Web1
Wrapper
...
Web k
62
• Each node corresponds to either a
subdocument enclosed in an occurrence of
a paired tag. For example, root node
corresponds to the subdocument enclosed
between <html> and </html> or to a
subdocument enclosed in an occurrence of
a non-paired tag and the tag that follows it
• Arcs leading to nodes corresponding to the
<a> tag and for which the protocol of the
associated URL is http are external. All
other arcs are internal.
63
• The incoming arc to a node contains the attributes
of the subdocument represented by this node.
• Internal arcs are labeled with a record containing
two fields: Tag and Text.
• Tag is the HTML tag corresponding to the subtree
that is the destination of the arc.
• The value of the Text depends on whether Tag is
paired or non-paired.
• If paired, then the value of the text is the text that
is enclosed between <Tag> and </Tag> excluding
markups.
• If Tag is non-paired, the value of text is the text
between <Tag> and the tag that comes after it in
document.
64
• External arcs are labeled with a record containing
four fields, label, url, base and text.
• Label is the label of the hyperlink, the text
enclosed between <a href =…> and the </a> tags;
url is the value of the href attribute, base is the url
of the document being processed and Text is the
text of the referred document excluding markup.
• A dummy tag named <xyz> is used to enclose
pieces of text that are not explicitly tagged.
• Rules are applied recursively to the text inside
occurrences of paired tags.
65
• <HTML>
<H1> Publications of Research Groups at Cs
Dept</H1>
<H2> Card Punching </H2>
<UL>
<LI>
<CTTE> Recent Advances in Card
Punching> <BR>
<B> Peter Smith, John Brown</B><BR>
Technical Report TR015</CTTE><BR>
<A HREF = http://../abstract.html> Abstract
</A><BR>
66
• <a href =“http://../paper.ps.Z> Full version</a>
• </LI>
• <LI>
• <CTTE> Are magnetic Media Better?<BR>
<B> Peter Smith, John Brown, Tom</B><BR>
ACM TOCP Vol. 3, No. , pp</CTTE><BR>
<A HREF = HTTP://../abst2.html>
Abstract</A><BR>
<A HREF=“http://../paper2.ps.Z”> Full version</A>
</LI>
</UL>
<H2> Programming lang</H2>
67