Web Data Management
Download
Report
Transcript Web Data Management
Web Data Management
COSC 4806
Introduction
The ‘world wide web’
a vast, widely distributed collection of
semi-structured multimedia documents
heterogeneous collection of documents
documents in the form of web pages
documents connected via hyperlinks
World Wide Web
The web is growing rapidly
Business organizations increasingly
presenting information on the Web
‘Business on the highway’
Myriad of raw data to be processed
for information
World Wide Web
The web is a fast growing, distributed &
non-administered global information
resource
WWW allows access to text, images, video,
sound and graphical data
Ever-increasing number of businesses
building web servers
A chaotic environment to locate information
of interest
Lost in hyperspace syndrome
World Wide Web
Characteristics of the WWW:
it’s a set of directed graphs
data is heterogeneous, self-describing &
schema less
unstructured, deeply nested information
no central authority for information
management
dynamic information vs. static information
web information discovery – search engines
World Wide Web
Rapid growth of web:
In 1994, WWW grew by 1758 % !!
June 1993 - 130
June 1994 - 1265
Dec. 1994 - 11,576
April 1995 - 15,768
July 1995 - 23,000+
January 2005 – 11.5 billion publiclyindexed web pages
World Wide Web
.com domains on the rise, as of July
2006:
76,683,115 hosts for ‘com’ domains
10,232,188 hosts for ‘edu’ domains
185,919,955 hosts for ‘net’ domains
727,773 hosts for ‘gov’ domains
1,933,551 hosts for ‘mil’ domains
1,660,470 hosts for ‘org’ domains
World Wide Web
The exponential growth of the Internet is
reflected in the number of hosts on the net
1.000 in 1984
10.000 in 1987
100.000 in 1989
1.000.000 in 1992
10.000.000 in 1996
100.000.000 in 2000
171,638,297 in 2003
489,774,269 in July 2007
Net Timeline (http://www.pbs.org/internet/timeline/)
Internet Domain Survey (http://www.isc.org/ds/)
World Wide Web
Distribution of hosts (worldwide)
US
European Union
Japan
Germany
Netherlands
South Korea
Australia
UK
Brazil
Taiwan
195,138,696
22,000,414
21,304,292
7,657,162
6,781,729
5,433,591
5,351,622
4,688,307
4,392,693
3,838,383
World Wide Web
Popular search methods
email
Search engine
Get news
Job related search
Instant messaging
Online banking
Chat room
Travel reservation
Read blogs
Online auction
77%
63%
46%
29%
18%
18%
8%
5%
3%
3%
World Wide Web
Key limitations of search engines:
do not exploit hyperlinks
search limited to string matching
queries evaluated on archived data
rather than up-to-date data; no indexing
on current data
low accuracy; replicated results
no further manipulation possible
World Wide Web
Key limitations of search engines
(contd.):
ERROR 404!
No efficient document management
Query results cannot be further
manipulated
No efficient means for knowledge
discovery
World Wide Web
more issues..
specifying/understanding what information is
wanted
the high degree of variability of accessible
information
the variability in conceptual vocabulary or
“ontology” used to describe information
complexity of querying unstructured data
World Wide Web
contd.
complexity of querying structured data
uncontrolled nature of web-based
information content
determining which information sources
to search/query
World Wide Web
Search Engines capabilities:
Selection of language
Keywords with disjunction, adjacency, presence,
absence, ...
Word stemming (Hotbot)
Similarity search (Excite)
Natural language (LycosPro)
Restrict by modification date (Hotbot) or range of dates
(AltaVista)
Restrict result types (e.g., must include images) (Hotbot)
Restrict by geographical source (content or domain)
(Hotbot)
Restrict within various structured regions of a document
(titles or URLs) (LycosPro); (summary, first heading, title,
URL) (Opentext)
World Wide Web
Search & Retrieval..
Search engine
Hotbot
AltaVista
Northern Light
Excite
Infoseek
Lycos
% web covered
34
28
20
14
10
3
Using several search engines is better
than using only one
World Wide Web
Schemes to locate information:
Supervised links between sites
ask at the reference desk
Gopher (Univ. Of Minnesota): menu format with links
both to sites and content
Classification of documents
search in the catalog
Archie (McGill Univ.): system to automatically gather,
index and serve information from all anonymous FTP
sites
Automated searching
wander around the library
Use META tags to gethermeta data
Spiders (robots, web-crawlers)
World Wide Web
Popular search engines..
Year 2000
Year 2001
AltaVista
Yahoo
HotBot
Google
NorthernLight
AltaVista
World Wide Web
Boolean search in Alta vista..
World Wide Web
Specifying field content in HotBot..
World Wide Web
Natural language interface in AskJeeves
World Wide Web
Examples of search strategies:
Rank web pages based on popularity
Rank web pages based on word
frequency
Match query to an expert database
The major search engines use a
mixed strategy
World Wide Web
Frequency based ranking:
Library analogue: Keyword search
Basic factors in HotBot ranking of pages:
- words in the title
- keyword meta tags
- word frequency in the document
- document length
World Wide Web
Alternative word frequency measures:
Excite uses a thesaurus to search for what you
want, rather than what you ask for
AltaVista allows you to look for words that
occur within a set distance of each other
NorthernLight weighs results by search term
sequence, from left to right
World Wide Web
Popularity based ranking:
Library analogue: citation index
The Google strategy for ranking pages:
- Rank is based on the number of links to a
page
- Pages with a high rank have a lot of other
web pages that link to it
- The formula is on the Google help page
World Wide Web
More on popularity ranking:
The Google philosophy is also applied by
others, such as NorthernLight
HotBot measures popularity of a page by
how frequently users have clicked on it
in past search results
World Wide Web
Expert Databases, Yahoo
An expert database contains predefined
responses to common queries
A simple approach is subject directory, e.g. in
Yahoo!, which contains a selection of links for
each topic
The selection is small, but can be useful
Library analogue: Trustworthy references
World Wide Web
Expert Databases, AskJeeves
AskJeeves has predefined responses to
various types of common queries
These prepared answers are augmented
by a meta-search, which searches other
SEs
Library analogue: Reference desk
World Wide Web
Example, best wines in France; AskJeeves
World Wide Web
Best wines in France; HotBot
World Wide Web
Best wines in France; Google
World Wide Web
Linux in Iceland; Google
World Wide Web
Linux in Iceland; HotBot
World Wide Web
Linux in Iceland; AskJeeves
Web Data Management
Web Data Management; key objectives
Design a suitable data model to represent web
information
Development of web algebra and query
language, query optimization
Maintenance of Web data - view maintenance
Development of knowledge discovery and web
mining tools
Web warehouse
Data integration, secondary storages, indexes
Web Data Management
Limitations of the web..
Applications cannot consume HTML
HTML wrapper technology is brittle
Companies merge , need interoperability
Web Data Management
Paradigm Shift
New Web standards – XML
XML generated by applications and
consumed by applications
Data exchange
- Across platforms: enterprise interoperability
- Across enterprises
Web : from documents to data
Web Data Management
Database challenges:
Query optimization and processing
Views and transformations
Data warehousing and data integration
Mediators and query rewriting
Secondary storages
Indexes
Web Data Management
DBMS needs paradigm shift too
Web data differs from database data
-
self describing, schema less,
structure changes without notice,
heterogeneous, deeply nested,
irregular documents and data mixed
designed by document expert, but not DB
expert
- need Web Data Management
Web Data Management
Web data representation
HTML - Hypertext Markup Language
- fixed grammar, no regular expressions
- Simple representation of data
- good for simple data and intended for human
consumption
- difficult to extract information
SGML - Standard Generalized Markup Language
- good for publishing deeply structured document
XML - Extended Markup Language
- a subset of SGML
Web Data Management
Terminology
HTML - Hypertext Mark-up Language
HTTP - Hypertext Transmission Protocol
URL - Uniform Resource Locator
example <URL>:=<protocol>://<Host>/<path>/filename
>[<#location>] where
- <protocol> is http, ftp, gopher
- host is internet address …
- #location is a textual label in the file
Web Data Management
Prevalent, persistent and informative
HTML documents (now XML) created by
humans or applications
Accessed day in and day out by Humans
and Applications
Persistent HTML documents
Can database technology help?
Web Data Management
Some recent research projects
Web Query System
- W3QS, WebSQL, AKIRA, NetQL, RAW,
WebLog, Araneus
Semi structured Data Management
- LOREL, UnQL, WebOQL, Florid
Website Management System
- STRUDEL, Araneus
Web Warehouse
- WHOWEDA
Web Data Management
Main tasks..
Modeling and Querying the Web
- view web as directed graph
- content and link based queries
- example - find the page that contain the
word “Clinton” which has a link from a page
containing word “Monica”
Web Data Management
Main tasks contd.
Information Extraction and integration
- wrapper - program to extract a structured
representation
of the data; a set of tuples from HTML pages.
- mediator: integration of data - software that accesses
multiple sources from a uniform interface
Web Site Construction and Restructuring
- creating sites
- modeling the structure of web sites
- restructuring data
Web Data Management
What to model?
Structure of Web sites
Internal structure of web pages
Contents of web sites in finer granularities
Web Data Management
Data representation of Web data
Graph Data Models
Semi structured Data Models (also graph
based)
Web Data Management
Graph data model
Labeled graph data model where nodes
represent web pages & arcs represent
links between pages
Labels on arcs can be viewed as
attribute names
Regular path expression queries
Web Data Management
Semi structured data models
Irregular data structure, no fixed schema
known and may be implicit in the data
Schema may be large and may change
frequently
Schema is descriptive rather than perspective;
describes current state of data, but violations of
schema still tolerated
Web Data Management
Semi structured data models
Data is not strongly typed; for different objects
the values of the same attributes may be of
differing types. (heterogeneous sources)
No restriction on the set of arcs that emanate
from a given node in a graph or on the types of
the values of attributes
Ability to query the schemas; arc variables
which get bound to labels on arcs, rather than
nodes in the graph
Web Data Management
Graph based Query Languages
Use graph to model databases
Support regular path expressions and
graph construction in queries.
Examples
- Graph Log for hypertext queries
- graph query language for OO
Web Data Management
Query languages for semi structured
data:
Use labeled graphs
Query the schema of data
Ability to accommodate irregularities in the
data, such as missing links etc.
Examples : Lorel (Stanford) , UnQL (AT&T),
STRUQL (AT&T
Web Data Management
Comparing Query Systems
Web Data Management
Types of Query Languages
First Generation
Second Generation
Web Data Management
First Generation Query languages
Combine the content-based queries of search
engines with structure-based queries
Combine conditions on text pattern in
documents with graph pattern describing link
structures
Examples –
- W3QL (TECHNION, Israel), WebSQL
(Toronto), WebLOG (Concordia)
Web Data Management
Second Generation Query languages
Called web data manipulation languages
Web pages as atomic objects with properties
that they contain or do not contain certain text
patterns and they point to other objects
Useful for data wrapping, transformation, and
restructuring
Useful for web site transformation and
restructuring
Web Data Management
How they differ?
Provide access to the structure of web objects they
manipulate - return structure
Model internal structures of web documents as well
as the external links that connect them
Support references to model hyperlinks and some
support to ordered collections of records for more
natural data representation
Ability to create new complex structures as a result
of a query
Web Data Management
Examples..
WebOQL
STRUQL
Florid
Web Data Management
Information Integration
To answer queries that may require extracting
and combining data from multiple web sources
Example - Movie database ; data about movies,
their start casts, directors, schedule etc.
Give me a movie playing time and a review of
movies starring Frank Sinatra, playing tonight
in Paris
Web Data Management
Approaches
Web warehouse – Data from multiple web sources is
loaded into a warehouse, all queries are applied to
warehouse data
- Disadvantage - Warehouse needs to be updated when
data sources change
- Advantage - Performance Improvement
Virtual warehouse – Data remain in the web sources,
queries are decomposed at run time into queries to
sources
- Data is not replicated and is fresh
- Due to autonomy of web sources query optimization
and execution methodology may differ and
performance may be affected
- Good when the number of sources are large, data
changes frequently, little control over web sources
Web Data Management
Virtual approach vs. DBMS
In virtual approach, data is not communicated
directly with storage manager, instead it
communicates to wrappers
Second, user does not pose queries directly in
the schema in which data is stored, user is free
from knowing the structure
User pose the queries to mediated schema,
virtual relations (not stored anywhere) designed
for particular application
Web Data Management
Data Integration Steps
Specification of mediated schema and reformulation –
Mediated schema is the set of collection and attribute
names needed to formulate queries
- Data integration system translates the query on the
mediated schema into a query to data source
Completeness of data in web sources
Differing query processing capabilities
Query Optimization – selecting a set of minimal sources
and minimal queries
Wrapper construction
Matching objects across sources