XML: An Overview

Download Report

Transcript XML: An Overview

Some of my
XML/Internet
Research Projects
CSCI 6530
October 5, 2005
Kwok-Bun Yue
University of Houston-Clear Lake
1
Content
•
•
•
•
•
Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 2
Areas of My Research Interest
•
•
•
•
Internet Computing
XML
Databases
Concurrent Programming
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 3
Content
•
•
•
•
•
Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 4
Some Current Projects
• Storage of XML in relational database
• Measuring Web bias using authorities and
hubs
• Measuring information quality of Web pages
• Distributed computer security laboratory
• Collaborative Open Community for
developing educational resources
• Generalized exchanges within organizations
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 5
Some Recent Student Work
• McDowell, A., Schmidt, C. & Yue, K., Analysis and Metrics of
XML Schema, Proceedings of the 2004 International
Conference on Software Engineering Research and Practice, pp
538-544, Las Vegas, June 2004.
• Yang A., Yue K., Liaw K., Collins G., Venkatraman J., Achar S.,
Sadasivam K., and Chen P., Distributed Computer Security Lab
and Projects, Journal of Computing Sciences in Colleges.
Volume 20, Issue 1. October 2004.
• Yue, K., Alakappan, S. and Cheung, W., A Framework of Inlining
Algorithms for Mapping DTDs to Relational Schemas, Technical
Report COMP-05-005, Computer Science Department, the
Hong Kong Baptist University, 2005,
http://www.comp.hkbu.edu.hk/en/research/?content=techreports.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 6
Content
•
•
•
•
•
Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 7
Storing XML in RDB
• Advantages:
– Mature database technologies.
– May be queried by
• XML technology: e.g. XPath, XQuery.
• RDB technology: e.g. SQL.
• Disadvantages:
– impedance mismatch: XML and relations
are different data models.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 8
Related Issues
• Effective mapping XML DTDs (~
ordered tree model) to relational
schemas.
• Mapping of XML queries (e.g. XQuery)
to RDB queries (e.g. SQL).
• Mapping of RDB query results back to
XML format.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 9
Related Work and Context
• Mapping
– With or without schemas for XML.
– With or without user input.
• Schemas for XML:
– Document Type Definition (DTD)
– XML Schema
• We consider mapping with DTD and
without user input.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 10
Naïve Mapping
• An XML element is mapped to a
relation.
Example 1a:
XML:
<a><b><c><d>hello</d></c></b></a>
-> Relations: a, b, c and d.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 11
Problems of Naïve Mapping
• Many relations.
• Ineffective queries: multiple query joins.
Example 1b:
XPath Query: //a
SQL Query: need to join the relations a, b,
c and d.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 12
Inlining Algorithms
• First proposed by Shanmugasundaram,
et. al.
• Expanded by Lu, Lee, Chu and others.
• Extended in various directions by
various researchers, e.g.,
– Preserving XML element orders.
– Preserving XML constraints.
• Do not consider extensions here.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 13
Basic Idea of Inlining
Algorithms
• Inline child element into the relation for
the parent element when appropriate.
• Different inlining algorithms differ in
inlining criteria.
Example 1c: XML:
<a><b><c><d>hello</d></c></b></a>
Inlined Relation: a.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 14
Inlining Algorithms
• Child elements & attributes may be
inlined.
• Child elements may not have their own
relations.
• Results in less number of relations.
• In general, more inlining -> less joins.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 15
Inlining Algorithm Structure
1.
2.
3.
Simplification of DTD.
Generation of DTD graphs
Generation of Relational Schemas
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 16
Our Preliminary Results
1. A more complete and optimal DTD
Simplification Algorithm
2. A generic DTD Graph that can be used
by inlining algorithms.
3. Inlining Considerations: framework for
analyzing inlining algorithm
4. A new and aggressive inlining
algorithm
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 17
Examples of Our Work
• Use DTD Simplification as an example of the
flavor of our work.
• Show the new Inlining Algorithm.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 18
Brief Introduction to DTD
• DTD: a simple language to describe
XML vocabulary:
– Element declarations: contents of
elements.
– Attribute declarations: types and properties
of attributes.
• DTD is still very popular.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 19
DTD Element Declarations
• Define element contents:
– #PCDATA: string
– ANY: anything go
– EMPTY: no content (attributes only)
– Content models: child elements.
– Mixed contents: child elements and strings.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 20
DTD Example
Example 2: A complete DTD
<!ELEMENT addressBook (person+)>
<!ELEMENT person (name,email*)>
<!ELEMENT name (last,first)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ATTLIST person id ID #REQUIRED>
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 21
Operators for Element
Declaration
•
•
•
•
•
•
,: sequence
+: 1 or more
*: 0 or more
?: optional; 0 or 1
|: choice
(): parenthesis
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 22
Simplification of DTD
• Mapping of DTD to Relational Schemas:
– Input: DTDs
– Output: Relational Schemas
• DTD can be complicated =>
simplification.
Example 3:
<!ELEMENT a
(b,((b+,c)|(d,b*,c?)),(e*,f)?)>
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 23
Simplification Principles
• The relational schema needs to store all
possible scenarios.
• Some relations/columns may not be
populated in some instances.
Example 3:
<!ELEMENT a (b|c)> and
<!ELEMENT a (b,c)>:
May be the same from the RDB’s point of
view.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 24
Simplification Details
• Comma-separated clauses: only
operators remain: (), , and *.
– + -> *, e.g. a+ -> a*.
– Removal of | and ?, e.g. (a|b?) -> (a,b)
– Removal of (), e.g. (a, (b)) -> (a,b)
– Removal of repetition, e.g. (a, b, a) -> (a*,
b)
• Note that element orders are not
preserved.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 25
Previous Simplification
Results
• Not complete: e.g.
– Shanmugasundaram: not specify how to
handle |.
– Lu: not specify how to remove ().
• Not optimal (may generate * when it is
not needed).
Example 4a: For Lu and Lee, 2 steps:
(b|(b,c)) -> (b,b,c) -> (b*,c)
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 26
Our Simplification Algorithm
• A set of definitions.
• A set of 7 simplification rules.
• An algorithm on how and when to use
them.
Example 4b: For us, 1 step:
(b|(b,c)) -> (b,c)
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 27
Simplification Rules
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 28
Simplification Algorithm
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 29
Complexity
Time complexity = O(Nop)
Where Nop is the total number of
operators (including parentheses) in the
element declarations of the DTD.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 30
Advantages
• Complete: handle all DTDs.
• Optimal: in the sense that * will not be
generated if not needed.
Example 5:
<!ELEMENT a
(b,((b+,c)|(d,b*,c?)),(e*,f)?)>
=> <!ELEMENT a (b*,c,d,e*,f)>
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 31
A New Inlining Algorithm (1)
•
•
•
•
Aggressive in inlining.
More complete.
Elaborated algorithms.
Handle more details: e.g. element types
of ANY, EMPTY and mixed contents.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 32
A New Inlining Algorithm (2)
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 33
A New Inlining Algorithm (3)
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 34
A New Inlining Algorithm (4)
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 35
Main Results
• Yue, K., Alakappan, S. and Cheung, W.,
A Framework of Inlining Algorithms for
Mapping DTDs to Relational Schemas,
Technical Report COMP-05-005,
Computer Science Department, the
Hong Kong Baptist University, 2005,
http://www.comp.hkbu.edu.hk/en/resear
ch/?content=tech-reports.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 36
Future Works
• Implemented the algorithms and tested
with many DTDs.
• Need to implement the XQuery/SQL
bridge for performance study.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 37
Content
•
•
•
•
•
Areas of My Research Interest
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 38
Measuring Web Bias
• Search engines dominate how information
are accessed.
• Search results have major social, political and
commercial consequences.
• Are search engines bias?
• How bias are them?
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 39
Previous Works
• To measure bias, results should be compared
to a norm.
• The norm may be from human experts.
• Mowshowitz and Kawaguchi: the
average search result of a collection of
popular search engines as the norm.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 40
Mowshowitz and Kawaguchi
union
SE 1
SE n
10/5/2005
U R LS 1
NORM
U R LS
NORM
URL
Vector
URL
Vector 1
Bias 1
URL
Vector n
Bias n
U R LS n
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 41
Limitations
• Based on URL Vector -> cannot
measure bias quality.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 42
Our Approach
• Use Kleinberg’s HITS algorithm to create
clusters, authorities and hubs of the result
norm URLs.
• Use them as norm clusters, authorities and
hubs.
• Measure distances between norms and
individual results as bias.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 43
HITS
• Obtain a directed graph G where
– Node: page
– Edge: URL link from between pages.
• Two indices: xp,i (authority) & yp,i (hub)
• Iterate until steady state:
– xp,i+1 <- ∑ q,q->p yq,i
– yp,i+1 <- ∑ q,p->q xq,i
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 44
Our Approach
union
SE 1
SE n
10/5/2005
U R LS 1
NORM
U R LS
NORM
C luster
NORM
C luster
Vector
URL
Vector 1
C luster
Vector 1
Bias 1
URL
Vector n
C luster
Vector n
Bias n
U R LS n
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 45
Current Progress
• Implemented previous results.
• Implemented vector analysis
• Implemented HITS algorithm, but it is not
accurate enough:
– ‘Conglomerate’ effect.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 46
Measuring Page’s Information
Quality
• People find information from Web pages.
• How good is the content of a given page?
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 47
Previous Works
• Measuring different kinds of quality:
– Web site design quality
– Navigational quality
• Many framework on how to measure
information quality:
– Most results in surveys so users can rank
informational quality.
– Very few automated or semi-automated tool.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 48
Our Objectives
• Build automated and/or semi-automated tool
to measure and/or assist user to measure
information quality of a Web page.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 49
Approach
• Hypothesis, measure, usage guidelines.
• Example:
– Hypothesis: a Web page with many spelling
mistakes is likely to have low information quality.
– Measures:
• Show frequencies of word occurrences.
• Show percentage of spelling ‘mistakes’.
– Usage guideline:
• Spelling ‘mistakes’ may not be actual mistakes (e.g.
UHCL).
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 50
Metrics
• Many potential metrics. Some examples:
–
–
–
–
–
–
–
–
Broken links
HTML Quality
Domain names
Page ranking and popularity
Appearance in directory structure
History (e.g. Way back machine)
Currency (e.g. last modified)
Author (e.g. Meta tag)
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 51
Current Progress
• ‘Pre-alpha’ prototype:
http://dcm.cl.uh.edu/yue/util/pageInfo.pl
• A capstone project
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 52
Content
•
•
•
•
•
Areas of My Research Interest.
Some Current Projects
Storage of XML in Relational Database
Example Internet Computing Projects
Conclusions
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 53
Conclusions
• Good time to do applied computing
research in the Web and XML areas.
• Style: hands-on supervision +
publications.
• Don't forget to donate a scholarship to
the School if your future research leads
to a windfall.
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 54
Questions?
• Any Questions?
• Thanks!
10/5/2005
Bun Yue: [email protected], http://dcm.uhcl.edu/yue
slide 55