Thanks to Jim Hendler, Carl Lagoze, Jayavel

Transcript Thanks to Jim Hendler, Carl Lagoze, Jayavel

XML, RDF and
Advanced Search
(Semantic Web – Web3.0)
Thanks to Jim Hendler, Carl Lagoze, Jayavel
Shanmugasundaram, Sara Cohen, Jonathan Mamou,
Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv,
Frank van Harmelen
What we have covered
•
•
•
•
•
•
•
•
•
•
•
•
•
What is IR
Evaluation
Tokenization and properties of text
Web crawling
Query models
Vector methods
Measures of similarity
Indexing
Inverted files
Basics of internet and web
Spam and SEO
Search engine design
Google and Link Analysis
– This week: metadata, XML, RDF; advanced search, Semantic Web
The importance of data and their rules
• Tim Berners-Lee
– inventor of the world wide web
– Founder of the W3C
• Presentation at Ted
Metadata and Markup Languages
“Metadata is data about data”
•Why is metadata important?
•Makes data easier to search
•It’s the foundation of the
•semantic web
•WEB3.0
Metadata often is written in XML
Metadata is semi-structured data conforming to commonly
agreed upon models, providing operational interoperability
in a heterogeneous environment
What is metadata?
Some simple definitions
• ‘Structured data about data’.
• Dublin Core Metadata Initiative FAQ, 2005
– http://dublincore.org/resources/faq/
• Machine-understandable information about
Web resources or other things.
• Tim Berners-Lee, W3C, 1997
– http://www.w3.org/DesignIssues/Metadata
"Web resources or other things"
• Metadata might be "about"… anything!
–
–
–
–
–
–
–
HTML documents
digital images
databases
books
museum objects
archival records
metadata records
–
–
–
–
–
–
–
–
–
–
Web sites
collections
services
physical places
people
organizations
“works”
formats
concepts
events
What might metadata "say"?
What is this called?
What is this about?
Who made this?
When was this made?
Where do I get (a copy of) this?
When does this expire?
What format does this use?
Who is this intended for?
What does this cost?
Can I copy this? Can I modify this?
What are the component parts of this?
What else refers to this?
What did "users" think of this?
(etc!)
What operations/functions?
•
•
•
•
•
•
•
•
•
•
resource disclosure & discovery
resource retrieval, use
resource management, including preservation
verification of authenticity
intellectual property rights management
commerce
content-rating
authentication and authorization
personalization and localization of services
(etc!)
What operations/functions?
• Different functions for different metadata
• Metadata (and metadata standards) sometimes
classified according to function
– Descriptive: primarily for discovery, retrieval
– Administrative: primarily for management
– Structural: relationships between component parts of
resources
– Contextual: relationships between resources
• No “one size fits all solution”!
Metadata of a report?
• What metadata would you associate with a
report or memo?
Types of Metadata
• Descriptive
– Discovery / description of objects
• Title, author, abstract, etc.
• Structural
– Storage & presentation of objects
• 1 pdf file, 1 ppt file, 1 LaTeX file, etc.
• Administrative
– Managing and preservation of objects
• Access control lists, terms and conditions, format descriptions,
“meta-metadata”
LOC - Library of Congress
Which View is Correct?
figure 1 from: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html
Approaches to Metadata
• from Ng, Park and Burnett, 1997 (also JASIS, 50(13))
http://www.scils.rutgers.edu/~sypark/asis.html
– library science: bibliographic control
• “organizing the physical containers of information, by means
of bibliographical description, subject analysis, and
classification notation construction, so that the container can be
efficiently described, identified, located and retrieved”
– computer and information science: data management
• “not only to store, access and utilize data effectively, but also
to provide data security, data sharing, and data integrity”
– Domains/areas define their own
Metadata Formats and Implementation
• Use markup languages
– Interoperable
– Extensible
– Robust
• Permits advance search features
When online, the beginning of a semantic
web!
What is a markup language?
• Textual (i.e. person readable) language
where significant elements are indicated by
markers
– <TITLE>XML</TITLE>
• Examples are RTF, HTML, XML, TEX etc.
• Easy to process and can be manipulated by
a variety of application programs
Standard Generalized Markup Language (SGML)
• Based on GML (generalized markup language), developed
by IBM in the 1960s
• An international standard (ISO 8879:1986) defines how
descriptive markup should be embedded in a document
• Can define any document format of any complexity
• Enables, extensibility, structure and validation
• Too many optional features for the Web
• Gave birth to the extensible markup language (XML),
W3C recommendation in 1998
The Purpose of SGML
•
SGML is designed to make your information last longer
than the systems that created it. Such longevity also
implies immunity to short-term changes -- such as a
change from one application program to another -- so
SGML is also inherently designed for re-purposing and
portability.
What is SGML?
• SGML (and it's derivatives, HTML and XML) are ASCII
character based representations of electronic data
• Remember, it's all bits--meaning is derived from how they
are organized…
• Think of SGML docs as strings that must be parsed--A
web browser parses an HTML doc and uses the markup
codes to display the data contained
• Since it's all ASCII, these docs can also be handled by non
parsing tools (such as vi, emacs, perl, etc.)
SGMLXMLHTML
SGML is the “mother tongue” – but is overkill for most
common applications.
XML is an abbreviated version of SGML
•
easier to define own document types
•
easier for programmers to write programs to handle
documents (and data)
•
omits all the options (and most of more complex and lessused parts) of SGML)
•
HTML is just one of many SGML or XML “applications” –
most frequently used on the Web
SGML Components
SGML documents have three parts:
• Declaration: specifies which characters and delimiters may
appear in the application
• DTD (document type definition) / style sheet: defines the
syntax of markup constructs
• Document instance: actual text (with the tag) of the
documents
• More info could be found:
http://www.W3.Org/markup/SGML
World Wide Web (W3C) Consortium
World Wide Web (W3C) Consortium
What is XML?
• XML – eXtensible Markup Language
• designed to improve the functionality of the Web by
providing more flexible and adaptable information
and identification
• “extensible” because not a fixed format like HTML
• a language for describing other languages (a metalanguage)
• design your own customised markup language
The HTML World
<body>
<h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1>
<p> The workshop was held on 28 July 2000. The editors of the workshop
were David Carmel, Yoelle Maarek, and Aya Soffer </p>
<h2> XQL and Proximal Nodes </h2>
<p> The paper was authored by Ricardo Baeza-Yates and
Gonzalo Navarro. The abstract of this paper is given below. </p>
<p> We consider the recently proposed language … </p>
<p> The paper references the following papers:
<a href=“http://www.acm.org/www8/paper/xmlql”> … </a>
…
</p>
…
The XML World
<workshop date=”28 July 2000”>
<title> XML and Information Retrieval: A SIGIR 2000 Workshop </title>
<editors> David Carmel, Yoelle Maarek, Aya Soffer </editors>
<proceedings>
<paper id=”1”>
<title> XQL and Proximal Nodes </title>
<author> Ricardo Baeza-Yates </author>
<author> Gonzalo Navarro </author>
<abstract> We consider the recently proposed language … </abstract>
<section name=”Introduction”>
Searching on structured text is becoming more important with XML …
<subsection name=“Related Work”>
The XQL language …
</subsection>
</section>
…
<cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite>
</paper>
XML
• XML is written in SGML – the Standardized
General Markup Language, an international
standard (ISO 8879)
• XML = very simple dialect of SGML
• goal = enable generic SGML to be served,
received and processed on the Web in ways
not possible with HTML
Why use XML?
• XML is not just for Web pages
• Data management:
– store any kind of structured document
– enclose/encapsulate information in order to pass it
between different computing systems that are
otherwise unable to communicate
Key feature of XML
An application is free to use XML tagged data in
many different ways, e.g.
• produce an image
• generate a formatted text listing
• display the XML document’s markup in pretty
colors
• restructure the data into a format for storing in a
database, transmission over a network, input to
another program.
XML Software?
• many programs are “XML ready” already
today.
• xml.coverpages.org covers news of new
additions to XML
• Find Penn State pages with XML
How do I run or execute an XML file?
•
•
•
•
You can’t and you don’t !
XML is not a programming language
XML is a markup specification language
XML files are just data (unicode) (waiting for
a program to do something with them)
• XML files can be viewed with an XML editor
or XML-compatible browser
Things to Remember
• XML does not replace HTML – it provides an
alternative which allows you to define your own
set of markup elements to a published standard:
–
–
–
–
<?xml version="1.0" standalone="yes"?>
<conversation>
<greeting>Hello, world!</greeting>
<response>Stop the planet, I want to get
off!</response>
– </conversation>
Things to Remember
• All parts of an XML document are case
sEnSiTiVe
• Element type names are case sensitive, so
<BODY> …</body> is out.
• Attribute names are case sensitive …
•
<PIC width=“7cm”/> and
•
<PIC WIDTH=“6cm”/>
• describe different attributes, not just
different values for the attribute “PIC
width”.
What is XQuery?
– XQuery is the language for querying XML data
• The best way to explain XQuery is to say that XQuery is to XML
what SQL is to database tables.
– XQuery uses XPath expressions to extract XML data.
• XPath is a language for finding information in an XML document.
• XPath is used to navigate through elements and attributes in an XML
document.
– XQuery is defined by the W3C.
– XQuery is supported by all the major database engines (IBM,
Oracle, Microsoft, etc.)
• XQuery 1.0 W3C Recommendation
Motivation for XML Search
• It is becoming increasingly popular to publish data on the
Web in the form of XML documents.
• Current search engines, which are an indispensable tool for
finding HTML documents, have two main drawbacks
when it comes to searching for XML documents.
– It is not possible to pose queries that explicitly refer to XML tags.
– Search engines return references (i.e. links) to documents and not
specific fragments thereof. This is problematic, since large XML
documents may contain thousands of elements storing many pieces
of information that are not necessarily related to each other.
Problems with XQuery
• A query language for XML, such as XQuery, can be used
to extract data from XML documents.
• However, such a query language is not an alternative to an
XML search engine for several reasons.
– The syntax of XQuery is more complicated than the syntax of a
standart search query. Hence, it is not appropriate for a naive user.
– Extensive knowledge of the document structure is required in order
to correctly formulate a query. Thus, queries must be formulated
on a per document basis.
– XQuery lacks any mechanism for ranking answers.
• Solution - XML Search engine
XML Search Tool Design Features?
• A simple syntax that can be used by naive users
• Search results should include XML fragments and not
necessarily full documents
• The XML fragments in an answer, should be semantically
related
– For example, a paper and an author should be in an answer only
if the paper was written by this author
• Search results should be ranked
• Search results should be returned in “reasonable” time
XML Search Engines
• Summary of XML engines
– Open source ones starting to emerge
• Or just use web search engine with filetype:xml
– Try Google
• Many for commercial use and some in design
– Active research area
• Web XML is a step in the direction of the semantic web!
Open Source XML Search Engine
What is Web 2.0 ?
• Term coined by Tim O’Reilly and Media Live
International as part of brainstorming session about the
future of the web in 2005
• Also may be called the Live Web or Living Web
• Refers to more interactive technologies that engage,
facilitate and empower users
• Companies utilizing interactive technologies are the hot
investments
• Companies are just starting to embrace these technologies
for business value
• Tim’s Def (Video); Schmidt’s (Video)
• The Machine (Video)
Web 1.0 vs 2.0 (Some Examples)
Web 1.0
DoubleClick
Ofoto
Akamai
mp3.com
Britannica Online
personal websites
domain name speculation
page views
screen scraping
publishing
content management systems
directories (taxonomy)
stickiness
Web 2.0
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
Google AdSense
Flickr
BitTorrent
Napster
Wikipedia
blogging
search engine optimization
cost per click
web services
participation
wikis
tagging ("folksonomy")
syndication
Source: www.oreilly.com, “What is web 2.0: Design Patterns and Business Models for the next Generation of Software”, 9/30/2005
Web 3.0
This will be the INTELLIGENT
Web!
The Semantic Web!
How will we get the semantic web?
Now... that should clear up a few things around here
Web 2.0 vs Web 3.0
• The Web and Web 2.0 were designed with humans in
mind.
(Human Understanding)
• The Web 3.0 will anticipate our needs! Whether it is State
Department information when traveling, foreign embassy
contacts, airline schedules, hotel reservations, area taxis,
or famous restaurants: the information. The new Web
will be designed for computers.
(Machine Understanding)
• The Web 3.0 will be designed to anticipate the meaning
of the search.
How do we get to the semantic web, really
• The next stage for the Web will be making data
accessible to artificial intelligence agents.
• The Web 3.0 uses new languages beyond HTML
or XML. That is the case of RDF or Resource
Description Framework.
• The Web 3.0 will need data delivered in
computer-readable form (RDF).
General idea of Semantic Web
Make current web more machine accessible and intelligent!
(currently all the intelligence is in the user)
Motivating use-cases
• Search engines
• concepts, not keywords
• semantic narrowing/widening of queries
• Shopbots
• semantic interchange, not screenscraping
• E-commerce
– Negotiation, catalogue mapping, personalisation
• Web Services
– Need semantic characterisations to find them
• Navigation
• by semantic proximity, not hardwired links
• .....
Example
• Try these queries with Google:
–
–
–
–
Distance between Paris and Madrid
Distance between Paris and New York
(The) Largest city of France
(The) Largest city of Spain
• Now, try these with Google:
– Distance between largest city of France and largest city of
Spain
– Distance between“largest city of France”and “largest city of
Spain”
– And worst, Distance between“the largest city of France and
that of Spain” –
– What state is south of Texas
Examples
• What other queries does Google not understand?
– START YOUR ENGINES
Web Search Semantics
• So, what’s wrong with Google?
– Nothing. The problem is with the World Wide Web:
• The Web contains unstructured information
– and Google is a keyword- and phrase-based search
engine
• Initiative to make the contents on the Web
structured information/represented knowledge
– the Semantic Web
• Another approach – let Google do it.
General idea of Semantic Web
Do this by:
• Somehow making data and metadata available
on the Web in machine-understandable form
(formalized)
• Structure the data and meta-data in ontologies
These are non-trivial
design decisions.
Alternative would be:
Semantic Search Engines
Do they exist?
• Some claim that they do
• Try these out (some no longer around):
–
–
–
–
–
–
–
–
Lexxe
iGlue
Hakia
Exalead
Kosmix
Swoogle
WolframAlpha
Bing
Expressed using the W3C stack
What it’s like to be a machine
on the Web
Required are:
• Explicit meta-data
• Shared domain descriptions
Machine-processable content
Machine-support for interoperability
machine accessible meaning
(What it’s like to be a machine)
name
education
CV
work
private
XML 
machine accessible meaning
name >
< name
<education>
< education>
< CV
CV >
<work>
< work>
<private>
< private >
So why not just use XML?
• No agreement on:
– structure
• is country a:
– object?
– class?
– attribute?
– relation?
– something else?
• what does nesting mean?
– vocabulary
• is country the same as
nation?
<country name=”Netherlands”>
<capital name=”Amsterdam”>
<areacode>020</areacode>
</capital>
</country>
<nation>
<name>Netherlands</name>
<capital>Amsterdam</capital>
<capital_areacode>
020
</capital_areacode>
</nation>
Are the above XML documents the same?
● Do they convey the same information?
● Is that information machine-accessible?
●
“2nd aim of Semantic Web”:
Data integration
–
–
–
–
Unstructured and sensors, programs, services semistructured sources (document collections, message
traffic, web pages, ...)
Structured data without an explicit data schema
(non-local databases, data tables, charts and reports,
...)
Non-Text collections (image, video, sound, ...)
Streams of data from
Must specify the structure of data resources..
2nd aim of Semantic Web:
Data integration
... so a processor can tell how the "attributes" and
"values" are related
–
–
–
–
–
What is required vs. optional?
How many values for a particular attribute?
What attributes are keys for other attributes?
Which attributes are necessarily related to other
attributes and in what way??
How do the attributes (and values) in one data
source map to attributes and values describing
another source?
Stack of languages
• XML:
– Surface syntax, no semantics
• XML Schema:
– Describes structure of XML documents
• RDF:
– Datamodel for “relations” between “things”
• RDF Schema (RDFS):
– RDF Vocabulary Definition Language
• OWL:
– A more expressive
Vocabulary Definition Language
Semantic web languages today
• Today there are three semantic web languages
– RDF – Resource Description Framework and variations
http://www.w3.org/RDF/
– DAML+OIL – Darpa Agent Markup Language http://www.daml.org/
(deprecated)
– OWL – Ontology Web Language
http://www.w3.org/2001/sw/
• OWL lit
• OWL DL
• OWL Full
RDF is the first Semantic Web language
Graph
XML Encoding
<rdf:RDF ……..>
<….>
<….>
</rdf:RDF>
Good for
Machine
Processing
RDF
Data Model
Triples
stmt(docInst, rdf_type, Document)
stmt(personInst, rdf_type, Person)
stmt(inroomInst, rdf_type, InRoom)
stmt(personInst, holding, docInst)
stmt(inroomInst, person, personInst)
Good For
Reasoning
Good For
Human
Viewing
RDF is a simple
language for building
graph based
representations
The RDF Data Model
• An RDF document is an unordered collection of statements, each with a
subject, predicate and object (aka triples)
• A triple can be thought of as a labelled arc in a graph
• Statements describe properties of web resources
• A resource is any object that can be pointed to by a URI:
–
–
–
–
–
a document, a picture, a paragraph on the Web, …
E.g., http://umbc.edu/~ypeng/F07671.html
a book in the library, a real person (?)
isbn://5031-4444-3333
…
• Properties themselves are also resources (URIs)
RDF without a Schema
• Object ->Attribute-> Value triples
pers05
Author-of
ISBN...
• objects are web-resources
• Value is again an Object:
• triples can be linked
• data-model = graph
pers05
Author-of
ISBN...
ISBN...
Publby
MIT
What does RDF Schema add?
• Defines vocabulary for RDF
• Organizes this vocabulary in a
typed hierarchy
• Class, subClassOf, type
• Property, subPropertyOf
• domain, range
Person
subClassOf
Author
domain
communicatesTo
type
Frank
subClassOf
range
Reader
type
communicatesTo
Lynda
Which Semantic Web?
• Version 1:
"Semantic Web as Web of Data" (TBL)
• recipe:
expose databases on the web,
use XML, RDF, integrate
• metadata from:
– expressing DB schema semantics
in machine interpretable ways
• enable integration and unexpected re-use
Which Semantic Web?
• Version 2:
“Enrichment of the current Web”
• recipe:
Annotate, classify, index
• metadata from:
– automatically producing markup:
named-entity recognition,
concept extraction, tagging, etc.
• enable personalization, search, browse,..
Which Semantic Web?
• Version 1:
“Semantic Web as Web of Data”
• Version 2:
“Enrichment of the current Web”
 Different use-cases
 Different techniques
 Different users
Semantic Web research
Four popular fallacies
about the Semantic Web
First: clear up some popular misunderstandings
False statement No :
“Semantic Web people try to
enforce meaning from the top”
They only “enforce” a language.
They don’t enforce what is said in that language
Compare: HTML “enforced” from the top,
But content is entirely free.
First: clear up some popular misunderstandings
False statement No :
“The Semantic Web people will require everybody to subscribe to a
single predefined "meaning" for the terms we use.”
Of course, meaning is fluid, contextual, etc.
Lot’s of work on (semi)-automatically
bridging between different vocabularies.
First: clear up some popular misunderstandings
False statement No :
“The Semantic Web will require users to understand
the complicated details of formalized knowledge
representation.”
All of this is “under the hood”.
First: clear up some popular misunderstandings
False statement No :
“The Semantic Web people will require us to manually markup all the
existing web-pages.”
Lots of work on automatically producing
semantic markup:
named-entity recognition,
concept extraction, etc.
Semantic Web research
The current state of Semantic Web
Advanced Search
Metadata and semantic web will make advanced search
much easier
Growth of web metadata.
Folksonomies!
Tools that automatically generate metadata
Crowdsourcing
TREC
Search for Web 3.0
• Natural language queries
• Search agent (avatar) understands and
anticipates your needs
• Personal life search with avatar
The Evolving Web
Web of
Knowledge
Proof, Logic and
Ontology Languages
DATA/PROGRAMS
Shared terms/terminology
Machine-Machine communication
2010
Resource Description Framework
eXtensible Markup Language
HyperText Markup Language
HyperText Transfer Protocol
Self-Describing Documents
2000
DOCUMENTS
Foundation of the Current Web
1990
Berners-Lee, Hendler; Nature, 2001
Semantic Web 2008 - ?
(Jim Hendler - internal talk, Microsoft Labs, July 2008)
Building DBPedia – starting the
semantic web
Semantic Web Companies
List of companies
Wikipedia
How would I check to see how much rdf is out there?
25th Anniversary of the WWW – 12 March 2014
A Bill of Rights for the World Wide Web
Web 4.0 :-?)
The next 5000 days of the web
• Kevin Kelly
– Founder of WIRED magazine
– Video
Web 4.0 Evolution
Web 4.0
• Machines talk back!
Search for Web 4.0
• We get real help
when we search!
Terminator: the
Sarah Connor
Chronicles
Cameron’s on our
side!
Web 2.0 vs Web 3.0
Web 3.0 applications
Everything on the web will be different – same impact as
natural language processing.
Web 4.0 will be the intelligent web with agents doing a lot of the
work.
What we covered
• Metadata
– xml
• The web of data
– xml, rdf, others
• Web 2.0
– The social web
• Web 3.0
– The semantic web
• Future of the web and web search