Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research Yahoo! Research (research.yahoo.com) -2-

Download Report

Transcript Making the Web Searchable Peter Mika Researcher, Data Architect Yahoo! Research Yahoo! Research (research.yahoo.com) -2-

Making the Web Searchable
Peter Mika
Researcher, Data Architect
Yahoo! Research
Yahoo! Research (research.yahoo.com)
-2-
Yahoo! Developer Network (developer.yahoo.com)
-3-
Yahoo! Research Barcelona
• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
– Web Mining
• content, structure, usage
– Distributed Web retrieval
– Multimedia retrieval
– NLP and Semantics
-4-
Yahoo! by numbers (April, 2007)
•
•
•
•
•
•
•
•
•
•

•
There are approximately 500 million users of Yahoo! branded services, meaning we reach 50 percent – or 1
out of every 2 users – online, the largest audience on the Internet (Yahoo! Internal Data).
Yahoo! is the most visited site online with nearly 4 billion visits and an average of 30 visits per user per
month in the U.S. and leads all competitors in audience reach, frequency and engagement (comScore Media
Metrix, US, Feb. 2007).
Yahoo! accounts for the largest share of time Americans spend on the Internet with 12 percent (comScore
Media Metrix, US, Feb. 2007) and approximately 8 percent of the world’s online time (comScore
WorldMetrix, Feb. 2007).
Yahoo! is the #1 home page with 85 million average daily visitors on Yahoo! homepages around the world,
an increase of nearly 5 million visitors in a month (comScore WorldMetrix, Feb. 2007).
Yahoo!’s social media properties (Flickr, delicious, Answers, 360, Video, MyBlogLog, Jumpcut and Bix) have
115 million unique visitors worldwide (comScore WorldMetrix, Feb. 2007).
Yahoo! Answers is the largest collection of human knowledge on the Web with more than 90 million unique
users and 250 million answers worldwide (Yahoo! Internal Data).
There are more than 450 million photos in Flickr in total and 1 million photos are uploaded daily. 80
percent of the photos are public (Yahoo! Internal Data).
Yahoo! Mail is the #1 Web mail provider in the world with 243 million users (comScore WorldMetrix, Feb.
2007) and nearly 80 million users in the U.S. (comScore Media Metrix, US, Feb. 2007)
Interoperability between Yahoo! Messenger and Windows Live Messenger has formed the largest IM
community approaching 350 million user accounts (Yahoo! Internal Data).
Yahoo! Messenger is the most popular in time spent with an average of 50 minutes per user, per day
(comScore WorldMetrix, Feb. 2007).
Nearly 1 in 10 Internet users is a member of a Yahoo! Groups (Yahoo! Internal Data).
Yahoo! is one of only 26 companies to be on both the Fortune 500 list and the Fortune’s “Best Place to Work”
List (2006).
-5-
Agenda
• Metadata on the Web
– The annotated web
– Microsearch
– SearchMonkey
– Yahoo Open Strategy
• Toward Semantic Search
– And some of the research we do
• Hands-on
-6-
Metadata on the Web
Which Semantic Web?
• Semantic Web is a set of standards for encoding the
meaning of information in a machine-processible form
– To facilitate reasoning with information, in particular
aggregation
• Two approaches to the Semantic Web
– Linked Data
• Bringing the content of databases to the Web (linkeddata.org)
• Data linked to data, separate from content
– Annotated Web
• Annotating the content of Web resources (documents, mm)
• Data inside content
• This presentation is about the Annotated Web.
-8-
Brief history of the Annotated Web
• 1995: HTML meta tags
• 1996: Simple HTML Ontology Extensions (SHOE)
• 1998: RDF/XML
– RDF/XML in HTML
– RDF linked from HTML
• 2003: Web 2.0
– Tagging
– Microformats
– Metadata in Wikipedia
– Machine tags in Flickr
• 2005: eRDF
• 2008: RDFa
-9-
HTML meta tags
<HTML>
<HEAD profile="http://dublincore.org/documents/dcq-html/">
<META name="DC.author" content="Peter Mika">
<LINK rel="DC.rights copyright"
href="http://www.example.org/rights.html" />
<LINK rel="meta" type="application/rdf+xml" title="FOAF"
href= "http://www.cs.vu.nl/~pmika/foaf.rdf">
</HEAD>
…
</HTML>
- 10 -
SHOE example
(Hefflin & Hendler, 1996)
<ONTOLOGY "our-ontology" VERSION="1.0">
<ONTOLOGY-EXTENDS "organization-ontology" VERSION="2.1" PREFIX="org"
URL="http://www.ont.org/orgont.html">
<ONTDEF CATEGORY="Person" ISA="org.Thing">
<ONTDEF RELATION="lastName" ARGS="Person STRING">
<ONTDEF RELATION="firstName" ARGS="Person STRING">
<ONTDEF RELATION="marriedTo" ARGS="Person Person">
<ONTDEF RELATION="employee" ARGS="org.Organization Person">
</ONTOLOGY>
<HEAD>
<META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george">
<USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html">
</HEAD>
<BODY>
<CATEGORY "our.Person">
<RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena">
<RELATION "our.employee" FROM="http://www.cs.umd.edu">
My name is
<ATTRIBUTE "our.firstName"> George </ATTRIBUTE>
<ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...
- 11 -
SHOE system
- 12 -
SHOE Text-based query interface
- 13 -
SHOE Graphical Query Interface
- 14 -
Example: Creative Commons
Embedding CC license in HTML (now deprecated):
<HTML>
<HEAD>… </HEAD>
<BODY>
…
<!–<rdf:RDF xmlns="http://creativecommons.org/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<Work rdf:about="http://www.yergler.net/averages/">
<dc:title>The Law of Averages</dc:title>
<dc:description>...because eventually i&apos;ll be right...</dc:description>
<license rdf:resource="http://creativecommons.org/licenses/by-nc/1.0/" />
</Work>
<License rdf:about="http://creativecommons.org/licenses/by-nc/1.0/">
<requires rdf:resource="http://web.resource.org/cc/Notice" />
<permits rdf:resource="http://web.resource.org/cc/Reproduction" />
<permits rdf:resource="http://web.resource.org/cc/Distribution" />
<prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" />
</License>
</rdf:RDF>
-->
- 15 -
Example: Creative Commons
• Current: rel attribute (HTML4)
This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by/3
.0/us/">Creative Commons Attribution 3.0
United States License</a>.
• Use of the “rel” attribute for semantic annotation is the birth of
the microformat…
- 16 -
Example: microformats
<div class="vcard">
<a class="email fn" href="mailto:[email protected]">Joe Friday</a>
<div class="tel">+1-919-555-7878</div>
<div class="title">Area Administrator, Assistant</div>
</div>
<cite class="vcard">
<a class="fn url" rel="friend colleague met"
href="http://meyerweb.com/">Eric Meyer</a>
</cite> wrote a post (<cite>
<a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">
Tax Relief</a></cite>) about an unintentionally humorous letter
he received from the <span class="vcard">
<a class="fn org url" href="http://irs.gov/">
Internal Revenue Service</a> </span>.
- 17 -
microformats
•
•
•
microformats.org
Originated by Tantek Celik and others
Agreements on the way to encode certain kinds metadata in HTML
–
–
–
–
•
Microformats have no shared syntax
–
•
No formal descriptions of schema, only text
Limited reuse, extensibility of schemas
No datatypes
No namespaces, unique identifiers (URIs)
–
–
•
•
Each microformat has a separate syntax tailored to the vocabulary
Microformats are not ontologies
–
–
–
•
Reuse of semantic-bearing HTML elements
Based on existing standards
Community process
Persons, events, listings etc. but also syntactic metadata: licenses, tags
no interlinking
mapping between instances is required
Relationship to page context is unclear
Widely used in millions of documents
–
User-generated as well as automatically generated
- 18 -
RDF-based annotation #1: eRDF
• eRDF
– Ian Davis (Talis)
– Embedding RDF in HTML
• Straightforward mapping to RDF triples (XSLT available)
• HTML4 compatible
– More complex than microformats
• Use any RDF/OWL vocabulary
• Reuse of semantic-bearing HTML elements is limited
– More limited than RDF
• No blank nodes
• No data types
• No statements about subjects other than the current document
– Limited usage
- 19 -
RDF-based annotation #2: RDFa
• RDFa
– World Wide Web Consortium (W3C) last call document
– Similar intent as eRDF, but full RDF support
• Requires XHTML
– Big question: user complexity ( data quality)
<p typeof="contact:Info" about="http://example.org/staff/jo">
<span property="contact:fn">Jo Smith</span>.
<span property="contact:title">Web hacker</span> at
<a rel="contact:org" href="http://example.org"> Example.org </a>.
You can contact me <a rel="contact:email"
href="mailto:[email protected]">
via email </a>.
</p> ...
- 20 -
From Microsearch to SearchMonkey
Microsearch
• Metadata is out there
– Just how much data is out there?
– What is the quality?
• Idea: bring metadata to the surface of search
• How does it work?
–
–
–
–
User enters query
Metadata is extracted dynamically
Entity reconciliation
Metadata is used to display
• rich abstracts,
• related pages
• spatial, temporal visualization
• Microsearch prototype
- 22 -
Example: ivan herman
Geolocation
Rich abstract
Related pages
based on
metadata
Events from
personal calendar,
Conferences, and
bio from LinkedIn
- 23 -
Example: peter site:flickr.com
Flickr users
named “Peter” by
geography
- 24 -
Example: san francisco conference
Conferences in
San Francisco by
date
- 25 -
Example: greater st. peter
Call phone
number
Save to
address book
(other actions)
- 26 -
Lessons
• More metadata than we expected
– 53% of unique queries have at least one metadata-enabled
page in top 10 (n=7848)
• Performance is poor
– Metadata needs to come from the index for performance
• ‘Metacrap’ does exist
– Users have to see metadata to spot mistakes in their markup,
warn others
• RDF templating (Fresnel) adds complexity
– Abstract needs to be customized to the particular site, query
- 27 -
SearchMonkey
• Creating an ecosystem of publishers, developers and endusers
– Motivating and helping publishers to implement semantic
annotation
– Providing tools for developers to create compelling applications
– Focusing on end-user experience
• Rich abstracts as a first application
• Addressing the long tail of query and content production
• Standard Semantic Web technology
– dataRSS = Atom + RDFa
– Industry standard vocabularies
• http://developer.yahoo.com/searchmonkey/
- 28 -
What is SearchMonkey?
an open platform for using structured data to build more
useful and relevant search results
Before
After
- 29 -
Enhanced Result
deep links
image
name/value
pairs or
abstract
- 30 -
Infobar
YAHOO! CONFIDENTIAL | 31
- 31 -
SearchMonkey
1
site owners/publishers share structured data with Yahoo!.
2
site owners & third-party developers build SearchMonkey apps.
3
consumers customize their search experience with Enhanced Results or Infobars
Page Extraction
RDF/Microformat Markup
Acme.com’s
Web Pages
Index
DataRSS feed
Web Services
Acme.com’s
database
- 32 -
DataRSS
• An Atom extension for structured data
• Why a new format?
– A feed format is required by publishers
• Exclusive content (e.g. partnerships, paid inclusion)
• No changes necessary to the web page
• No standard named graph format for the Semantic Web
– Needed to capture meta-metadata such as source and timestamp of
information
– Not really a new format
• An Atom extension
• Use any RDFa parser to get the triples out
• cf. Google Base feeds
- 33 -
DataRSS
<?profile http://search.yahoo.com/searchmonkey-profile ?>
<feed xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/Atom ../latest/xsd/datarss.xsd“>
<id>http://www.linkedin.com/datarss/</id>
<author>
<name>Peter Mika ([email protected])</name>
</author>
<title>Example data feed for social</title>
<updated>2007-11-14T04:05:06+07:00</updated>
<entry>
<!-- title field of entry is not used for anything -->
<title>Peter Mika</title>
<!--URL of the webpage extracted from -->
<id>http://www.linkedin.com/ppl/webprofile?id=5054019</id>
<updated>2007-11-14T04:05:06+07:00</updated>
<content type="application/xml">
<y:adjunct version="1.0" name="social-simple" xmlns:y="http://search.yahoo.com/datarss/">
<y:item rel="dc:subject">
<y:type typeof="foaf:Person">
<y:meta property="foaf:name">John Doe</y:meta>
<y:meta property="foaf:gender">male</y:meta>
<y:item rel="foaf:homepage" resource="http://www.joeisageek.com"/>
<y:item rel="foaf:mbox" resource="mailto:[email protected]"/>
<y:item rel="foaf:weblog" resource="http://johnblog.example.org"/>
<y:item rel="foaf:knows">
<y:type typeof="foaf:Person">
<y:meta property="foaf:name">Jane Doe</y:meta>
<y:meta property="foaf:gender">female</y:meta>
<y:item rel="foaf:mbox" resource="mailto:[email protected]"/>
</y:type>
</y:item>
</y:type>
</y:item>
</y:adjunct>
</entry>
</feed>
- 34 -
Atom 1.0
XML + RDFa
The data part
<adjunct version="1.0" id=“com.yahoo.page.rdfa" xmlns="http://search.yahoo.com/datarss/“
updated=“2007-11-14T04:05:06+07:00”>
<item rel="dc:subject">
<type typeof="foaf:Person">
<meta property="foaf:name">John Doe</meta>
<meta property="foaf:gender">male</meta>
<item rel="foaf:homepage" resource="http://www.joeisageek.com"/>
<item rel="foaf:mbox" resource="mailto:[email protected]"/>
<item rel="foaf:weblog" resource="http://johnblog.example.org"/>
<item rel="foaf:knows">
<type typeof="foaf:Person">
<meta property="foaf:name">Jane Doe</meta>
<meta property="foaf:gender">female</meta>
<item rel="foaf:mbox" resource="mailto:[email protected]"/>
</type>
</item>
</type>
</item>
</adjunct>
- 35 -
Developer tool
- 36 -
Developer tool
- 37 -
Developer tool
- 38 -
Developer tool
- 39 -
Developer tool
- 40 -
Gallery
- 41 -
Example apps
• LinkedIn
– hCard plus feed data
• Creative Commons by Ben Adida
– CC in RDFa
- 42 -
Example apps. II.
• Other me by Dan Brickley
– Google Social Graph API wrapped using a Web Service
- 43 -
What happened since the launch?
• It’s starting to work!
– Click rates improve  Publishers are willing to invest  More
structured data  More applications  More users  Click
rates improve
• Increasing excitement all around
– Standardization of RDFa is bringing new energy
– Good market for companies that help publishers to ‘semantify’
or support developers in extracting structured from web pages
• OpenCalais, Dapper, AdaptiveBlue, Intel MashMaker, Zemanta…
• There have been some lessons learned…
- 44 -
Lessons learned: data quality
• Publishers/developers want the quick and dirty answer, not the
long and clean one
• Resource or literal?
– <meta property=“vcard:url”>http://www.example.org</meta>
– <meta property=“vcard:tel”>0034691792522</meta>
• Webpage or resource?
– Complexity
Should we allowof
a resource
have the same
URI quality
as an existing
the
formalism
=
Data
down
webpage?
– This is the default in eRDF/RDFa!
• <div class=“foaf-name”>Peter Mika</div>
• Types vs. datatypes
– <meta property=“use:email”>[email protected]</meta>
• Extensibility
– rdfs:movies
- 45 -
Lessons learned: vocabularies
• Coverage is small
– Books, movies, stuff people care about…
• Competing proposals
– Versions floating around
• Not maintained
Distributed
development
= Mess
– I cannot
maintain ontology
your vocabulary
for you
• Vocabularies for microformats
– A must
• The role of the W3C
– Ontologies as member submission….
• Vocabularies not designed for the annotated Web
- 46 -
Lessons learned: eRDF
• Difficult for complex pages and dangerous in non-expert
hands
– Serious limitations
• No datatypes
• No subjects other than identifiers within the current page
• Reuse of the id attribute
<div id=“blue_panel”>
<span class=“foaf-name”>Peter Mika</span>
<div id=“white_clickable_thingy_on_the right”>
<span class=“foaf-mbox”>[email protected]</span>
….
- 47 -
Lessons learned: RDFa
•
A huge improvement
– E.g. no repurposing of HTML attributes
•
Still, not everything is intuitive to the uninitiated:
<div about=“#id”>
<span property=“foaf:name“>Peter Mika</span>
<span rel=“foaf:img“ typeof=“foaf:Image”>
<span property=“dc:format”>jpg</span>
…
</span>
</div>
<div about=“#id”>
<span property=“foaf:name“>Peter Mika</span>
<span rel=“foaf:img“ resource=“http://www.example.org/photo.jpg”>
<span typeof=“foaf:Image”>
<span property=“dc:format”>jpg</span>
</span
</span
</div>
- 48 -
The Yahoo Open Strategy
Y!OS 1.0
• Yahoo! Open Strategy
– Build Your Own Search Service (BOSS)
– Yahoo! Query Language (YQL)
• Access (other) web services using a SQL-like language
– Yahoo! Social Platform
• Profiles, Connections, Updates, Contacts and Status
– Yahoo! Application Platform (YAP)
• Developer hosted execution of applications with access to Yahoo's
Social APIs and YQL;
• Support for OpenSocial's JavaScript API; and
• Support for server-side YML tags.
• Future: run applications on Y! sites
• OpenID, OAuth
- 50 -
BOSS: Build your Own Search Service
• Becoming a serious player in search requires enormous
investments in basic infrastructure
• At the same time many companies have tremendous assets
to improve search
– User profiles and social graphs, natural language analytics,
visualization frameworks, bookmarks & tags etc.
• Bring back innovation to search by allowing anyone to build upon
the basic Yahoo infrastructure
– Ability to re-order results and blend-in addition content
– No restrictions on presentation
– No branding or attribution
– Access to multiple verticals (web search, image, news)
– 40+ supported language and region pairs
- 51 -
BOSS: Build your Own Search Service
• http://developer.yahoo.com/search/boss/
• Three flavors
– BOSS API
– BOSS Custom (e.g. TechCrunch)
– BOSS Academic
• Pricing
– Usage based
– 10000 queries a day free of charge
– Monetization is completely open
– http://developer.yahoo.com/search/boss/fees.html
- 52 -
Accessing structured data via BOSS
• Query for all pages with hResume microformat and the word
‘php’ on them
• http://boss.yahooapis.com/ysearch/web/v1/searchmonkeyi
d:com.yahoo.page.uf.hresume%20php?view=searchmo
nkey_rdf&appid={APPID}&format=xml
<ysearchresponse xmlns="http://www.inktomi.com/" responsecode="200">
<nextpage><![CDATA[/ysearch/web/v1/searchmonkeyid:com.yahoo.page.uf.hresume%20php?..
<resultset_web count="10" start="0" totalhits="776" deephits="593000">
<result>
<abstract><![CDATA[View <b>PHP</b> Programmer's professional profile on LinkedIn.
<clickurl>http://lrd.yahooapis.com/_ylc=X3oDMTV
<date>2008/11/11</date>
<dispurl><![CDATA[www.<b>linkedin.com</b>/in/alexct]]></dispurl>
<searchmonkey_rdf><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntaxns#"><rdf:Description rdf:about="http://www.linkedin.com/in/alexct"><dc:subject
xmlns:dc="http://purl.org/dc/terms/" rdf:nodeID="id1563413635"/>…
- 53 -
Yahoo Query Language (YQL)
• http://developer.yahoo.com/yql/
• Get microformat/RDF data for a single URL
– select * from microformats where url='http://wait-till-i.com‘
• Mash up with other Yahoo! data
– Pizza places near Sunnyvale:
• select Url from local.search where zip='94085' and query='pizza‘
– Metadata about pizza places in Sunnyvale
• select * from microformats where url in (select Url from
local.search where zip='94085' and query='pizza‘)
• PHP code samples
• YQL console
- 54 -
Future plans
• Making it easier for developers to annotate their pages
– Integration with Intel MashMaker is a first step
• More default applications
– SearchMonkey as customization
• Exposing more and more data publicly
– Adding new microformats
– Exposing custom data services (with publishers permission)
• Other types of applications
– Applications that act on multiple results or adapt to the query
– Different placements for applications
- 55 -
Yahoo and You
• Help us build the Semantic Web
– Create, share and reuse ontologies
• Participate (or organize!) events like VoCamp
– Implement RDFa and help others to do so
– Give us feedback on our own efforts
• SearchMonkey, BOSS, markup in our own sites etc.
– Tell us what you do and how we can help you
- 56 -
the monkey is out!
- 57 -