LCS-Marine: Project Overview - Research Information Network

Download Report

Transcript LCS-Marine: Project Overview - Research Information Network

Free your Data:
Instant Gratification with the
Semantic Web
David Karger
Why everyone should be their own
database administrator,
UI designer,
application developer, and
web site builder,
and how they can
David Karger
A Semantic Web Vision
• Autonomous computational agents perform sophisticated
information tasks on behalf of their human users
• Use data that is annotated with rich semantics
–
–
–
–
–
Ontologies that explain precisely what the data means
Schema annotations that explain how to align multiple ontologies
Rules that explain how new data can be formally derived from existing
Inference systems that put it all together
Lots of logicians and AI researchers developing tools
• This vision is frightening
– Involves solving problems that have bedeviled AI for decades
– Often used to attack the semantic web
– Or to argue to slow down deployment
* “we can’t put up that data until we have an ontology!”
Aim Lower: the Semimantic Web
• Not “make computers help” but “make them not hinder”
– “First, do no harm”
• Create a tiny bit of structure:
–
–
–
–
–
Name objects (with URLs)
Record named relations between them
No semantics on relations
No schemas
No inference
• This is both
– Technically simple
– Immediately useful
• You should do it
– And you can right now
Why Applications?
• Typical user tasks require interaction with multiple pieces of
information
–
–
–
–
Display
Explore
Query
Manipulate
• Applications bring together the data, specialized views, and
operations necessary to perform tasks
• Artist
– Of dance, not music
– ID3v2 added “Composer”
– shown in wrong place
• Menu of genre choices
– My genre (of dance, not music) missing
– ID3v2 lets user add
• No “difficulty” field
• Irrelevant info
– Distracting
– Covers up more important info
– Place in comment field
– Uses field up
– Where put “tempo”?
Summary of Problems
• Application has fixed idea of “right” data
– Both properties and values for them
• And right way to display that data
• User wants to “stretch” the app to their needs
– Cannot hide irrelevant data
– Cannot incorporate new kinds of data
– Cannot change how data is presented
• Perhaps just use generic comment field?
– Add what you want
– Format how you want
• Properties have structure
– Used for layout
– And for browsing
Sometimes, one application isn’t enough
• Applications inappropriately partition task
– Because task wasn’t planned for in application design
• No application has all the necessary data, operations
– Need to launch several to do task
• Each includes unneeded data, operations
– Clutter distracts from what you need to see
• Can’t work with data “across” application boundaries
– Can’t record or view data connections
– Have to find it again in second application
– Or enter it manually a second time
* Type budget numbers on postits to move to other application
Why?
• Building applications is hard
– Done by expert few for the many
– They determine which data, views, operations are useful
• Applications are “mass produced”
– Everybody gets the same one
– And only build for large markets
– Word processor, email, photo album, …
• Problem: different people want different applications
– Basket weaving. UFO sightings, junkyard management
– Want to work with unusual information
– Want to see, navigate, manipulate it “their way”
• Developers can’t afford to build these boutique applications
What about the Web?
• Anything can get a URL
• Anything can go in a page, linked to anything
– Common to “schematize on the fly”, making lists of interesting
properties/values
• Support for orienteering
–
–
–
–
Scan list of choices
Pick the one that seems to lead in the right direction
Fact: people orienteer even when there’s an easy query that is faster
On web, never bounce off an application boundary
Downside
• Hard to author
– Especially if I want to record lots of complex data
• Hard to manipulate, do complex queries
– HTML loses meaning of data
– Can’t “switch to tabular view”
• That’s why web sites are backed by databases
– Data is kept structured to support complex queries
– Templating engines convert to human readable presentation
• End users aren’t going to manage this kind of web site
• Gives powerful operations, but only “inside” web site
– User may discover need to cross site boundaries
– Like applications, web sites create (possibly wrong) data partitions
– So all the problems with applications apply here too
Not just music
• Scientific research generates masses of data
– E.g. Bioinformatics
• Others want to access that data
• Big standards bodies meet to decide on community
standard formats and systems under which everyone will
distribute data
• When scientist wants to try or report something new, or
needs data from outside the community, stuck.
Information Wants to be Free
• Applications and Web Sites make assumptions about how
their data will be used
• Those assumptions are hard-coded into the interaction with
the data
• But no developer can predict all uses of the data
• Fixed interfaces prevent data repurposing
• Solution: give direct access to the data
• Just set up a SQL server?
– (A long-running screed of the DB community)
But it Can’t be Just about the Data
• People need to look at the data
– (unless we figure out those autonomous agents…)
• And need to create it in the first place
• Apps and template-driven web sets give us nice interfaces
for interacting with the data they manage
• But if we use them we can’t repurpose the data
• And what interface can we use for the repurposed data?
• Web needed a server (of data) and a client (to show it)
• How make viewing, authoring and repurposing arbitrary data
as easy as viewing and authoring web pages?
– Without knowing precisely what data people will want to view or how
they will want to view it?
Example: Piggy Bank
•
•
•
•
I need data from more than one web site
And I need to look at it differently than any web site
What is minimum necessary support?
Piggy Bank: A firefox plugin for navigating structured data
• Find
some
movies
• Free that
data
• Show it a
different
way
• Combine it
with other
sources
Mash Ups?
• Developer decides to integrate data from multiple sites
• Writes programmatic “scrapers”
– reverse the web site’s templating process to recover data
• Combines resulting data structures
• Presents using their own template driven web site
– Thus guilty of same sin as the one they are fighting
– I only get the mash-ups a programmer decides to create
• Piggy bank lets end users do their own mashing
Data Model
RDF
Movie
• W3C standard
• Minimum data model
– URL for arbitrary objects
– Arbitrary named links between
two objects
– No schemas
• Much like the web, except
– URLs need not be web pages
– Machine readable “anchor text” in
links
• Yet Powerful
– Relations are natural/universal
– Represent a semantic network
Superman
8PM
Kendall Sq.
Theater
Loew’s
Are we done?
• Is RDF the only answer?
–
–
–
–
SQL/Tuples, XML can represent same info
So any would do
And user shouldn’t have to know which we’ve chosen
But RDF is easiest to create sloppily, incrementally
* So best suited to let enthusiasts create some
– And imposes fewest requirements to be “compatible”
• Is RDF the whole answer?
– Still unclear how to interact with it
Visualization
Lenses
• If data is amorphous, monolithic UI won’t do
– Can’t know in advance what kind of data we’ll need to display
– Or what user will want to do with that data
• Let each type come with “view prescription”
– “To display a document, show its title, author, and abstract
– “To display a person, show his name and affiliation”
– Specifies properties to show, and “decoration” (fonts, layouts)
• After you get the data, assemble lenses to show it
– (recursively)
• Lenses are described in RDF
– So they can be collected, repurposed like any other data
Fresnel
dsp:publicationLens rdf:type :Lens;
:classLensDomain ow:Publication;
:group gr:group;
:purpose :defaultLens;
:showProperties (
dc:description
dc:identifier
dc:creator
dc:contributor
dc:date
dc:subject
dc:type
dc:publisher
dc:rights ) .
dsp:rightsFomat rdf:type :Format;
:group gr:group;
:propertyFormatDomain dc:rights;
:propertyStyle "dspace-rights" .
Benefits
• Data collected from anywhere can be viewed together
– Each piece of data with its own lens
• Lenses are described, not programmed
–
–
–
–
Enthusiasts can write their own
(especially if we give them wysiwyg tools)
No need to build a template driven web site
Just edit, publish some lenses
Manipulation
Application Development by End Users
•
•
•
•
People want applications to manipulate their data
But applications only manipulate developer’s data
So let end users build their own
Use lenses, but refract in both directions
– Lenses describe how to map data to presentation
– Invert, interpret manipulation of presentation as manipulation of data
* (extend lenses to talk about click, drag, drop)
• Operations represented as web services
– Internal and remote operations
– Receive RDF data and act on it
The Big Picture
Sufficient for Nice Applications?
• Application design is impoverished
–
–
–
–
–
–
Divide up the screen
Put an object in each piece
Show properties of each object
With pretty formatting
Put operations in menus
And add some toolbars to save time
• This application “vocabulary” is limited enough
– to be described instead of programmed
– so it can be edited by end users
Workspace Designer
• Editing mode for applications
• Define regions of screen
– By splitting existing regions
• Resize Regions
• Specify content of each region
– Object to be shown (drag and drop object)
– Lens to use to show object (menu of relevant lens)
– Operations to make available on object (drag operations)
Writing a Brain Research Paper
Adding “Things to Do” Region
Revised Application
Lens Designer
• Specify how a particular object can be shown
• Similar to workspace designer
– Lens is “workspace” for viewed object
• Subdivide canvas
• Specify property to show in each region
• Specify lens for value of each property
Drug Discovery Dashboard
http://www.w3.org/2005/04/swls/BioDash
Topic: GSK3beta Topic
Disease: DiabetesT2
Alt Dis: Alzheimers
Target: GSK3beta
Cmpd: SB44121
CE: DBP
Team: GSK3 Team
Person: John
Related Set
Path: WNT
50
Bridging Chemistry and Molecular
Biology
•Lenses can aggregate, accentuate,
or even analyze new result sets
• Behind the lens, the data can be
persistently stored as RDF-OWL
• Correspondence does not need
to mean “same descriptive
object”, but may mean objects
with identical references
51
Pathway Polymorphisms
•Merge directly onto
pathway graph
•Identify targets with
lowest chance of genetic
variance
•Predict parts of pathways
with highest functional
variability
Non-synonymous
polymorphisms
from db-SNP
•Map genetic influence to
potential pathway elements
•Select mechanisms of
action that are minimally
impacted by polymorphisms
52
Clinical Dashboard
• Gene Expression Data
• Additional relations and
aspects can be defined
additionally: Mendelian
Index of Man
Diseased
Tissue
Links to
OMIM (RDF)
53
Bar View Lens for Gene
Expression
54
ClinDash: Clinical Trials
Browser
•Values can be
normalized across all
measurables (rows)
•Samples can be
aligned to their
subjects using RDF
rules
•Clustering can now be
done over all
measureables (rows)
and types
Subjects
Clinical Obs
Expression
Data
55
Shattering Applications
• Specific lenses may be too complex for end users to create
• But end users can
– Assemble these lenses into “applications”
– Decide at which data these lenses point
• Current application developers can build those views
– Much more modular
– Instead of building whole application, just build a lens and add to pool
– Repurposable lenses for repurposable data
• Simpler views can be built by non programmers
– Embedding the complex lenses as subparts
Sharing
Semantic Bank
• Tools directly collect and manipulate RDF
– So sharing just requires publishing the RDF back
• Semantic Bank is just a big RDF repository
– GET a resource to fetch the (XML encoding of) RDF about it
– Similarly, upload an XML encoding of the RDF:
* POST /semantic-bank/foo?command=upload&format=rdfxml HTTP/1.1
Host: bank.example.org
Content-Length: 317
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:about="http://www.example.org/ns#item12345">
<rdfs:label>An Example</rdfs:label>
<rdf:type rdf:resource="http://www.example.org/ns#Thing"/>
</rdf:Description>
</rdf:RDF>
Getting There
What’s wrong?
• It seems obvious: RDF lets anyone
–
–
–
–
–
–
Ignore web site and application boundaries
Gather data they need
Define their own new attributes and relationships
Look at it the way that the need
Manipulate it
Publish it back for others to use it, without having to manage a web site
• So why don’t we already have it?
Cost of Getting Started?
• Web:
– Download/run a web server (hardest part, happens only once)
– Download a web browser
– Write a web page
• Semantic Web
–
–
–
–
–
Install database, define schemas
Add middleware layer
Create templating engines
Develop ontolgies, data import protocols
…
• Semimantic web
– Post some rdf (written in n3) to a semantic bank
– Install piggybank
Absence of Schemas?
• What good is it to put up RDF without explaining all the
properties?
• What happens when different people put up “mismatched”
data with different (explicit or implicit) schemas?
• What if there are multiple URLs for the same thing, with
inconsistent statements about them?
• How can I use data I collected from somewhere else, if it
doesn’t have the same schema as mine?
• But designing schemas is hard
– Requires big committees, lots of meetings, deliberation, buy-in
Data First, Schema later (if ever)
• Need for schemas is a fallacy, blocking progress
• Each site is likely consistent with itself
• And will likely “go with the crowd” and be consistent with
others
• If not, let users (not machines) translate
– Mapping properties to properties
– As needed, from site to site
* (or site to personal repository)
– Typically only need to blend a few sites
There’s no RDF?
• Database backed servers can easily expose RDF, if they
want to
–
–
–
–
E.g., citeseer.csail.mit.edu
Import into piggy bank
Browse, query, search in interesting ways
Maintain collections of references
• If server won’t cooperate, scrape
–
–
–
–
Piggy bank has a scraper repository
One person writes scraper, everyone uses
Or, one scrapes and publishes to semantic bank, others get from bank
Also unsupervised machine learning approaches
Clogs and Plogs
• Much blogging is about recycling content
• Clogs (Content Blogs) can manually merge data
–
–
–
–
Blogger locates sources of data that ought to be in their schema
Invests work to align properties and instances
Publishes resulting single (schema unified) blob of data
No front end
• Plogs (Presentation Blogs) display data
– Develop interesting lenses
– Point them at clogger content
– Someone else’s back end
• Separate front and back ends into different web sites
Chicken and Egg
• RDF-aware clients useless without data, and vice versa
• What can prime the pump?
Research Projects
• Many of our projects generate interesting data
• Then present through one interface
– Eg NLP, speech
• Instead, post it to the semantic bank
– Others will find new uses for the data
• Other projects consume data
– Get it from the bank
• Let’s talk…
Conclusion
• We have the tools to separate data from presentation
– RDF repositories
– Lenses to display arbitrary data in arbitrary combination
• Doing so would offer substantial benefits
– Application barriers go away
– Anyone can create interesting content
– People can repurpose it to their own specific needs
• Semantic Web can be lightweight
– Low cost of deployment
– Immediate benefit
– All we need do is ignore semantics
• Haystack.csail.mit.edu
• Simile.mit.edu