SKG2006 Introduction http://www.culturegrid.net/SKG2006/ Guilin China November 2 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http://www.infomall.org.

Download Report

Transcript SKG2006 Introduction http://www.culturegrid.net/SKG2006/ Guilin China November 2 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http://www.infomall.org.

SKG2006
Introduction
http://www.culturegrid.net/SKG2006/
Guilin China
November 2 2006
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org
1
SKG2006

Last year saw the first conference of this series in Beijing
covering
• Knowledge sharing
• Semantic networking
• Grid computing

These areas underlie
• Electronic Science (eScience)
• Scholarship and
• Communities (the real world)




This year we are pleased to present the second conference which
had an 18% acceptance rate for regular papers
We look forward to the meeting next year in Xi’an
Listen and ask lots of questions!
Lets thank Hai Zhuge and CAS for their wonderful vision and
implementation
2
Web 2.0, Knowledge
and the Semantic Grid
SKG 2006
http://www.culturegrid.net/SKG2006/
Guilin China
November 2 2006
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org
3
Motivation

Build Cyberinfrastructure (Grids) that
• Support science from beginning (planning, instruments)
through middle (analysis) and end (refereed publications,
follow-on work)
• Integrates with the popular Web 2.0 (community) tools whose
successes point to interesting ways of working together
• Integrate with Digital Library technology
• Does not redo previous work but rather augments it
• Assumes a heterogeneous fragmented world with multiple
platforms
• Allows one to specify and manage all the services and data
that a project needs with a mix of synchronous,
asynchronous, close (classic workflow) and loose (including
zero) coupling
4
Application Drivers




Semantic analysis of scientific documents as in case of
chemistry which has very precise naming rules for
compounds that allow accurate searches in documents
• Suggesting how to tag scientific documents either
when writing it or after the fact
Journal web site of the future as illustrated by Nature
building social bookmarking tool Connotea
Conference support tools as can benefit from features
needed by journals
This gives Digital Library (document) enhanced
Cyberinfrastructure (CI)
5
The Science Drivers


From Workshop on Challenges of Scientific Workflows
http://vtcpc.isi.edu/wiki/index.php/Main_Page
Workflow is underlying support for current science
model
• Distributed interdisciplinary data deluged scientific
methodology as an end (instrument, conjecture) to end (paper,
Nobel prize) process is a transformative approach


Reproducibility core to scientific method and requires
rich provenance, interoperable persistent repositories
with linkage of open data and publication as well as
distributed simulations, data analysis and new
algorithms.
Distributed Science Methodology publishes all steps in a
new electronic logbook capturing scientific process (data
analysis) as a rich cloud of resources including emails,
PPT, Wikis as well as databases, compiler options, build
time/runtime configuration…
6
Community Tools




e-mail and list-serves are oldest and best used
Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration –
text, audio-video conferencing, files
del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared
bookmarks
MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create
(upload) community resources and share them; Friendster, LinkedIn create
networks
• http://en.wikipedia.org/wiki/List_of_social_networking_websites



Writely, Wikis and Blogs are powerful specialized shared document systems
ConferenceXP and WebEx share general applications
Google Scholar tells you who has cited your papers while publisher sites tell you
about co-authors
• Windows Live Academic Search has similar goals

Note sharing resources creates (implicit) communities
• Social network tools study graphs to both define communities and extract
their properties

Mashups link resources together (federation/workflow)
How to use Web2.0 Community tools in CI

Nearly all of them have “profiles”, “users”, “groups”, “friends”
etc.
• Need to integrate these

P2P File Sharing: Maybe this is useful for sharing files in
research groups (virtual organizations)
• Will modify Maze http://maze.pku.edu.cn – popular Chinese social P2P
system with 2.5 million users


BitTorrent: more popular than FTP – why not use for higher
performance fault tolerant cached file sharing?
MySpace etc.: Could consider MyGridSpace or MyScienceSpace
that supports a similar document sharing model with users
uploading pictures, papers and even data/services of interest
• Could include uploaded material in workflows

Social Bookmarking and linking: discuss later
• http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/
Mashups and Grids






http://www.programmableweb.com
There are 303 “commodity”
service Web 2.0 API’s on
October 30 2006
Mashups are composed from
JavaScript, AJAX and REST
and not usually BPEL WSDL
and SOAP
Architecture of Mashups and
Grids “identical”
See Amazon S3 Storage and
EC2 Elastic Computing services
Mashups enable everybody to
contribute
MashUp API’s with use indicated by size


Note most
Mashups are
implemented
client side
inside
Browser
Most Grid
workflows
are executed
server side
Mashup Matrix
Document-enhanced Cyberinfrastructure
Export:
RSS, Bibtex
Endnote etc.
Traditional
Cyberinfrastructure
Windows Live
Academic Search
Del.icio.us
CiteULike
Google Scholar
Connotea
Citeseer
Bibliographic
Database
MyResearch
Database
Science.gov
Biolicious
PubChem
Generic Document Tools
Read
Journals
etc.
Integration/
Enhancement
User Interface
New Document-enhanced
Research Tools
CMT
Conference
Management
PubMed
Community Tools
Bibsonomy
Submit
Journals
Existing
User Interface
Web service
Wrappers
Existing Document 12
based Research Tools
Digital Library-enhanced Cyberinfrastructure
aka Semantic Scholar Grid I



Citeseer and Google Scholar scour the Internet and analyze
documents for incidental metadata
• Title, author and institution of documents
• Citations with their own metadata allowing one to match
to other documents
Science.gov extracts traditional library metadata from lots
of US Government databases
These capabilities are sure to become more powerful and to
be extended
• Give “Citation Index” in real time
• Tell you all authors of all papers that cite a paper that
cites you etc. (Note it’s a small world so don’t go too far
in link analysis)
• Tell you all citations of all papers in a workshop
13
Digital Library-enhanced Cyberinfrastructure
aka Semantic Scholar Grid II

It is natural to develop knowledge extraction document Services
such as those used in Citeseer/Google Scholar but applied to
“your” documents of interest that may not have been processed
yet
• As paper just submitted to a conference perhaps


These tools can help form useful lists such as authors of all cited
or submitted papers to a journal
OSCAR3 (from Peter Murray-Rust’s group at Cambridge)
augments the application independent “core” metadata (Title,
authors, institutions, Citations) with a list of all chemical terms
• This tool is a Service that can be applied to “your” document or to a set of
documents harvested in some fashion
• Other fields have natural application specific metadata and OSCAR like
tools can be developed for them

Such high value tools could appear on “publisher” sites of future
14
OSCAR Chemistry
Document analysis

It detects “magic”
chemical strings in text
and then
• Stores them as metadata
associated with
document


Queries
ChemInformatics
repositories to tell you
lots of information
about identified
compounds
Tells you which other
documents have this
compound
15
Scholar Grid III



Search and annotation provide unstructured and structured
Semantic Web/Grid for documents
Other Web 2.0 tools address linkage of people together and
people to information
Information is metadata as in profiles or personal publication as
in Blogs, Wikis, YouTube, MySpace
• All of these involve some sort of collaboration
• Comments on Blogs and uploads to Collaborative editing in a Wiki

Our projects usually use Wikis as central control (group
logbook) and each researecher (including students) can use Blogs
to define progress (an experimental web 2.0 electronic
notebook))
• I can comment on student progress with Blog comment
• Other students can keep abreast of group progress
• Security model not clear

There is also P2P file transfer with BitTorrent
16
Delicious Semantic Web/Grid









http://del.icio.us purchased by Yahoo for ~$30M
http://www.CiteULike.org
http://www.connotea.org (Nature)
Associate metadata with Bookmarks specified by
URL’s, DOI’s (Digital Object Identifiers)
Users add comments and keywords (called tags)
Users are linked together into groups (communities)
Information such as title and authors extracted
automatically from some sites (PubMed, ACM, IEEE,
Wiley etc.)
Bibtex like additional information in CiteULike
This is perhaps de facto Semantic Web – remarkable
for its simplicity
17
Connotea
18
Connotea queried by SERVOGrid
19
Biolicious
automatically
produces
(interesting)
scientific lists
Advertising!
20
Chemical Informatics as a Grid Application



Chemical Informatics is the application of information technology to
problems in chemistry.
• Example problems: managing data in large scale drug discovery
and molecular modeling
Building Blocks: Chemical Informatics Resources:
• Chemical databases maintained by various groups
 NIH PubChem, NIH DTP, http://nihroadmap.nih.gov/
• Application codes (both commercial and open source)
 Data mining such as clustering
 Quantum chemistry and molecular modeling
• Screening centers (with HTS High Throughput Screening devices)
measuring interaction of chemicals with biological samples
• Visualization tools
• Web resources: journal articles, etc.
Chemical Informatics Grid http://www.chembiogrid.org needs to
integrate these into a common, loosely coupled, distributed computing
environment.
21
OSCAR3 Service from Cambridge
UK
 Oscar3 is a tool for shallow, chemistry-specific

natural language parsing of chemical documents
(i.e. journal articles).
It identifies (or attempts to identify):





Chemical names: singular nouns, plurals, verbs etc., also
formulae and acronyms.
Chemical data: Spectra, melting/boiling point, yield etc. in
experimental sections.
Other entities: Things like N(5)-C(3) and so on.
Uses SMILES, InChI and CML
There is a larger effort, SciBorg, in this area

http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html
http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3
22
Workflows Using Chemical Literature
Find similar
documents
Bulk download of
Pubmed abstracts
OSCAR3
program
All of PubMed
“just” takes
about a day to
run through
OSCAR3 on
2048 node Big
Red
Extract chemical
structures
Find similar
molecules
PDBBind
OSCAR3
Service
PubChem
Local DTP
database
SMILES NAME Pubmed ID
CCC
propane 1425356
CC
ethane 3546453
..... ............. ............. Clustering of documents linked to
clustering of chemicals
Searchable
(structure/similarity)
Grid database
Initial Results

We have a small sample (100) of full text Chemistry papers
selected at random from 15 years of PubMed with over 5 million
abstracts
• OSCAR3 generates 4.17 compound names per abstract
• and 36.7 compound names per full text

Illustrates how much knowledge journal publishers are hiding
from us
24
Clustering
Documents
from chemical
properties
Provenance and Delicious CI

We can use del.icio.us style interface to annotate
Application Data with (extra) provenance and user
comments of any type (describing quality of data or a
keyword relating different data etc.)
• All data should be labeled by a URI to enable this
• One has in addition Citeseer/OSCAR metadata

Current major tagging systems support flat list of tags
without name=value (RDF triple) or schema
organization
• Tradeoff between features and pervasive deployment


Some extra features are easy to add as a custom service
Features not supported by del.icio.us can be uploaded
as comments
26
Implementation Strategy


Doesn’t seem useful to build the 251st community tool
In fact a major barrier to use of existing tools is
• What happens when a better tool comes along and/or chosen tool
disappears (unsupported/removed from Web)

So assume use existing tools but wrap them all as web services so
can transfer information to new tools and integrate information
between tools
• Need some “glue” logic, a “unification” database and minimal user
interface




Bookmarking tools: del.icio.us, Connotea, CiteULike (includes
plug-ins to major publisher sites)
Document: Google Scholar, Windows Live, Citeseer tools,
OSCAR3 for Chemistry, Science.gov (later)
Journals: Manuscript Central
Conferences: CMT from Microsoft or ?
Current Status



Google Scholar, Windows Live Academic Search, del.icio.us,
Connotea, CiteULike, OSCAR3 are Web Services
Debugging on 500 presentations and papers from my CGL
research group
Experiment with GGF Presentations, Broad collection of
Chemical Informatics resources (explore science document CI
link) and Concurrency&Computation: Practice&Experience
Web site (?business model for journals)
http://gf6.ucs.indiana.edu:48990/SemanticResearchGrid/ 28
Knowledge Model for Scientific Journals

There are classes of scientific journals
• Large circulation society journals effectively subsidized by fees of
professional society membership; circulations can be more than 10,000
• “Popular” magazine style journals
• A few prestigious journals
• Many specialized journals publishing archival refereed papers with
circulations from one hundred to a few thousand

The specialized journals largely sell a mix of paper and (a
growing number of) electronic subscriptions to libraries and very
few individuals subscribe
• Access is limited and expensive
• Even if one subscribes, one is often restricted on the number of full text
papers one can access
• Collections like PubMed only include abstracts



Systems like Google Scholar, Microsoft Academic Live and
Citeseer cannot fully analyze knowledge in papers unless get
access to full text
Current publishing model hindering and not helping science
Similar discussion for journal papers and research data
29
Internet Business Models




How to make money on the Internet has been debated
for many years
One can offer content (data on web) and/or services
(user customizable transformations of web data)
Advertising is dominant model in large sites.
Content and Services can be free or paid by
Transactions or Subscriptions.
• Often there is a mixed model with basic content/services frees
and one pays for premium features

One can charge reader or publisher.
• Advertising charges publisher of Advert
• In the past, journals were funded by page charges i.e. one
charged the authors (institution) that produced paper
30
Examples of Internet Information and
Knowledge Content and Business Model


Itunes and other music sources; at right price, people
will pay for convenience
News web sites supported by a mix of advertising and
premium content.
• Not clear latter successful except in specialized areas


Sites like http://www.chessbase.com/ with collections of
Chess Games with occasional annotation
Several Financial Service sites
• Yahoo Google etc. Financial Services with premium for realtime stock quotes
• Other sites feature commentary that is either free (supported
by advertising) or premium content (such as Wall Street
Journal and many stock picker sites) which you subscribe to
31
Examples of Internet Information and
Knowledge Services and Business Model




Google etc. online Office versus more sophisticated paid
Microsoft Office which also has "history" advantage as
owned field before Internet
WebEx collaboration services paid by transaction or
subscription; not obviously a viable long term model
ICC Chess Site http://www.chessclub.com/ supports the
community of chess players with free basic access but
valuable premium features including better game
playing, rating and real-time commentary. Other
gaming sites similar
Amazon S3 and Computing Cloud paid services copuld
be successful as alternative (buy your own computers)
costs real money and perhaps less reliable
32
Publishing Business Model in the Internet Age




Journal publishing currently has a business model where the
price reflects neither the cost nor the value-added
Publishers currently do not have significant internal expertise in
new approaches/technologies to drive new business models
However much is outsourced already and so one can outsource to
organizations with new expertise e.g. to those that know Web 2.0
rather than putting ink on paper
There is no clear new business model but plausible that current
model will not survive for that long
• So need to change even if less lucrative or success unclear

Note libraries provide funds to publishers and libraries will
continue
• Not clear how fast libraries will change as they also don’t obviously have
expertise to support new models
• Some think that one role of university libraries will be curation of data
produced by university faculty
33
Strengths of Current Publishing Model


Permanent “guaranteed” archival storage but there are
other approaches such as Amazon S3 to this
Uniform look and feel and copyediting to remove
language errors.
• Useful but not so valuable that we can trade access for this.
• In particular can only correct some language errors as only a
subject expert can really rewrite in good grammar and
expression

Refereeing of a quality implied by the journal and the
editorial board
• Most important strength but business model does not directly
reflect this as only a small part of subscription price goes to
editorial function
• For most papers cost of refereeing much less than other costs
of producing paper
• Not clear why viewer should pay for refereeing

Large amount of pre-existing papers from old issues of
journals
34
Pressures on Current Publishing Model

Mandated open access to scholarly work funded by government
• Cornyn-Lieberman bill in the US
• NIH PubMed Central requires deposited of full text of articles after a
length of time


Electronic access to publisher sites is not especially good
Division of articles into journals and publishers is not very
helpful today where technology does not care about location of
information
• Location is just a rather simple annotation (meta data) specifying aspects
of provenance of article
• Note a special issue of SKG2006 is just an annotation roughly
characterizing nature and quality of work


Publishing on the Internet is not a valuable service and has been
addressed by Web servers in general and by Web 2.0 in
attractive ways
Essentially nobody reads or even has access to paper copies of
journal
• Not clear it is useful to print specialized journals on paper
35
Scholarly Research Community Site



Best product should allow one to make best use of knowledge in scholarly
publications and data
Should integrate journal and conference publications and services
Should contain integrated or support outside services for curation,
annotation, analysis and search
• Looking at Web 2.0 successes, one needs to conveniently share data and set up
communities


Content is scholarly journals and data
Services include
•
•
•
•
•
•
•
•
•
•
Annotation as in Connotes, CiteULike, Del.icio.us
Semantic analysis for citations, authors, chemical compounds etc.
Biolicious style custom classifications including added value contacts
Search as in Google Scholar, Microsoft Academic Live
MySpace/Facebook/LinkedIn style services for existing or new contacts
Support of conference and journal refereeing
Other conference/journal services such as registration, advertising
Integration with research such as electronic log books
Internal integration e.g. Authors in citations are linked to community
Links to more general document services such as:
 Online Office style Tools
 WebEx type collaboration
36
Business Model for Scholarly
Journal/Research Community Site


One can charge for advertising, better content, better services or
better implementation
Natural is to start with a basic free content and services with
advertising.
• Content must be free eventually “by law”
• Services will have open source versions anyway so counter this with free
basic services


One could use page charge model for charging for refereeing.
One charges user for features that add value. These include:
• Better or better implemented community/digital library services
• Premium Content possibly contracted by site owner


Problem with Advertising Business model: Audience specialized
(i.e. small) but upscale
Problem with charging for Community Tools: Competing with
free software but likely can offer much better service than free
software just as WebEx does fine in spite of free VNC
37