Introduction to e-Science and Semantic Web Professors Deborah McGuinness and Joanne Luciano (With Li Ding and Peter Fox) CSCI-6962-01 Week 1, August 30, 2010

Download Report

Transcript Introduction to e-Science and Semantic Web Professors Deborah McGuinness and Joanne Luciano (With Li Ding and Peter Fox) CSCI-6962-01 Week 1, August 30, 2010

Introduction to e-Science and
Semantic Web
Professors Deborah McGuinness and Joanne Luciano
(With Li Ding and Peter Fox)
CSCI-6962-01
Week 1, August 30, 2010
1
Admin info (keep/ print this slide)
•
•
•
•
Class: CSCI-6962-01
Hours: 1pm-3:50pm Mondays (except one)
Location: Winslow 1140
Instructors: Deborah McGuinness, Joanne Luciano,
with Peter Fox and Li Ding
• Instructor contacts: [email protected],
[email protected]
• Contact locations: Winslow 2104 (DLM), 2143 (JSL)
• Wiki: http://tw.rpi.edu/wiki/Semantic_eScience_%282010_Fall%29
2
Introductions
•
•
•
•
•
Who are we?
Who are you?
Why are you here?
What do you want to get out of the class?
Will you make the class (on time) each week
and do you have any other conflicts or issues
we should know about?
3
“Knowledge is the common
wealth of humanity”*
In the Earth and space sciences and elsewhere,
ready and open access to the vast and growing
collections of cross-disciplinary digital information
is the key to understanding and responding to
complex Earth system phenomena that influence
human survival.
We have a shared responsibility to create and
implement strategies to realise the full potential of
digital information and services for present and
future generations.
*Adama Samassekou, Convener of the UN World Summit on the Information Society
What do we need to achieve Semantic eScience?
(in-class brainstorming exercise)
organization, leadership, management strategies, roles and
assignment of roles
dissemination strategy
communication of ideas
- machine level
- human level
conflict resolution
cross-disciplinary
collaboration
flexible
adaptable, feedback
extensible
ability to filter information
usage/application of resources, optimization
facts, knowledge (domain knowledge)
context, domain, scope
goals, use cases
metadata - data to describe data
ability to link information
ability to understand information
ability to capture and represent conflicting ideas
provenance - where data come from
trust - reliable
ability to capture intent (humanitarian aspect / responsibility)
credibility of information
interesting and appealing
standardization
education and outreach
methods and metrics
criteria for evaluation
Outline of the course
• Topics for Semantic e-Science/ Foundations:
–
–
–
–
–
–
–
–
–
–
–
–
–
Semantic Methodologies
Knowledge Representation for e-Science
Ontology Engineering and Re-Use for e-Science
Knowledge Integration for e-Science
Semantic Data Integration
Semantic Web Languages, Tools and Services
Semantic Infrastructure and Architecture for e-Science
Semantic Grid Middleware
Ontology Evolution for e-Science
Knowledge Management for e-Science
e-Science Workflow Management
Data life-cycle for e-Science
Data Mining and Knowledge Discovery
6
Contents
•
•
•
•
•
•
•
•
•
Outline of the course
Background
e-Science
Examples
Informatics
Semantics
Elements of Semantic e-Science (SeS)
What we expect
Logistics summary
7
The Information Era: Interoperability
Modern information and communications
technologies are creating an
“interoperable” information era in which
ready access to data and information can
be truly universal. Open access to data
and services enables us to meet the new
challenges of understand the Earth and
its space environment as a complex
system:
• managing and accessing large data sets
• higher space/time resolution capabilities
• rapid response requirements
• data assimilation into models
• crossing disciplinary boundaries.
8
One Real World Example
Influenza Ontology Development
to Support Research, Surveillance
and Monitoring
Ontology Support for
Influenza Research and
Surveillance
Joanne Luciano, PhD,
Lynette Hirschman, PhD, Marc Colosimo, PhD
Approved for Public Release; Distribution Unlimited.
28 April 2008 Case Number 08-0738
© 2006 The MITRE Corporation. All rights reserved
Case Study: Indonesia


Possible Human to Human transmission of H5N1 (May 2006)
Samples were collected and epidemiological data obtained
– Know who got sick and their relationship to each other
– Know when they got sick and if they died
– Have some public sequence data from that time

It is not known if these sample are from these people!
isolation_source="gender:M; age:32;
Lung Aspirate"
Public Sequence Data
30 Aug 2006
A/Indonesia/CDC595/2006 (2006-05-09)
A/Indonesia/CDC594/2006 (2006-05-10)
Same….
person ?
Metadata
23 May 2006
Metadata
12 Jul 2006
A/Indonesia/CDC625L/2006 (2006-05-22)
A/Indonesia/CDC644/2006 (2006-05-30)
GenBank
WHO
Nature
© 2008 The MITRE Corporation. All rights reserved
Case Study: UK


Outbreak of H5N1 in the UK at a turkey farm Feb 1, 2007
What is the source of the outbreak?
– Contact with infected wild birds?

But turkeys were in an enclosed “biosecure” unit

No H5N1 detected in the region in the 2 previous months
– Govt. veterinarian suggested turkey meat from Hungary might be
source of infection

Turkey farm is adjacent to a poultry packing plant that had processed
poultry products from Hungary

Hungary had reported an H5N1 outbreak 2 weeks earlier

Sequence data showed that strain infecting the turkeys was
99.96% identical to strain that had infected Hungarian birds

Conclusion: Infected Hungarian poultry was source of H5N1
infection
– Open question (relevant to food defense):
how did H5N1 spread from processing plant to live turkeys?
© 2008 The MITRE Corporation. All rights reserved
Research Agenda
Create “Reference” Database to hold
Influenza virus sequences and “Metadata”
– What metadata to collect?
– Where to find data and how to connect different sources (bridging
the gap)?
© 2008 The MITRE Corporation. All rights reserved
Research Question: Bridging the Gap Connecting Genomics and Epidemiology
Genomic
Sequence Data
Systems Biology
Demographic data
Clinical data
Geospatial data
Temporal data
Pathogenicity
Host
Genomics:
Genes of
Pathogen
Influenza Ontology
Epidemiology:
Occurrence of
Disease in
Host
© 2008 The MITRE Corporation. All rights reserved
Influenza Ontology: Development
 Identify
the right collaborators
 Collect metadata terms
 Identify resources for that include these terms
 Regularize metadata
– Generate a controlled vocabulary (terms)
 Validate
subset with BioHealthBase CEIRS data
 Iterate, review with community, publish
 Integrate Influenza ontology into workflow
© 2008 The MITRE Corporation. All rights reserved
Influenza Ontology First Draft:
Community
Collaboration with BioHealthBase and Gemina

BioHealthBase: NIAID Influenza Database
Point of Contact for
– Centers of Excellence for Influenza Research and Surveillance (CEIRS)
 Research: Emory, Mt Sinai, St. Jude, Univ. of Rochester
 Surveillance: St. Jude, UCLA, Univ. of Minnesota
– Los Alamos National Laboratory (LANL)
 Gemina: Category A-C Pathogen Database
Point of contact for
– Children’s Hospital Boston
– Johns Hopkins University
 MITRE
© 2008 The MITRE Corporation. All rights reserved
Influenza Ontology First Draft:
Identify metadata
~200 controlled vocabulary terms
covering several fields
© 2008 The MITRE Corporation. All rights reserved
Influenza Ontology:
Metadata resources
Reuse of existing ontologies & metadata standards
OBI – Ontology of Biomedical Investigations
EnvO – Environmental Ontology (habitat of pathogen)
GAZ – Gazeteer (geographic locations)
FMA – Foundational Model of Anatomy
DC – Dublin Core (publication metadata)
PATO – Phenotype
SO – Sequence Ontology (sequence features)
Cell – Cell Ontology (types of cells)
DO – Disease Ontology
IDO – Infectious Disease Ontology
© 2008 The MITRE Corporation. All rights reserved
Influenza Ontology First Draft
Initial steps:
•Collect metadata terms
•Map and align terms
•Group related information
•Identify and define relationships
•Identify external ontologies
Formalize:
•Normalize terms into a CV
•Issue unique identifiers
•Instantiate class hierarchy
•Define properties and values
•Link to external ontology
terms
Formalize
Excel Spreadsheet
OBO-Edit: Ontology Editing Tool
© 2008 The MITRE Corporation. All rights reserved
Subsequent Work
Ontology development
 Complete
 Validate
formalization process
subset with data from BioHealthBase
– Circulate for review and comments
 Use
ontology to annotate influenza data
© 2008 The MITRE Corporation. All rights reserved
Team

BioHealthBase (UT Southwestern
Medical Center)
– Burke Squires
– Richard Scheuermann

Institute of Genome Sciences/Gemina
(U. Maryland Baltimore)
– Lynn Schriml

MITRE
– Joanne Luciano
– Lynette Hirschman
– Marc Colosimo

Note! (not mentioned in class):
This collaboration became
international in it’s 3rd year
when the Canadian
Government decided they too
needed an ontology to support
integration of influenza data that the spread of influenza
does not stop at the
international borders. When
they did their research, they
found us and joined the
collaboration and we were
grateful to have their help and
expertise.
British Columbia Cancer Agency (Vancouver, Canada)
– Ryan Brinkman
– Mélanie Courtot
© 2008 The MITRE Corporation. All rights reserved
Background
Scientists should be able to access a global, distributed
knowledge base of scientific data that:
• appears to be integrated
• appears to be locally available
But… data is obtained by multiple means, using
various protocols, in differing vocabularies, using
(sometimes unstated) assumptions, with
inconsistent (or non-existent) meta-data. It may be
inconsistent, incomplete, evolving, and distributed
And… there often exists significant levels of semantic
heterogeneity, large-scale data, complex data
types, legacy systems, inflexible and unsustainable
22
implementation technology…
Information
Information
But
data has
products have
Lots of Audiences
More Strategic
Less Strategic
SCIENTISTS TOO
23
From “Why EPO (Education and Public Outreach)?”, a NASA internal
report on science education, 2005
Shifting the Burden from the User
to the Provider
24
Fox CI and X-informatics - CSIG 2008, Aug 11
e-Science
• Emphasis is on Science
• Original narrative: One of the key drivers behind the search for such new
scientific tools is the imminent deluge of data from new generations of
scientific experiments and surveys (*). In order to exploit and explore the
petabytes of scientific data that will arise from these high-throughput
experiments, supercomputer simulations, sensor networks, and satellite
surveys, scientists will need assistance from specialized search engines,
data mining tools, and data visualization tools that make it easy to ask
questions and understand answers. To create such tools, the data will
need to be annotated with relevant "metadata" giving information as to
provenance, content, conditions, and so on; and, in many instances, the
sheer volume of data will dictate that this process be automated.
Scientists will create vast distributed digital repositories of scientific data
requiring management services similar to those of more conventional
digital libraries, as well as other data-specific services. The ability to
search, access, move, manipulate, and mine such data will be a central
requirement for this new generation of collaborative science software
applications. Hey and Trefethen, 2005
25
Evolving Science
• Thousand years ago:
science was empirical
describing natural phenomena
• Last few hundred years:
theoretical branch
using models, generalizations
• Last few decades:
a computational branch
simulating complex phenomena
• Today:
data exploration (eScience)
synthesizing theory, experiment and
computation with advanced data
management and statistics
 new algorithms!
2
.
4G
c2
a
 a   3   a 2
 
Living in an Exponential World
1000
• Scientific data doubles every year
– caused by successive generations
of inexpensive sensors +
exponentially faster computing
•
•
•
•
100
10
1
0.1
1970
Changes the nature of scientific computing
Cuts across disciplines (eScience)
It becomes increasingly harder to extract knowledge
20% of the world’s servers go into huge data centers
by the “Big 5”
– Google, Microsoft, Yahoo, Amazon, eBay
• So it is not only the scientific data!
1975
1980
1985
1990
1995
2000
CCDs
Glass
Collecting Data
• Very extended distribution of data sets:
data on all scales!
• Most datasets are small, and manually
maintained (Excel spreadsheets)
• Total amount of data dominated by the other
end
(large multi-TB archive facilities)
• Most bytes today are collected via electronic
sensors
Making Discoveries
• Where are discoveries made?
– At the edges and boundaries
– Going deeper, collecting more data, using more colors….
• Metcalfe’s law
– Utility of computer networks grows as the
number of possible connections: O(N2)
• Federating data (the connections!!)
– Federation of N archives has utility O(N2)
– Possibilities for new discoveries grow as O(N2)
• Many examples
– Sky surveys – galaxy zoo… Very early discoveries from SDSS, 2MASS, DPOS
– Genomics+proteomics
– Alzheimers article in reading
Data Delivery: Hitting a Wall
FTP and GREP are not adequate
•
•
•
•
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years
•
Oh!, and 1PB ~4,000 disks
•
•
•
•
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (~1 $/GB)
…
2 days and 1K$
…
3 years and 1M$
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
• Take the analysis to the data!!
Mind the Gap!
• As a result of finding out who is doing what,
 Informatics - information science includes the
sharing experience/ expertise, and
science of (data and) information, the practice
substantial coordination:
of information processing, and the engineering
• There
is/ was still
a gap between
science
of information
systems.
Informatics
studies the
and
the underlying
and of natural
structure,
behavior,infrastructure
and interactions
technology
that
is available
and artificial
systems
that store, process and
communicate (data and) information. It also
develops its own conceptual
theoretical
• Cyberinfrastructure
is the new and
research
foundations. Since
computers,
individuals
environment(s)
that support
advanced
data and
acquisition,
dataallstorage,
management,
organizations
processdata
information,
data
integration,
mining, data
informatics
has data
computational,
cognitive and
visualization and other computing and
social aspects, including study of the social
information processing services over the
impact of information technologies. Wikipedia.
Internet.
31
Progression after progression
Informatics
IT Cyber
Infrastru
cture
Cyber
Informatics
Core
Informatics
Science
Informatics,
aka
Xinformatics
Science,
Societal
Benefit
Areas
32
World-Wide Emerging Technology
Trends
• Innovation will come from other parts of the world
other than the U.S.
• The Chinese have skipped the Internet first
generation.
• Growth will occur in Asia, and continue to
decrease in Western Europe.
• U.S. Industry is compulsively outsourcing abroad.
• Software is moving from forms-based applications
to business processes.
• Networks are migrating to IP and optical
networking technologies.
Cyberinfrastructure
•
•
•
•
•
•
•
•
•
Data curation and storage
Federated access
Collaboration
New uses in High Performance Computing
Databases
Web servers, services (software as service)
Wiki
Visualization
All discipline neutral
Semantic Web Methodology and
Technology Development Process
•
•
Establish and improve a well-defined methodology vision for
Semantic Technology based application development
Leverage controlled vocabularies, etc.
Adopt
Leverage
Rapid
Technology Technology Science/Expert
Open World: Prototype
Infrastructure Approach Review & Iteration
Evolve, Iterate,
Redesign,
Redeploy
Use Tools
Evaluation
Analysis
Use Case
Small Team,
mixed skills
Develop
model/
ontology
35
Ex. 1: Virtual Observatories
Make data and tools quickly and easily accessible
to a wide audience.
Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference as if all the materials were
available on his/her local computer using the
user’s preferred language: i.e. appear to be
local and integrated
Likely to provide controlled vocabularies that may
be used for interoperation in appropriate
domains along with database interfaces for
access and storage -> thus part IT, part CI, part
Informatics and all about doing new science
36
Added value
Education, clearinghouses,
disciplines, et c.
other
services,
Semantic mediation layer - midupper-level
VO
Portal
Semantic
interoperability
Added value
Added value
Semantic query,
hypothesis and
inference
Web
Serv.
VO
API
Mediation Layer
• Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
Semantic mediation layer - VSTO - low level
associated classes, properties) and Service
Classes
• Maps queries to underlying data Metadata, schema,
data
• Generates access requests for metadata,
data
• Allows queries, reasoning, analysis, new
Added value
DBn
DB2
DB3 explanation,
hypothesis
generation,
testing,
et
c.
…………
DB
1
Query,
access
and use
of data
37
Science and technical use cases
Find data which represents the state of the neutral
atmosphere anywhere above 100km and toward the
arctic circle (above 45N) at any time of high
geomagnetic activity.
– Extract information from the use-case - encode knowledge
– Translate this into a complete query for data - inference and
integration of data from instruments, indices and models
Provide semantically-enabled, smart data query services
via a SOAP web for the Virtual IonosphereThermosphere-Mesosphere Observatory that retrieve
data, filtered by constraints on Instrument, Date-Time,
and Parameter in any order and with constraints
38
included in any combination.
Inferred plot type
and return required
axes data
39
Semantic Web Benefits
• Unified/ abstracted query workflow: Parameters, Instruments, Date-Time
• Decreased input requirements for query: in one case reducing the
number of selections from eight to three
• Generates only syntactically correct queries: which was not always
insurable in previous implementations without semantics
• Semantic query support: by using background ontologies and a
reasoner, our application has the opportunity to only expose coherent
query (portal and services)
• Semantic integration: in the past users had to remember (and maintain
codes) to account for numerous different ways to combine and plot the
data whereas now semantic mediation provides the level of sensible data
integration required, and exposed as smart web services
– understanding of coordinate systems, relationships, data synthesis,
transformations, etc.
– returns independent variables and related parameters
• A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)
40
But data has Lots of Audiences
More Strategic
Less Strategic
From “Why EPO?”, a NASA internal
report on science education, 2005
41
What is a Non-Specialist Use Case?
Teacher accesses internet goes
to An Educational Virtual
Observatory and enters a
search for “Aurora”.
Someone
should be able
to query a
virtual
observatory
without having
specialist
knowledge
42
What should the User Receive?
Teacher receives four groupings of search
results:
1) Educational materials:
http://www.meted.ucar.edu/topics_spacewx.ph
p and http://www.meted.ucar.edu/hao/aurora/
2) Research, data and tools: via research VOs
but the search for brightness, or green/red line
emission is mediated for them
3) Did you know?: Aurora is a phenomena of
the upper terrestrial atmosphere (ionosphere)
also known as Northern Lights
4) Did you mean?: Aurora Borealis or Aurora
Australis, etc.
43
Semantic Information Integration:
Concept map for educational use of
science data in a lesson plan
44
Fox CI and X-informatics - CSIG 2008, Aug 11
45
Fox CI and X-informatics - CSIG 2008, Aug 11
Semantic Web Basics
• The triple: {subject-predicate-object}
Interferometer is-a optical instrument
Optical instrument has focal length
An ontology is a representation of this knowledge
• W3C is the primary (but not sole) governing organization for
languages, specifications, best practices, et c.
– RDF - Resource Description Framework
– OWL 1.0 - Ontology Web Language (OWL 2.0 on the way)
• Encode the knowledge in triples, in a triple-store, software is
built to traverse the semantic network, it can be queried or
reasoned upon
• Put semantics between/ in your interfaces, i.e. between layers
and components in your architecture, i.e. between ‘users’ and
‘information’ to mediate the exchange
46
•
•
•
•
•
Terminology
Semantic Web
– An extension of the current web in which information is given well-defined
meaning, better enabling computers and people to work in cooperation,
www.semanticweb.org
– Primer: http://www.ics.forth.gr/isl/swprimer/
Semantic Grid
– Semantic services to use the resources of many computers connected by a
network to solve large scale computational/ data problems
Provenance
– origin or source from which something comes, intention for use, who/what
generated for, manner of manufacture, history of subsequent owners, sense
of place and time of manufacture, production or discovery, documented in
detail sufficient to allow reproducibility.
Service-oriented architecture
– Provision of a capability over the internet via a ‘remote-procedure-call’ using
prescribed input, output and pre-conditions
Ontology (n.d.). The Free On-line Dictionary of Computing.
http://dictionary.reference.com/browse/ontology
– An explicit formal specification of how to represent the objects, concepts and
other entities that are assumed to exist in some area of interest and the
47
relationships that hold among them.
•
•
•
Terminology
Closed World - where complete knowledge is known (encoded), AI relied on this
Open World - where knowledge is incomplete/ evolving, SW promotes this
Languages
–
–
–
–
–
–
–
•
OWL - Web Ontology Language (W3C)
RDF - Resource Description Framework (W3C)
OWL-S/SWSL - Web Services (W3C)
WSMO/WSML - Web Services (EC/W3C)
SWRL - Semantic Web Rule Language, RIF- Rules Interchange Format
PML - Proof Markup Language
Editors: Protégé, SWOOP, Medius, SWeDE, …
Reasoners
– Pellet, Racer, Medius KBS, FACT++, fuzzyDL, KAON2, MSPASS, QuOnto
•
Query Languages
– SPARQL, XQUERY, SeRQL, OWL-QL, RDFQuery
•
Other Tools for Semantic Web
–
–
–
–
•
Search: SWOOGLE swoogle.umbc.edu
Collaboration: www.planetont.org
Other: Jena, SeSAME/SAIL, Mulgara, Eclipse, KOWARI
Semantic wiki: OntoWiki, SemanticMediaWiki
Emerging Semantic Standards for Earth Science
– SWEET, VSTO, MMI, GeoSciML
48
Semantic Web Layers
49
http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/
Application Areas for Semantics
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Smart search
Annotation (even simple forms), smart tagging
Geospatial
Implementing logic (rules), e.g. in workflows
Data integration
Verification …. and the list goes on
Web services
Web content mining with natural language parsing
User interface development (portals)
Semantic desktop
Wikis - OntoWiki, SemanticMediaWiki
Sensor Web
Software engineering
Explanation
50
Visibility
2007-2008 Hype Cycle for Emerging
Semantic Web Technologies v0.6
Semantic
Web
Services
Triple stores, e.g.
Jena, Sesame,
Mulgara, Oracle
Spatial
Semantic
Wiki
Smart search,
e.g. NOESIS
Rules/Logic,
SWRL
Query Lang,
SPARQL
Ontology editor,
SWOOP
Concept
map, Cmap
RDF
OWL 1.0
Tagging /
annotation
Mid-level ES
domain
ontologies, e.g
GEON
Protégé
XML
Estimated years to
mainstream adoption
in Earth science
< 2 years
DL Reasoners,
2-5 years
SKOS,
e.g.
Pellet,
Racer
Species
Query
5-10 years
FOAF
Validators
Lang,
Upper level
Mid-level ES
OWL 1.1
OWL-QL
> 10 years
ontologies,
e.g
domain
ontologies,
Natural Language
Obsolete
ABC, DOLCE,
e.g SWEET
before
Ontologies
SUMO
plateau
Query Lang, Commercial
Managing
and embedded QL
modular
51
Slope of
Plateau of
ontologies Technology
Peak of
Trough of
Enlightenment
Productivity
(ES and
trigger
Inflated
Disillusionment
general)
Expectations
Produced for NASA TIWG semantic web subgroup
Time
April 2008
Outcome
 Increased
Collaboration &
Interdisciplinary Science
 Acceleration of
Knowledge
Production
 Revolutionizing
how science is
done
Output
 Geospatial
semantic services
established
 Geospatial semantic
services proliferate
 Scientific
semantic assisted
services
 Autonomous
inference of
science results
Vocabulary
Interoperable
Information
Infrastructure
Assisted
Discovery &
Mediation
 Improved
Information
Sharing
Languages/
Reasoning
Technology
Capability
Results
Semantic Web Roadmap
 Some common
vocabulary based
product search
and access
 Semantic
geospatial search &
inference, access
 Semantic agentbased searches
 Semantic agentbased integration
 Local
processing + data
exchange
 Basic data
tailoring services
(data as service),
verification/
validation
Interoperable
geospatial services
(analysis as
service), results
explanation service
 Metadata-driven
data fusion
(semantic service
chaining), trust
 SWEET core
1.0 based on
GCMD/CF
 SWEET core 2.0
based on best
practices decided from
community
 RDF, OWL,
OWL-S
 Geospatial
reasoning, OWL-Time
 SWEET 3.0 with
semantic callable
interfaces via standard
programming languages
 Numerical
reasoning
 Reasoners
able to utilize
SWEET 4.0
 Scientific
reasoning
52
Current
Near Term (0-2 yrs)
Mid Term (2-5 yrs)
Long Term (5+ yrs)
Interactive Interoperable Responsive Verifiable
Assisted
Assisted
Data
Information Information Information Knowledge Discovery &
Analysis
services
Delivery
Quality
Building
Mediation
Seamless
Data
Access
Capability
Semantic Web Roadmap (capability)
April 2008
 Some common
vocabulary based product
search and access
 Some metadata
and limited
provenance
available
 Semantic geospatial
search & inference,
access
 Semantic agentbased searches
 Semantic agent-based
integration
 Common
 Ontologies for data
terminology captured
mining, visualization and
in ontologies, crossing
analysis emerging/ maturing
domains
 Ontologies for
information quality
developed
 Verification is manual
with minimal tool
support
 Domain and range
properties in ontologies
used in tools
 Provenance/
annotation with
ontologies in user
tools
 Service
ontologies carry
quality provenance
 Services annotated
 Dynamic service
 Semantic markup of
 Services must be
with resource
discovery and mediation, data latency (time lags)
hardwired and service
descriptions
and data scheduling
which adapt dynamically
agreements established
 Local processing +
data exchange
 Limited metadata
passed to analysis
applications
 Basic data tailoring  Interoperable geospatial
services
services (data as
(analysis as service),
service), verification/
results explanation service
validation
 Tag properties, nonjargon vocabulary for
non-specialist use
 Access mediated by
agreed standard
vocabularies, hard-wired
connections
Current
 Access mediated
by common
ontologies
Near Term (0-2 yrs)
 Shared terminology for
the visual properties of
interface objects and graph
types...
 Mediation aided by
services with domain/
range properties
Mid Term (2-5 yrs)
 Metadata-driven
data fusion (semantic
service chaining),
trust
 Semantic fields
to describe tag key
modal functions.
 Key data access
services are
semantically mediated 53
Long Term (5+ yrs)
Interactive Interoperable Responsive Verifiable
Assisted
Assisted
Data
Information Information Information Knowledge Discovery &
Analysis
services
Delivery
Quality
Building
Mediation
Seamless
Data
Access
Capability
Roadmap - from near-term to mid-term
 Semantic geospatial
search & inference,
access
 Ontologies for data
mining, visualization and
analysis emerging/ maturing
 Ontologies for
information quality
developed
 Services annotated
with resource
descriptions
 Basic data tailoring
services (data as
service), verification/
validation
 Tag properties, nonjargon vocabulary for
non-specialist use
 Access mediated
by common
ontologies
Near Term (0-2 yrs)
-> requires agent development
and vocabulary for agent
characterization
 Semantic agentbased searches
-> requires mature (domain and
data-type) ontologies with
community endorsement and
governance and a robust
integration framework
-> requires mature quality and
uncertainty ontologies with
domain and range properties
added and populated
 Common
terminology captured
in ontologies, crossing
domains
-> requires semantic service
(ontology) registry
-> requires service to
implement v/v, new
descriptions of analyses,
developing explanation
-> requires development of
portal modal function
vocabulary and ontology, link
to domain context and data
structure
-> requires adding properties
to classes in ontologies and
populating instances with
expert agreement
 Domain and range
properties in ontologies
used in tools
 Dynamic service
discovery and mediation,
and data scheduling
 Interoperable geospatial
services
(analysis as service),
results explanation service
 Shared terminology for
the visual properties of
interface objects and graph
types...
 Mediation aided by
services with domain/
range properties
Mid Term (2-5 yrs)
54
Selected Technical Benefits
1.
2.
3.
4.
5.
6.
7.
8.
Integrating Multiple Data Sources
Semantic Drill Down / Focused Perusal
Statements about Statements
Inference
Translation
Smart (Focused) Search
Smarter Search … Configuration
Proof and Trust
Updated material reused from “The Substance of the Web”. McGuinness and Dean. Semantic Web Applications for National
Security. May, 2005. http://www.schafertmd.com/swans/agenda.html
55
1: Integrating Multiple Data
Sources
• The Semantic Web lets us merge
statements from different sources
• The RDF Graph Model allows
programs to use data uniformly
regardless of the source
• Figuring out where to find such
data is a motivator for Semantic
Web Services
hasCoordinates
#Ionosphere
#magnetic
name
hasLowerBoundaryValue
“100”
“Terrestrial
Ionosphere”
hasLowerBoundaryUnit
“km”
Different line & text colors
56
represent different data sources
2: Drill Down /Focused
Perusal
• The Semantic Web uses Uniform
Resource Identifiers (URIs) to
…#NeutralTemperature
name things
• These can typically be resolved
to get more information about the
resource
measuredby
• This essentially creates a web of
data analogous to the web of text
created by the World Wide Web
Internet
• Ontologies are represented using
the same structure as content
– We can resolve class and
property URIs to learn about the
ontology
…#Norway
locatedIn
...#ISR
...#FPI
type
operatedby
...#MilllstoneHill …#EISCAT
57
3: Statements about Statements
• The Semantic Web allows us to
make statements about
statements
– Timestamps
– Provenance / Lineage
– Authoritativeness / Probability /
Uncertainty
– Security classification
– …
#Danny’s
#Aurora
hasSource
hasDateTime
hascolor
• This is an unsung virtue of the
Semantic Web
20031031
Red
Ontologies Workshop, APL May 26, 2006
58
4: Inference
• The formal foundations of
the Semantic Web allow
us to infer additional
(implicit) statements that
are not explicitly made
• Unambiguous semantics
allow question answerers
to infer that objects are
the same, objects are
related, objects have
certain restrictions, …
• SWRL allows us to make
additional inferences
beyond those provided by
the ontology
OperatesInstrument
#Millstone Hill
#Interferometer
hasInstrument
isOperatedBy
Measures
hasTypeofData
hasOperatingMo
hasMeaasuredData
#VerticalMeans
59
5: Translation
• While encouraging sharing,
the Semantic Web allows
multiple URIs to refer to the
same thing
• There are multiple levels of
mapping
–
–
–
–
Classes
Properties
Instances
Ontologies
• OWL supports equivalence
and specialization; SWRL
allows more complex
mappings
#precipitation
name
ont1:Precipitation
ont1:EduLevel
VO:Scientist
#precipitation
name
ont2:Rain
ont2:EduLevel
EduVO:K-12
60
6: Smart (Focused) Search
• The Semantic Web
associates 1 or more
classes with each
object
• We can use ontologies
to enhance search by:
–
–
–
–
Query expansion
Sense disambiguation
Type with restrictions
….
61
7: Smarter Search / Configuration
62
GEONGRID Ontology Search
and Data Integration Example
Uses emerging web standards to enable smart web
applications
Given an upper-level domain choice
•Ecology
Illustrate or list contained concepts/hierarchy
•VegetationCover, TreeRings, etc.
Retrieve some specific options from web
•Maps, tree-ring data,
•
Info: https://portal.geongrid.org:8443/gridsphere/gridsphere
63
64
65
8: Proof
• The logical foundations
hasCalibration
#Critical
of the Semantic Web
#FlatField
Dataset
allow us to construct
proofs that can be used
hasPeerReview
to improve transparency,
understanding, and trust
#Solar
Physics
• Proof and Trust are onPaper
going research areas for
the Semantic Web: e.g., “Critical Dataset has been calibrated
See PML and Inference with a flat field program that is published
In the peer reviewed literature.”
66
Web
Inference Web
Framework for explaining reasoning tasks by storing,
exchanging, combining, annotating, filtering, segmenting,
comparing, and rendering proofs and proof fragments
provided by multiple distributed reasoners.
• OWL-based Proof Markup Language (PML) specification as
an interlingua for proof interchange
• IWExplainer for generating and presenting interactive
explanations from PML proofs providing multiple dialogues
and abstraction options
• IWBrowser for displaying (distributed) PML proofs
• IWBase distributed repository of proof-related meta-data such
as inference engines/rules/languages/sources
• Integrated with theorem provers, text analyzers, web
services, …
http://iw.rpi.edu
67
Inference Web Infrastructure
(McGuinness, et.al., 2004 http://www.ksl.stanford.edu/KSL_Abstracts/KSL-04-03.html )
Files/WWW
Semantic
OWL-S/BPEL
Discovery Service
(DAML/SNRC)
CWM
(NSF TAMI)
JTP
(DAML/NIMD)
SPARK
(DARPA CALO)
N3
KIF
SPARK-L
UIMA
(DTO NIMD Text Analytics
Exp Aggregation)
Proof Markup
Language (PML)
Trust
Justification
Provenance
Toolkit
IWTrust
Trust computation
IW Explainer/
Abstractor
End-user friendly
visualization
IWBrowser
Expert friendly
Visualization
IWSearch
search engine
based publishing
IWBase
provenance
registration
Framework for explaining question answering tasks by
• abstracting, storing, exchanging,
• combining, annotating, filtering, segmenting,
• comparing, and rendering proofs and proof fragments
provided by question answerers.
68
SW Questions & Answers
Users can explore extracted entities and relationships, create new
hypothesis, ask questions, browse answers and get explanations for
answers.
A question
An answer
A context for
explaining
the answer
An abstracted
explanation
69
(this graphical interface done by Batelle supported by Stanford KSL)
Summary
• Semantics are a very key ingredient for progress in
informatics and escience
• A sustained involvement of key inter-disciplinary
team members is very important -> leads to
incentives, rewards, etc. and a balance of research
and production
• This is what we will be teaching you in this class
70
Semantic Web Methodology and
Technology Development Process
•
•
Establish and improve a well-defined methodology vision for
Semantic Technology based application development
Leverage controlled vocabularies, et c.
Rapid
Leverage
Open World: Prototype
Technology
Evolve, Iterate,
Infrastructure
Redesign,
Redeploy
Adopt
Technology Science/Expert
Approach Review & Iteration
Use Tools
Evaluation
Analysis
Use Case
Small Team,
mixed skills
Develop
model/
ontology
71
Outline of the course
• Topics for Semantic e-Science/ Foundations:
–
–
–
–
–
–
–
–
–
–
–
–
–
Semantic Methodologies
Knowledge Representation for e-Science
Ontology Engineering and Re-Use for e-Science
Knowledge Integration for e-Science
Semantic Data Integration
Semantic Web Languages, Tools and Services
Semantic Infrastructure and Architecture for e-Science
Semantic Grid Middleware
Ontology Evolution for e-Science
Knowledge Management for e-Science
e-Science Workflow Management
Data life-cycle for e-Science
Data Mining and Knowledge Discovery
72
SeS Applications and Ontologies
•
•
•
•
Semantic Web for Health Care and Life Science
Semantic Web for Bio-Med-informatics
Semantic Web for System and Integrated Biology
Semantic Web for Sun, Earth, Environment and
Climate
• Semantic Web for Chemistry, Physics and
Astronomy
• Semantic Web for Engineering
• Semantic Web and Digital Libraries and Scientific
Publications
73
SeS Project options
• Configuration and Deployment of Semantic Virtual
Observatories
– Oceanography, astronomy, geology
•
•
•
•
•
Ontology Merging and Validation Test-bed
Semantic Language and Tool Use and Evaluation
Semantic eScience Implementation Evaluation
Semantic Collaboration Case Studies
Semantic Application Development and
Demonstration
74
Schedule - wiki
• Reading assignments
• Assignments
– Individual
– Group
• Written assessments
• Presentation assessments
• Group assessments
75
What we expect
• Attend class, complete assignments
• Participate
• Ask questions – be honest with yourself and
others about what you do and do not know
• Work both individually and in a group
• Work constructively in group and class
sessions
76
Logistics summary
• Class - Monday 1-3:50pm
• Office hours – By Appointment along with a
regular time to be determined
• This weeks assignment:
– Reading - Ontologies 101, Semantic Web, eScience, RDFS
– Turn in a one page description of one of your
favorite papers from the reading list
• Next class (week 2 – September 13*****):
– Foundations I: Methodologies, Knowledge
Representation
• Questions?
77
Extra
78
Digital natives expect services to
accommodate their preferences.
•
•
•
•
•
•
•
•
•
Information online, not “in line”
Information on-demand, free of place or time
Blended classroom and online experience
Flexible schedule for working students
Relevant and timely content
More team collaboration
More content from multiple sources
Interactive content from voice, video and data
Ability to contribute, as well as consume,
content/knowledge
• Leads to virtual access…
Progression after progression
Informatics
IT Cyber
Infrastru
cture
Cyber
Informatics
Core
Informatics
Science
Informatics,
aka
Xinformatics
Science,
Societal
Benefit
Areas
80
Summary
• The data and information challenges are (almost)
being identified as increasingly common
• Data and information science is becoming the
‘fourth’ column (along with theory, experiment
and computation)
• Informatics is playing a key role in filling the gap
between science (and the spectrum of nonexpert) use and generation and the underlying
cyberinfrastructure – evident due to the
emergence of Xinformatics (world-wide)
• Informatics is a profession and a community
activity and requires efforts in all 3 sub-areas
(science, core, cyber) and must be synergistic
81