Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

Data integration, web services
and workflow management
Paolo Romano
National Cancer Research Institute, Genova
([email protected])
P. Romano, Tutorial BITS2005
1
Summary






Information and data integration
Web Services
CABRI and TP53 databases
Implementation of Web Services (soaplab)
Workflow management
Demo: execution of workflows with taverna
P. Romano, Tutorial BITS2005
2
Information in biology

Biomedical research produces an increasing
quantity of new information
 Some domains, like genomics and
proteomics, contributes to huge databases

Emerging domains, like mutation and
variation analysy, polymorphisms,
metabolism, and technologies, e.g.,
microarrays, will contribute with even huger
amounts of data
P. Romano, Tutorial BITS2005
3
Information in biology

EMBL Data Library 74 (Mar 2003):
o Sequences: 23,234,788, Bases: 30,356,786,718

EMBL Data Library 81 (Dec 2004):
o Sequences: 40,696,839, Bases: 44,285,259,441
o WGS sequences: 5,408,558, Bases: 34,986,041,399

EMBL Data Library 82 (Mar 2005):
o Sequences: 43,246,005, Bases: 46,927,070,905
o WGS sequences: 6,228,397, Bases: 38,207,643,477
o
Size: 7,3% more vs 81 (3 months), 112,9% vs 74 (24 months)
P. Romano, Tutorial BITS2005
4
Heterogeneicity of databanks





Only a few databanks are managed in an almost
homogenous way by EBI, NCBI, DDBJ (sequence)
Many databanks are created by small groups or single
researchers
Secondary databases are of high quality (good and
extended annotation, quality control)
Many databases are highly specialized, e.g. by gene,
organism, disease, mutation, etc…
Databanks are distributed: different DBMS, data
structures, information, semantics, distribution
methods
P. Romano, Tutorial BITS2005
5
Softwares

Specialist softwares are essential for almost
all analysis in molecular biology:
o

Sequence analysis, secondary and tertiary protein
structure prediction, gene prediction, molecular
evolution, etc…
Softwares must interoperate with databases
o
o
Databases as input for softwares
Results as new data to record and analyze
P. Romano, Tutorial BITS2005
6
Goals of the integration

Integration is needed in order to:
o
o
o
o
Achieve a better and wider view of all available
information
Carry out analysis and/or searches involving more
databases and softwares automatically
Perform analysis involving large data sets
Carry out a real data mining
P. Romano, Tutorial BITS2005
7
Integration longevity

Integration needs stability
o
o
o
o

Standardization……
Good domain knowledge
Well defined data
Well defined goals
Integration fears:
o
o
o
o
o
o
Heterogeneicity of data and systems
Uncertain domain knowledge
Fast evolution of data
Highly specialized data
Lacking of predefined, clear goals
Originality, experimentalism (“let me see if this works”)
P. Romano, Tutorial BITS2005
8
Integration of biological information
In biology:
o
o
o
Goals and needs of researchers evolve very quickly
according to new theories and discoveries
A pre-analysis and reorganization of the data is
very difficult, because data and related knowledge
vary continuosly
Complexity of information makes it difficult to
design data models which can be valid for different
domains and over time
P. Romano, Tutorial BITS2005
9
Integration methods
Integration methods
Explicit (reciprocal) links (xrefs)
 Implicit links (e.g., names)


Common contents (vocabularies)
Shared data models and schemas
 Ontologies

P. Romano, Tutorial BITS2005
10
Web Services

XML based network services
 Implement standard transport protocols (SOAP,
HTTP)
 Standards available for their retrieval and
identification (UDDI), description (WSDL) and
composition (WSFL)
 Allow software applications to access data
“intelligently”: identification of contents, interpretation
of semantics information
 Metadata needed
 Web Services implemented by many Institutes and
service nodes (EBI, NCBI, ....)
P. Romano, Tutorial BITS2005
11
WSDL: the description
Web Services Description Language (WSDL)

Standard for the description of Web Services
 Define localization, access ways and detailed
description
 Abstract functionalities, practical details
 WSDL Binding: implementation for SOAP,
HTTP, MIME
P. Romano, Tutorial BITS2005
12
CABRI: Objectives
Common Access to Biological Resources and
Information

Setting Quality Management Guidelines
 Distributing biological resources of the highest quality
 Integrating searches and access to catalogues
 Ad hoc search (CABRI Simple Search)
 Shopping cart (pre-ordering facility)
P. Romano, Tutorial BITS2005
13
CABRI: Partners and resources
Partners:


BCCM, CABI, CBS, CIP, DSMZ, ECACC, ICLC,
NCCB, NCIMB (culture collections)
IST, CERDIC (ICT)
Resources:


Microorganisms (bacteria, yeasts, fungi strains)
Animal cells (animal and human cell lines,
hybridomas, HLA typed B lines)
 Plasmids, phages, viruses, DNA probes
 Overall, more than 110.000 biological resources
P. Romano, Tutorial BITS2005
14
CABRI: SRS

Reasons why
o
o
o
o
o
o
o
Manages heterogeneous databases
Flat file format
Simple and effective interface
Internal and external links
Link operator
Easily expandible (new databases)
Flexibility in creation of indexes
P. Romano, Tutorial BITS2005
15
CABRI: data structure
For each material, three data sets identified:

Minimum Data Set (MDS): essential data, needed to
identify individual resources
 Recommended Data Set (RDS): all data that are
useful to describe individual resources
 Full Data Set (FDS): all data available on the
resources
P. Romano, Tutorial BITS2005
16
CABRI: data structure
For each information, data input and
authentication guidelines, including:

Detailed textual description of the information
 In-house reference lists of terms and controlled
vocabularies
 Predefined syntaxes (e.g., Literature, scientific names)
P. Romano, Tutorial BITS2005
17
CABRI: Name field
Field
Name
Description
Full scientific and most recent name of the strain.
It includes:
Genus
name and species epithet
Subspecies
Pathovar
Authors of the name
Year of valid publication or validation
Approbation of the name
Input process
Enter full scientific name as given by depositor and confirmed (or
changed) by collection. Names of authors of the name, year of valid
publication or validation and approbation are included after a
comma.
Values for approbation:
AL = approved list, c.f.r. IJSB 1980
VL = validation list, in IJSB after 1980
VP = validly published, paper in IJSB after 1980
Reference list: DSMZ list of bacterial names
Required for
MDS
P. Romano, Tutorial BITS2005
18
CABRI: Reference paper field
Field
Reference paper
Description
Original paper [if available]
Input process
New entries:
JournalTitle Year; Volume(issue): beginning page#-ending page#
The title is abbreviated following international standard rules (ISSN).
Abbreviations are without dot. Authors and title of the article are not
mentioned.
The reference can be followed by the Pubmed ID enclosed within
square brackets as follows:
[PMID: 1234567], where '1234567' is the Pubmed ID of the paper
Required for
MDS
P. Romano, Tutorial BITS2005
19
CABRI: integration
For each material:


Common data structure and syntax
Integrated searches/results through SRS
For each catalogue:

SRS and HTML links to reference dbs (media,
synonyms, hazard, etc…)
For many catalogues:

Explicit links to Medline, EMBL, plamisd maps
P. Romano, Tutorial BITS2005
20
IARC TP53 database
IARC TP53 Mutation Database
http://www.iarc.fr/p53/




Release 9: 19,809 somatic mutations, 1,769 papers,
Information: mutation, source, patient’s life style.
Vocabularies and standardized annotations
On-line queries imply human interaction.
SRS implementation of the TP53 Database
http://srs.o2i.it/srs71/




SRS based service
Definition of an ad hoc DTD
XML based data interchange
Improved automated accessibility
P. Romano, Tutorial BITS2005
21
CABRI and TP53 Web Services
Implementing web services that allow:
 The retrieval of information from CABRI and TP53 databases by
using remote calls to SRS
 The possibility of including such services in complex workflows
Reproducing current behaviour:
 Search by name, identifier and free text (CABRI)
 Search by interesting properties (TP53)
 Combine results
 Integrate data with other sources by using IDs/common terms
Two types of services:
 Search for a specific feature and return ID
 Search for an ID and return full record (or predefined sections)
P. Romano, Tutorial BITS2005
22
Soaplab: SOAP-based Analysis Web Service
“Soaplab is a set of Web Services providing a
programatic access to some applications on remote
computers.It is often referred to as an Analysis (Web)
Service” (Martin Senger, EBI).
It allows for the implementation of Web Services offering
access to:

local command-line applications

EMBOSS

contents of ordinary web pages (GowLab)
Requirements

Apache Tomcat servlet engine and Axis SOAP toolkit, Java

perl, mySQL
P. Romano, Tutorial BITS2005
23
Soaplab
P. Romano, Tutorial BITS2005
24
Soaplab
appl: getCellLineIdsByName [
documentation: "Get cell lines by name from CABRI human and
animal cell lines catalogues (see www.cabri.org)"
groups: "CABRI"
nonemboss: "Y"
comment: "launcher get"
supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz"
comment: "method [{$libs}-nam:'$name'] -ascii“ ]
string: libs [ parameter: "Y“ ]
string: name [ parameter: "Y“ ]
outfile: result [ ]
P. Romano, Tutorial BITS2005
25
Soaplab
appl: getCellLineIdsByProperty [
documentation: "Get cell lines by properties (all text) from CABRI
human and animal cell lines catalogues (see www.cabri.org)"
groups: "CABRI"
nonemboss: "Y"
comment: "launcher get"
supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz"
comment: "method [{$libs}-all:'$text'] -ascii"
]
string: libs [ parameter: "Y“ ]
string: text [ parameter: "Y“ ]
outfile: ids [ ]
P. Romano, Tutorial BITS2005
26
Soaplab
appl: getCellLinesById [
documentation: "Get cell lines by Id from CABRI human and animal
cell lines catalogues (see www.cabri.org)"
groups: "CABRI"
nonemboss: "Y"
comment: "launcher get"
supplier: "http://www.cabri.org/CABRI/srs-bin/wgetz"
comment: "method -e [{$libs}:'$id'] -ascii"
]
string: libs [ parameter: "Y“ ]
string: id [ parameter: "Y“ ]
outfile: result [ ]
P. Romano, Tutorial BITS2005
27
Workflow management
“A computerized facilitation or automation of a business
process, in whole or part". (Workflow Management Coalition)
Main goal is:
 the implementation of data analysis processes in standardized
environments
Main advantages relate to:
 effectiveness: being an automatic procedure, it frees bioscientists from repetitive interactions with the web and it
supports good practice,
 reproducibility: analysis can be replicated over time,
 reusability: intermediate results can be reused,
 traceability: the workflow is carried out in a transparent
analysis environment where data provenance can be checked
and/or controlled.
P. Romano, Tutorial BITS2005
28
Workflow management
Workflow management softwares:





Biopipe, an add-on to bioperl,
GPipe, an extension of the Pise interface
Taverna (EBI), a component of the myGrid platform,
Wildfire (Bioinformatics Institute, Singapore)
Pipeline Pilot (SciTegic).
P. Romano, Tutorial BITS2005
29
Workflow management
Taverna Workbench
 constructs complex analysis workflows
 access both remote and local processors
 defines alternative processors
 runs workflows
 visualizes the results
 includes a bioinformatics data ontology
Requirements: java, Windows or Linux
P. Romano, Tutorial BITS2005
30
Workflow management
WSDL services

Web Service Description Language (WSDL) file: adds WSDL based service nodes
Soaplab servers

Soaplab server: adds a list of soaplab provided services
Biomoby registries

Moby Central repository: determines hosts and their services
Workflows

XScufl definition file: adds the workflow as a node and processors as child node
Biomart databases

Biomart data warehouse: adds all available data sets
Local processors

Simple list/string processors, constant values, beanshell scripts
P. Romano, Tutorial BITS2005
31
Demo: workflows for CABRI dbs
P. Romano, Tutorial BITS2005
32
Demo: workflows for TP53 dbs
P. Romano, Tutorial BITS2005
33
Some acknoledgements…..
This work has partially been supported by the Italian
Ministry for Education, University and Research
(MIUR), project “Oncology over Internet” (2002 – 2005)
I wish to thank my colleagues:
Domenico Marra (TP53 databases and Soaplab),
Federico Malusa (CABRI databases),
Francesca Piersigilli (CABRI databases)
P. Romano, Tutorial BITS2005
34
…and an announcement!
Workshop NETTAB 2005
http://www.nettab.org/2005/
Workflows management:
new abilities for the biological information overflow
October 5 - 7,
2005,University of Naples
Naples, Italy
Take a brochure!
P. Romano, Tutorial BITS2005
35