Research Dataspaces: Pay-as-you-go Integration and Analysis QuickTime™ and a decompressor are needed to see this picture. Bill Howe, Phd University of Washington.

Download Report

Transcript Research Dataspaces: Pay-as-you-go Integration and Analysis QuickTime™ and a decompressor are needed to see this picture. Bill Howe, Phd University of Washington.

Research Dataspaces:
Pay-as-you-go Integration and Analysis
QuickTime™ and a
decompressor
are needed to see this picture.
Bill Howe, Phd
University of Washington
Data acquisition is no longer the
bottleneck to scientific discovery
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)

Oceanography: high-resolution models, cheap sensors, satellites

Biology: lab automation, high-throughput sequencing,
3/12/09
Bill Howe, eScience Institute
2
Two dimensions
Astronomy
LSST
# of bytes
PanSTARRS
SDSS
Oceanography
OOI
Biology
IOOS
LANL Galaxy
Pathway HIV
BioMart
Commons
GEO
# of data types
3/12/09
Bill Howe, eScience Institute
3
Building a Research Data Management System:
Status Quo
1. Establish (scientific) consensus
Scope, vision, requirements, terminology
2. Derive and encode a domain model (schema)
Encode shared knowledge in a machine readable manner
a. Relational schema, ontology, metadata standards,
conventions, controlled vocabularies, object model, API
b. Mappings between existing models
3. Retrofit new domain model to existing data
Populate the schema, attach semantics, clean data
4. Build applications
Use domain model to inform design
5. Analyze data
Do science
3/12/09
Bill Howe, eScience Institute
4
The Value of a Data Repository
VR = BD2 + UD + C
D = # of datasets in the repository
B = # of binary operations facilitated
U = # of unary operations facilitated
C = intrinsic value of the schema (for communication, etc.)
3/12/09
Bill Howe, eScience Institute
5
Quote
A typical biological data management system involves accessing
or gathering data from multiple sources, followed by data
correlation, classification, review, and curation using domain
specific tools (e.g., functional clusters, ontologies) and expertise.
In practice, biological data management is less daunting when it is
considered in the context of an iterative strategy based on gradual
data integration while accumulating domain specific knowledge
throughout the integration process.
Victor Markowitz, LBNL
3/12/09
Bill Howe, eScience Institute
6
Outline




Challenges
Dataspaces
Dataspace Support Platforms
Next Steps
3/12/09
Bill Howe, eScience Institute
7
Dataspaces
QuickTime™ and a
decompressor
are needed to see this picture.
Franklin, Halevy, Maier 2005
3/12/09
slide source: Alon Halevy
Bill Howe, eScience Institute
8
Data Management Solutions
3/12/09
Bill Howe, eScience Institute
9
Databases vs. Dataspaces
Single Schema
Data “Coexistence”
Centralized Administration
Autonomous Sources
Structured Query
Search, Browse,
Approximate Answers
Strict Integrity Constraints
Patterns and trends;
few global properties
3/12/09
Bill Howe, eScience Institute
10
Dataspaces vs. Databases (2)

Databases are Exclusive


Reject data that violates types,
schema, integrity constraints, rules +
triggers
In return:



structured query, logical and physical
data independence, transactions
…over the clean subset of your data
Dataspaces are Inclusive
Few restrictions; all data is welcome
 In return, best effort services at first:

Cataloging, keywords, attribute-value
 …over (almost) everything

3/12/09
Bill Howe, eScience Institute
11
Dataspace Services
Catalog
Structured Query
Domain-specific
Analysis
Time
Keyword search
3/12/09
Semantic Query
Bill Howe, eScience Institute
12
Example: The Internet
3/12/09
Bill Howe, eScience Institute
13
Example: Ocean Circulation
Forecasting System
Atmospheric
models
Tides
River discharge
filesystem
perl and cron
forcings (i.e., inputs)
FORTRAN
…
products via the web
RDBMS
perl and cron
salinity isolines
3/12/09
station extractions
model-data comparisons
Simulation results
Config and log files
Intermediate files
Annotations
Data Products
Relations
Bill Howe, eScience Institute
cluster
14
Example:
Environmental
environment
SAMPLING
metadata
Metagenomics
correlate diversity
w/environment
sequencing
CAMERA annotation
correlate
diversity and
nutrients
metagenome 1
ANNOTATION TABLES
Pfams
TIGRfams
COGs
FIGfams
metagenome 2
metagenome 3
SQLShare
find new
genes
metagenome 4
HMMer search
seed
of meta*ome
aligned meta*ome
fragments
alignment
precomputed
reference
tree
precomputed
PPLACER
find new
taxa and
their
distributions
compare meta*omes
STATs
taxonomic info
f Pfams, TIGRfams, COGs, FIGfams
3/12/09
Bill Howe, eScience Institute
src:
Robin Kodner
15
Example: CHAVI
NHP
Database
B Cell
Control
T Cell
Control
Relational
Dataspace
Interface and
Analysis
NK Cell
Control
Genetics
Databases
Virus Seq.
Data
src: Bart Haynes
3/12/09
Bill Howe, eScience Institute
16
Outline




Challenges
Dataspaces
Dataspace Support Platforms
Next Steps
3/12/09
Bill Howe, eScience Institute
17
Example Systems cast as DSSPs

Atlas (LabKey)



“Data Warehouse” prototype (SCHARP)


data model: triples
iTrails [Salles et al. 2007]


data model: relations
Quarry [Howe, et al. 2006]


data model: relations
SQLShare (UW eScience)


data model: tables and files
Mark Igra will present
data model: triples
Google Fusion Tables [Halevy 2010]

data model: relations
3/12/09
Bill Howe, eScience Institute
18
Environmental
Sampling
Sequencing
Pfams, TIGRfams,
COGs, FIGfams
Public annotation DBs
Phylogeny
metadata
correlate diversity
w/environment?
search hits
taxonomic info
correlate diversity
w/nutrients?
find new taxa and
their distributions?
find new genes?
“90% of my time spent manipulating
data rather than doing science'”
Bill Howe, eScience Institute
3/12/09compare meta*omes?
19
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query
length C O G hit #1
c hr_4 [4 8 0 0 0 1 - 5 8 0 0 0 0 ].2 8 7 4 5 0 0
c hr_4 [5 6 0 0 0 1 - 6 6 0 0 0 0 ].1
3556
c hr_9 [4 0 0 0 0 1 - 5 0 0 0 0 0 ].5 0 3 4 2 1 1 C O G 4 5 4 7
c hr_9 [3 2 0 0 0 1 - 4 2 0 0 0 0 ].5 4 8 2 8 3 3 C O G 5 4 0 6
c hr_2 7 [3 2 0 0 0 1 -4 0 4 2 9 8 ].2 0 3 9 9 1 C O G 4 5 4 7
c hr_2 6 [3 2 0 0 0 1 -4 2 0 0 0 0 ].3 7 8 3 9 6 3 C O G 5 0 9 9
c hr_2 6 [4 0 0 0 0 1 -4 4 1 2 2 6 ].1 9 6 2 9 4 9 C O G 5 0 9 9
c hr_2 4 [1 6 0 0 0 1 -2 6 0 0 0 0 ].6 5 3 5 4 2
c hr_5 [7 2 0 0 0 1 - 8 2 0 0 0 0 ].3 3 9 3 1 4 1 C O G 5 0 9 9
c hr_9 [1 6 0 0 0 1 - 2 6 0 0 0 0 ].2 4 3 3 0 0 2 C O G 5 0 7 7
c hr_1 2 [7 2 0 0 0 1 -8 2 0 0 0 0 ].8 6 2 8 9 5 C O G 5 0 3 2
c hr_1 2 [8 0 0 0 0 1 -9 0 0 0 0 0 ].1 0 9 1 4 6 3 C O G 5 0 3 2
c hr_1 1 [1 - 1 0 0 0 0 0 ].7 0
2886
c hr_1 1 [8 0 0 0 1 - 1 8 0 0 0 0 ].1 0 0 1 5 2 3
e- value #1 identity #1 s c ore #1 hit length #1 des c ription #1
Simple Example
2 .0 0 E - 0 4
2 .0 0 E - 0 4
5 .0 0 E - 0 5
5 .0 0 E - 0 5
2 .0 0 E - 0 4
19
38
18
17
17
4 4 .6
4 3 .9
4 6 .2
4 6 .2
4 3 .9
620
1001
620
777
777
C obalamin bios ynthes is prote
N uc leos ome binding fac tor S P
C obalamin bios ynthes is prote
RN A -binding protein of the P u
RN A -binding protein of the P u
4 .0 0 E - 0 9
1 .0 0 E - 2 5
2 .0 0 E - 0 9
1 .0 0 E - 0 9
20
26
30
30
5 9 .3
114
6 0 .5
6 0 .1
777
1089
2105
2105
RN A -binding protein of the P u
U biquitin c arboxyl- terminal h
P hos phatidylinos itol kinas e a
P hos phatidylinos itol kinas e a
COGAnnotation_coastal_sample.txt
id
query
hit
e_value identity_ s c ore query_s tart query_end hit_s tart hit_end hit_length
1 FH J 7 D RN 0 1 A 0 T N D .1C O G 0 4 1 4 1 .0 0 E - 0 8
28
51
1
74
180
257
285
2 FH J 7 D RN 0 1 A 1 A D 2 .2C O G 0 0 9 2 3 .0 0 E - 2 0
4 7 8 9 .9
6
85
41
120
233
3 FH J 7 D RN 0 1 A 2 H WZ.4
C O G3889
0 .0 0 0 6
2 6 3 5 .8
9
94
758
845
872
…
2 8 5 3FH J 7 D RN 0 2 H XT BY .5C O G 5 0 7 7 7 .0 0 E - 0 9
3 7 5 2 .3
3
77
313
388
1089
2 8 5 4FH J 7 D RN 0 2 H ZO 4 J.2C O G 0 4 4 4 2 .0 0 E - 3 1
67 127
1
73
135
207
316
…
3 5 6 6FH J 7 D RN 0 2 F U JW3 .1C O G 5 0 3 2 1 .0 0 E - 0 9
3 2 5 4 .7
1
75
1965
2038
2105
…
select *
from annotationsummary_combinedorfannotation16_phaeo_genome,
COGAnnotation_surface
where phaeo_gene = surf_hit
3/12/09
Bill Howe, eScience Institute
20
Environmental
Sampling
Sequencing
Pfams, TIGRfams,
COGs, FIGfams
Public annotation DBs
Phylogeny
metadata
SQL
correlate diversity
w/environment?
search hits
SQLShare
taxonomic info
correlate diversity
w/nutrients?
“That took me a week with Excel”
find new taxa and find new genes?
their distributions?3/12/09
“I can
do science again”
compare meta*omes?
Bill Howe, eScience
Institute
21
3/12/09
Bill Howe, eScience Institute
22
3/12/09
Bill Howe, eScience Institute
23
SQLShare Motivation

Conventional wisdom says “Scientists won’t write SQL”


We don’t believe it
Instead, we implicate difficulty in






installation
configuration
schema design
performance tuning
data ingest
over-reliance on GUIs
We ask “What kind of technology would
make SQL a natural fit for hypothesis
testing?”
3/12/09
Bill Howe, eScience Institute
24
SQLShare Features



Collaborative SQL authoring and sharing
Views for incremental abstraction and integration
Semi-automatic integration


SQL Autocomplete


User starts typing, system uses query logs to make suggestions
[Khoussainova 10]
English Query


Identify “natural” unions and joins
Bootstrap a SQL query from an English questions
Simple Visualization

via Integration with Google Fusion Tables
3/12/09
Bill Howe, eScience Institute
25
Outline




Challenges
Dataspaces
Dataspace Support Platforms
Next Steps
3/12/09
Bill Howe, eScience Institute
26
Next Steps

Define scope


Define HIV Dataspace team
Build a minimal technical team





Identify and catalog dataspace “participants” (i.e., sources)
Review data access rights and security requirements
Gather “spanning basis” of questions to answer


“Data Wrangler”
“Application Wrangler”
Jim Gray’s “20 questions” methodology
Gather “spanning basis” of existing data


use exemplars if necessary
load data “as is” into a database
3/12/09
Bill Howe, eScience Institute
27
Next Steps (2)

Answer initial questions (Data wrangler)



RDBMS example: create views
Visualize initial answers (Application wrangler)
Demonstrate early progress


Check breadth (what’s missing?)
Check depth (Did “hard” questions get answered?)
3/12/09
Bill Howe, eScience Institute
28
Summary



Conventional “schema-first” approaches
break down in research contexts
The dataspace abstraction and DSSPs offer
a way forward
Systems and best practices are emerging in
the literature and from production
deployments
3/12/09
Bill Howe, eScience Institute
29
3/12/09
Bill Howe, eScience Institute
30
BACKUP SLIDES
3/12/09
Bill Howe, eScience Institute
31
Feature: Sharing SQL
3/12/09
Bill Howe, eScience Institute
32
Feature: SQL Autocomplete


User requests suggestions on-the-fly as
he/she types query
Recommends snippets:





predicates in the WHERE clause
tables in the FROM clause
attributes in the SELECT clause
Recommendations are context-aware
Leverages past queries by user and
collaborators
Src: Nodira Khoussainova
3/12/09
Bill Howe, eScience Institute
33
Feature: English Query

Lots of research on Natural Language
Interfaces to Databases



c.f. [Etzioni 2008, Zettermeyer 2009]
Very hard problem, in general
Significant simplification: user can inspect
and “fix” the generated SQL prior to
execution
3/12/09
Bill Howe, eScience Institute
34
Feature: Simple Visualization
For each phaeo gene, count the number of matches in the COGAnnotation_surface
dataset, joining on COG id. Return the top 10 most commonly found genes.
Implementation: Export to Google Fusion Tables
3/12/09
Bill Howe, eScience Institute
35
Dataspaces: Summary
A “Dataspace Support Platform” should
 use a “lowest common denominator” data model
 not rely crucially on upfront global consensus
 not rely crucially on “perfect” metadata
 embrace exceptions, but exploit patterns
 support task-specific, “top down” integration


….but seek and exploit cross-cutting patterns where possible
deliver incremental return for incremental investment




…in data quality enhancement
…in metadata normalization
…in usage standardization
…in application “convergence”
3/12/09
Bill Howe, eScience Institute
36
Timeline
Semantic
Web
value for users
Dataspace
support
platforms
Dataspaces
Data Integration Tools
Insular Data Sources
Federated
Databases
RDF/OWL
Ontologies
time, scope, effort
3/12/09
Bill Howe, eScience Institute
37
metagenomics
Example: Metagenomics
Study microbial populations
metatranscriptomics
sampled from the environment
instead of individual organisms
metaproteomics
1. Who is there?
Which organisms make up the population?
2. What are they doing?
Which metabolic pathways are present and active?
(and who is doing what?)
3. Compare datasets
- across a transect (nearshore vs. deep ocean)
- before/after some event (e.g., Spring freshet)
- across salinity/temperature gradients
- diurnal cycles (day/night)
3/12/09
Bill Howe, eScience
Institute
Source:
Robin
Kodner, Armbrust38
Lab
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
###query
length C O G hit #1 e- value #1 identity #1 s c ore #1 hit length #1 des c ription #1
c hr_4 [4 8 0 0 0 1 - 5 8 0 0 0 0 ].2 8 7 4 5 0 0
c hr_4 [5 6 0 0 0 1 - 6 6 0 0 0 0 ].1
3556
c hr_9 [4 0 0 0 0 1 - 5 0 0 0 0 0 ].5 0 3 4 2 1 1 C O G 4 5 4 7 2 .0 0 E - 0 4
19
4 4 .6
620
C obalamin bios ynthes is protein
c hr_9 [3 2 0 0 0 1 - 4 2 0 0 0 0 ].5 4 8 2 8 3 3 C O G 5 4 0 6 2 .0 0 E - 0 4
38
4 3 .9
1001
N uc leos ome binding fac tor SP N
c hr_2 7 [3 2 0 0 0 1 -4 0 4 2 9 8 ].2 0 3 9 9 1 C O G 4 5 4 7 5 .0 0 E - 0 5
18
4 6 .2
620
C obalamin bios ynthes is protein
c hr_2 6 [3 2 0 0 0 1 -4 2 0 0 0 0 ].3 7 8 3 9 6 3 C O G 5 0 9 9 5 .0 0 E - 0 5
17
4 6 .2
777
RN A -binding protein of the P uf f
c hr_2 6 [4 0 0 0 0 1 -4 4 1 2 2 6 ].1 9 6 2 9 4 9 C O G 5 0 9 9 2 .0 0 E - 0 4
17
4 3 .9
777
RN A -binding protein of the P uf f
c hr_2 4 [1 6 0 0 0 1 -2 6 0 0 0 0 ].6 5 3 5 4 2
c hr_5 [7 2 0 0 0 1 - 8 2 0 0 0 0 ].3 3 9 3 1 4 1 C O G 5 0 9 9 4 .0 0 E - 0 9
20
5 9 .3
777
RN A -binding protein of the P uf f
c hr_9 [1 6 0 0 0 1 -coastal
2 6 0 0 0 0sample
].2 4 3 3 0 0 2 C O G 5 0 7 7 1 .0 0 E - 2 5
26
114
1089
U biquitin c arboxyl- terminal hyd
c hr_1 2 [7 2 0 0 0 1 -8 2 0 0 0 0 ].8 6 2 8 9 5 C O G 5 0 3 2 2 .0 0 E - 0 9
30
6 0 .5
2105
P hos phatidylinos itol kinas e and
hit
query_s
tart
hit_end
hit_length
c hr_1 2 [8 0 0 0 0 id
1 -9 0 0query
0 0 0 ].1 0Browser
9 1 4 6 3 Cross-Reference
COG
5 0 3 2 1 .0 0e_value
E -09
30
6 0query_end
.1
2 1hit_s
0 5 tart
P hos
phatidylinos
itol kinas e and
6
4
0
9FH
J
7
D
R
N
0
1
BY
A
6
1
.1
T
I
GR0
0
1
4
9
2
.2
0
E
2
1
1
8
4
4
3
1
2
5
134
c hr_1 1 [1 - 1 0 0 0 0 0 ].7 0
2886
6
4
1
0FH
J
7
D
R
N
0
1
BD
T
E
A
.1
T
I
GR0
0
1
4
9
3
.4
0
E
0
9
3
4
2
3
0
6
9
134
c hr_1 1 [8 0 0 0 1 - 1 8 0 0 0 0 ].1 0 0 1 5 2 3
6 4 1 1FH J 7 D R N 0 2 H E U G Q .1
T I GR0 0 1 4 9 1 .7 0 E - 0 5
4
46
1
46
134
6 4 1 2FH J 7 D R N 0 1 C A 4 BO .1
T I GR0 0 1 4 9 5 .3 0 E - 0 5
4
45
1
45
134
COG database
6
4
1
3FH
J
7
D
R
N
0
1
D
M
2
FK.3
T
I
GR0
1
6
5
1
5
.7
0
E
6
4
1
7
6
5
1
1
5
8
6
606
SwissProt web service
6 4 1 4FH J 7 D R N 0 1 B8 B P S.1T I GR0 1 6 5 1 1 .2 0 E - 3 6
1
5
2
5
0
0
5
5
1
606
…
6 4 1 5FH J 7 D R N 0 2 J M 5 4 P .1T I GR0 1 6 5 1 2 .2 0 E - 2 4
0
3 0bios
1 ynthes
3 6 6 is protein
6 0 6C obT
[H ] C1O5G 4 5 4 7 C8obalamin
coastal sample
(nic otinatemononuc
6 4 1 6FH J 7 D R N 0 2 F K6 C 5 .2T I GR0 0 0 3 9 2 .7 0 E - 1 6
1
4 5 leotide:5
3 7 , 6 -dimethylbenzimidazole
85
153
phos
phoribos
yltrans
feras
e)
6
4
1
7FH
J
7
D
R
N
0
1
D
0
1
9
A
.1
T
I
GR0
0
0
3
9
8
.9
0
E
1
2
5
6
5
4
8
1
1
8
153
TIGRFAM
to
GO
Mapping
id
query
hit
e_value query_s tart query_end hit_s tart hit_end hit_length
Y
pe:
Y
P
M
T
1
.8
6
4
1
8FH
J
7
D
R
N
0
2
F
Y
A
FO
.1
T
I
GR0
0
0
3
9
1
.6
0
E
1
1
1
7
6
6
7
1
5
3
153
6 4 0 9FH J 7 D R N 0 1 BY A 6 1 .1T I GR0 0 1 4 9 2 .2 0 E - 2 1
1
84
43
125
134
TIGR01650
GO:0051116
contributes_to
A
tu:
A
G
l2
4
1
0
6 4 1 0FH J 7 D R N 0 1 BD T E A .1T I GR0 0 1 4 9 3 .4 0 E - 0 9
3
42
30
69
134
S
me:
S
M
c
0
0
7
0
1
TIGR01651
GO:0009236
NULL
6 4 1 1FH J 7 D R N 0 2 H E U G Q .1
T I GR0 0 1 4 9 1 .7 0 E - 0 5
4
46
1
46
134
B me:
GO:0051116
6 4 1 2FH J 7 D R N 0 1 C A 4 BO .1
T I GR0 0 1 4 9 5 .3 0 E -TIGR01651
05
4
4 5 B M E I 010 5 0 NULL
45
134
M lo:
GO:0008940
NULL
6 4 1 3FH J 7 D R N 0 1 D M 2 FK.3T I GR0 1 6 5 1 5 .7 0 E -TIGR01660
64
1
7 6 mll3 556111
586
606
C c5r:2 C C 0 6
7
2
6 4 1 4FH J 7 D R N 0 1 B8 B P S.1T I GR0 1 6 5 1 1 .2 0 E -TIGR01660
36
1
5
0
0
5
5
1
606
GO:0009061
NULL
…
6 4 1 5FH J 7 D R N 0 2 J M 5 4 P .1T I GR0 1 6 5 1 2 .2 0 E -TIGR01660
24
1 GO:0009325
5
80
301
366
606
NULL
[J] C O G 5 0 9 9 RN A - binding protein of the P uf family,
6 4 1 6FH J 7 D R N 0 2 F K6 C 5 .2T I GR0 0 0 3 9 2 .7 0 E -TIGR01663
16
1
4
5
3
7
8
5
153
GO:0000012
trans lational represNULL
s or
6 4 1 7FH J 7 D R N 0 1 D 0 1 9 A .1T I GR0 0 0 3 9 8 .9 0 E - 1 2
5
65
48
118
153
TIGR01663 GO:0046403
NULL
S c e: Y G L 0 1 4 w Y G
L 1 7 8 w Y J R0 9 1 c Y L L 0 1 3 c Y P R0 4 2 c
6 4 1 8FH J 7 D R N 0 2 F Y A FO .1T I GR0 0 0 3 9 1 .6 0 E - 1 1
1
76
67
153
153
Complex Example
S po: S P A C 1 6 8 7 .2 2 c S P A C 4 G 8 .0 3 c S P A C 4 G 9 .0 5
S P A C 6 G 9 .1 4 S P B C 5 6 F2 .0 8 c S P B P 3 5 G 2 .1 4 S P C C 1 6 8 2
E c u: E C U 1 1 g1 7 3 0
…
Background: Relational Databases
Pre-relational brittleness: if your data changed, your
application often broke.
Early RDBMS were buggy and slow (and often reviled),
but required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
3/12/09
Bill Howe, eScience Institute
40
Relational Databases: Summary



A General Data Model: “just tables”
Logical and Physical Data Independence
Declarative Query Language


Good Scalability


via the “Relational Algebra”
“SQL is the most successful parallel language in the world”
Results



$15B industry
Nearly every (non-search engine) website you visit is backed by a
RDBMS
One of the all-time best examples of CS research impact
3/12/09
Bill Howe, eScience Institute
41
So what went wrong?

DBAs!


“Schema design” became paramount
“Applications write queries, not users”



Applications became tightly coupled to schema
Ad hoc queries, ad hoc views, ad hoc data
confounded predictable performance, centralized
management, and strong global guarantees
Result: Other tools enlisted to fil the gap

Java, etc.; XML, RDF, etc.; Web Services
3/12/09
Bill Howe, eScience Institute
42
Key Idea: Data Independence
views
SELECT seq
FROM all_sequences
WHERE seq =
‘GATTACGATATTA’;
logical data independence
relations
SELECT dna
FROM ncbi_sequences
WHERE dna =
‘GATTACGATATTA’;
physical data independence
3/12/09
f = fopen(‘table_file’);
fseek(10030440);
files and while (True) {
pointers
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
Bill Howe, eScience Institute
43
Key Idea: An Algebra of Tables
select
project
join
join
Other operators: aggregate, union, difference, cross product
3/12/09
Bill Howe, eScience Institute
44
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity:
2. (/) identity:
3. (*) distributes:
4. (*) commutes:
x+0 = x
x/1 = x
(n*x+n*y) = n*(x+y)
x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
3/12/09
Bill Howe, eScience Institute
45
My Interests
Computer Science
Databases
Data-Intensive Scalable Computing
Scientific Data
Management
Research Data Integration
Cloud Computing
Visual Data Analytics
3/12/09
Bill Howe, eScience Institute
46
Research Cycle
Observe
Synthesis
Publish/Share
Experiment
Analyze
3/12/09
Bill Howe, eScience Institute
47
Data Management
complexity-hiding interfaces
Storage
Cloud Computing
Information Integration
Information Extraction,
3/12/09
Access Query
Web
Methods Languages Services
Visualization;
Workflow
Data Mining,
Distributed Programming Models,
Bill Howe, eScience Institute
48