“It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to.

Download Report

Transcript “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to.

“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
11/6/2015
Bill Howe, UW
1
Science is about querying databases
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
– Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
– Oceanography: high-resolution models, cheap sensors, satellites
– Biology: lab automation, high-throughput sequencing, imaging
11/6/2015
Bill Howe, UW
2
11/6/2015
Bill Howe, UW
src: Lincoln Stein
3
Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
11/6/2015
Bill Howe, UW
4
Data management in the life sciences
90% of all business data is maintained in spreadsheets
– Enrique Godreau, Voyager Capital
11/6/2015
Bill Howe, UW
5
2010 Pilot- Outreach and Educationbased sampling: Schooner Adventuress
Robin
Kodner
9
13
11
14
12
15
16
4
510
8
7
11/6/2015
Bill Howe,
UW
6
6
11/6/2015
5/18/10
Bill Howe, UW
Garret Cole, eScience Institute
7
11/6/2015
5/18/10
Bill Howe, UW
Garret Cole, eScience Institute
8
11/6/2015
5/18/10
Bill Howe, UW
Garret Cole, eScience Institute
9
metadata
sequence
data
search results
11/6/2015
Bill Howe, UW
10
SQL
11/6/2015
5/18/10
Bill Howe, UW
Garret Cole, eScience Institute
11
plankton tows
taxa counts
whole water
DNA/RNA
sequencing
nutrient analysis + toxin analysis
Solving the Petascale Challenge:
What is the rate-limiting step in data understanding?
(1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes)
Amount of data in the world
Amount of data in
the world
Time
src: Cecilia Aragon
Solving the Petascale Challenge:
What is the rate-limiting step in data understanding?
Amount
of data in the
world
Processing
power
(1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes)
Processing power:
Moore’s Law
Time
Time
src: Cecilia Aragon
Amount of data in
the world
Solving the Petascale Challenge:
What is the rate-limiting step in data understanding?
Amount
of data in power
the world
Processing
(1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes)
Processing power:
Moore’s Law
Amount of data in
the world
Effective
Processing Power:
Amdahl’s Law
Time
Time
src: Cecilia Aragon
Solving the Petascale Challenge:
What is the rate-limiting step in data understanding?
Processing power
(1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes)
Processing power:
Moore’s Law
Amount of data in
the world
Effective
Processing Power:
Amdahl’s Law
Human cognitive capacity
Time
Idea adapted from “Less is More” by Bill Buxton
(2001)
src: Cecilia Aragon
A wealth of information
creates a poverty of attention.
-- Herbert Simon, 1978
11/6/2015
Bill Howe, UW
17
NatureMapping Program
Wildlife Observations (1902- )
Karen
Dvornich
Data collection and submission options:
1. Download/upload spreadsheet
2. Online data entry
3. NatureTracker on handheld/GPS
4. Android ODK (Open Data Kit)
Water Quality Monitoring Sites (2003 - )
11/6/2015
Bill Howe, UW
18
Karen
Dvornich
11/6/2015
Bill Howe, UW
19
Andrew White,
UW Chemistry
“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces.
Previously, we were using huge directory trees and plain text files.
Now we can accomplish a 10 minute 100 line script in 1 line of SQL.”
11/6/2015
Bill Howe, UW
-- Andrew D White
20
Isabelle Phan, Core Director,
Seattle Biomed
“Just before the holidays, my team
obtained a small fund from the institute
to give our labs 100hrs of free tutorials
on bioinfo tools and techniques,
including "Data transfer from
spreadsheets into an on-line relational
database for query" (sic). This is what
folks have specifically requested.”
SkyScraper: Scalable Image Registration and
Query in the Cloud with MapReduce
M1
Andy Connolly
R1
R2
M2
M3
M3
M1
R1
M2
M4
Horizon: Where the Ocean meets the Cloud
• Need interactive “climatologies”: Decade-scale averages under different
assumptions
• Must manipulate 40 terabytes the same way you manipulate 40
megabytes: efficiently, interactively, visually
• Client + Cloud: VisTrails,
GridFields, 400-node Hadoop
Cluster (NSF CluE program)
Bill Howe
Claudio Silva
Juliana Freire
http://clue.cs.washington.edu/
11/6/2015
Bill Howe, UW
24
Deployment on R/V Barnes
3/12/09
Bill Howe, eScience Institute
25
Ship-to-Ship and Ship-to-Shore Telemetry
Wecoma
Forerunner
Barnes
3/12/09
Bill Howe, eScience Institute
SWAP Network;
collaboration of:
- OSU
- OHSU
- UNOLS
26
Event Detection: Red Water
myrionecta rubra
3/12/09
Bill Howe, eScience Institute
27