“It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to.
Download ReportTranscript “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to.
“It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera 11/6/2015 Bill Howe, UW 1 Science is about querying databases Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses) – Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) – Oceanography: high-resolution models, cheap sensors, satellites – Biology: lab automation, high-throughput sequencing, imaging 11/6/2015 Bill Howe, UW 2 11/6/2015 Bill Howe, UW src: Lincoln Stein 3 Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 11/6/2015 Bill Howe, UW 4 Data management in the life sciences 90% of all business data is maintained in spreadsheets – Enrique Godreau, Voyager Capital 11/6/2015 Bill Howe, UW 5 2010 Pilot- Outreach and Educationbased sampling: Schooner Adventuress Robin Kodner 9 13 11 14 12 15 16 4 510 8 7 11/6/2015 Bill Howe, UW 6 6 11/6/2015 5/18/10 Bill Howe, UW Garret Cole, eScience Institute 7 11/6/2015 5/18/10 Bill Howe, UW Garret Cole, eScience Institute 8 11/6/2015 5/18/10 Bill Howe, UW Garret Cole, eScience Institute 9 metadata sequence data search results 11/6/2015 Bill Howe, UW 10 SQL 11/6/2015 5/18/10 Bill Howe, UW Garret Cole, eScience Institute 11 plankton tows taxa counts whole water DNA/RNA sequencing nutrient analysis + toxin analysis Solving the Petascale Challenge: What is the rate-limiting step in data understanding? (1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes) Amount of data in the world Amount of data in the world Time src: Cecilia Aragon Solving the Petascale Challenge: What is the rate-limiting step in data understanding? Amount of data in the world Processing power (1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes) Processing power: Moore’s Law Time Time src: Cecilia Aragon Amount of data in the world Solving the Petascale Challenge: What is the rate-limiting step in data understanding? Amount of data in power the world Processing (1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes) Processing power: Moore’s Law Amount of data in the world Effective Processing Power: Amdahl’s Law Time Time src: Cecilia Aragon Solving the Petascale Challenge: What is the rate-limiting step in data understanding? Processing power (1 PB = 1,000,000,000,000,000 B = 1015 bytes = one quadrillion bytes) Processing power: Moore’s Law Amount of data in the world Effective Processing Power: Amdahl’s Law Human cognitive capacity Time Idea adapted from “Less is More” by Bill Buxton (2001) src: Cecilia Aragon A wealth of information creates a poverty of attention. -- Herbert Simon, 1978 11/6/2015 Bill Howe, UW 17 NatureMapping Program Wildlife Observations (1902- ) Karen Dvornich Data collection and submission options: 1. Download/upload spreadsheet 2. Online data entry 3. NatureTracker on handheld/GPS 4. Android ODK (Open Data Kit) Water Quality Monitoring Sites (2003 - ) 11/6/2015 Bill Howe, UW 18 Karen Dvornich 11/6/2015 Bill Howe, UW 19 Andrew White, UW Chemistry “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” 11/6/2015 Bill Howe, UW -- Andrew D White 20 Isabelle Phan, Core Director, Seattle Biomed “Just before the holidays, my team obtained a small fund from the institute to give our labs 100hrs of free tutorials on bioinfo tools and techniques, including "Data transfer from spreadsheets into an on-line relational database for query" (sic). This is what folks have specifically requested.” SkyScraper: Scalable Image Registration and Query in the Cloud with MapReduce M1 Andy Connolly R1 R2 M2 M3 M3 M1 R1 M2 M4 Horizon: Where the Ocean meets the Cloud • Need interactive “climatologies”: Decade-scale averages under different assumptions • Must manipulate 40 terabytes the same way you manipulate 40 megabytes: efficiently, interactively, visually • Client + Cloud: VisTrails, GridFields, 400-node Hadoop Cluster (NSF CluE program) Bill Howe Claudio Silva Juliana Freire http://clue.cs.washington.edu/ 11/6/2015 Bill Howe, UW 24 Deployment on R/V Barnes 3/12/09 Bill Howe, eScience Institute 25 Ship-to-Ship and Ship-to-Shore Telemetry Wecoma Forerunner Barnes 3/12/09 Bill Howe, eScience Institute SWAP Network; collaboration of: - OSU - OHSU - UNOLS 26 Event Detection: Red Water myrionecta rubra 3/12/09 Bill Howe, eScience Institute 27