Transcript Vldb2013-Keynote
Sam Madden
[email protected]
With a cast of many….
BIG Data
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Example: Medical Costs
MGH Cancer Center “Super-Database ”
Largest cancer database in the world (173,301 patients) Based on national tumor registry Cross linked with death registry Includes billing, reports, labs, imagery, genome SNPs Question: What are the factors driving costs for lung cancer patients?
Some results: No correlation of cost with
•
Stage of presentation
•
Survival Strong correlation of cost with oncologist!
- Dr. James Michaelson, PhD, MGH, Harvard Medical School
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Challenge: Making Data Accessible
What does the data look like?
How do I correlate it with other data sets?
How do I present it to users/execs?
Where are these anomalies and outliers coming from?
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Challenge: Making Data Accessible Introducing Datahub
+ DB Technology Octocat, the Github mascot = MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Introducing Datahub
Data Commons
Secure, Hosted Data Storage (“Database Service”) Easy to Find, Combine, Clean Data Sets Selective Sharing and Access Control Ability to Browse, Visualize, and Query Data
in situ
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Lots of other places to find data!
Datahub : “five-star” integrated, browse-able, & query-able repository of linked data Aka … Just a bunch of zip files Versus open, linked data (Tim Berners Lee Taxonomy)
★ ★★ ★★★ ★★★★ ★★★★★ make your stuff available on the Web under an open license make it available as structured data use non-proprietary formats (e.g., CSV instead of Excel) use URIs to denote things, so that people can point at your stuff link your data to other data to provide context
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Datahub Interface
Anant Bhardwaj
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
“Wrangling” Features Wrangler: Interactive Visual Specification of Data Transformation Scripts
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Post-Wrangling MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
More Datahub Interface Versions Browsing and Visualization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
MIT Living Lab
•
Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.
A Dogfood Eating Exercise
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
MIT Living Lab
•
Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.
Organizational Data MIT Data Hub MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data… Public Data Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Personal Data Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings…
What Will Data Hub Enable at MIT?
• • • •
Campus “Quantification”
– is going to class correlated with better grades?
– which dining facilities are most popular amongst different groups?
Transportation planning:
– bus utilization and on demand routing – parking lot utilization – carpool finding, etc
Health + Medical:
– campus wide public health, e.g., flu tracking, – observing who is missing class, depressed – Health signals: exercise and eating habits; partners; – outpatient care
Research:
– expert finding; – data sharing between groups
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Challenges: It’s Not All Fuzzy Stuff
We also don’t want our research to be like this guy
Platform Challenges: How to efficiently store thousands or millions of databases?
Monomi
How to anonymize data, control access, etc?
How to keep data private and allowing querying over it?
Scorpion
Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Private Data Problem
Confidential data leaks 2012: hackers extracted 6.5 million hashed passwords from the DB of LinkedIn User 1 User 2 User 3 Application Sensitive content Hackers SQL Threat: passive DB server attacks DB Server
Datahub
System administrator
How to protect data confidentiality?
Sensitive content DB Server Client [request] [result] Sensitive content
Encrypt data
server may not be able to process queries!
Compute on encrypted data!
Without giving server encryption key!
General approach has been proposed several times…
User 1 User 2 User 3
Monomi / CryptDB
Threat 1: passive DB server attacks Application Sensitive content SQL DB Server 1.
2.
3.
Process SQL queries on encrypted data Hide DB from sys. admins., outsource DB to the cloud Modest overhead No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications
w/ Raluca Popa, Stephen Tu, Hari Balakrishnan, Frans Kaashoek, Nickolai Zeldovich
Application SELECT * FROM emp WHERE salary = 100 Proxy SELECT * FROM table1 WHERE col3 = x5a8c34 table1/emp col1/rank col2/name col3/salary 60 100 800 100
SQL Queries on Encrypted Data Example
Application SELECT * FROM emp WHERE salary
≥
100 SELECT * FROM table1 WHERE col3
≥
x638e54 table1 (emp) col1/rank col2/name col3/salary Proxy 60 100 800 100 x638e54 x922eb4 x638e54
Monomi: Protecting Data in Datahub
•
Extensions to CryptDB to efficiently support OLAP queries
•
Show how to run all of TPC-H, rather than just 4 of 22 queries
– Key insight: split queries, run as much as possible on untrusted DBMS, compute remainder on trusted client
Monomi vs Plaintext
TPC-H SF10, Postgres
See Stephen Explain How it Really Works Right after this Talk!
Takeaway
: median overhead
1.24x
,
Many Open Problems Understanding performance more broadly How to reason about security of non-randomized schemes?
Auditing, information flow, etc.
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
DataHub Research Challenges Platform Challenges: How to efficiently store thousands or millions of databases?
Monomi
How to anonymize data, control access, etc?
How to keep data private and allowing querying over it?
Scorpion
Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Interactive Large-Scale Visualization using a GPU Database
Todd Mostak
The Need for Interactive Analytics
• DataHub needs to support browsing massive data sets • Browsing is best supported through visualization ad-hoc analytics, with millisecond response times WHAT IS MAPD? MapD is: A GPU (Graphics Processing Unit) accelerated SQL column store database Scales to any number of Nvidia GPUs A real-time map generator Uses GPUs to render point and heatmaps of query results in milliseconds A WMS web-server Can serve out of the box as the backend for a web mapping client, allowing for querying and visualization of billions of features Fast and cost-effective 4 Nvidia commodity GPUs provide provide over 12 Teraflops of compute power and nearly 1 TB/sec of memory bandwidth 147,201,658 tweets from Oct 1, 2012 to Nov 6, 2012 Relative intensity of “tornado” on Twitter (with point overlay) from Febuary 29, 2012 to March 1, 2012
MapD: GPU Accelerated SQL Database
• • • • WHAT IS MAPD? memory that a cluster of them can store substantial amounts of data A GPU (Graphics Processing Unit) Not an accelerator, but a full blown query processor!
A real-time map generator Massive parallelism enables interactive browsing interfaces – – – 4x GPUs can provide > 1 TB/sec of bandwidth 12 Tflops compute Order of magnitude speedups over CPUs, when data is on GPU 147,201,658 tweets from Oct 1, 2012 to Nov 6, 2012 4 Nvidia commodity GPUs provide “Shared nothing” arrangement of memory bandwidth Relative intensity of “tornado” on Twitter (with point overlay) from Febuary 29, 2012 to March 1, 2012
Demo
Search for “flu” showing outbreak over Southeastern U.S.
ANATOMY OF A QUERY SQL Query: SELECT sender, text FROM tweets WHERE lat < 0.0 ORDER BY DIST(lon, lat, 31.3, 3.0) sender text bspears I sing smadden @mit Parser Optimizer Executor 8 Gpu 1 ___________________________ Id lat lon text 0 31.2 -87.1 I lol 4 -17.1 46.3 I sing 43.1 -93.7 boston Hybrid On-Disk/In-Memory Column Store _______________________________________________________ Id lat lon date sender text 0 31.2 -87.1 10-11 bobama I lol 1 -41.3 16.4 10-14 smadden @mit Gpu 2 _____________________________ Id lat lon text 1 -41.3 16.4 @mit 5 53.1 14.3 haha 9 58.4 2.35 happy Join Row Ids 4 1 Gpu N ______________________________ Id lat lon text 3 37.9 -97.8 bieber 7 12.3 11.1 je ne 11 28.4 -81.7 pepsi
Next Steps
•
Scale out to many nodes, automate layout algorithms
•
Add various advanced analytics (e.g., machine learning algorithms)
•
Generalize visualization beyond maps
DataHub Research Challenges Platform Challenges: How to efficiently store thousands or millions of databases?
Monomi
How to anonymize data, control access, etc?
How to keep data private and allowing querying over it?
Scorpion
Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Visual Provenance: Scorpion
•
Visualization of data is most common form of big data analysis
• •
Common problem: outliers
Eugene Wu
why
outliers exist
Definition of Why
Given an outlier group, find a
predicate
over the inputs that makes the output no longer an outlier.
i =
Input Data 5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 Italy Output Visualization
p
France Spain US
p =
predicate
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Outlier Group
Definition of Why
Given an outlier group, find a
predicate
over the inputs that makes the output no longer an outlier.
i =
Input Data Output Visualization 5 4,5 4 3,5 3 2,5 2 1,5 1
p
0,5 0 Italy France Spain
p =
predicate
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
US
Definition of Why
Given an outlier group, find a
predicate
over the inputs that makes the output no longer an outlier.
i =
Input Data 5 4,5 4 3,5 3 2,5 2 1,5 1
p
0,5 0 Italy France
Removing the predicate makes US no longer an outlier What are common properties of those records?
Output Visualization Spain US
{Bill Gates, Steve Ballmer} p: Company = MSFT
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Why is this hard?
Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation A B C D E F G MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Why is this hard?
Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation
AVG(rows) = 2.7
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Why is this hard?
Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation
AVG(rows) = 2.9
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Why is this hard?
Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation
AVG(rows) = 2.2
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Why is this hard?
Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation
AVG(rows) = 3.3
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Why is this hard?
Exponential search space over records, attributes aggregation
See Eugene Explain How it Really Works this Afternoon!
Desire for simple, understandable predicates and a general purpose visualization framework
AVG(rows) = 3.1
…
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Next Steps
•
A general purpose
visualization language
for expressing visualizations with provenance support
References to underlying data set
MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY
Conclusion
Big Data is a cry for help from non DB people Lots of exciting work on scalable systems DB community should be doing a much better job of helping users use data We risk losing mindshare Datahub aims to make data easy to find, visualize, and query, securely and efficiently Many fascinating, hard problems!
(Monomi, MapD, Scorpion) MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY