Vldb2013-Keynote

Download Report

Transcript Vldb2013-Keynote

Sam Madden

[email protected]

With a cast of many….

BIG Data

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Example: Medical Costs

MGH Cancer Center “Super-Database ”

Largest cancer database in the world (173,301 patients) Based on national tumor registry Cross linked with death registry Includes billing, reports, labs, imagery, genome SNPs Question: What are the factors driving costs for lung cancer patients?

Some results: No correlation of cost with

Stage of presentation

Survival Strong correlation of cost with oncologist!

- Dr. James Michaelson, PhD, MGH, Harvard Medical School

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Challenge: Making Data Accessible

What does the data look like?

How do I correlate it with other data sets?

How do I present it to users/execs?

Where are these anomalies and outliers coming from?

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Challenge: Making Data Accessible Introducing Datahub

+ DB Technology Octocat, the Github mascot = MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Introducing Datahub

Data Commons

Secure, Hosted Data Storage (“Database Service”) Easy to Find, Combine, Clean Data Sets Selective Sharing and Access Control Ability to Browse, Visualize, and Query Data

in situ

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Lots of other places to find data!

Datahub : “five-star” integrated, browse-able, & query-able repository of linked data Aka … Just a bunch of zip files Versus open, linked data (Tim Berners Lee Taxonomy)

★ ★★ ★★★ ★★★★ ★★★★★ make your stuff available on the Web under an open license make it available as structured data use non-proprietary formats (e.g., CSV instead of Excel) use URIs to denote things, so that people can point at your stuff link your data to other data to provide context

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Datahub Interface

Anant Bhardwaj

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Datahub Interface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

“Wrangling” Features Wrangler: Interactive Visual Specification of Data Transformation Scripts

Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Post-Wrangling MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

More Datahub Interface Versions Browsing and Visualization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

MIT Living Lab

Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.

A Dogfood Eating Exercise

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

MIT Living Lab

Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub.

Organizational Data MIT Data Hub MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data… Public Data Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Personal Data Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings…

What Will Data Hub Enable at MIT?

• • • •

Campus “Quantification”

– is going to class correlated with better grades?

– which dining facilities are most popular amongst different groups?

Transportation planning:

– bus utilization and on demand routing – parking lot utilization – carpool finding, etc

Health + Medical:

– campus wide public health, e.g., flu tracking, – observing who is missing class, depressed – Health signals: exercise and eating habits; partners; – outpatient care

Research:

– expert finding; – data sharing between groups

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Challenges: It’s Not All Fuzzy Stuff

We also don’t want our research to be like this guy 

Platform Challenges: How to efficiently store thousands or millions of databases?

Monomi

How to anonymize data, control access, etc?

How to keep data private and allowing querying over it?

Scorpion

Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Private Data Problem

 Confidential data leaks  2012: hackers extracted 6.5 million hashed passwords from the DB of LinkedIn User 1 User 2 User 3 Application Sensitive content Hackers SQL Threat: passive DB server attacks DB Server

Datahub

System administrator

How to protect data confidentiality?

Sensitive content DB Server Client [request] [result] Sensitive content 

Encrypt data

server may not be able to process queries!

Compute on encrypted data!

 Without giving server encryption key!

General approach has been proposed several times…

User 1 User 2 User 3

Monomi / CryptDB

Threat 1: passive DB server attacks Application Sensitive content SQL DB Server 1.

2.

3.

Process SQL queries on encrypted data Hide DB from sys. admins., outsource DB to the cloud Modest overhead No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications

w/ Raluca Popa, Stephen Tu, Hari Balakrishnan, Frans Kaashoek, Nickolai Zeldovich

Application SELECT * FROM emp WHERE salary = 100 Proxy SELECT * FROM table1 WHERE col3 = x5a8c34 table1/emp col1/rank col2/name col3/salary 60 100 800 100

SQL Queries on Encrypted Data Example

Application SELECT * FROM emp WHERE salary

100 SELECT * FROM table1 WHERE col3

x638e54 table1 (emp) col1/rank col2/name col3/salary Proxy 60 100 800 100 x638e54 x922eb4 x638e54

Monomi: Protecting Data in Datahub

Extensions to CryptDB to efficiently support OLAP queries

Show how to run all of TPC-H, rather than just 4 of 22 queries

– Key insight: split queries, run as much as possible on untrusted DBMS, compute remainder on trusted client

Monomi vs Plaintext

TPC-H SF10, Postgres

See Stephen Explain How it Really Works Right after this Talk!

Takeaway

: median overhead

1.24x

,

Many Open Problems Understanding performance more broadly How to reason about security of non-randomized schemes?

Auditing, information flow, etc.

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

DataHub Research Challenges Platform Challenges: How to efficiently store thousands or millions of databases?

Monomi

How to anonymize data, control access, etc?

How to keep data private and allowing querying over it?

Scorpion

Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Interactive Large-Scale Visualization using a GPU Database

Todd Mostak

The Need for Interactive Analytics

• DataHub needs to support browsing massive data sets • Browsing is best supported through visualization  ad-hoc analytics, with millisecond response times WHAT IS MAPD? MapD is: A GPU (Graphics Processing Unit) accelerated SQL column store database Scales to any number of Nvidia GPUs A real-time map generator Uses GPUs to render point and heatmaps of query results in milliseconds A WMS web-server Can serve out of the box as the backend for a web mapping client, allowing for querying and visualization of billions of features Fast and cost-effective 4 Nvidia commodity GPUs provide provide over 12 Teraflops of compute power and nearly 1 TB/sec of memory bandwidth 147,201,658 tweets from Oct 1, 2012 to Nov 6, 2012 Relative intensity of “tornado” on Twitter (with point overlay) from Febuary 29, 2012 to March 1, 2012

MapD: GPU Accelerated SQL Database

• • • • WHAT IS MAPD? memory that a cluster of them can store substantial amounts of data A GPU (Graphics Processing Unit) Not an accelerator, but a full blown query processor!

A real-time map generator Massive parallelism enables interactive browsing interfaces – – – 4x GPUs can provide > 1 TB/sec of bandwidth 12 Tflops compute Order of magnitude speedups over CPUs, when data is on GPU 147,201,658 tweets from Oct 1, 2012 to Nov 6, 2012 4 Nvidia commodity GPUs provide “Shared nothing” arrangement of memory bandwidth Relative intensity of “tornado” on Twitter (with point overlay) from Febuary 29, 2012 to March 1, 2012

Demo

Search for “flu” showing outbreak over Southeastern U.S.

ANATOMY OF A QUERY SQL Query: SELECT sender, text FROM tweets WHERE lat < 0.0 ORDER BY DIST(lon, lat, 31.3, 3.0) sender text bspears I sing smadden @mit Parser Optimizer Executor 8 Gpu 1 ___________________________ Id lat lon text 0 31.2 -87.1 I lol 4 -17.1 46.3 I sing 43.1 -93.7 boston Hybrid On-Disk/In-Memory Column Store _______________________________________________________ Id lat lon date sender text 0 31.2 -87.1 10-11 bobama I lol 1 -41.3 16.4 10-14 smadden @mit Gpu 2 _____________________________ Id lat lon text 1 -41.3 16.4 @mit 5 53.1 14.3 haha 9 58.4 2.35 happy Join Row Ids 4 1 Gpu N ______________________________ Id lat lon text 3 37.9 -97.8 bieber 7 12.3 11.1 je ne 11 28.4 -81.7 pepsi

Next Steps

Scale out to many nodes, automate layout algorithms

Add various advanced analytics (e.g., machine learning algorithms)

Generalize visualization beyond maps

DataHub Research Challenges Platform Challenges: How to efficiently store thousands or millions of databases?

Monomi

How to anonymize data, control access, etc?

How to keep data private and allowing querying over it?

Scorpion

Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Visual Provenance: Scorpion

Visualization of data is most common form of big data analysis

• •

Common problem: outliers

Eugene Wu

why

outliers exist

Definition of Why

Given an outlier group, find a

predicate

over the inputs that makes the output no longer an outlier.

i =

Input Data 5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 Italy Output Visualization

p

France Spain US

p =

predicate

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Outlier Group

Definition of Why

Given an outlier group, find a

predicate

over the inputs that makes the output no longer an outlier.

i =

Input Data Output Visualization 5 4,5 4 3,5 3 2,5 2 1,5 1

p

0,5 0 Italy France Spain

p =

predicate

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

US

Definition of Why

Given an outlier group, find a

predicate

over the inputs that makes the output no longer an outlier.

i =

Input Data 5 4,5 4 3,5 3 2,5 2 1,5 1

p

0,5 0 Italy France

Removing the predicate makes US no longer an outlier What are common properties of those records?

Output Visualization Spain US

{Bill Gates, Steve Ballmer} p: Company = MSFT

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard?

Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation A B C D E F G MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard?

Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation

AVG(rows) = 2.7

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard?

Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation

AVG(rows) = 2.9

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard?

Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation

AVG(rows) = 2.2

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard?

Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation

AVG(rows) = 3.3

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Why is this hard?

Exponential search space over records, attributes aggregation

See Eugene Explain How it Really Works this Afternoon!

Desire for simple, understandable predicates and a general purpose visualization framework

AVG(rows) = 3.1

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Next Steps

A general purpose

visualization language

for expressing visualizations with provenance support

References to underlying data set

MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

Conclusion

Big Data is a cry for help from non DB people Lots of exciting work on scalable systems DB community should be doing a much better job of helping users use data We risk losing mindshare Datahub aims to make data easy to find, visualize, and query, securely and efficiently Many fascinating, hard problems!

(Monomi, MapD, Scorpion) MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY