Slides - MIT Database Group
Download
Report
Transcript Slides - MIT Database Group
Transforming Big Data with D4M
Jeremy Kepner
MIT Lincoln Laboratory
3 October 2012
This work is sponsored by the Department of the Air Force under Air Force Contract
#FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are
those of the authors and are not necessarily endorsed by the United States Government.
D4M-1
Acknowledgements
• Nicholas Arcolano
• David Bestor
• Michelle Beard
• Chansup Byun
• Bob Bond
• Matt Hubbell
• Josh Haines
• Pete Michaleas
• Matthew Schmidt
• Julie Mullen
• Ben Miller
• Andy Prout
• Benjamin O’Gwynn
• Albert Reuther
• Tamara Yu
• Tony Rosa
• Bill Arcand
• Charles Yee
• Bill Bergeron
• Dylan Hutchinson
D4M-2
Outline
• Introduction
• Theory
• Results
• Summary
D4M-3
Example Applications of Graph Analytics
ISR
Social
Cyber
• Graphs represent entities
and relationships detected
through multi-INT sources
• Graphs represent
relationships between
individuals or documents
• Graphs represent
communication patterns of
computers on a network
• 1,000s – 1,000,000s tracks
and locations
• 10,000s – 10,000,000s
individual and interactions
• 1,000,000s – 1,000,000,000s
network events
• GOAL: Identify anomalous
patterns of life
• GOAL: Identify hidden
social networks
• GOAL: Detect cyber attacks
or malicious software
• Cross-Mission Challenge: Detection of subtle patterns in massive
multi-source noisy datasets
D4M-4
Four Ecosystems Dominate
Cloud Computing
Enterprise
Big Compute
- Interactive
- On-demand
- Elastic
- High performance
- Parallel Languages
- Scientific computing
- Java
- Map/Reduce
- Easy admin
- Indexing
- Search
- Security
Big Data
DBMS
• Each ecosystem is at the center of a multi-$B market
• Pros/cons of each are numerous; diverging hardware/software
• Some missions can exist wholly in one ecosystem; some can’t
D4M-5
Four Ecosystems Dominate
Cloud Computing
Enterprise
LLGrid
Big Compute
- Interactive
- On-demand
- Elastic
- High performance
- Parallel Languages
- Scientific computing
MapReduce
- Java
- Map/Reduce
- Easy admin
- Indexing
- Search
- Security
Big Data
DBMS
• LLGrid MapReduce provides map/reduce interface in a big compute
•
environment
D4M provides an interactive parallel scientific computing environment
to databases
D4M-6
Big Data + Big Compute Challenge
Database Worldview
“It’s the data!”
Delivering data is the end
Shared Data
Separate Compute
•
•
•
•
D4M-7
Supercomputing Worldview
“It’s the computer!”
Delivering data is the start
Shared Compute
Separate Data
Database and supercomputing views are fundamentally different
Have never coexisted; do not know how to coexist
Big Data “Analytics” are forcing them together
Current standard practice duplicates hardware and data
Big Data + Big Compute Stack
Novel Analytics for:
Text, Cyber, Bio
Weak Signatures,
Noisy Data,
Dynamics
B
High Level Composable API:
D4M (“Databases for Matlab”)
A
C
Array
Algebra
E
Distributed Database:
Accumulo (triple store)
High Performance Computing:
LLGrid + Hadoop
•
D4M-8
Distributed
Database/
Distributed File
System
Interactive
Supercomputing
Combining Big Compute and Big Data enables entirely new domains
High Level Language: D4M
http://www.mit.edu/~kepner/D4M
D4M
Distributed Database
Dynamic
Distributed
Dimensional
Data
Model
Associative Arrays
Numerical Computing Environment
B
A
C
Query:
Alice
Bob
Cathy
David
Earl
E
D
A D4M query returns a sparse
matrix or a graph…
…for statistical signal
processing or graph analysis in
MATLAB
D4M binds associative arrays to databases, enabling rapid
prototyping of data-intensive cloud analytics and visualization
D4M-9
Outline
• Introduction
• Theory
–
Associate Arrays
–
Incidence Matrix
• Results
• Summary
D4M-10
What are Spreadsheets and Big Tables?
Big Tables
Spreadsheets
• Spreadsheets are the most commonly used analytical structure on Earth
(100M users/day?)
• Big Tables (Google, Amazon, …) store most of the analyzed data in the world
(Exabytes?)
• Simultaneous diverse data: strings, dates, integers, reals, …
• Simultaneous diverse uses: matrices, functions, hash tables, databases, …
• No formal mathematical basis; Zero papers in AMA or SIAM
D4M-11
D4M Key Concept:
Associative Arrays Unify Four Abstractions
• Extends associative arrays to 2D and mixed data types
A('alice ','bob ') = 'cited '
A('alice ','bob ') = 47.0
or
• Key innovation: 2D is 1-to-1 with triple store
('alice ','bob ','cited ')
('alice ','bob ',47.0)
or
bob
AT
x
ATx
bob
cited
carl
alice
cited
carl
D4M-12
alice
Composable Associative Arrays
• Key innovation: mathematical closure
–
All associative array operations return associative arrays
• Enables composable mathematical operations
A + B
A - B
A & B
A|B
A*B
• Enables composable query operations via array indexing
A('alice bob ',:)
A('alice ',:)
A('al* ',:)
A('alice : bob ',:)
A(1:2,:)
A == 47.0
• Simple to implement in a library (~2000 lines) in programming
environments with: 1st class support of 2D arrays, operator
overloading, sparse linear algebra
•
•
D4M-13
Complex queries with ~50x less effort than Java/SQL
Naturally leads to high performance parallel implementation
Associative Array Definitions
• Keys and values are from the infinite strict totally ordered set
• Associative array A(k) :
, k=(k1,…,kd), is a partial function from d
keys (typically 2) to 1 value, where
A(ki) = vi
and
otherwise
d
• Binary operations on associative arrays A3 = A1 A2,
where = f() or f(), have the properties
– If A1(ki) = v1 and A2(ki) = v2, then A3(ki) is
v1 f() v2 = f(v1,v2)
or
v1 f() v2 = f(v1,v2)
– If A1(ki) = v or and A2(ki) = or v, then A3(ki) is
v f() = v
•
•
•
or
v f() =
High level usage dictated by these definitions
Deeper algebraic properties set by the collision function f()
Frequent switching between “algebras” (how spreadsheets are used)
D4M-14
Theory Questions
• Associative arrays can be constructed from a few definitions
• Similar to linear algebra, but applicable to a wider range of data
• Key questions
–
Which linear algebra properties do apply to associative arrays (intuitive)
–
Which linear algebra properties do not apply to associative arrays
(watch out)
–
Which associative array properties do not apply to linear algebra (new)
Associative
Arrays
new
D4M-15
Linear
Algebra
intuitive
watch out
References
•
•
•
D4M-16
Book: “Graph Algorithms in the Language of Linear Algebra”
Editors: Kepner (MIT-LL) and Gilbert (UCSB)
Contributors:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Bader (Ga Tech)
Bliss (MIT-LL)
Bond (MIT-LL)
Dunlavy (Sandia)
Faloutsos (CMU)
Fineman (CMU)
Gilbert (USCB)
Heitsch (Ga Tech)
Hendrickson (Sandia)
Kegelmeyer (Sandia)
Kepner (MIT-LL)
Kolda (Sandia)
Leskovec (CMU)
Madduri (Ga Tech)
Mohindra (MIT-LL)
Nguyen (MIT)
Radar (MIT-LL)
Reinhardt (Microsoft)
Robinson (MIT-LL)
Shah (USCB)
Outline
• Introduction
• Theory
–
Associate Arrays
–
Incidence Matrix
• Results
• Summary
D4M-17
Digraphs are Black & White
D4M-18
The World is Color
Artist: Ann Pibal; Painting: “XCRS”
D4M-19
5 Edge Colors
Blue
Silver
Green
Orange
Pink
Artist: Ann Pibal; Painting: “XCRS”
D4M-20
20 Vertices
V12 V14
V3 V17
V8 V19
V13
V7
V9
V11
V2 V16
V6
V5
V10
V1 V15
V4
V18
Artist: Ann Pibal; Painting: “XCRS”
D4M-21
V20
1 Isolated Standard Edge
P4
Artist: Ann Pibal; Painting: “XCRS”
D4M-22
12 Multi Edges
Artist: Ann Pibal; Painting: “XCRS”
D4M-23
18 Hyper Edges
P5
P8
O5
P7
P3
P6
Artist: Ann Pibal; Painting: “XCRS”
D4M-24
27 Edge Orderings
O5 < P3,P6,P7,P8
O5 < B1,S1,G1,O1,O2,P1
O5 < B2,S2,G2,O3,O4,P2 < P7,P8
P5
P8
O5
P7
P3
P6
Artist: Ann Pibal; Painting: “XCRS”
D4M-25
52 Standard Multi Edges
P5x2
P8x2
O5x5
P7x2
P3x3
P6x2
Artist: Ann Pibal; Painting: “XCRS”
D4M-26
Summary Observations
•
Standard edge representation fragments hyper edges
–
•
Digraph representation compresses multi-edges
–
•
Information is lost
Standard graph representation drops edge order
–
•
Information is lost
Matrix representation drops edge labels
–
•
Information is lost
Information is lost
Need edge representation that preserves information
Artist: Ann Pibal; Painting: “XCRS”
D4M-27
Solution: Incidence Matrix
Edge Color Order V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
B1
Blue
2
1
1
1
S1
Silver
2
1
1
1
G1 Green
2
1
1
1
O1 Orange
2
1
1
1
O2 Orange
2
1
1
1
P1
Pink
2
1
1
1
B2
Blue
2
1
1
1
1
1
S2
Silver
2
1
1
1
1
1
G2 Green
2
1
1
1
1
1
O3 Orange
2
1
1
1
1
1
O4 Orange
2
1
1
1
1
1
P2
2
1
1
1
1
1
Pink
1
O5 Orange
1
P3
Pink
2
P4
Pink
2
1
P5
Pink
2
1
P6
Pink
2
P7
Pink
3
P8
Pink
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Artist: Ann Pibal; Painting: “XCRS”
D4M-28
1
1
1
1
1
Outline
• Introduction
• Theory
• Results
–
Network monitoring example
–
Bioinformatics example
• Summary
D4M-29
Graph Construction Using D4M:
Explode Schema
Raw
Data
CSV
Files
Assoc.
Arrays
Distributed
Database
Dense Table
log_id
Use as row
indices
src_ip
server_ip
001
128.0.0.1
208.29.69.138
002
192.168.1.2
157.166.255.18
003
128.0.0.1
74.125.224.72
Create columns for
each unique
type/value pair
src_ip|128.0.0.1
src_ip|192.168.1.2
server_ip|157.166.255.18
server_ip|208.29.69.138
log_id|001
1
0
0
1
0
log_id|002
0
1
1
0
0
log_id|003
1
0
0
0
1
Exploded Table
D4M-30
server_ip|74.125.224.72
Graph Construction Using D4M:
Storing Exploded Data as Triples
Raw
Data
CSV
Files
Assoc.
Arrays
Distributed
Database
Exploded Table
src_ip|128.0.0.1
src_ip|192.168.1.2
server_ip|157.166.255.18
server_ip|208.29.69.138
server_ip|74.125.224.72
log_id|001
1
0
0
1
0
log_id|002
0
1
1
0
0
log_id|003
1
0
0
0
1
D4M stores the triple data representing both
the exploded table and its transpose
Table Triples
Row
log_id|001
log_id|001
log_id|002
log_id|002
log_id|003
log_id|003
D4M-31
Column
src_ip|128.0.0.1
server_ip|208.29.69.138
src_ip|192.168.1.2
server_ip|157.166.255.18
src_ip|128.0.0.1
server_ip|74.125.224.72
Table Transpose Triples
Value
1
1
1
1
1
1
Row
server_ip|157.166.255.18
server_ip|208.29.69.138
server_ip|74.125.224.72
src_ip|128.0.0.1
src_ip|128.0.0.1
src_ip|192.168.1.2
Column
log_id|002
log_id|001
log_id|003
log_id|001
log_id|003
log_id|002
Value
1
1
1
1
1
1
Graph Construction Using D4M:
Construct Associative Arrays
Raw
Data
CSV
Files
Distributed
Database
D4M Query #1
keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)
(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)
(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)
...
D4M-32
Assoc.
Arrays
Graph Construction Using D4M:
Construct Associative Arrays
Raw
Data
CSV
Files
Distributed
Database
D4M Query #1
keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2
data = T(Row(keys), :);
(‘log_id|001’,‘server_ip|208.29.69.138’,1)
(‘log_id|001’,‘src_ip|128.0.0.1’,1)
(‘log_id|001’,‘time_stamp|11/May/2011:09:52:53’,1)
...
(‘log_id|002’,‘server_ip|157.166.255.18’,1)
(‘log_id|002’,‘src_ip|192.168.1.2’,1)
(‘log_id|002’,‘time_stamp|12/May/2011:13:24:11’,1)
...
(‘log_id|003’,‘server_ip|74.125.224.72’,1)
(‘log_id|003’,‘src_ip|128.0.0.1’,1)
(‘log_id|003’,‘time_stamp|13/May/2011:11:05:12’,1)
...
D4M-33
Assoc.
Arrays
Graph Construction Using D4M:
Construct Associative Arrays
Raw
Data
CSV
Files
Distributed
Database
D4M Query #1
keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2
data = T(Row(keys), :);
Associative Array Algebra
G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);
(‘src_ip|128.0.0.1’,‘server_ip|208.29.69.138’,1)
(‘src_ip|128.0.0.1’,‘server_ip|74.125.224.72’,1)
(‘src_ip|192.168.1.2’,‘server_ip|157.166.255.18’,1)
...
D4M-34
Assoc.
Arrays
Graph Construction Using D4M:
Construct Associative Arrays
Raw
Data
CSV
Files
Distributed
Database
Assoc.
Arrays
D4M Query #1
keys = T(:,’time_stamp|10/May/2011:00:00:00’,:, ...
’time_stamp|13/May/2011:23:59:59’,);
D4M Query #2
data = T(Row(keys), :);
Associative Array Algebra
G = data(:,’src_ip|*’).’ * data(:,’server_ip|*’);
Adj(G);
• Graphs can be constructed with minimal effort using D4M queries
and associative array algebra
D4M-35
Accumulo Ingestion Scalability Study
LLGrid MapReduce With A Python Application
Accumulo Database: 1 Master + 7 Tablet servers
4 Mil e/s
Data #1:
5 GB of 200 files
D4M-36
Data #2:
30 GB of 1000
files
Outline
• Introduction
• Theory
• Results
–
Network monitoring example
–
Bioinformatics example
• Summary
D4M-37
Relative Cost per DNA Sequence
Big Data
Energy Efficient
High Volume Sequencer
D4M-38
Portable
Sequencer
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome
Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed
03/08/2012
Example Disease Outbreak
May-July 2011 - Virulent E. Coli Outbreak Germany
Outbreak
identified
diarrhea
kidney
Spanish
Cucumbers
implicated
DNA
Sequence
released
Sprouts
Identified
Deaths
www.rki.de EHEC final report
Conclusions: Identification of E. Coli source too late to have substantial impact on illnesses
Publishing sequence data allowed for broad community to fully characterize pathogen
Sequencing and crowd source analysis showed promising potential -> Still too
slow
D4M-39
Sequence Matching Graph
Sparse Matrix Multiply in D4M
Collected Sample
unknown
bacteria
reference
bacteria
RNA Reference Set
unknown
sequence ID
reference
sequence ID
A1
A2
A1 A2'
sequence word (10mer)
reference
sequence ID
sequence word (10mer)
unknown sequence ID
• Associative arrays provide a natural framework for sequence matching
D4M-40
Database Automatically Computes
Reference 10mer Distribution
0.5%
5%
50%
• Using 10mer distribution can quickly select reference 10mers that
maximally differentiate sample sequences and eliminate most 10mers
D4M-41
Leveraging “Big Data” Technologies for High
Speed Sequence Matching
D4M
10000
BLAST
100x faster
run time (seconds)
100x smaller
1000
100
D4M +
Triple Store
10
100
10000
code volume (lines)
1000000
• High performance triple store database trades computations for lookups
• Used Apache Accumulo database to accelerate comparison by 100x
• Used Lincoln D4M software to reduce code size by 100x
D4M-42
Summary
• Big data is found across a wide range of areas
–
Document analysis
–
Computer network analysis
–
DNA Sequencing
• Currently there is a gap in big data analysis tools for algorithm
developers
• D4M fills this gap by providing algorithm developers composable
associative arrays that admit linear algebraic manipulation
D4M-43