An Update on Data Science

Download Report

Transcript An Update on Data Science

Application of Machine Learning
Patterns and Behaviors in Complex Systems
James M. Brase
Deputy Associate Director, Computation
Lawrence Livermore National Laboratory
LLNL-PRES-671957
This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
9-8-10-SSCI
Machine learning is applied to a broad set
of applications at LLNL
Document analysis – Is this document relevant to topic Y? Topics
are defined as distributions of terms, phrases, phrase graphs ….
Cybersecurity – How many network connections do we expect
node A to make in the next minute?
Materials science – Discovery of patterns in component material
attributes and critical reaction parameters to produce customdesigned properties
Adaptive mesh simulation- Will this simulation parameter set
cause the mesh to tangle?
Image and multimedia analysis – Can we label the objects in this
image? Can we find other, similar videos?
Lawrence Livermore National Laboratory
2
LLNL-PRES-671957
Machine learning – statistical inference of
patterns in data
Training….
Supervised learning –
Mapping feature vectors
to labels
Training set
1
-1
-1
-1
1
1
1
-1
1
-1
1
-1
-1
Training
data
Feature
vectors
fˆ ( fv)
Labels
• Logistic regression
• Random forests
• Neural networks
Applying….
New data
Feature
vector
Lawrence Livermore National Laboratory
• Discrete labels –
classifiers
• Continuous labels –
regression
• Function mapping
lˆ = fˆ ( fv)
Unsupervised learning –
Finding structure in data
•
•
•
•
Association rules
Clustering
Density estimation
Autoencoders
3
LLNL-PRES-671957
Learning language models for estimating
document relevance
New
documents
Weak
filtering
Entity
extractor
New
document
graph
Collocation
filter
Keyphrase
extractor
Graph classifier
Relevant graphs
vs backround
graphs
Forced migration
reference documents
Lawrence Livermore National Laboratory
Training
graph
models
Relevance
score
4
LLNL-PRES-671957
Document relevance for the NYT corpus
Relevance to forced
migration reference
document set
Lawrence Livermore National Laboratory
5
LLNL-PRES-671957
Cybersecurity uses machine learning and
graph analysis to model network behavior
Collect packets, flow and
process data from the full
physical network
Stream processing
for feature and
signature extraction
Build a dynamic
graph representation
of activity
Machine learning on
the dynamic graph
• Node and group
classification
algorithms
• Temporal activity
models – dynamic
Bayesian networks
• Anomaly detection
algorithms
Applications
• Inferring node and group roles
• Prediction of activity distributions
• Cueing analysts to anomalous behaviors
• Functional network discovery and characterization
Lawrence Livermore National Laboratory
6
LLNL-PRES-671957
Host role learning
Dynamic IP-IP graph
Learning Markov models
for behavior forecasting
Reduced
prediction error
using host roles
Host roles are local
characteristics of the IP-IP
graph structure e.g.
“center of star”, end node,
Anomaly Detection
in host role
distribution
…
Ryan Rossi, Brian Gallagher, Jennifer Neville, Keith Henderson. Modeling Dynamic
Behavior in Large Evolving Graphs. ACM International Conference on Web Search
and Data Mining (WSDM), 2013.
Lawrence Livermore National Laboratory
7
LLNL-PRES-671957
Some R&D directions in machine learning
Training….
Training set
Training
data
1
-1
-1
-1
1
1
1
-1
1
-1
1
-1
-1
N
fˆ ( fv)
D
Labels
Feature vectors
Features have traditionally been hand
engineered. Is there a principled
approach to finding a good set of
features?
 Deep learning
Lawrence Livermore National Laboratory
We usually deal with N>>D. In
emerging app’s we can have N<<D.
(e.g. genomics, ...). Can we regularize
(constrain the solutions) with
mechanistic models?
8
LLNL-PRES-671957
Deep learning provides an unsupervised
approach to learning feature sets from data
Lawrence Livermore National Laboratory
9
LLNL-PRES-671957
Deep machine learning research is
extending pattern recognition and
discovery beyond human capabilities
100B synapse
deep learning
networks
Airplanes neuron
“Fireworks” neuron
Learning patterns in 100M
random images from Flickr
Images w. text neuron
• Discovering complex patterns in massive multisource intelligence data sets guided
by science-based models – not exact keywords
• Image recognition performance now surpasses human accuracy
• Partnership with Stanford and UC Berkeley on algorithms, NVIDIA on large GPU
implementations, and IBM on neurosynaptic architectures
Lawrence Livermore National Laboratory
10
LLNL-PRES-671957
Data movement is the limiting factor for analytics
– supplementing the memory hierarchy
Partnership with Intel and Cray
to develop a 150 TF/s data
analytics computer
Technical focus on NVRAM
layers in memory hierarchy
supporting 24 core node –
prototyping analytics in new
environment
Over 5GB DRAM & 36GB NVRAM per
core
Initial applications will focus on
 Prototyping exascale
simulation analysis
architectures
 Bioinformatics algorithms
 Graph analytics
Lawrence Livermore National Laboratory
11
LLNL-PRES-671957