High Performance Scientific Data Analytics

Download Report

Transcript High Performance Scientific Data Analytics

High Performance
Scientific Data Analytics
Nagiza F. Samatova, PhD
Department of Computer Science, NCSU
Computer Science and Mathematics Division, ORNL
Core Team
Paul
Breimyer
Srikanth
David
Nagiza
Jiangtian
Xiasong
Guru
Chongle
Hoony
George
Heshan
Faisal
Collaborators: Co-authors on papers, Scott Klasky,
Roselyne, Mladen, Arie and Alex, Marcia, Bill Nevins,
Bob Hettich, John Drake, Tony Mezzacappa, etc.
Chandra
Publications
[CRAN] Samatova NF, Yoginath S, Kora G, Bauer D, http://cran.r-project.org/mirrors.html.
[SciDAC-06] Samatova NF, Branstetter M, Ganguly AR, Hettich R, Khan S, Kora G, Li J, Ma X, Pan C, Shoshani A, Yoginath S, Journal of
Physics: Conference Series 46 (2006) 505–509.
[PDCS-05] Yoginath S, Samatova NF, Bauer D, Kora G, Fann G, Geist A, In Proceedings of the 18th International Conference on Parallel and
Distributed Computing Systems (PDCS-2005), September 12 - 14, 2005, Las Vegas, Nevada.
[AnalChem-06.a] Pan C, Kora G, McDonald WH, Tabb DL, VerBerkmoes NC, Hurst GB, Pelletier DA, Samatova NF, Hettich RL, Anal Chem.
2006 Oct 15;78(20):7121-31.
[AnalChem-06.b] Pan C, Kora G, Tabb DL, Pelletier DA, McDonald WH, Hurst GB, Hettich RL, Samatova NF, Anal Chem. 2006 Oct
15;78(20):7110-20.
[TPAMI-05] Ostrouchov G, Samatova NF, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1340-1343, 2005.
[JCGS-07] Qu YM, Ostrouchov G, Yoginath S, Samatova NF, Journal of Computational and Graphical Statistics, 2007
[MCP-08] Pan, C., Oda, Y., Lankford, P.K., Zhang, B., Samatova, N.F., Pelletier, D.A.,Harwood, C.S., Hettich, R.L.,Characterization of
anaerobic catabolism of p-coumarate in Rhodopseudomonas palustris by integrating transcriptomics and quantitative proteomics." Mol Cell
Proteomics, vol. 7, no. 5, pp. 938-48, 2008.
[CSDA-07] Park BH, Ostrouchov G, Samatova NF., Sampling streaming data with replacement. Comput. Stat. Data Anal., vol. 52, no. 2, pp.
750-762, 2007
[TVCG-07] Sisneros, R., Jones, C., Huang, J., Gao, J., Park, B.H., Samatova, N.F., A multi-level cache model for run-time optimization of
remote visualization." IEEE Trans Vis Computer Graph, vol. 13, no. 5, pp. 991-1003, Sep-Oct 2007
[DPD-02] Samatova NF, Ostrouchov G, Geist A, Melechko AV., RACHET: An efficient cover-based merging of clustering hierarchies from
distributed datasets." Distrib. Parallel Databases,vol. 11, no. 2, pp. 157-180, Mar 2002
[BIBM-08] Breimyer, P., Green, N., Kumar, V., Samatova, N.F., \BioDEAL: Biological data-evidence-annotation linkage system." Proceedings
of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2008),, Philadelphia, PA, USA, Nov. 7-9, 2008
Ma, X.; Li, J.; Samatova, N.F., \Automatic Parallelization of Scripting Languages: Toward Transparent Desktop Parallel Computing."
Proceedings of IEEE/ACS International Conference on Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1-6, 26-30 March,
2007
Publications (cont.)
Lin H, Ma X, Chandramohan P, Geist A, Samatova NF, Efficient Data Access for Parallel BLAST." Proceedings of 19th IEEE International
Parallel and Distributed Processing Symposium (IPDPS 2005), pp. 72, 04-08 April 2005
Yoginath S, Samatova NF, Bauer D, Kora G, Fann G, Geist A, RScaLAPACK: High-performance parallel statistical computing with R and
ScaLAPACK.“ Proceedings of the 18th International Conference on Parallel and Distributed Computing Systems (PDCS-2005), Sep 12-14,
2005, Las Vegas, Nevada.
Park BH, Ostrouchov G, Samatova NF, \Reservoir-based random sampling from data stream." Proceedings of the Fourth SIAM International
Conference on Data Mining, Orlando, FL, April, 2004
Ostrouchov G, Samatova NF, Embedding methods and robust statistics for dimension reduction." COMPSTAT 2004 Proceedings in
Computational Statistics, Physica-Verlag, A Springer Company, 2004
Park, B.-H.; Samatova, N.F., Ostrouchov, G.; Geist, A., Xmap: Fast dimension reduction algorithms for multivariate streamline data."
Proceedings of the 6th International Workshop on High Performance Data Mining: Pervasive and Data Stream Mining (in conjunction with
Third International SIAM Conference on Data Mining), San Francisco, CA May 1-3, 2003.
Abu-Khzam FN, Samatova NF, Ostrouchov G, Langston MA, Geist GA, Distributed dimension reduction algorithms for widely dispersed
dataa." Proceedings of the Fourteenth IASTED International Conference on Parallel and Distributed Computing and Systems (IASTED PDCS
2002), p. 167-174, 2002, ACTA Press.
Qu Y, Ostrouchov G, Samatova NF, Geist A, Principal component analysis for dimension reduction in massive distributed data sets."
Proceedings of the Second SIAM International Conference on Data Mining, p 4-9, April 2002
Samatova NF, Ostrouchov G, Geist A, Melechko AV, RACHET: A new algorithm for mining multi-dimensional distributed datasets."
Proceedings of the SIAM Third Workshop on Mining Scientific Datasets, Chicago, IL, April 2001
Samatova NF, Breimyer P, Kora G, Pan P, Yoginath S, \Parallel R for High Performance Analytics: Applications to Biology." in Scientic Data
Management, A. Shoshani and D. Rotem (editors), C. Kamath (co-editor), CRC Press/Taylor and Francis, 2008 (Coming soon)
Samatova, N.F., Branstetter, M., Ganguly, A.R., Hettich, R., Khan, S., Kora, G., Li, J., Ma, X., Pan, C.,Shoshani, A., S. Yoginath, \High
performance statistical computing with parallel R: Applications to biology and climate." Journal of Physics: Conference Series, SciDAC 2006, v.
46, p. 505-509, 2006.
Bethel W, Abram G, Sharf J, Frank R, Ahrens J, Samatova NF, Miller M, Interoperability of visualization software and data models is not an
achievable goal." In Proceedingsof the IEEE Visualization, Seattle, Washington, October 19-24, 2003, p. 607-610
Tony’s Frustrations
Scientific Computing is not only COMPUTE-INTENSIVE but also
DATA-INTENSIVE.
•
Visualization:
•
•
•
•
•
TSB, ParaView, EnSight, VisBench... – Which one to choose? What if I want the
best part of each one of them? Will they ever interoperate?
Will they support HDF directly? What about parallel I/O?
Will I have viz pipelines/features customized for TSI?
Multi-resolution, remote, collaborative, interactive, parallel, scalable…
Data analysis:
•
•
•
•
•
Will I have data analysis pipelines customized for TSI?
What features to extract?
Move from qualitative to quantitative validation and verification of models
Can I have a compact representation of entire simulation? How to compare
simulations? Will data analysis be coupled w/ data archives?
Will data analysis be ever coupled with visualization?
More Frustrations…
Tony wants to remain a “Domain Expert” NOT to become a
“Jack of All Trades”
•
Data Management & Networking:
•
Hydro-run 10243 produces terabytes per run
•
How to efficiently stream directly to-from HPSS?
•
PVFS, SRM, HRM… – How to utilize them?
•
Simultaneous transfer of data from simulation computer to data
analysis/Viz. cluster
•
File I/O and data transfer take as much time and effort as simulation if not
more, while limiting data size often results in rerun due to overly coarse
sampling
•
What about data reduction/compression techniques? How aggressive can I
be? Will it be enough? What about viz and data analysis running on reduced
data? Will I still preserve the desired features?
•
How to efficiently utilize network resources including data staging,
cataloging, scheduling of preprocessing data analysis & viz tasks?
How to Make Tony Happy? –
Internet “Plug-ins” for Ultrascale Computing?
Paraview
[IEEE Viz-2003]
ASPECT
End-to-End Data Analytics
Domain Application Layer
Climate
Biology
Fusion
Interface Layer
Web
Service
Dashboard
Workflow
Middleware Layer
Automatic
Parallelization
Scheduling
Plug-in
Analytics Core Library Layer
Parallel
Distributed
Streamline
Data Movement, Storage, Access Layer
Data Mover
Light
Parallel
I/O
Indexing
Programmer’s Dilemma
Domain-specific (?)
Productivity
Performance
Scripting (R, Matlab, IDL)
high-level
languages
Object Oriented (C++, Java)
Functional languages (C, Fortran)
Assembly
low-level language
Towards High-Performance High-Level Languages
How do we get there? ― Parallelization
?
Domain-specific (?)
Productivity
Performance
Scripting (R, Matlab, IDL)
high-level
languages
Object Oriented (C++, Java)
Functional languages (C, Fortran)
Assembly
low-level language
One Hat Does NOT Fit All
Parallel R for Data Intensive Statistical Computing
Data Intensive
Statistical Computing
•Technical computing
•Matrix and vector
formulations
•Data Visualization and
analysis platform
•Image processing,
vector computing
Statistical computing
and graphics
http://www.r-project.org
• Developed by R. Gentleman & R. Ihaka
• Expanded by community as open source
• Extensible via dynamically loadable libs
Statistical Computing with R
About R (http://www.r-project.org/):
Open source, most widely used for statistical analysis
and graphics; similar to S.
Extensible via dynamically loadable add-on packages.
Originally developed by R. Gentleman and R. Ihaka.
> library(mva)
> pca <- prcomp(data)
> … > summary(pca)
> dyn.load( “foo.so”)
> .C( “foobar” )
> dyn.unload( “foo.so” )
Towards Enabling Parallel Computing in R:
snow (Luke Tierney): general API on top of message passing routines to provide
high-level (parallel apply) commands; mostly demonstrated for embarrassingly
parallel applications.
snow API
Rmpi (Hao Yu): R interface to LAM-MPI.
rpvm (Na Li and Tony Rossini): R interface
to PVM; requires knowledge of parallel
programming.
> library (rpvm)
> .PVM.start.pvmd ()
> .PVM.addhosts (...)
> .PVM.config ()
Lessons Learned from R/Matlab Parallelization
Interactivity and High-Level: Curse & Blessing
high
Back-end approach
- data parallelism
- C/C++/Fortran with MPI
- RScaLAPACK (Samatova et al, 2005)
pR
Automatic parallelization
- task parallelism
- task-pR (Samatova et al, 2004)
Abstraction
Interactivity
Productivity Embarrassing parallelism
- data parallelism
- snow (Tierney, Rossini, Li, Sevcikova, 2006)
Manual parallelization
- message passing
- Rmpi (Hao Yu, 2006)
-rpvm (Na Li & Tony Rossini, 2006)
Compiled approach
- MatlabCautomatic parallelization
low
Packages: http://cran.r-project.org/
Parallel Performance
high
Task and Data Parallelism in pR
Task Parallelism Data Parallelism
Goal: Parallel R (pR) aims:
(1) to automatically detect and
execute task-parallel analyses;
(2) to easily plug-in data-parallel
MPI-based C/Fortran codes
(3) to retain high-level of
interactivity, productivity and
abstraction
Task & Data Parallelism in pR
Task-parallel analyses:
Likelihood Maximization
Re-sampling schemes: Bootstrap, Jackknife
Markov Chain Monte Carlo (MCMC)
Animations
Data-parallel analyses:
k-means clustering
Principal Component Analysis
Hierarchical clustering
Distance matrix, histogram, etc.
pR Multi-tiered Architecture
Interactive R Client
Tightly Coupled
Loosely Coupled
MPI Servers
R Servers
R
Task/Embarrassingly
parallel jobs
A  matrix (1:10000, 100,100)
library (pR)
PE (
Data parallel
svd(A)
S  sla.svd(A)
b  list ()
for (k in 1:dim (A) [ 1 ] ) {
b [ k ]  sum ( A [ k, ] )
Embarrassingly parallel
}
m  mean ( A )
Task parallel
d  sum ( A )
)
svd
svd
svd
svd
S
R
pR
script
R script
svd
A
R
R
svd
Data parallel jobs
Data Bank Server(s)
Memory & I/O Management
pR in Use
Key Features of pR – Users’ Perspective:
•Be able to use existing high level R code
•Require minimal extra efforts for parallelizing
•Have identical/similar (presumably easy-to-use) interface to R’s
•Be able to test codes in sequential settings
•Provide efficient and scalable (in terms of problem size and number
of processors) performance
•Integrate with Kepler as front-end interface
Scalability of pR: RScaLAPACK
R> solve (A,B)
pR> sla.solve (A, B, NPROWS, NPCOLS, MB)
A, B are input matrices; NPROWS and NPCOLS are process grid specs; MB is block size
106
99
83
59
116
111
8192x8192
S(p)=
Tserial
Tparallel(p)
4096x4096
2048x2048
1024x1024
Architecture: SGI Altix at CCS of ORNL with 256 Intel Itanium2 processors at 1.5 GHz; 8 GB of memory per
processor (2 TB system memory); 64-bit Linux OS; 1.5 TeraFLOPs/s theoretical total peak performance.
Overhead due to R & pR
2048x2048
2048x2048
4096x4096
8192x8192
O(p)=
RScaLAPACK(p) ― ScaLAPACK(p)
x100
RScaLAPACK(p)
C/C++/Fortran Plug-in to pR
R Script
dyn.load(“SharedLibrary.so”)
nums = as.numeric(1,1000000);
Result = .External("median", nums);
pR
pR SEXP median(SEXP args)
{
C++
pR
pR::pRParameters prpArgs (args);
pR::pRVector <double> vec(prpArgs (0));
vector<double>* myVec = vec.getNativePointer();
… calculate myMedian for myVec…
pR::pRVector<double> ret (1);
ret[0] = myMedian;
return ret.getRObject();
}
Serial pR Performance over Python and R
pR
pR Improv.
over Python
pR
Comparing Method Performance in Seconds
pR
pR Improv.
over R
RedHat and CRAN Distribution
CRAN R-Project
RedHat Linux RPM
Available for download
from R’s CRAN web site
(www.R-Project.org) with
37 mirror sites in 20
countries
http://cran.r-project.org/web/
packages/RScaLAPACK/index.html
http://rpmfind.net/linux/RPM/RByName.html
End-to-End Data Analytics
Domain Application Layer
Climate
Biology
Fusion
Interface Layer
Web
Service
Dashboard
Workflow
Middleware Layer
Automatic
Parallelization
Scheduling
Plug-in
Analytics Core Library Layer
Parallel
Distributed
Streamline
Data Movement, Storage, Access Layer
Data Mover
Light
Parallel
I/O
Indexing
Outreach: Applications & Publications
Across Science Applications:
Biology: Quantitative Proteomics (B. Hettich, G. Hurst, C. Harwood, C. Pan)
Climate: Analysis of Extreme Events (M. Branstetter, A. Ganguly, S. Khan)
GIS: GRASS+pR (G. Fann, B. Budhend)
Fusion: Scott Klasky, Bill Nevins
• Subtract background
noise from data
• Generate Covariance
Chromatogram
• Apply Savitzky-Golay
Smoother
• Calculate cut-off for
search
• Find Window with
Max. SN ratio
• …..
ProRata
http://www.MSProRata.org
ProRata – Bringing pR to Biologists
DOE OBER Projects Using ProRata:
• J. Banfield, Bob Hettich: AMD [Nature-09]
• M. Buchanan: CMCS Center [Bioinformatics08]
• J. Mielenz: BESC BioEnergy [In-submission]
• C. Harwood, Bob Hettich: R. palustris [MCP-08]
>1,000
downloads
J. of Proteome Research
Vol. 5, No. 11, 2006
[AnalChem-06.a, 06.b]
ProRata
http://www.MSProRata.org
About GRASS (grass.itc.it)
• GRASS (Geographic Resources
Analysis Support System) is a
raster/vector GIS, image processing
system, and graphics production.
GRASS contains over 350 programs
and tools to render maps and images on
monitor and paper; manipulate raster,
vector, and sites data; process multi
spectral image data; create, manage,
and store spatial data.
It is Free (Libre) Software/Open
Source released under GNU GPL.
•
•
End-to-End Data Analytics
Domain Application Layer
Climate
Biology
Fusion
Interface Layer
Web
Service
Dashboard
Workflow
Middleware Layer
Automatic
Parallelization
Scheduling
Plug-in
Analytics Core Library Layer
Parallel
Distributed
Streamline
Data Movement, Storage, Access Layer
Data Mover
Light
Parallel
I/O
Indexing
Programmatic Backend Access Via Web Services:
Integration to Kepler
Kepler Workflow
Dashboard Interface to pR
Scott Klasky
Roselyne
Nobert
End-to-End Data Analytics
Domain Application Layer
Climate
Biology
Fusion
Interface Layer
Web
Service
Dashboard
Workflow
Middleware Layer
Automatic
Parallelization
Scheduling
Plug-in
Analytics Core Library Layer
Parallel
Distributed
Streamline
Data Movement, Storage, Access Layer
Data Mover
Light
Parallel
I/O
Indexing
Parallel, Distributed and Streamline Algorithms
•
•
•
•
Clustering:
•
RACHET: [REF, REF]
•
Faisal’s
Dimension Reduction and Data Compression:
•
Distributed PCA: [REF]
•
Streamline XMap:
•
RobustMap: [REF]
Outlier/Extreme Event Detection:
•
RobustMap: [REF]
•
Modeling the Usual to Find the Unusual: [REF]
•
Climate Extreme Events: [SciDAC-06]
Streamline Sampling:
•
•
With replacement: [REF, REF]
Parallel Graph Mining:
RACHET: Distributed Hierarchical
Clustering
1. Generate
Local Dendogram
Send the code NOT the data
RACHET
3. Merge
Centroid Descriptive Statistics

DS( c )  ( Nc ,NORMSQc ,Rc ,SUMc ,MINc ,MAXc )
Merging Theorem for updating DS
Global Dendogram
Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic (RACHET)
Incremental update via fusion
Distributed & Streaming Dimension Reduction:
Merging Information Rather Than Raw Data
Distributed vs. Monolithic PCA
Stream of simulation data
t=t0
t=t1
new
t=t2
new
Ratio of # of PCs
80% global variability required
90% local variability
Ratio of transmission cost
# of Data Sets
•
•
•
Merge pivotal points only
Linear time for each chunk
~5% deviation from monolithic
•
•
•
Merge few PCs and local means
One time communication
Controlled variability preserved
Model the Usual to Find the Unusual
To reduce the data & to detect extreme/specific events in
global context.
1. Segment series (100 obs)
2. Fit simple local models to series
( c0, c1, c2, ||e||, ||e||2)
3. Reduce data to model parameters
4. Select extremes for global
analysis
5. Cluster the extremes (4)
6. Map back to series
End-to-End Data Analytics
Domain Application Layer
Climate
Biology
Fusion
Interface Layer
Web
Service
Dashboard
Workflow
Middleware Layer
Automatic
Parallelization
Scheduling
Plug-in
Analytics Core Library Layer
Parallel
Distributed
Streamline
Data Movement, Storage, Access Layer
Data Mover
Light
Parallel
I/O
Indexing
Climate Data Movement: ESG+SDM
Dr. David Bernholdt
Dr. Luca Cinquini, et.
al
Dr. Nagiza
Samatova
Paul Breimyer
EDP
DML
Dr. Arie
Shoshani
Dr. Alex Sim
Dr. Marcia Branstetter
mpiBLAST-pio: Exploiting Parallel I/O
•
Publications: IPDPS-05, SSDBM-08
•
Download: http://mpiblast.lanl.gov or http://www.mpiblast.org
•
Collaborators: Xiasong Ma, Heshan Lin, Wu Feng
3500
Search
Other
3000
2500
2000
1500
1000
500
Program-Output Size
m
pi
-1
53
M
pi
o15
3M
m
pi
-9
6M
pi
o96
M
m
pi
-4
7M
pi
o47
M
0
m
pi
-1
1M
pi
o11
M
Execution Time (s)
4000
End-to-End Data Analytics: Summary
Domain Application Layer
Climate
Biology
Fusion
Interface Layer
Web
Service
Dashboard
Workflow
Middleware Layer
Automatic
Parallelization
Scheduling
Plug-in
Analytics Core Library Layer
Parallel
Distributed
Streamline
Data Movement, Storage, Access Layer
Data Mover
Light
Parallel
I/O
Indexing
Looking into the Future…NSF Expedition…
Nagiza Samatova
Mladen Vouk
Scott Klasky
Alok Choudhary
Bertram Ludaescher
Concept-Driven Analytics
Generating Knowledge Hierarchies via
In-X Analytics
Climate Use Case
In-X devices/applications (white spheres) produce Knowledge
Layers (pyramid) for annotation and further discussion by
scientific social sub-nets (smileys).
L1: A supercomputer runs a simulation and produces raw data
(bottom pyramid layer).
L2: As the simulation proceeds, in-X cloud is informed of the
pending analytics. While streaming time series to their
destination, “cyberinfrastructure cloud” on-the-fly segments them
(into ~100 time points), fits polynomials into each segment,
reduces segments to a few polynomial coefficients. In-networks
reduced data reaches remote destination, active disks.
L3: Disks, while storing the data, perform in-disks clustering
to find similar points in low-dimensional coefficient space (the
usual) and detect outliers to find local extremes (the unusual).
L4: Disks fit statistical models into clusters of similar points
(e.g., cluster centroids, density).
L5: Local/global extremes for different variables are analyzed
in memory for cause-effect linkages.
L6: Humanly and/or automatically generated hypotheses are
recorded in community knowledgebases.
L8: Databases, while recording the predicted relationships and
hypotheses, compare, contrast, and link them to prior
knowledge. In-database comparative analysis results are
recorded.
Semantic Knowledge Annotation with BioDEAL