Scientific Data Management Center (ISIC) http://sdmcenter.lbl.gov contains extensive publication list Scientific Data Management Center Participating Institutions Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross Arie Shoshani, Doron.
Download
Report
Transcript Scientific Data Management Center (ISIC) http://sdmcenter.lbl.gov contains extensive publication list Scientific Data Management Center Participating Institutions Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross Arie Shoshani, Doron.
Scientific Data Management
Center
(ISIC)
http://sdmcenter.lbl.gov
contains extensive publication list
1
Scientific Data Management Center
Participating Institutions
Center PI:
Arie Shoshani
LBNL
DOE Laboratories co-PIs:
Bill Gropp, Rob Ross
Arie Shoshani, Doron Rotem
Terence Critchlow, Chandrika Kamath
Nagiza Samatova, Andy White
Universities co-PIs :
Mladen Vouk
Alok Choudhary
Reagan Moore, Bertram Ludaescher
Calton Pu
ANL
LBNL
LLNL
ORNL
North Carolina State
Northwestern
UC San Diego (SDSC)
Georgia Tech
2
Phases of Scientific Exploration
Data Generation
From large scale simulations or experiments
Fast data growth with computational power
examples
• HENP: 100 Teraops and 10 Petabytes by 2006
• Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km),
T42: about 1 TB/100 year run => factor of ~ 10-20
Problems
•
•
•
•
Can’t dump the data to storage fast enough – waste of compute resources
Can’t move terabytes of data over WAN robustly – waste of scientist’s time
Can’t steer the simulation – waste of time and resource
Need to reorganize and transform data – large data intensive tasks slowing
progress
3
Phases of Scientific Exploration
Data Analysis
Analysis of large data volume
Can’t fit all data in memory
Problems
•
•
•
•
•
Find the relevant data – need efficient indexing
Cluster analysis – need linear scaling
Feature selection – efficient high-dimensional analysis
Data heterogeneity – combine data from diverse sources
Streamline analysis steps – output of one step needs to match input of
next
4
Example Data Flow in TSI
Aggregate to
~500 files (< 2 to
10+ GB each)
Input
Data
Logistical
Network
Logistic Network
L-Bone
Local Mass
Storage 14+TB)
Aggregate to
one file (1+
TB each)
Data Depot
Archive
Highly
Parallel
Compute
Local 44 Proc.
Data Cluster
- data sits on local nodes for weeks
Output
~500x500
files
Viz Software
Viz
Wall
Viz Client
Courtesy: John Blondin
5
Goal: Reduce the Data Management Overhead
• Efficiency
• Example: parallel I/O, indexing, matching storage structures to
the application
• Effectiveness
• Example: Access data by attributes-not files, facilitate massive
data movement
• New algorithms
• Example: Specialized PCA techniques to separate signals or to
achieve better spatial data compression
• Enabling ad-hoc exploration of data
• Example: by enabling exploratory “run and render” capability to
analyze and visualize simulation output while the code is running
6
Approach
Use an integrated
framework that:
SDM Framework
• Provides a scientific
workflow capability
Scientific
Process
Automation
Layer
• Supports data mining and
analysis tools
• Accelerates storage and
access to data
Simplify data management
tasks for the scientist
• Hide details of underlying
parallel and indexing
technology
• Permit assembly of
modules using a simple
graphical workflow
description tool
Scientific
Application
Data
Mining &
Analysis
Layer
Scientific
Understanding
Storage
Efficient
Access
Layer
7
Technology Details by Layer
Scientific
Process
Automation
(SPA)
Layer
Data
Mining &
Analysis
(DMA)
Layer
Storage
Efficient
Access
(SEA)
Layer
WorkFlow
Management
Tools
ASPECT:
integration
Framework
Storage
Resource
Manager
(To HPSS)
Data
Analysis
tools
(PCA, ICA)
Parallel
NetCDF
Software
Layer
Web
Wrapping
Tools
Efficient
indexing
(Bitmap
Index)
Efficient
Parallel
Visualization
(pVTK)
ROMIO
MPI-IO
System
Parallel
Virtual
File
System
Hardware, OS, and MSS (HPSS)
8
Accomplishments:
Storage Efficient Access (SEA)
Parallel Virtual File System:
Shared memory communication
Enhancements and deployment
Developed Parallel netCDF
Enables high performance parallel I/O to
netCDF datasets
Achieves up to 10 fold performance
improvement over HDF5
P0
P1
P2
P2
P3
Parallel File System
Provides MPI access to PVFS
Advanced parallel file system interfaces
for more efficient access
Before
After
Developed PVFS2
P1
Enhanced ROMIO:
P0
Parallel netCDF
netCDF
Parallel File System
P3
Adds Myrinet GM and InfiniBand support
improved fault tolerance
asynchronous I/O
offered by Dell and HP for Clusters
Deployed an HPSS Storage Resource
Manager (SRM) with PVFS
Automatic access of HPSS files to PVFS
through MPI-IO library
SRM is a middleware component
FLASH I/O Benchmark Performance (8x8x8 block sizes)
9
Robust Multi-file Replication
Anywhere
Problem: move thousands
of files robustly
Takes many hours
Need error recovery
Mass storage systems
failures
Network failures
Use Storage Resource
Managers (SRMs)
DataMover
SRM-COPY
(thousands of files)
NCAR
SRM-GET (one
Get list
of files
LBNL
file at a time)
SRM
SRM
(performs writes)
(performs reads)
GridFTP GET (pull mode)
Problem: too slow
Use parallel streams
Use concurrent transfers
Use large FTP windows
Pre-stage files from MSS
MSS
Disk
Cache
Disk
Cache
Network transfer
archive files
stage files
10
Accomplishments:
Data Mining and Analysis (DMA)
Developed Parallel-VTK
Developed “region tracking” tool
30
20
10
0
0
2
4
6
8
10
12
14
16
18
Number of Processors
PVTK Serial Writer
PVTK Parallel Writer
Used for accurate for signal separation
Used for discovering key parameters
that correlate with observed data
Developed highly effective data reduction
For exploring 2D/3D scientific
databases
Using bitmap technology to identify
regions based on multi-attribute conditions
40
Implemented Independent Component
Analysis (ICA) module
Efficient 2D/3D Parallel Scientific
Visualization for NetCDF and HDF files
Built on top of PnetCDF
PVTK Serial (vs) Parallel Writer (80 MB)
Time (seconds)
Achieves 15 fold reduction with high level
of accuracy
Using parallel Principle Component Analysis
(PCA) technology
Combustion region tracking
Developed ASPECT
A framework that supports a rich set of
pluggable data analysis tools
Including all the tools above
A rich suite of statistical tools based on R
package
El Nino signal (red) and estimation (blue) closely match
11
ASPECT Analysis Environment
Data Select Data Access Correlate Render Display
(temp, pressure)
From astro-data
Where (step=101)
(entropy>1000);
Data Mining &
Analysis Layer
Sample (temp, pressure)
Select
Data
Use Bitmap
(condition)
Storage Efficient
Access Layer
Take
Sample
Get variables
(var-names, ranges)
Bitmap
Index
Selection
Run R
analysis
Run pVTK
filter
R Analysis
Tool
Read Data
(buffer-name)
Write Data
Read Data
(buffer-name)
Write Data
Parallel
NetCDF
Visualize
scatter
plot in QT
pVTK
Tool
Read Data
(buffer-name)
PVFS
Hardware, OS, and MSS (HPSS)
12
Accomplishments:
Scientific Process Automation (SPA)
Unique requirements of scientific WFs
Moving large volumes between modules
• Tightlly-coupled efficient data movement
Specification of granularity-based iteration
• e.g. In spatio-temporal simulations –
a time step is a “granule”
Support for data transformation
• complex data types (including file
formats, e.g. netCDF, HDF)
Dynamic steering of workflow by user
• Dynamic user examination of results
workflow steps defined graphically
Developed a working scientific work
flow system
Automatic microarray analysis
Using web-wrapping tools developed by
the center
Using Kepler WF engine
Kepler is an adaptation of the UC Berkeley
tool, Ptolemy
workflow results presented to user
13
GUI for setting up and running
workflows
14
Re-applying Technology
SDM technology, developed for one application, can be
effectively targeted at many other applications …
Technology
Initial Application
New Applications
Parallel NetCDF
Astrophysics
Climate
Parallel VTK
Astrophysics
Climate
Compressed bitmaps
HENP
Combustion, Astrophysics
Storage Resource
Managers
HENP
Astrophysics
Feature Selection
Climate
Fusion
Scientific Workflow
Biology
Astrophysics (planned)
15
Broad Impact of the SDM Center…
Astrophysics:
High speed storage technology, parallel NetCDF,
parallel VTK, and ASPECT integration software
used for Terascale Supernova Initiative (TSI) and
FLASH simulations
Tony Mezzacappa – ORNL, John Blondin –NCSU,
Mike Zingale – U of Chicago, Mike Papka – ANL
ASCI FLASH – parallel NetCDF
Climate:
High speed storage technology, Parallel NetCDF,
and ICA technology used for Climate Modeling
projects
Ben Santer – LLNL, John Drake – ORNL, John
Michalakes – NCAR
Dimensionality reduction
Combustion:
Compressed Bitmap Indexing used for fast
generation of flame regions and tracking their
progress over time
Wendy Koegler, Jacqueline Chen – Sandia Lab
Region growing
16
Broad Impact (cont.)
Biology:
Kepler workflow system and webwrapping technology used for executing
complex highly repetitive workflow tasks
for processing microarray data
Matt Coleman - LLNL
Building a scientific workflow
High Energy Physics:
Compressed Bitmap Indexing and Storage
Resource Managers used for locating
desired subsets of data (events) and
automatically retrieving data from HPSS
Doug Olson - LBNL, Eric Hjort – LBNL,
Jerome Lauret - BNL
Fusion:
Dynamic monitoring of HPSS file transfers
A combination of PCA and ICA technology
used to identify the key parameters that
are relevant to the presence of edge
harmonic oscillations in a Tokomak
Keith Burrell - General Atomics
Identifying key
parameters for the
DIII-D Tokamak
17
Goals for Years 4-5
Fully develop the integrated SDM framework
Implement the 3 layer framework on SDM center facility
Provide a way to select only components needed
Develop self-guiding web pages on the use of SDM components
Use existing successful examples as guides
Generalize components for reuse
Develop general interfaces between components in the layers
support loosely-coupled WSDL interfaces
Support tightly-coupled components for efficient dataflow
Integrate operation of components in the framework
Hide details form user – automate parallel access and indexing
Develop a reusable library of components that can be selected
for use in the workflow system
18