LANL Briefing - May 2004
Download
Report
Transcript LANL Briefing - May 2004
IBM Systems & Technology Group
Data Intensive World
Bob Curran and Kevin Gildea
IBM Systems & Technology Group
© 2006 IBM Corporation
IBM Systems & Technology Group
Introduction
Data size is growing both in aggregate and in individual elements
Rate of data storage/access is growing both in aggregate and in
individual elements.
Price of storage is making the storage of more data and more
elaborate forms of old data affordable.
Better networking is making it possible to deliver some of this data
to users away from the installation which created the data.
New application classes build upon all of this
All of this leads to questions about data integrity, data validity, data
search, backup, retention …
© 2006 IBM Corporation
IBM Systems & Technology Group
Data is growing
Individual files are growing
1MB of text observations becomes 1TB of observational data in
weather over the last two decades
A 4KB letter in ASCI becomes a 30KB Word document with no
added pictures, signatures or logos.
Media applications generating more and higher resolution data
Aggregate Data is growing
Science data is growing. CERN starting a new set of particle
acceleration experiments expected to collect 10 PB/year of filtered
retained data
GPFS file system at LLNL reached 2PB this year
Library of Congress RFP in 1/2006 called for the ingest of 8PB/year
of video data retained indefinitely
© 2006 IBM Corporation
IBM Systems & Technology Group
Storage is growing
Anthology Terabyte NAS on sale at Fry’s - $699 or $599 w/ rebate
SATA drives of 500GB are available today. FC drives at 300GB.
IBM DS8300 can store 180TB of data in one controller. A DS4800
mid-range controller can store 89TB.
One IBM TS3500 tape library can store over 3PB of data on 3592
cartridges which can contain 1TB each with typical compression.
Disk density improvements appear to be slowing down.
PLUS: you can aggregate them to create file systems of hundreds
of TB to a few PB today
© 2006 IBM Corporation
IBM Systems & Technology Group
Data Access Rates are growing
Disks attached by 4Gb FC, 320 MB/sec SCSI … replacing the 10
MB/sec SCSI of a few years ago.
Drive speeds of 10K rpm or 15K rpm are common replacing the
5400 rpm of a few years ago
Data transfer rates vary heavily based on usage patterns and
fragmentation, but 50MB/sec + is not uncommon is our
experience.
RAID arrays can multiply this transfer by 4-8X for a single object
PLUS: you can stripe files or sets of files across arrays, controllers
and links to produce even larger data rates for a file or collection of
files. GPFS measured 122GB/sec for one file at LLNL in 3/06.
© 2006 IBM Corporation
IBM Systems & Technology Group
Network data rates are going up
My office has gone from 15Mb (token ring) to 1Gb over the past 5
years. My home has gone from 9kbps modems to 300Mbps cable
modems
Highest performing clusters have multiple 2GB/sec proprietary
links and 4X IB or 10Gb ethernet
Multiple Gbps links are common between sites and some GRID
environments aggregate enough to provide 30-40Gb/sec.
I’d like any data from anywhere and I’d like it now, please.
© 2006 IBM Corporation
U Penn Digital Mammography Project
Potential for 28 Petabytes/year over 2000 Hospitals, in full
production
7 Regional Archives @ 4,000 TB/yr
20 Area @ 100 TB/yr
15 Hospitals @ 7 TB/yr
Proposed Hierarchical Layout
A
A
A
A
REG I O NAL
A
A
A
A
REG I O NAL
A
REG I O NAL
A
A
A
A
A
A
A
REG I O NAL
A
A
A
A
A
A
A
A
REG I O NAL
A
A
REGIONAL
A
A
A
Goal: Distribute Storage Load and
Balance Network and Query Loads
A
7
U Penn Digital Mammography Project
Current NDMA Configuration
Testbed to demonstrate
feasibility
Storage and retrieval
Infrastructure for access
Instant consultation with
experts
Innovative teaching tools
Ensure privacy and
confidentiality
Current NDMA Portal Locations
Regional Archive map
CAnet*3
CAnet*3
Chicago GigaPop
NSCP
Abilene
N. C. GigaPop
Atlanta GigaPop
ESNET
.
http://nscp.upenn.edu/NDMA
8
IBM Systems & Technology Group
HPCS (A petaflop project)
As stated:
1.5-5TB/sec
aggregate for big science applications. Less for other
applications
32K
file creates/sec
Single
1
stream at 30GB/sec
trillion files in a file system
Inferred
Probably
requires 50-100PB for a balanced system at the high end of
this project. Multi-PB even at the lower end of this project.
It
would be good to manage these components effectively. There’s a
lot of components here that need to work in concert.
Access
beyond the main cluster is desired beyond basic NFS
© 2006 IBM Corporation
IBM Systems & Technology Group
Data increasingly created by events/sensors
Images collected from NEAT (Near- Earth Asteroid Tracking)
telescopes
First year: processed 250,000 images, archived
– 6 TB of compressed data
– Randomly accessed by community of users
Life sciences
– Mass spectrometer creates 200 GB/day, times 100 Mass Spectrometers ->
20 TB/day
– Mammograms, X-rays and other medical images
Ability to create/store data is exponentially increasing
Problem is to extract meaningful info from all the data
Often able to find information from hindsight perspective
Sometimes able to do some analysis at data collection which
requires the ability to access the data as it is collected. High
data rates required.
© 2006 IBM Corporation
IBM Systems & Technology Group
Collaborative computing – no boundaries
Emergence of large, multidisciplinary computational science
teams
Across geographies
Across disciplines/sciences
Biologic systems
Computer science, biochemistry, statistics, computation
chemistry, fluids, …
Automotive design
Change sheet metal design – interactive integration
Collaboration within enterprise/university
Collaboration anywhere/anytime
Ability to share/characterize data is critical
© 2006 IBM Corporation
IBM Systems & Technology Group
Data accessible forever
Size – too large to backup/restore unconditionally
Growth rate – constantly changing/increasing
Value – always needs to be available for analysis
Search - Need to be able to find the data
Performance – appropriate for the data
Collaboration – no way to anticipate users
New storage paradigm
© 2006 IBM Corporation
IBM Systems & Technology Group
GPFS: parallel file access
Parallel Cluster File System Based on
Shared Disk (SAN) Model
Cluster – fabric-interconnected
nodes (IP, SAN, …)
Shared disk - all data and metadata
on fabric-attached disk
Parallel - data and metadata flows
from all of the nodes to all of the
disks in parallel under control of
distributed lock manager.
GPFS File System Nodes
Switching fabric
(System or storage area network)
Shared disks
(SAN-attached or network
block device)
© 2006 IBM Corporation
IBM General Parallel File System
Adding Information Lifecycle Management to GPFS
GPFS adds support for ILM
abstractions: filesets,
storage pools, policy
Fusion Native Clients (Linux, AIX)
Application
Application
Application
GPFS
Placement
Policy
Placement
Posix
Policy
GPFS
Placement
Policy
GPFS
Placement
Policy
– Fileset: subtree of a file system
– Storage pool – group of LUNs
– Policy – rules for placing files
into storage pools
Application
GPFS
GPFS Manager Node
•Cluster manager
•Lock manager
•Quota manager
•Allocation manager
•Policy manager
GPFS RPC Protocol
Examples
– Place new files on fast, reliable
storage, move files as they age
to slower storage, then to tape
– Place media files on videofriendly storage (fast, smooth),
other files on cheaper storage
– Place related files together, e.g.
for failure containment
Storage Network
Gold
Pool
System Pool
Silver
Pool
Pewter
Pool
Data Pools
GPFS File System (Volume Group)
14
© 2006 IBM Corporation
IBM Systems & Technology Group
The Largest GPFS systems
System
Year
TF
GB/s
Nodes
Disk size
Storage
Disks
Blue P
1998
3
3
1464
9GB
43 TB
5040
White
2000
12
9
512
19GB
147 TB
8064
Purple/C
2005
100
122
1536
250GB
2000 TB
11000
© 2006 IBM Corporation
IBM Systems & Technology Group
ASCI Purple Supercomputer
1536-node, 100 TF pSeries cluster at Lawrence Livermore Laboratory
2 PB GPFS file system
122 GB/s to a single file from all nodes in parallel
© 2006 IBM Corporation
IBM General Parallel File System
Multi-cluster GPFS and Grid Storage
Multi-cluster supported in GPFS 2.3
– Remote mounts secured with OpenSSL
– User ID’s mapped across clusters
• Server and remote client clusters can have different userid spaces
• File userids on disk may not match credentials on remote cluster
• Pluggable infrastructure allows userids to be mapped across clusters
Multi-cluster works within a site or across a WAN
– Lawrence Berkeley Labs (NERSC)
• multiple supercomputer clusters share large GPFS file systems
– DEISA (European computing grid)
• RZG, CINECA, IDRIS, CSC, CSCS, UPC, IBM
• pSeries and other clusters interconnected with multi-gigabit WAN
• Multi-cluster GPFS in “friendly-user” production 4/2005
– Teragrid
•
•
•
•
17
SDSC, NCSA, ANL, PSC, CalTech, IU, UT, …..
Sites linked via 30 Gb/sec dedicated WAN
500TB GPFS file system at SDSC shared across sites, 1500 nodes
Multi-cluster GPFS in production 10/2005
© 2006 IBM Corporation
IBM General Parallel File System
Grid Computing
Dispersed resources connected
via a network
–
–
–
–
Compute
Visualization
Data
Instruments (telescopes,
microscopes, etc.)
Sample Workflow:
– Ingest data at Site A
– Compute at Site B
– Visualize output data at Site C
Logistic nightmare!
– Conventional approach: copy data via ftp. Space,
time, bookkeeping.
– Possible approach: on-demand parallel file
access over grid
•
18
… but the scale of problems demand high
performance!
© 2006 IBM Corporation
IBM General Parallel File System
SDSC-IBM StorCloud Challenge Demo – Grid Storage
SDSC DataStar Compute Nodes
NCSA Compute Nodes
Teragrid
30 Gb/s WAN
Enzo
visualization
Grid Sites
Data Star (183-node IBM pSeries cluster) at SDSC
Mercury IBM IA64 cluster at NCSA
GPFS 2.3 (Multi-cluster) 120-terabyte file system on
IBM DS4300 (FAStT 600 Turbo) storage at SC04
Teragrid 30 Gb/s Teragrid backbone WAN
Workflow
Enzo simulates the evolution of the universe from
big bang to present. Enzo runs best on the pSeries
Data Star nodes at SDSC. Enzo writes its output as
it is produced over the Teragrid to the GPFS
Storcloud file system at SC04.
VISTA reads the Enzo data from the GPFS
StorCloud file system and renders it into images that
can be compressed into a QuickTime video. Vista
takes advantage of the cheaper IA64 nodes at
NCSA. VISTA reads the Enzo results from the
StorCloud file system and writes its output images
there as well.
The resulting QuickTime movie is read from GPFS
and displayed in the SDSC booth at the conference
19
© 2006 IBM Corporation
GPFS Clients
in IBM Booth
GPFS NSD Servers
SDSC Booth
Storcloud SAN
/Gpfs/storcloud
120 TB GPFS File System
60 Ds4300 RAID
Storcloud Booth
Visualization
IBM Systems & Technology Group
DEISA – European Science Grid
http://www.deisa.org/applications/deci.php
DEISA File System - Logical View
Global
name space
Transparent access across sites
“symmetric” access
equal performance everywhere via 1 Gb/s GEANT network
DEISA File System - Physical View
FZJ-Jülich
(Germany): P690 (32 processor nodes) architecture, incorporating
1312 processors. Peak performance is 8.9 Teraflops.
IDRIS-CNRS (France): Mixed P60 and P655+ (4 processor nodes)
architecture, incorporating 1024 processors. Peak performance is 6.7 Teraflops.
RZG–Garching (Germany): P690 architecture incorporating 896 processors.
Peak performance is 4.6 Teraflops.
CINECA (Italy): P690 architecture incorporating 512 processors. Peak
performance is 2.6 Teraflops.
© 2006 IBM Corporation
IBM Systems & Technology Group
DEISA – Management
Each
“core” partner contributes initially 10-15% of its computing
capacity to a common resource pool. This pool benefits from the
DEISA global file system.
Sharing
model is based on simple exchanges: on the average each
partner recovers as much as he contributes. This leaves the different
business models of the partners organizations unchanged.
The
pool is dynamic: in 2005, computing nodes will be able to join or
leave the pool in real time, without disrupting the national services.
The pool can therefore be reconfigured to match users requirements
and applications profiles.
Each
DEISA site is a fully independent administration domain, with
its own AAA policies. The “dedicated” network connects computing
nodes – not sites.
A network
of trust will be established, to operate the pool.
© 2006 IBM Corporation
IBM Systems & Technology Group
Issues
Data Integrity
Systems must ensure data is valid at creation and integrity is
maintained
Data locality
Difficult to predict where data will be needed/used
Data Movement
GRID like infrastructure to move data as needed
Move close to biggest users?
Analyze access patterns
Data manipulation
Data annotation
Data availability
© 2006 IBM Corporation
IBM Systems & Technology Group
Solution Elements
Storage/Storage controller
Redundancy
Caching
Management
A file system
Scalable
Provides appropriate performance
© 2006 IBM Corporation
IBM Systems & Technology Group
Issues
Data manipulation
What can you do with a 100 GB file?
Common tools to analytically view/understand data
Extract information/correlations
Data annotation
What is the data? What is quality? What is source?
Mars rover – feet vs meters
Climate study – temperature bias
Data availability
Anywhere/anytime
© 2006 IBM Corporation
IBM Systems & Technology Group
Solutions
Global file systems
First step to name mapping and distributed access
Embedded analysis via controller
Storage controllers are basically specialized computers
Analysis locality
Data annotation
XML self describing, meta-data, RDBMS
Storage GRIDs
Access to collaboraters
Global data dictionary
© 2006 IBM Corporation
IBM Systems & Technology Group
The 2010 HPC File system
Wide striping for data rates scaling to 1TB/sec is basic.
Metadata requires heavy write cache or solid state disk
Access beyond the compute cluster using the speed delivered by
network vendors. GPFS multi-cluster begins this process for
IBM. pNFS is also working in this direction.
The file system will automatically adapt to degradation in the
network and the storage.
The file system will provide improved facilities for HSM, backup
and other utility vendors to selectively move files.
Better search algorithms will be implemented to find the data that
you need. This will be joint between the file system and external
search capabilities or databases
© 2006 IBM Corporation
IBM Systems & Technology Group
pNFS – Parallel NFS Version 4
Extension to NFS4 to support parallel access
Allows transparent load balancing across multiple servers
Metadata server handles namespace operations
Data servers handle read/write operations
Layer pNFS metadata server and data servers on top of GPFS cluster
Working with University of Michigan and others on Linux pNFS on top of GPFS
Linux
Linux
NFSD
NFSv4
NFSDClients
NFSv4 + pNFS
NFSv4 metadata
server
get layout
device
driver
Storage protocol
NFSv4
read/write
Storage Device
Storage
Or Device
Storage
Device
Or server
NFSv4 data
or
NFSv4 data server
NFSv4 data server
Management
Protocol
© 2006 IBM Corporation
IBM Systems & Technology Group
In 2010
Disk drives in excess of 1TB. Drive transfer rates increase
somewhat because of greater density. Disk connection networks of
10Gb/sec or more.
General networks of 30Gb/sec or more (12x IB) and expand over
greater distances
The network is becoming less of a limitation. Storage will centralize
for management reasons.
Processor trends continue
==========================================
Enables larger and faster file systems.
Requires better sharing of data. Standard NFS over TCP will only be
part of the answer. Data center sharing of data through high speed
networks becomes common
Requires better management of the components and better
robustness.
New applications involve collection and search of large amounts of
data
© 2006 IBM Corporation
IBM Systems & Technology Group
Data Analytics
Analytics is the intersection of:
Visualization, analysis, scientific data management, humancomputer interfaces, cognitive science, statistical analysis,
reasoning, …
All sciences need to find, access, and store and understand
information
In some sciences, the data management (and analysis) challenge
already exceeds the compute-power challenge
The ability to tame a tidal wave of information will distinguish the
most successful scientific, commercial, and national security
endeavors
It is the limiting or the enabling factor for a wide range of sciences
Analyzing data to find meaningful information requires substantially
more computing power and more intelligent data handling
Bioinformatics, Financial, Climate, Materials
© 2006 IBM Corporation
IBM Systems & Technology Group
Distributed Data generators
Realtime data creation
Cameras, telescopes, satellites, sensors, weather stations,
simulations
Old way – capture all data for later analysis
New way – analysis at data origin/creation
Embed intelligent systems into sensors
Analyze data early in life cycle
© 2006 IBM Corporation
IBM Systems & Technology Group
Summary
Data is going to keep increasing
Smarter methods to extract information from data
Full blown collaborative infrastructures needed
© 2006 IBM Corporation