LANL Briefing - May 2004

Download Report

Transcript LANL Briefing - May 2004

IBM Systems & Technology Group
Data Intensive World
Bob Curran and Kevin Gildea
IBM Systems & Technology Group
© 2006 IBM Corporation
IBM Systems & Technology Group
Introduction
 Data size is growing both in aggregate and in individual elements
 Rate of data storage/access is growing both in aggregate and in
individual elements.
 Price of storage is making the storage of more data and more
elaborate forms of old data affordable.
 Better networking is making it possible to deliver some of this data
to users away from the installation which created the data.
 New application classes build upon all of this
 All of this leads to questions about data integrity, data validity, data
search, backup, retention …
© 2006 IBM Corporation
IBM Systems & Technology Group
Data is growing
 Individual files are growing
 1MB of text observations becomes 1TB of observational data in
weather over the last two decades
 A 4KB letter in ASCI becomes a 30KB Word document with no
added pictures, signatures or logos.
 Media applications generating more and higher resolution data
 Aggregate Data is growing
 Science data is growing. CERN starting a new set of particle
acceleration experiments expected to collect 10 PB/year of filtered
retained data
 GPFS file system at LLNL reached 2PB this year
 Library of Congress RFP in 1/2006 called for the ingest of 8PB/year
of video data retained indefinitely
© 2006 IBM Corporation
IBM Systems & Technology Group
Storage is growing
 Anthology Terabyte NAS on sale at Fry’s - $699 or $599 w/ rebate
 SATA drives of 500GB are available today. FC drives at 300GB.
 IBM DS8300 can store 180TB of data in one controller. A DS4800
mid-range controller can store 89TB.
 One IBM TS3500 tape library can store over 3PB of data on 3592
cartridges which can contain 1TB each with typical compression.
 Disk density improvements appear to be slowing down.
PLUS: you can aggregate them to create file systems of hundreds
of TB to a few PB today
© 2006 IBM Corporation
IBM Systems & Technology Group
Data Access Rates are growing
 Disks attached by 4Gb FC, 320 MB/sec SCSI … replacing the 10
MB/sec SCSI of a few years ago.
 Drive speeds of 10K rpm or 15K rpm are common replacing the
5400 rpm of a few years ago
 Data transfer rates vary heavily based on usage patterns and
fragmentation, but 50MB/sec + is not uncommon is our
experience.
 RAID arrays can multiply this transfer by 4-8X for a single object
PLUS: you can stripe files or sets of files across arrays, controllers
and links to produce even larger data rates for a file or collection of
files. GPFS measured 122GB/sec for one file at LLNL in 3/06.
© 2006 IBM Corporation
IBM Systems & Technology Group
Network data rates are going up
 My office has gone from 15Mb (token ring) to 1Gb over the past 5
years. My home has gone from 9kbps modems to 300Mbps cable
modems
 Highest performing clusters have multiple 2GB/sec proprietary
links and 4X IB or 10Gb ethernet
 Multiple Gbps links are common between sites and some GRID
environments aggregate enough to provide 30-40Gb/sec.
I’d like any data from anywhere and I’d like it now, please.
© 2006 IBM Corporation
U Penn Digital Mammography Project
Potential for 28 Petabytes/year over 2000 Hospitals, in full
production
7 Regional Archives @ 4,000 TB/yr
20 Area @ 100 TB/yr
15 Hospitals @ 7 TB/yr
Proposed Hierarchical Layout
A
A
A
A
REG I O NAL
A
A
A
A
REG I O NAL
A
REG I O NAL
A
A
A
A
A
A
A
REG I O NAL
A
A
A
A
A
A
A
A
REG I O NAL
A
A
REGIONAL
A
A
A
Goal: Distribute Storage Load and
Balance Network and Query Loads
A
7
U Penn Digital Mammography Project
Current NDMA Configuration
Testbed to demonstrate
feasibility
Storage and retrieval
Infrastructure for access
Instant consultation with
experts
Innovative teaching tools
Ensure privacy and
confidentiality
Current NDMA Portal Locations
Regional Archive map
CAnet*3
CAnet*3
Chicago GigaPop
NSCP
Abilene
N. C. GigaPop
Atlanta GigaPop
ESNET
.
http://nscp.upenn.edu/NDMA
8
IBM Systems & Technology Group
HPCS (A petaflop project)
 As stated:
 1.5-5TB/sec
aggregate for big science applications. Less for other
applications
 32K
file creates/sec
 Single
1
stream at 30GB/sec
trillion files in a file system
 Inferred
 Probably
requires 50-100PB for a balanced system at the high end of
this project. Multi-PB even at the lower end of this project.
 It
would be good to manage these components effectively. There’s a
lot of components here that need to work in concert.
 Access
beyond the main cluster is desired beyond basic NFS
© 2006 IBM Corporation
IBM Systems & Technology Group
Data increasingly created by events/sensors
Images collected from NEAT (Near- Earth Asteroid Tracking)
telescopes

First year: processed 250,000 images, archived
– 6 TB of compressed data
– Randomly accessed by community of users
 Life sciences
– Mass spectrometer creates 200 GB/day, times 100 Mass Spectrometers ->
20 TB/day
– Mammograms, X-rays and other medical images
 Ability to create/store data is exponentially increasing
 Problem is to extract meaningful info from all the data
 Often able to find information from hindsight perspective
 Sometimes able to do some analysis at data collection which
requires the ability to access the data as it is collected. High
data rates required.
© 2006 IBM Corporation
IBM Systems & Technology Group
Collaborative computing – no boundaries
 Emergence of large, multidisciplinary computational science
teams
 Across geographies
 Across disciplines/sciences
 Biologic systems
 Computer science, biochemistry, statistics, computation
chemistry, fluids, …
 Automotive design
 Change sheet metal design – interactive integration
 Collaboration within enterprise/university
 Collaboration anywhere/anytime
 Ability to share/characterize data is critical
© 2006 IBM Corporation
IBM Systems & Technology Group
Data accessible forever







Size – too large to backup/restore unconditionally
Growth rate – constantly changing/increasing
Value – always needs to be available for analysis
Search - Need to be able to find the data
Performance – appropriate for the data
Collaboration – no way to anticipate users
New storage paradigm
© 2006 IBM Corporation
IBM Systems & Technology Group
GPFS: parallel file access
Parallel Cluster File System Based on
Shared Disk (SAN) Model



Cluster – fabric-interconnected
nodes (IP, SAN, …)
Shared disk - all data and metadata
on fabric-attached disk
Parallel - data and metadata flows
from all of the nodes to all of the
disks in parallel under control of
distributed lock manager.
GPFS File System Nodes
Switching fabric
(System or storage area network)
Shared disks
(SAN-attached or network
block device)
© 2006 IBM Corporation
IBM General Parallel File System
Adding Information Lifecycle Management to GPFS

GPFS adds support for ILM
abstractions: filesets,
storage pools, policy
Fusion Native Clients (Linux, AIX)
Application
Application
Application
GPFS
Placement
Policy
Placement
Posix
Policy
GPFS
Placement
Policy
GPFS
Placement
Policy
– Fileset: subtree of a file system
– Storage pool – group of LUNs
– Policy – rules for placing files
into storage pools

Application
GPFS
GPFS Manager Node
•Cluster manager
•Lock manager
•Quota manager
•Allocation manager
•Policy manager
GPFS RPC Protocol
Examples
– Place new files on fast, reliable
storage, move files as they age
to slower storage, then to tape
– Place media files on videofriendly storage (fast, smooth),
other files on cheaper storage
– Place related files together, e.g.
for failure containment
Storage Network
Gold
Pool
System Pool
Silver
Pool
Pewter
Pool
Data Pools
GPFS File System (Volume Group)
14
© 2006 IBM Corporation
IBM Systems & Technology Group
The Largest GPFS systems
System
Year
TF
GB/s
Nodes
Disk size
Storage
Disks
Blue P
1998
3
3
1464
9GB
43 TB
5040
White
2000
12
9
512
19GB
147 TB
8064
Purple/C
2005
100
122
1536
250GB
2000 TB
11000
© 2006 IBM Corporation
IBM Systems & Technology Group
ASCI Purple Supercomputer
 1536-node, 100 TF pSeries cluster at Lawrence Livermore Laboratory
 2 PB GPFS file system
 122 GB/s to a single file from all nodes in parallel
© 2006 IBM Corporation
IBM General Parallel File System
Multi-cluster GPFS and Grid Storage

Multi-cluster supported in GPFS 2.3
– Remote mounts secured with OpenSSL
– User ID’s mapped across clusters

• Server and remote client clusters can have different userid spaces
• File userids on disk may not match credentials on remote cluster
• Pluggable infrastructure allows userids to be mapped across clusters
Multi-cluster works within a site or across a WAN
– Lawrence Berkeley Labs (NERSC)
• multiple supercomputer clusters share large GPFS file systems
– DEISA (European computing grid)
• RZG, CINECA, IDRIS, CSC, CSCS, UPC, IBM
• pSeries and other clusters interconnected with multi-gigabit WAN
• Multi-cluster GPFS in “friendly-user” production 4/2005
– Teragrid
•
•
•
•
17
SDSC, NCSA, ANL, PSC, CalTech, IU, UT, …..
Sites linked via 30 Gb/sec dedicated WAN
500TB GPFS file system at SDSC shared across sites, 1500 nodes
Multi-cluster GPFS in production 10/2005
© 2006 IBM Corporation
IBM General Parallel File System
Grid Computing

Dispersed resources connected
via a network
–
–
–
–

Compute
Visualization
Data
Instruments (telescopes,
microscopes, etc.)
Sample Workflow:
– Ingest data at Site A
– Compute at Site B
– Visualize output data at Site C

Logistic nightmare!
– Conventional approach: copy data via ftp. Space,
time, bookkeeping.
– Possible approach: on-demand parallel file
access over grid
•
18
… but the scale of problems demand high
performance!
© 2006 IBM Corporation
IBM General Parallel File System
SDSC-IBM StorCloud Challenge Demo – Grid Storage
SDSC DataStar Compute Nodes
NCSA Compute Nodes
Teragrid
30 Gb/s WAN
Enzo
visualization
Grid Sites
 Data Star (183-node IBM pSeries cluster) at SDSC
 Mercury IBM IA64 cluster at NCSA
 GPFS 2.3 (Multi-cluster) 120-terabyte file system on
IBM DS4300 (FAStT 600 Turbo) storage at SC04
 Teragrid 30 Gb/s Teragrid backbone WAN
Workflow
 Enzo simulates the evolution of the universe from
big bang to present. Enzo runs best on the pSeries
Data Star nodes at SDSC. Enzo writes its output as
it is produced over the Teragrid to the GPFS
Storcloud file system at SC04.
 VISTA reads the Enzo data from the GPFS
StorCloud file system and renders it into images that
can be compressed into a QuickTime video. Vista
takes advantage of the cheaper IA64 nodes at
NCSA. VISTA reads the Enzo results from the
StorCloud file system and writes its output images
there as well.
 The resulting QuickTime movie is read from GPFS
and displayed in the SDSC booth at the conference
19
© 2006 IBM Corporation
GPFS Clients
in IBM Booth
GPFS NSD Servers
SDSC Booth
Storcloud SAN
/Gpfs/storcloud
120 TB GPFS File System
60 Ds4300 RAID
Storcloud Booth
Visualization
IBM Systems & Technology Group
DEISA – European Science Grid
http://www.deisa.org/applications/deci.php
DEISA File System - Logical View
Global
name space
Transparent access across sites
“symmetric” access
equal performance everywhere via 1 Gb/s GEANT network
DEISA File System - Physical View
FZJ-Jülich
(Germany): P690 (32 processor nodes) architecture, incorporating
1312 processors. Peak performance is 8.9 Teraflops.
IDRIS-CNRS (France): Mixed P60 and P655+ (4 processor nodes)
architecture, incorporating 1024 processors. Peak performance is 6.7 Teraflops.
RZG–Garching (Germany): P690 architecture incorporating 896 processors.
Peak performance is 4.6 Teraflops.
CINECA (Italy): P690 architecture incorporating 512 processors. Peak
performance is 2.6 Teraflops.
© 2006 IBM Corporation
IBM Systems & Technology Group
DEISA – Management
Each
“core” partner contributes initially 10-15% of its computing
capacity to a common resource pool. This pool benefits from the
DEISA global file system.
Sharing
model is based on simple exchanges: on the average each
partner recovers as much as he contributes. This leaves the different
business models of the partners organizations unchanged.
The
pool is dynamic: in 2005, computing nodes will be able to join or
leave the pool in real time, without disrupting the national services.
The pool can therefore be reconfigured to match users requirements
and applications profiles.
Each
DEISA site is a fully independent administration domain, with
its own AAA policies. The “dedicated” network connects computing
nodes – not sites.
A network
of trust will be established, to operate the pool.
© 2006 IBM Corporation
IBM Systems & Technology Group
Issues
 Data Integrity
 Systems must ensure data is valid at creation and integrity is
maintained
 Data locality
 Difficult to predict where data will be needed/used
 Data Movement
 GRID like infrastructure to move data as needed
 Move close to biggest users?
 Analyze access patterns
 Data manipulation
 Data annotation
 Data availability
© 2006 IBM Corporation
IBM Systems & Technology Group
Solution Elements
 Storage/Storage controller
 Redundancy
 Caching
 Management
 A file system
 Scalable
 Provides appropriate performance
© 2006 IBM Corporation
IBM Systems & Technology Group
Issues
 Data manipulation
 What can you do with a 100 GB file?
 Common tools to analytically view/understand data
 Extract information/correlations
 Data annotation
 What is the data? What is quality? What is source?
 Mars rover – feet vs meters
 Climate study – temperature bias
 Data availability
 Anywhere/anytime
© 2006 IBM Corporation
IBM Systems & Technology Group
Solutions
 Global file systems
 First step to name mapping and distributed access
 Embedded analysis via controller
 Storage controllers are basically specialized computers
 Analysis locality
 Data annotation
 XML self describing, meta-data, RDBMS
 Storage GRIDs
 Access to collaboraters
 Global data dictionary
© 2006 IBM Corporation
IBM Systems & Technology Group
The 2010 HPC File system
 Wide striping for data rates scaling to 1TB/sec is basic.
 Metadata requires heavy write cache or solid state disk
 Access beyond the compute cluster using the speed delivered by
network vendors. GPFS multi-cluster begins this process for
IBM. pNFS is also working in this direction.
 The file system will automatically adapt to degradation in the
network and the storage.
 The file system will provide improved facilities for HSM, backup
and other utility vendors to selectively move files.
 Better search algorithms will be implemented to find the data that
you need. This will be joint between the file system and external
search capabilities or databases
© 2006 IBM Corporation
IBM Systems & Technology Group
pNFS – Parallel NFS Version 4






Extension to NFS4 to support parallel access
Allows transparent load balancing across multiple servers
Metadata server handles namespace operations
Data servers handle read/write operations
Layer pNFS metadata server and data servers on top of GPFS cluster
Working with University of Michigan and others on Linux pNFS on top of GPFS
Linux
Linux
NFSD
NFSv4
NFSDClients
NFSv4 + pNFS
NFSv4 metadata
server
get layout
device
driver
Storage protocol
NFSv4
read/write
Storage Device
Storage
Or Device
Storage
Device
Or server
NFSv4 data
or
NFSv4 data server
NFSv4 data server
Management
Protocol
© 2006 IBM Corporation
IBM Systems & Technology Group
In 2010
 Disk drives in excess of 1TB. Drive transfer rates increase
somewhat because of greater density. Disk connection networks of
10Gb/sec or more.
 General networks of 30Gb/sec or more (12x IB) and expand over
greater distances
 The network is becoming less of a limitation. Storage will centralize
for management reasons.
 Processor trends continue
==========================================
 Enables larger and faster file systems.
 Requires better sharing of data. Standard NFS over TCP will only be
part of the answer. Data center sharing of data through high speed
networks becomes common
 Requires better management of the components and better
robustness.
 New applications involve collection and search of large amounts of
data
© 2006 IBM Corporation
IBM Systems & Technology Group
Data Analytics
 Analytics is the intersection of:
 Visualization, analysis, scientific data management, humancomputer interfaces, cognitive science, statistical analysis,
reasoning, …
 All sciences need to find, access, and store and understand
information
 In some sciences, the data management (and analysis) challenge
already exceeds the compute-power challenge
 The ability to tame a tidal wave of information will distinguish the
most successful scientific, commercial, and national security
endeavors
 It is the limiting or the enabling factor for a wide range of sciences
 Analyzing data to find meaningful information requires substantially
more computing power and more intelligent data handling
 Bioinformatics, Financial, Climate, Materials
© 2006 IBM Corporation
IBM Systems & Technology Group
Distributed Data generators
 Realtime data creation
 Cameras, telescopes, satellites, sensors, weather stations,
simulations
 Old way – capture all data for later analysis
 New way – analysis at data origin/creation
 Embed intelligent systems into sensors
 Analyze data early in life cycle
© 2006 IBM Corporation
IBM Systems & Technology Group
Summary
 Data is going to keep increasing
 Smarter methods to extract information from data
 Full blown collaborative infrastructures needed
© 2006 IBM Corporation