SLAC Computing - California Institute of Technology
Download
Report
Transcript SLAC Computing - California Institute of Technology
Pursuit of a Scalable High
Performance Multi-Petabyte Database
16th IEEE Symposium on Mass Storage Systems
Andrew Hanushevsky
SLAC Computing Services
Marcin Nowak
CERN
Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy
Andrew Hanushevsky
17-Mar-99
1
High Energy Experiments
BaBar at SLAC
High precision investigation of B-meson decays
Explore the asymmetry between matter and antimatter
Where did all the antimatter go?
ATLAS at CERN
Probe the Higgs boson energy range
Explore the more exotic reaches of physics
Andrew Hanushevsky
17-Mar-99
2
High Energy Physics Quantitative Challenge
Experiment
Starts
Data Volume
ATLAS/CERN
May 2005
5.0 petabytes/yr
2.0 petabytes
100 petabytes
Aggregate xfr rate 200 MB/sec disk
60 MB/sec tape
Processing power 5,000 SPECint95
Total amount
BaBar/SLAC
May 1999
0.2 petabytes/yr
SPARC Ultra 10’s
Physicists
Locations
Countries
Andrew Hanushevsky
526
800
87
9
100 GB/sec disk
1 GB/sec tape
250,000 SPECint95
27,000
3,000
250
50
17-Mar-99
3
Common Elements
Data will be stored in an Object Oriented database
Objectivity/DB
Most data will be kept offline
HPSS
Has theoretical ability to scale to size of experiments
Heavy duty, industrial strength mass storage system
BaBar will be blazing the path
First large scale experiment to use this combination
The year of the hare will be a very interesting time
Andrew Hanushevsky
17-Mar-99
4
Objectivity/DB
Client/Server Application
Primary access is through the Advanced Multithreaded Server (AMS)
AMS serves “pages” (512 to 64K byte blocks)
Can have any number of AMS’
Similar to other remote filesystem interfaces (e.g., NFS)
Objectivity client can read and write database “pages” via AMS
Pages range from 512 bytes to 64K in powers of 2 (e.g., 1K, 2K, 4K, etc.)
ufs protocol
ams protocol
Andrew Hanushevsky
17-Mar-99
5
High Performance Storage System
Control
#Bitfile Server
Network
#Name Server
#Storage Servers
#Physical Volume Library
# Physical Volume Repositories
#Storage System Manager
#Migration/Purge Server
#Metadata Manager
#Log Daemon
#Log Client
#Startup Daemon
#Encina/SFS
#DCE
Data Network
Andrew Hanushevsky
17-Mar-99
6
The Obvious Solution
Mass Storage
System
Compute
Farm
Network
Switch
Database
Servers
External Collaborators
But… the devil is in the details
Andrew Hanushevsky
17-Mar-99
7
Capacity and Transfer Rate
GB Capacity
MB/Sec
1024
Tape Cartridge Capacity
512
Disk System Capacity
256
128
384
64
192
32
96
16
48
8
24
4
12
2
6
1
3
88
90
92
94
96
98
00
02
04
Disk Transfer Rate
Tape Transfer Rate
06
Year
Andrew Hanushevsky
17-Mar-99
8
The Capacity Transfer Rate Gap
Density growing faster than ability to transfer data
We can store the data just fine, but do we have the time to look at it?
There are solutions short of poverty
Stripped tape?
Intelligent staging
Primary access on RAID devices
Cost/Performance is still a problem
Need to address UFS scaling problem
Replication - a fatter pipe?
Only if you want a lot of headaches
Data synchronization problem
Load balancing issues
Whatever the solution is, you’ll need lot of them
Andrew Hanushevsky
17-Mar-99
9
Part of the solution: Together Alone
HPSS
Highly scalable, excellent I/O performance for large files but
AMS
Efficient database protocol and highly flexible but
High latency for small block transfers (i.e., Objectivity/DB)
Limited security, tied to local filesystem
Need to synergistically mate these systems
Andrew Hanushevsky
17-Mar-99
10
Opening up new vistas: The Extensible AMS
oofs interface
System specific interface
Andrew Hanushevsky
17-Mar-99
11
As big as it gets: Scaling The File System
Veritas Volume Manager
Veritas File System
Catenates disk devices to form very large capacity logical devices
High performance (60+ MB/Sec) journaled file system for fast recovery
Combination used as HPSS staging target
Allows for fast streaming I/O and efficient small block transfers
Andrew Hanushevsky
17-Mar-99
12
Not out of the woods yet: Other Issues
Access Patterns
Random vs sequential
Staging latency
Scalability
Security
Andrew Hanushevsky
17-Mar-99
13
No prophets here: Supplying Performance Hints
Need additional information for optimum performance
Different from Objectivity clustering hints
Information is Objectivity independent
Database clustering
Processing mode (sequential/random)
Desired service levels
Need a mechanism to tunnel opaque information
Client supplies hints via oofs_set_info() call
Information relayed to AMS in a transparent way
AMS relays information to underlying file system via oofs()
Andrew Hanushevsky
17-Mar-99
14
Where’s the data? Dealing With Latency...
Hierarchical filesystems may have high latency bursts
Mounting a tape file
Need mechanism to notify client of expected delay
Prevents request timeout
Prevents retransmission storms
Also allows server to degrade gracefully
Can delay clients when overloaded
Defer Request Protocol
Certain oofs() requests can tell client of expected delay
For example, open()
Client waits indicated amount of time and tries again
Andrew Hanushevsky
17-Mar-99
15
Many out of one: Dynamically Replicated Databases
Dynamically distributed databases
Single machine can’t manage over a terabyte of disk cache
No good way to statically partition the database
Dynamically varying database access paths
As load increases, add more copies
Copies accessed in parallel
As load decreases, remove copies to free up disk space
Objectivity catalog independence
Copies managed outside of Objectivity
Minimizes impact on administration
Andrew Hanushevsky
17-Mar-99
16
If There are many, which One Do I Go To?
Request Redirect Protocol
oofs () routines supply alternate AMS location
oofs routines responsible for update synchronization
Typically, read/only access provided on copies
Only one read/write copy conveniently supported
Client must declare intention to update prior to access
Lazy synchronization possible
Good mechanism for largely read/only databases
Load balancing provided by an AMS collective
Has one distinguished member recorded in the catalogue
Andrew Hanushevsky
17-Mar-99
17
The AMS Collective
Collective members are
effectively interchangeable
redirect
Distinguished
Members
Andrew Hanushevsky
17-Mar-99
18
Keeping the hackers at bay: Object Oriented Security
No performance is sufficient if you have to always recompute
Need mechanism to provide security to thwart hackers
Protocol Independent Authentication Model
Public or private key
PGP, RSA, Kerberos, etc.
• Can be negotiated at run-time
Automatically called by client and server kernels
Client Objectivity Kernel creates security objects as needed
Supplied via replaceable shared libraries
Security objects supply context-sensitive authentication credentials
Works only with Extensible AMS via oofs interface
Andrew Hanushevsky
17-Mar-99
19
Overall Effects
Extensible AMS
Generic Authentication Protocol
Allows passing of hints to improve filesystem performance
Defer Request Protocol
Allows proper client identification
Opaque Information Protocol
Allows use of any type of filesystem via oofs layer
Accommodates hierarchical filesystems
Redirection Protocol
Accommodates terabyte+ filesystems
Provides for dynamic load balancing
Andrew Hanushevsky
17-Mar-99
20
Dynamic Load Balancing Hierarchical Secure AMS
Dynamic
Selection
Andrew Hanushevsky
17-Mar-99
21
Summary
AMS is capable of high performance
Ultimate performance limited by disk speeds
The oofs interface + other protocols greatly enhance
performance, scalability, usability, and security
5+TB of SLAC data has been processed using AMS+HPSS
Should be able to deliver average of 20 MB/Sec per disk
Some AMS problems
No HPSS problems
SLAC will be using this combination to store physics data
BaBar experiment will produce over a 2 PB database in 10 years
2,000,000,000,000,000 = 21015 bytes @ 200,000 3590 Tapes
Andrew Hanushevsky
17-Mar-99
22
Now for the reality
Full AMS features not yet implemented
SLAC/Objectivity design has been completed
oofs and ooss layers are completely functional
DRP, RRP, and GAP
Initial feature set to be deployed late summer
HPSS integration is full-featured and complete
Protocol development has been fully funded at SLAC
oofs OO interface, OO security, protocols (I.e., DRP, RRP, and GAP)
DRP, GAP, and limited RRP
Full asynchronous replication within 2 years
CERN & SLAC approaches similar
But quite different in detail….
Andrew Hanushevsky
17-Mar-99
23
CERN staging approach: RFIO/RFCP + HPSS
AMS
File & catalog
management
RFIO calls
Stage-in
requests
HPSS
Server
UNIX FS I/O
DB pages
Disk
Server
RFIO
daemon
HPSS Mover
Migration
daemon
(Solaris)
Tape
Robot
Disk
Pool
Andrew Hanushevsky
RFCP
(RFIO copy)
17-Mar-99
24
SLAC staging approach: PFTP + HPSS
AMS
File & catalog
management
Gateway Requests
Stage-in
requests
UNIX FS I/O
DB pages
Disk
Server
PFTP
(control)
Gateway
daemon
HPSS
Server
HPSS Mover
Migration
daemon
(Solaris)
Tape
Robot
Disk
Pool
Andrew Hanushevsky
PFTP
(data)
17-Mar-99
25
SLAC ultimate approach: Direct Tape Access
AMS
HPSS
Server
File & catalog
management
Stage-in
requests
Native API (rpc)
UNIX FS I/O
DB pages
Disk
Server
Migration
daemon
HPSS Mover
(Solaris)
Tape
Robot
Disk
Pool
Andrew Hanushevsky
Direct Transfer
17-Mar-99
26
CERN 1TB Test Bed
HPSS Data
Mover
current approximation
future
1Gb switched ether
star topology
IBM
RS6000
HPSS Server
RFIO daemon
IBM Tape Silo
FDDI
HIPPI
Fast Ethernet
IBM
RS6000
SUN Sparc 5
AMS/HPSS
Interface
SUN Sparc 5
HPSS Data
Mover
Staging
Pool
DEC Alpha
Andrew Hanushevsky
17-Mar-99
27
SLAC Configuration
approximate
AMS Server
HPSS Mover
Sun 4500
900 G
Sun 4500
HPSS Server
AMS Server
HPSS Mover
Gigabit
Ethernet
IBM RS6000
F50
AMS Server
HPSS Mover
Sun 4500
B
Sun 4500
Andrew Hanushevsky
AMS Server
HPSS Mover
AMS Server
HPSS Mover
17-Mar-99
28
SLAC Detailed Configuration
Andrew Hanushevsky
17-Mar-99
29