SLAC Computing - California Institute of Technology

Transcript SLAC Computing - California Institute of Technology

Pursuit of a Scalable High
Performance Multi-Petabyte Database
16th IEEE Symposium on Mass Storage Systems
Andrew Hanushevsky
SLAC Computing Services
Marcin Nowak
CERN
Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy
Andrew Hanushevsky
17-Mar-99
1
High Energy Experiments

BaBar at SLAC


High precision investigation of B-meson decays
Explore the asymmetry between matter and antimatter


Where did all the antimatter go?
ATLAS at CERN


Probe the Higgs boson energy range
Explore the more exotic reaches of physics
Andrew Hanushevsky
17-Mar-99
2
High Energy Physics Quantitative Challenge



Experiment
Starts
Data Volume






ATLAS/CERN
May 2005
5.0 petabytes/yr
2.0 petabytes
100 petabytes
Aggregate xfr rate 200 MB/sec disk
60 MB/sec tape
Processing power 5,000 SPECint95


Total amount
BaBar/SLAC
May 1999
0.2 petabytes/yr
SPARC Ultra 10’s
Physicists
Locations
Countries
Andrew Hanushevsky
526
800
87
9
100 GB/sec disk
1 GB/sec tape
250,000 SPECint95
27,000
3,000
250
50
17-Mar-99
3
Common Elements

Data will be stored in an Object Oriented database

Objectivity/DB


Most data will be kept offline

HPSS


Has theoretical ability to scale to size of experiments
Heavy duty, industrial strength mass storage system
BaBar will be blazing the path


First large scale experiment to use this combination
The year of the hare will be a very interesting time
Andrew Hanushevsky
17-Mar-99
4
Objectivity/DB

Client/Server Application

Primary access is through the Advanced Multithreaded Server (AMS)


AMS serves “pages” (512 to 64K byte blocks)


Can have any number of AMS’
Similar to other remote filesystem interfaces (e.g., NFS)
Objectivity client can read and write database “pages” via AMS

Pages range from 512 bytes to 64K in powers of 2 (e.g., 1K, 2K, 4K, etc.)
ufs protocol
ams protocol
Andrew Hanushevsky
17-Mar-99
5
High Performance Storage System
Control
#Bitfile Server
Network
#Name Server
#Storage Servers
#Physical Volume Library
# Physical Volume Repositories
#Storage System Manager
#Migration/Purge Server
#Metadata Manager
#Log Daemon
#Log Client
#Startup Daemon
#Encina/SFS
#DCE
Data Network
Andrew Hanushevsky
17-Mar-99
6
The Obvious Solution
Mass Storage
System
Compute
Farm
Network
Switch
Database
Servers
External Collaborators
But… the devil is in the details
Andrew Hanushevsky
17-Mar-99
7
Capacity and Transfer Rate
GB Capacity
MB/Sec
1024
Tape Cartridge Capacity
512
Disk System Capacity
256
128
384
64
192
32
96
16
48
8
24
4
12
2
6
1
3
88
90
92
94
96
98
00
02
04
Disk Transfer Rate
Tape Transfer Rate
06
Year
Andrew Hanushevsky
17-Mar-99
8
The Capacity Transfer Rate Gap

Density growing faster than ability to transfer data


We can store the data just fine, but do we have the time to look at it?
There are solutions short of poverty

Stripped tape?


Intelligent staging




Primary access on RAID devices
Cost/Performance is still a problem
Need to address UFS scaling problem
Replication - a fatter pipe?



Only if you want a lot of headaches
Data synchronization problem
Load balancing issues
Whatever the solution is, you’ll need lot of them
Andrew Hanushevsky
17-Mar-99
9
Part of the solution: Together Alone

HPSS

Highly scalable, excellent I/O performance for large files but


AMS

Efficient database protocol and highly flexible but


High latency for small block transfers (i.e., Objectivity/DB)
Limited security, tied to local filesystem
Need to synergistically mate these systems
Andrew Hanushevsky
17-Mar-99
10
Opening up new vistas: The Extensible AMS
oofs interface
System specific interface
Andrew Hanushevsky
17-Mar-99
11
As big as it gets: Scaling The File System

Veritas Volume Manager


Veritas File System


Catenates disk devices to form very large capacity logical devices
High performance (60+ MB/Sec) journaled file system for fast recovery
Combination used as HPSS staging target

Allows for fast streaming I/O and efficient small block transfers
Andrew Hanushevsky
17-Mar-99
12
Not out of the woods yet: Other Issues

Access Patterns




Random vs sequential
Staging latency
Scalability
Security
Andrew Hanushevsky
17-Mar-99
13
No prophets here: Supplying Performance Hints

Need additional information for optimum performance

Different from Objectivity clustering hints




Information is Objectivity independent


Database clustering
Processing mode (sequential/random)
Desired service levels
Need a mechanism to tunnel opaque information
Client supplies hints via oofs_set_info() call


Information relayed to AMS in a transparent way
AMS relays information to underlying file system via oofs()
Andrew Hanushevsky
17-Mar-99
14
Where’s the data? Dealing With Latency...

Hierarchical filesystems may have high latency bursts


Mounting a tape file
Need mechanism to notify client of expected delay



Prevents request timeout
Prevents retransmission storms
Also allows server to degrade gracefully


Can delay clients when overloaded
Defer Request Protocol

Certain oofs() requests can tell client of expected delay


For example, open()
Client waits indicated amount of time and tries again
Andrew Hanushevsky
17-Mar-99
15
Many out of one: Dynamically Replicated Databases

Dynamically distributed databases



Single machine can’t manage over a terabyte of disk cache
No good way to statically partition the database
Dynamically varying database access paths

As load increases, add more copies



Copies accessed in parallel
As load decreases, remove copies to free up disk space
Objectivity catalog independence

Copies managed outside of Objectivity

Minimizes impact on administration
Andrew Hanushevsky
17-Mar-99
16
If There are many, which One Do I Go To?

Request Redirect Protocol


oofs () routines supply alternate AMS location
oofs routines responsible for update synchronization


Typically, read/only access provided on copies
Only one read/write copy conveniently supported




Client must declare intention to update prior to access
Lazy synchronization possible
Good mechanism for largely read/only databases
Load balancing provided by an AMS collective

Has one distinguished member recorded in the catalogue
Andrew Hanushevsky
17-Mar-99
17
The AMS Collective
Collective members are
effectively interchangeable
redirect
Distinguished
Members
Andrew Hanushevsky
17-Mar-99
18
Keeping the hackers at bay: Object Oriented Security

No performance is sufficient if you have to always recompute


Need mechanism to provide security to thwart hackers
Protocol Independent Authentication Model

Public or private key

PGP, RSA, Kerberos, etc.
• Can be negotiated at run-time

Automatically called by client and server kernels


Client Objectivity Kernel creates security objects as needed


Supplied via replaceable shared libraries
Security objects supply context-sensitive authentication credentials
Works only with Extensible AMS via oofs interface
Andrew Hanushevsky
17-Mar-99
19
Overall Effects

Extensible AMS


Generic Authentication Protocol


Allows passing of hints to improve filesystem performance
Defer Request Protocol


Allows proper client identification
Opaque Information Protocol


Allows use of any type of filesystem via oofs layer
Accommodates hierarchical filesystems
Redirection Protocol


Accommodates terabyte+ filesystems
Provides for dynamic load balancing
Andrew Hanushevsky
17-Mar-99
20
Dynamic Load Balancing Hierarchical Secure AMS
Dynamic
Selection
Andrew Hanushevsky
17-Mar-99
21
Summary

AMS is capable of high performance

Ultimate performance limited by disk speeds



The oofs interface + other protocols greatly enhance
performance, scalability, usability, and security
5+TB of SLAC data has been processed using AMS+HPSS



Should be able to deliver average of 20 MB/Sec per disk
Some AMS problems
No HPSS problems
SLAC will be using this combination to store physics data

BaBar experiment will produce over a 2 PB database in 10 years

2,000,000,000,000,000 = 21015 bytes @ 200,000 3590 Tapes
Andrew Hanushevsky
17-Mar-99
22
Now for the reality

Full AMS features not yet implemented

SLAC/Objectivity design has been completed


oofs and ooss layers are completely functional



DRP, RRP, and GAP
Initial feature set to be deployed late summer


HPSS integration is full-featured and complete
Protocol development has been fully funded at SLAC


oofs OO interface, OO security, protocols (I.e., DRP, RRP, and GAP)
DRP, GAP, and limited RRP
Full asynchronous replication within 2 years
CERN & SLAC approaches similar

But quite different in detail….
Andrew Hanushevsky
17-Mar-99
23
CERN staging approach: RFIO/RFCP + HPSS
AMS
File & catalog
management
RFIO calls
Stage-in
requests
HPSS
Server
UNIX FS I/O
DB pages
Disk
Server
RFIO
daemon
HPSS Mover
Migration
daemon
(Solaris)
Tape
Robot
Disk
Pool
Andrew Hanushevsky
RFCP
(RFIO copy)
17-Mar-99
24
SLAC staging approach: PFTP + HPSS
AMS
File & catalog
management
Gateway Requests
Stage-in
requests
UNIX FS I/O
DB pages
Disk
Server
PFTP
(control)
Gateway
daemon
HPSS
Server
HPSS Mover
Migration
daemon
(Solaris)
Tape
Robot
Disk
Pool
Andrew Hanushevsky
PFTP
(data)
17-Mar-99
25
SLAC ultimate approach: Direct Tape Access
AMS
HPSS
Server
File & catalog
management
Stage-in
requests
Native API (rpc)
UNIX FS I/O
DB pages
Disk
Server
Migration
daemon
HPSS Mover
(Solaris)
Tape
Robot
Disk
Pool
Andrew Hanushevsky
Direct Transfer
17-Mar-99
26
CERN 1TB Test Bed
HPSS Data
Mover
current approximation
future
1Gb switched ether
star topology
IBM
RS6000
HPSS Server
RFIO daemon
IBM Tape Silo
FDDI
HIPPI
Fast Ethernet
IBM
RS6000
SUN Sparc 5
AMS/HPSS
Interface
SUN Sparc 5
HPSS Data
Mover
Staging
Pool
DEC Alpha
Andrew Hanushevsky
17-Mar-99
27
SLAC Configuration
approximate
AMS Server
HPSS Mover
Sun 4500
900 G
Sun 4500
HPSS Server
AMS Server
HPSS Mover
Gigabit
Ethernet
IBM RS6000
F50
AMS Server
HPSS Mover
Sun 4500
B
Sun 4500
Andrew Hanushevsky
AMS Server
HPSS Mover
AMS Server
HPSS Mover
17-Mar-99
28
SLAC Detailed Configuration
Andrew Hanushevsky
17-Mar-99
29

SLAC Computing - California Institute of Technology

Transcript SLAC Computing - California Institute of Technology

Directory