Panasas Parallel File System Brent Welch October 16, 2008

Download Report

Transcript Panasas Parallel File System Brent Welch October 16, 2008

Panasas Parallel File System
Brent Welch
Director of Software Architecture, Panasas Inc
October 16, 2008
HPC User Forum
Go Faster. Go Parallel.
www.panasas.com
Outline
Panasas background
Storage cluster background, hardware and software
Technical Topics
High Availability
Scalable RAID
pNFS
Brent’s garden, California
Slide 2 |
October 2008
IDC, Panasas, Inc.
Panasas Company Overview
Founded
Technology
Locations
1999 By Prof. Garth Gibson, Co-Inventor of RAID
Parallel File System and Parallel Storage Appliance
US:
HQ in Fremont, CA, USA
R&D centers in Pittsburgh & Minneapolis
EMEA: UK, DE, FR, IT, ES, BE, Russia
APAC:
Customers
Market Focus
Alliances
Primary
Investors
Slide 3 |
October 2008
China, Japan, Korea, India, Australia
FCS October 2003, deployed at 200+ customers
Energy
Government
Manufacturing
ISVs:
Academia
Life Sciences
Finance
Resellers:
IDC, Panasas, Inc.
The New ActiveStor Product Line
Design,
Modeling and
Visualization
Applications
Simulation
and
Analysis
Applications
Tiered QOS
Storage
Backup/
Secondary
ActiveStor 6000
ActiveStor 4000
ActiveStor 200
• 20, 15 or 10TB shelves
• 20 GB cache/shelf
• Integrated 10GigE
• 600 MB/sec/shelf
• Tiered Parity
• ActiveImage
• ActiveGuard
• ActiveMirror
• 20 or 15TB shelves
• 5 GB cache/shelf
• Integrated 10GigE
• 600 MB/sec/shelf
• Tiered Parity
Options:
• 104 TBs (3DBs x 52SBs)
• 5 shelves (20U, 35”)
• Single 1GigE port/shelf
• 350 MB/sec aggregate
• 20 GB aggregate cache
• Tiered Parity
• ActiveImage
• ActiveGuard
• ActiveMirror
ActiveScale 3.2
Slide 4 |
October 2008
IDC, Panasas, Inc.
Leaders in HPC choose Panasas
Slide 5 |
October 2008
IDC, Panasas, Inc.
Panasas Leadership Role in HPC
US DOE: Panasas Selected for Roadrunner – Top of the Top 500
LANL $130M system will deliver 2x performance over current top BG/L
SciDAC: Panasas CTO selected to lead Petascale Data Storage Inst
CTO Gibson leads PDSI launched Sep 06, leveraging experience from PDSI
members: LBNL/Nersc; LANL; ORNL; PNNL Sandia NL; and UCSC
Aerospace: Airframes and engines, both commercial and defense
Boeing HPC file system; major engine mfg; top 3 U.S. defense contractors
Formula-1: HPC file system for Top 2 clusters – 3 teams in total
Top clusters at Renault F-1 and BMW Sauber, Ferrari also on Panasas
Intel: Certifies Panasas storage for broad range of HPC applications, now ICR
Intel uses Panasas storage for EDA design, and in HPC benchmark center
SC07: Six Panasas customers won awards at SC07 (Reno) conference
Validation: Extensive recognition and awards for HPC breakthroughs
Slide 6 |
October 2008
IDC, Panasas, Inc.
Panasas Joint Alliance Investments with ISVs
Vertical Focus
Applications
Energy
Seismic Processing;
Interpretation;
Reservoir Modeling
Manufacturing;
Government;
Higher Ed and Res;
Energy
Computational Fluid
Dynamics (CFD);
Comp Structural
Mechanics (CSM)
Semiconductor
Electronic Data
Automation (EDA)
Financial
Slide 7 |
October 2008
Panasas ISV Alliances
Trading;
Derivatives Pricing;
Risk Analysis
IDC, Panasas, Inc.
Intel Use of Panasas in the IC Design Flow
Panasas critical
to tape-out stage
Slide 8 |
October 2008
IDC, Panasas, Inc.
Description of Darwin at U of Cambridge
University of Cambridge
HPC Service, Darwin Supercomputer
Darwin Supercomputer Computational Units
Nine repeating units, each consists of 64 nodes (2
racks) providing 256 cores each, 2340 cores total
All nodes within a CU connected to a full bisectional
bandwidth Infiniband 900 MB/s, MPI latency of 2 ms
Source:
Slide 9 |
October 2008
http://www.hpc.cam.ac.uk
IDC, Panasas, Inc.
Details of the FLUENT 111M Cell Model
Unsteady external aero for 111 MM cell truck; 5 time
steps with 100 iterations, and a single .dat file write
Number of cells
111,091,452
Solver
PBNS, DES, Unsteady
5 time steps, 100 total iters data save after last iteration
Iterations
Output size:
FLUENT v6.3
FLUENT v12
FLUENT v12
(serial I/O; size of .dat file)
(serial I/O; size of .dat file)
(parallel I/O; size of .pdat file)
Truck
111M Cells
14,808 MB
16,145 MB
19, 683 MB
DARWIN 585 nodes; 2340 cores
Univ of Cambridge DARWIN Cluster
Location: University of Cambridge
http://www.hpc.cam.ac.uk
Vendor: Dell ; 585 nodes; 2340 cores; 8 GB per node; 4.6 TB total mem
CPU: Intel Woodcrest DC, 3.0 GHz / 4MB L2 cache
Interconnect: InfiniPath QLE7140 SDR HCAs; Silverstorm 9080 and 9240 switches,
File System: Panasas PanFS, 4 shelves, 20 TB capacity
Operating System: Scientific Linux CERN SLC release 4.6
Slide 10 |
October 2008
Panasas: 4 Shelves, 20 TB
IDC, Panasas, Inc.
Scalability of Solver + Data File Write
Time (Seconds) of Solver + Data File Write
FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster
5000
Time of Solver + Data File Write
4568
3000
NFS -- FLUENT 6.3
Lower
is
better
4000
2000
PanFS -- FLUENT 12
1.7x
Truck Aero
111M Cells
2680
2541
1.5x
1.9x
1000
1790
1318
1.7x
779
0
64
128
NOTE: Read
times are not
included in
these results
256
Number of Cores
Slide 11 |
October 2008
IDC, Panasas, Inc.
Performance of Data File Write in MB/s
Effective Rates of I/O (MB/s) for Data Write
FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster
400
PanFS -- FLUENT 12
Effective Rates of I/O for Data File Write
345
NFS -- FLUENT 6.3
334
300
Higher
Is
Better
273
200
Truck Aero
111M Cells
39x
31x
100
20x
NOTE: Data
File Write Only
0
64
128
256
Number of Cores
Slide 12 |
October 2008
IDC, Panasas, Inc.
Panasas Architecture
Cluster technology provides scalable capacity and performance: capacity
scales symmetrically with processor, caching, and network bandwidth
Disk
CPU
Memory
Network
Scalable performance with commodity parts provides excellent
price/performance
Object-based storage provides additional scalability and security
advantages over block-based SAN file systems
Automatic management of storage resources to balance load across the
cluster
Shared file system (POSIX) with the advantages of NAS, with direct-tostorage performance advantages of DAS and SAN
Slide 13 |
October 2008
IDC, Panasas, Inc.
Panasas bladeserver building block
Power Supplies
Embedded
Switch
Battery
4U
high
Mid Plane
Rails
11 slots for blades
The Shelf
DirectorBlade
StorageBlade
Slide 14 |
October 2008
14
Garth Gibson,
IDC, Panasas,
Inc. July 2008
Panasas Blade Hardware
Integrated GE Switch
Battery Module
(2 Power units)
Shelf Front
1 DB, 10 SB
Shelf Rear
StorageBlade
DirectorBlade
Midplane routes GE, power
Slide 15 |
October 2008
IDC, Panasas, Inc.
Panasas Product Advantages
Proven implementation with appliance-like ease of
use/deployment
Running mission-critical workloads at global F500 companies
Scalable performance with Object-based RAID
No degradation as the storage system scales in size
Unmatched RAID rebuild rates – parallel reconstruction
Unique data integrity features
Vertical parity on drives to mitigate media errors and silent corruptions
Per-file RAID provides scalable rebuild and per-file fault isolation
Network verified parity for end-to-end data verification at the client
Scalable system size with integrated cluster management
Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers
Simultaneous access from over 12000 servers
Slide 16 |
October 2008
IDC, Panasas, Inc.
Linear Performance Scaling
Breakthrough data throughput AND random I/O
Performance and scalability for all workloads
Slide 17 |
October 2008
IDC, Panasas, Inc.
Scaling Clients
IOzone Multi-Shelf Performance Test (4GB sequential write/read)
1-shelf write
1-shelf read
3,000
2-shelf write
2-shelf read
4-shelf write
4-shelf read
2,500
8-shelf write
8-shelf read
MB/sec
2,000
1,500
1,000
500
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
155
200
# clients
Slide 18 |
October 2008
IDC, Panasas, Inc.
Proven Panasas Scalability
Storage Cluster Sizes Today (e.g.)
Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25
DirectorBlades and 250 StorageBlades each in two other smaller systems.)
LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system
today, planning to increase to 144 shelves next year.
Intel has 5,000 active DF clients against 10-shelf systems, with even more
clients mounting DirectorBlades via NFS. They have qualified a 12,000 client
version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this
year.
BP uses 200 StorageBlade storage pools as their building block
LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades
Most customers run systems in the 100 to 200 blade size range
Slide 19 |
October 2008
IDC, Panasas, Inc.
Emphasis on Data Integrity
Horizontal Parity
Per-file, Object-based RAID across OSD
Scalable on-line performance
Scalable parallel RAID rebuild
Vertical Parity
Detect and eliminate unreadable sectors and silent data corruption
RAID at the sector level within a drive / OSD
Network Parity
Client verifies per-file parity equation during reads
Provides only truly end-to-end data integrity solution available today
Many other reliability features…
Media scanning, metadata fail over, network multi-pathing, active hardware
monitors, robust cluster management
Slide 20 |
October 2008
IDC, Panasas, Inc.
High Availability
Quorum based cluster management
3 or 5 cluster managers to avoid split brain
Replicated system state
Cluster manager controls the blades and all other services
High performance file system metadata fail over
Primary-backup relationship controlled by cluster manager
Low latency log replication to protect journals
Client-aware fail over for application-transparency
NFS level fail over via IP takeover
Virtual NFS servers migrate among DirectorBlade modules
Lock services (lockd/statd) fully integrated with fail over system
Slide 21 |
October 2008
IDC, Panasas, Inc.
Technology Review
Turn-key deployment and
automatic resource configuration
Storage Clusters scaling to
~1000 nodes today
Scalable Object RAID
Very fast RAID rebuild
Compute clusters scaling to
12,000 nodes today
Vertical Parity to trap silent
corruptions
Blade-based hardware with
1Gb/sec building block
Network parity for end-to-end
data verification
Bigger building block going forward
Distributed system platform with
quorum-based fault tolerance
Coarse grain metadata clustering
Metadata fail over
Automatic capacity load leveling
Slide 22 |
October 2008
IDC, Panasas, Inc.
The pNFS Standard
The pNFS standard defines the NFSv4.1 protocol
extensions between the server and client
The I/O protocol between the client and storage is specified
elsewhere, for example:
SCSI Block Commands (SBC) over Fibre Channel (FC)
SCSI Object-based Storage Device (OSD) over iSCSI
Network File System (NFS)
The control protocol between the server and storage
devices is also specified elsewhere, for example:
SCSI Object-based Storage Device (OSD) over iSCSI
Client
Storage
NFS 4.1 Server
Slide 23 |
October 2008
IDC, Panasas, Inc.
Key pNFS Participants
Panasas (Objects)
Network Appliance (Files over NFSv4)
IBM (Files, based on GPFS)
EMC (Blocks, HighRoad MPFSi)
Sun (Files over NFSv4)
U of Michigan/CITI (Files over PVFS2)
Slide 24 |
October 2008
IDC, Panasas, Inc.
pNFS Status
pNFS is part of the IETF NFSv4 minor version 1 standard
draft
Working group is passing draft up to IETF area directors, expect RFC later in ’08
Prototype interoperability continues
San Jose Connect-a-thon March ’06, February ’07, May ‘08
Ann Arbor NFS Bake-a-thon September ’06, October ’07
Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08)
Availability
TBD – gated behind NFSv4 adoption and working implementations of pNFS
Patch sets to be submitted to Linux NFS maintainer starting “soon”
Vendor announcements in 2008
Early adoptors in 2009
Production ready in 2010
Slide 25 |
October 2008
IDC, Panasas, Inc.
Questions?
Thank you for your time!
Go Faster. Go Parallel.
www.panasas.com
Deep Dive: Reliability
High Availability
Cluster Management
Data Integrity
Slide 27 |
October 2008
IDC, Panasas, Inc.
Vertical Parity
“RAID” within an individual
drive
Seamless recovery from
media errors by applying
RAID schemes across disk
sectors
Vertical
Parity
Repairs media defects by
writing through to spare
sectors
Detects silent corruptions and
prevents reading wrong data
Independent of horizontal
array-based parity schemes
Slide 28 |
October 2008
IDC, Panasas, Inc.
Network Parity
Extends parity capability across the data path to the client or
server node
Enables End-to-End data integrity validation
Protects from errors introduced by disks, firmware, server hardware, server
software, network components and transmission
Client either receives valid data or an error notification
Network
Parity
Vertical
Parity
Horizontal Parity
Slide 29 |
October 2008
IDC, Panasas, Inc.
Panasas Scalability
Two Layer Architecture
Division between these two layers is an important separation of concerns
Platform maintains a robust system model and provides overall control
File system is an application layered over the distributed system platform
Automation in the distributed system platform helps the system adapt to
failures without a lot of hand-holding by administrators
The file system uses protocols optimized for performance, and relies on the
platform to provide robust protocols for failure handling
Applications
Parallel File System
Distributed System Platform
Hardware
Slide 30 |
October 2008
IDC, Panasas, Inc.
Model-based Platform
The distributed system platform maintains a model of the
system
what are the basic configuration settings (networking, etc.)
Manual
which storage and manager nodes are in the system
where services are running
what errors and faults are present in the system
Automatic
what recovery actions are in progress
Model-based approach mandatory to reduce administrator
overhead
Automatic discovery of resources
Automatic reaction to faults
Automatic capacity balancing
Proactive hardware monitoring
Slide 31 |
October 2008
IDC, Panasas, Inc.
Quorum-based Cluster Management
The system model is replicated on 3 or 5 (or more) nodes
Maintained via a Lamport’s PTP (Paxos) quorum voting protocol
PTP handles quorum membership change, brings partitioned members back
up to date, provides basic transaction model
Each member keeps the model in a local database that is updated within a
PTP transaction
7 or 14 msec update cost, which is dominated by 1 or 2 synchronous disk IOs
This robust mechanism is not on the critical path of any file
system operation
Yes: change admin password (involves cluster manager)
Yes: configure quota tree
Yes: initiate service fail over
No: open close read write etc. (does not involve cluster manager)
Slide 32 |
October 2008
IDC, Panasas, Inc.
File Metadata
PanFS metadata manager stores metadata in object
attributes
All component objects have simple attributes like their capacity, length, and
security tag
Two component objects store replica of file-level attributes, e.g. file-length,
owner, ACL, parent pointer
Directories contain hints about where components of a file are stored
There is no database on the metadata manager
Just transaction logs that are replicated to a backup via a low-latency network
protocol
Slide 33 |
October 2008
IDC, Panasas, Inc.
File Metadata
File ownership divided along file subtree boundaries
(“volumes”)
Multiple metadata managers, each own one or more volumes
Match with the quota-tree abstraction used by the administrator
Creating volumes creates more units of meta-data work
Primary-backup failover model for MDS
90us remote log update over 1GE, vs. 2us local in-memory log update
Some customers introduce lots of volumes
One volume per DirectorBlade module is ideal
Some customers are stubborn and stick with just one
E.g., 75 TB and 144 StorageBlade modules and a single MDS
Slide 34 |
October 2008
IDC, Panasas, Inc.
Scalability over time
Software baseline
Single software product, but with branches corresponding to major releases
that come out every year (or so)
Today most customers are on 3.0.x, which had a two year lifetime
We are just introducing 3.2, which has been in QA and beta for 7 months
There is forward development on newer features
Our goal is a major release each year, with a small number of maintenance releases
in between major releases
Compatibility is key
New versions upgrade cleanly over old versions
Old clients communicate with new servers, and vice versa
Old hardware is compatible with new software
Integrated data migration to newer hardware platforms
Slide 35 |
October 2008
IDC, Panasas, Inc.
Deep Dive: Networking
Scalable networking infrastructure
Integration with compute cluster fabrics
Slide 36 |
October 2008
IDC, Panasas, Inc.
LANL Systems
Tourquoise Unclassified
Pink 10TF - TLC 1TF - Coyote 20TF
(compute)
1 Lane PaScalBB ~ 100 GB/sec
(network)
68 Panasas Shelves, 412 TB, 24 GBytes/sec
(storage)
Unclassified Yellow
Flash/Gordon 11TF - Yellow Roadrunner Base 5TF – (4 RRp3 CU’s 300 TF)
2 Lane PaScalBB ~ 200 GB/sec
24 Panasas Shelves - 144 TB - 10 GBytes/sec
Classified Red
Lightning/Bolt 35TF - Roadrunner Base 70TF (Accelerated 1.37PF)
6 Lane PaScalBB ~ 600 GB/sec
144 Panasas Shelves - 936 TB - 50 GBytes/sec
Slide 37 |
October 2008
IDC, Panasas, Inc.
IB and other network fabrics
Panasas is a TCP/IP, GE-based storage product
Universal deployment, Universal routability
Commodity price curve
Panasas customers use IB, Myrinet, Quadrics, …
Cluster interconnect du jour for performance, not necessarily cost
IO routers connect cluster fabric to GE backbone
Analogous to an “IO node”, but just does TCP/IP routing (no storage)
Robust connectivity through IP multipath routing
Scalable throughput at approx 650 MB/sec IO router (PCI-e class)
Working on a 1GB/sec IO router
IB-GE switching platforms
QLogic or Voltare switch provides wire-speed bridging
Slide 38 |
October 2008
IDC, Panasas, Inc.
Petascale Red Infrastructure Diagram with Roadrunnner Accelerated FY08
Secure Core switches
NFS
and other
network
services,
WAN
Nx10GE
NxGE
Nx10GE
Archive
I
B
4
X Compute Unit
FTA’s
10GE
Roadruner
Phase 3
1.026 PF
F
a
t
T
r
e
e
IONODES
Site
wide
Shared
Global
Parallel
File
System
Compute Unit
4 GE
per 5-8
TB
10GE
IONODES
IO
Unit
M
y
r
i
n
e
t
CU
Roadrunner
Phase 1
70TF
CU
Lightning/Bolt
35 TF
Scalable to
1GE
600 GB/sec
before adding
Lanes
Slide 39 |
IONODES
October 2008
IO
Unit
M
y
r
i
n
e
t
CU
CU
IDC, Panasas, Inc.
LANL Petascale (Red) FY08
NFS
complex
and other
network
services,
WAN
Site wide
Shared
Global
Parallel
File
System
650-1500
TB
50-200
GB/s
(spans
lanes)
N gb/s
N gb/s
N gb/s
N gb/s
Archive
FTA’s
I
B
4
X Compute Unit
156 I/O nodes,
1 – 10gbit link
each, 195
GB/sec,
planned for
Accelerated
Road Runner
1PF sustained
F
a
t
T
r
e
e
Compute Unit
96 I/O nodes,
2 – 1gbit links
each, 24
GB/sec
4 gb/s
per 5-8
TB
IO
Unit
M
y
r
i
n
e
t
CU
Road Runner
Base, 70TF,
144 node units,
12 I/O
nodes/unit, 4
socket dual
core AMD
nodes, 32 GB
mem/node, full
fat tree, 14
units,
Acceleration to
1 PF sustained
Bolt, 20TF, 2
socket,
singe/dual
core AMD,
256 node
units, reduced
fat tree, 1920
nodes
CU
Lane switches, 6
X 105 = 630 GB/s
If more bandwidth is needed we just need to add
more lanes and add more storage, scales nearly
linearly, not N2 like a fat tree Storage Area Network.
Slide 40 |
October 2008
Lane passthru
switches, to
connect legacy
Lightning/Bolt
64 I/O nodes,
2 – 1gbit links
each, 16
GB/sec
IO
Unit
M
y
r
i
n
e
t
CU
CU
IDC, Panasas, Inc.
Lightning, 14
TF, 2 socket
single core
AMD, 256
node units,
full fat tree,
1608 nodes
Multi-Cluster sharing: scalable BW with fail over
Archive
KRB
DNS1
NFS
Cluster A
To Site
Network
Panasas
Storage
Colors depict
subnets
Compute
Nodes
I/O Nodes
Layer 2 switches
Cluster B
Slide 41 |
October 2008
Cluster C
IDC, Panasas, Inc.
Deep Dive: Scalable RAID
Per-file RAID
Scalable RAID rebuild
Slide 42 |
October 2008
IDC, Panasas, Inc.
Automatic per-file RAID
System assigns RAID level based on file size
Small File
<= 64 KB RAID 1 for efficient space allocation
> 64 KB RAID 5 for optimum system performance
> 1 GB two-level RAID-5 for scalable performance
RAID-1 and RAID-10 for optimized small writes
RAID 1 Mirroring
Large File
Automatic transition from RAID 1 to 5 without restriping
Programmatic control for application-specific
layout optimizations
RAID 5 Striping
Very Large File
Create with layout hint
Inherit layout from parent directory
2-level RAID 5 Striping
Clients are responsible for writing data and its parity
Slide 43 |
October 2008
IDC, Panasas, Inc.
Declustered RAID
Files are striped across component objects on different StorageBlades
Component objects include file data and file parity for reconstruction
File attributes are replicated with two component objects
Declustered, randomized placement distributes RAID workload
CFE
Mirrored
or 9-OSD
Parity
Stripes
HGkE
2-shelf
BladeSet
Read
about
half of
each
surviving
OSD
Write a
little
to each
OSD
Scales
linearly
Slide 44 |
October 2008
IDC, Panasas, Inc.
Scalable RAID Rebuild
Rebuild bandwidth is the rate at which data is regenerated (writes)
Overall system throughput is N times higher because of the necessary reads
Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel
Declustering spreads disk I/O over more disk arms (StorageBlades)
Shorter repair time in larger storage
pools
Customers report 30 minute rebuilds for
800GB in 40+ shelf blade set
Rebuild MB/sec
140
120
One Volume, 1G Files
One Volume, 100MB Files
100
N Volumes, 1GB Files
N Volumes, 100MB Files
80
Variability at 12 shelves due to uneven
utilization of DirectorBlade modules
Larger numbers of smaller files was better
Reduced rebuild at 8 and 10 shelves
because of wider parity stripe
60
40
20
0
0
2
4
6
8
10
# Shelves
Slide 45 |
October 2008
IDC, Panasas, Inc.
12
14
Scalable rebuild is mandatory
Having more drives increases risk, just like having more light bulbs increases the
odds one will be burnt out at any given time
Larger storage pools must mitigate their risk by decreasing repair times
The math says
if (e.g.) 100 drives are in 10 RAID sets of 10 drives each and
each RAID set has a rebuild time of N hours
The risk is the same if you have a single RAID set of 100 drives
and the rebuild time is N/10
Block-based RAID scales the wrong direction for this to work
Bigger RAID sets repair more slowly because more data must be read
Only declustering provides scalable rebuild rates
Total number of drives
Slide 46 |
October 2008
Drives per RAID set
Repair time
IDC, Panasas, Inc.
Deep Dive: pNFS
Standards-based parallel file systems: NFSv4.1
Slide 47 |
October 2008
IDC, Panasas, Inc.
pNFS: Standard Storage Clusters
pNFS is an extension to the Network File System v4 protocol standard
Allows for parallel and direct access
From Parallel Network File System clients
To Storage Devices over multiple storage protocols
Moves the Network File System server out of the data path
data
pNFS
Clients
NFSv4.1 Server
Slide 48 |
October 2008
Block (FC) /
Object (OSD) /
File (NFS)
Storage
IDC, Panasas, Inc.
pNFS Layouts
Client gets a layout from the NFS Server
The layout maps the file onto storage devices and addresses
The client uses the layout to perform direct I/O to storage
At any time the server can recall the layout
Client commits changes and returns the layout when it’s done
pNFS is optional, the client can always use regular NFSv4 I/O
layout
Storage
Clients
Slide 49 |
October 2008
NFSv4.1 Server
IDC, Panasas, Inc.
pNFS Client
Common client for different storage back ends
Wider availability across operating systems
Fewer support issues for storage vendors
Client Apps
pNFS Client
Layout
Driver
NFSv4.1
1. SBC (blocks)
2. OSD (objects)
3. NFS (files)
4. PVFS2 (files)
5. Future backend…
pNFS Server
Layout metadata
grant & revoke
Cluster
Filesystem
Slide 50 |
October 2008
IDC, Panasas, Inc.
pNFS is not…
Improved cache consistency
NFS has open-to-close consistency enforced by client polling of attributes
NFSv4.1 directory delegations can reduce polling overhead
Perfect POSIX semantics in a distributed file system
NFS semantics are good enough (or, all we’ll give you)
But note also the POSIX High End Computing Extensions Working Group
http://www.opengroup.org/platform/hecewg/
Clustered metadata
Not a server-to-server protocol for scaling metadata
But, it doesn’t preclude such a mechanism
Slide 51 |
October 2008
IDC, Panasas, Inc.
Prototype PNFS Performance
pNFS iozone Throughput
8 clients, 1+10 SB1000, 5 GB files
450.0
400.0
350.0
MB/s
300.0
250.0
200.0
write
150.0
re-write
100.0
read
re-read
50.0
0.0
1
2
3
4
5
6
7
8
# Clients
Slide 52 |
October 2008
IDC, Panasas, Inc.
Prototype PNFS Performance
pNFS iozone Throughput
8 clients, 4+18 system, 5 GB files
700.0
600.0
MB/s
500.0
400.0
300.0
write
200.0
re-write
read
100.0
re-read
0.0
1
Slide 53 |
October 2008
2
3
4
5
6
7
8
IDC, Panasas, Inc.
Is pNFS Enough?
Standard for out-of-band metadata
Great start to avoid classic server bottleneck
NFS has already relaxed some semantics to favor performance
But there are certainly some workloads that will still hurt
Standard framework for clients of different storage backends
Files (Sun, NetApp, IBM/GPFS, GFS2)
Objects (Panasas, Sun?)
Blocks (EMC, LSI)
PVFS2
Your project… (e.g., dcache.org)
Slide 54 |
October 2008
IDC, Panasas, Inc.
Storage Cluster Components
StorageBlade
Processor, Memory, 2 NICs, 2 spindles
Object storage system
Block management
DirectorBlade
Processor, Memory, 2 NICs
Object-based Clustered
File System
Smart, Commodity
Hardware
Distributed file system
File and Object management
Cluster management
NFS/CIFS re-export
Integrated hardware/software solution
11 blades per 4u shelf
Today: 1 to 100-shelf systems
Tomorrow: 1 to 500-shelf systems
Slide 55 |
October 2008
Panasas ActiveScale
Storage Cluster
IDC, Panasas, Inc.
Compute Nodes
Out of Band
architecture with
direct, parallel
paths from
clients to
storage nodes
Client
RPC
NFS/CIFS
SysMgr
Client
PanFS
iSCSI/OSD
Manager Nodes
100+
OSDFS
Storage
Nodes
1,000+
Slide 56 |
October 2008
Up to 12,000
Internal cluster
management
makes a large
collection of blades
work as a single
system
IDC, Panasas, Inc.
Panasas Global Storage Model
Client Node
Client Node
Client Node
Client Node
TCP/IP network
Panasas System A
Panasas System B
BladeSet 1
BladeSet 2
BladeSet 3
VolX
home
VolN
VolY
delta
Physical
Storage
Pool
Logical
Quota
Tree
VolM
VolL
VolZ
DNS Name
Global Name
/panfs/sysa/delta/file2
Slide 57 |
October 2008
/panfs/sysb/volm/proj38
IDC, Panasas, Inc.