Cplant I/O Pang Chen Lee Ward Sandia National Laboratories

Download Report

Transcript Cplant I/O Pang Chen Lee Ward Sandia National Laboratories

Cplant I/O
Pang Chen
Lee Ward
Sandia National Laboratories
Scalable Computing Systems
Fifth NASA/DOE Joint PC Cluster
Computing Conference
October 6-8, 1999
1
Conceptual Partition Model
Service
Compute
File I/O
Users
/home
Net I/O
2
File I/O Model
• Support large-scale unstructured grid applications.
– Manipulate single file per application, not per processor.
• Support collective I/O libraries.
– Require fast concurrent writes to a single file.
3
Problems
•
•
•
•
Need a file system NOW!
Need scalable, parallel I/O.
Need file management infrastructure.
Need to present the I/O subsystem as a single parallel file system both
internally and externally.
• Need production-quality code.
4
Approaches
• Provide independent access to file systems on each I/O node.
– Can’t stripe across multiple I/O nodes to get better performance.
• Add a file management layer to “glue” the independent file systems so
as to present a single file view.
– Require users (both on and off Cplant) to differentiate between this
“special” file system and other “normal” file systems.
– Lots of special utilities are required.
• Build our own parallel file system from scratch.
– A lot of work just to reinvent the wheel, let alone the right wheel.
• Port other parallel file systems into Cplant.
– Also a lot of work with no immediate payoff.
5
Current Approach
• Build our I/O partition as a scalable nexus between Cplant and external
file systems.
+
+
+
–
–
Leverage off existing and future parallel file systems.
Allow immediate payoff with Cplant accessing existing file systems.
Reduce data storage, copies, and management.
Expect lower performance with non-local file systems.
Waste external bandwidth when accessing scratch files.
6
Building the Nexus
• Semantics
– How can and should the compute partition use this service?
• Architecture
– What are the components and protocols between them?
• Implementation
– What we have now and what we hope to achieve in the future?
7
Compute Partition Semantics
• POSIX-like.
– Allow users to be in a familiar environment.
• No support for ordered operations (e.g., no O_APPEND).
• No support for data locking.
– Enable fast non-overlapping concurrent writes to a single file.
– Prevent a job from slowing down the entire system for others.
• Additional call to invalidate buffer cache.
– Allow file views to synchronize when required.
8
Cplant I/O
I/O
I/O
I/O
I/O
Enterprise Storage Services
9
Architecture
• I/O nodes present a symmetric view.
– Every I/O node behaves the same (except for the cache).
– Without any control, a compute node may open a file with one I/O node,
and write that file via another I/O node.
• I/O partition is fault-tolerant and scalable.
– Any I/O node can go down without the system losing jobs.
– Appropriate number of I/O nodes can be added to scale with the compute
partition.
• I/O partition is the nexus for all file I/O.
– It provides our POSIX-like semantics to the compute nodes and
accomplishes tasks on behalf of the them outside the compute partition.
• Links/protocols to external storage servers are server dependent.
– External implementation hidden from the compute partition.
10
Compute -- I/O node protocol
• Base protocol is NFS version 2.
– Stateless protocols allow us to repair faulty I/O nodes without aborting
applications.
– Inefficiency/latency between the two partitions is currently moot;
Bottleneck is not here.
• Extension/modifications:
– Larger I/O requests.
– Propagation of a call to invalidate cache on the I/O node.
11
Current Implementation
•
•
•
•
•
Basic implementation of the I/O nodes
Have straight NFS inside Linux with ability to invalidate cache.
I/O nodes have no cache.
I/O nodes are dumb proxies knowing only about one server.
Credentials rewritten by the I/O nodes and sent to the server as if the
the requests came from the I/O nodes.
• I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI
O2K as the (XFS) file server on the other end.
• Don’t have jumbo packets.
• Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each
using about 15% of CPU.
12
Current Improvements
• Put a VFS infrastructure into I/O node daemon.
– Allow access to multiple servers.
– Allow a Linux /proc interface to tune individual I/O nodes quickly and
easily.
– Allow vnode identification to associate buffer cache with files.
• Experiment with a multi-node server (SGI/CXFS).
13
Future Improvements
•
•
•
•
•
•
•
Stop retries from going out of network.
Put in jumbo packets.
Put in read cache.
Put in write cache.
Port over Portals 3.0.
Put in bulk data services.
Allow dynamic compute-node-to-I/O-node mapping.
14
Looking for Collaborations
Lee Ward
505-844-9545
[email protected]
Pang Chen
510-796-9605
[email protected]
15