Transcript A Low-Bandwidth Network File System
A LOW-BANDWIDTH NETWORK FILE SYSTEM
A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, New York U
Highlights
• A file system for slow or wide-area networks • Exploits similarities between files or versions of the same file – Avoids sending data that can be found in the server’s file system or the client’s cache • Also uses conventional compression and caching • Requires 90% less bandwidth than traditional network file systems
Working on slow networks
• Can work with local copies – Must then worry about update conflicts • Can use remote login – Only for text-based applications • Should use instead a low-bandwidth file system – Better than remote login – Must then deal with issues like big autosaves blocking the editor for the duration of transfer
LBFS (I)
• Client keeps all recently accessed files in its cache • LBFS exploits cross file similarities to reduce data transfers between client and server – File server divides the file it stores into
variable-size chunks
– Indexes these chunks by their hash values
LBFS (II)
• When transferring a file between the client and the server – LBFS identifies the chunks the receiving side already has – Only transmits the other chunks • Provides close-to-open consistency – Same as Coda (and newer versions of NFS)
Related work (I)
• AFS used callbacks to reduce network traffic • Leases are callbacks with expiration date • Coda supports slow networks and disconnected operations through optimistic replication • Bayou and OceanStore investigate conflict resolution for optimistic updates • Lee et al. have extended Coda to support operation-based updates
Related Work (II)
• Spring and Wetherall use large client and server caches to eliminate redundant network traffic: – Can send address of data already in cache of receiver rather than data themselves • Rsync exploits similarities between directory trees containing similar subtrees
LBFS Design
• Key ideas: – Close-to-open consistency – Have a large persistent file cache at client • IDE disks are now large enough for that – Exploits similarities between files (and file versions) • Only transmits data chunks containing
new data
Identifying Similar Data Chunks
• LBFS uses collision-resistant property of
SHA-1 hash function
– Assumes no hash collisions • Central challenge is – Keeping the index a reasonable size – Dealing with shifting offsets
The Case against Fixed-Size Blocks
File F File F after an insertion The two files do not have a single block in common
The Case against “Diffs”
• “Diffs” are used by several UNIX utilities – Computed by comparing contents of file with another file – Very efficient • Must know which file(s) to compare to • Difficult in a file system – Obscure naming of editor buffer files and other temp files
Dividing Files into Chunks
• LBFS – Only looks for non-overlapping chunks in files – Sets chunk boundaries based on file contents • To divide a file into chunks, LBFS – Examines every (overlapping) 48-byte region of the file – Uses Rabin’s fingerprints to select
boundary regions or breakpoints
Using Rabin’s Fingerprints
• Polynomial representation of data in 48-byte region modulo an irreducible polynomial • Boundary regions have the 13 least significant bits of their fingerprint equal to an arbitrary predefined value – Assuming random data, expected chunk size is 2 13 = 8K • Method is reasonably fast
How it works
A file X partitioned into three chunks Same file X after one insertion inside middle chunk
New Chunk
Chunk boundaries are arbitrary and identified by the content of their boundary regions
Another way to look at it (I)
• Old File: Four score and seven years ago our fathers brought forth, a new country, conceived in liberty, and dedicated to the proposition that "all men are created equal."
Another way to look at it (II)
• New File: Four score and seven years ago our fathers brought forth, upon this continent, a new nation , conceived in liberty, and dedicated to the proposition that "all men are created equal"
Another way to look at it (III)
• Identify Chunks: Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal"
Another way to look at it (IV)
• Send back to server the modified chunk: upon this continent, a new nation, conceived in liberty,
in compressed form
Pathological cases
• Having too many chunks require too much aggregate bandwidth • Very large chunks would be too difficult to send in a single RPC • Chunk sizes must be between 2K and 64K – May have to artificially insert chunk boundaries when files are full of repeated sequences
The chunk database (I)
• The chunk database – Indexes chunks by first 64 bits of SHA-1 hash – Maps keys to (file,offset, count) triples • How to keep this database up to date?
– Must update it whenever file is updated – Can still have problems with local updates at server site – Crashes can corrupt database contents
The chunk database (II)
• Best solution is to tolerate inconsistencies: – LBFS recomputes hash of any data chunk before using it – Recomputed value is also used to detect collisions • Very improbable but still possible
Protocol
• NFS with some changes: – Uses leases to implement close-to-open consistency (callbacks with limited lifetime) – Practices aggressive pipelining of RPC calls – Compresses all RPC traffic
Leases
• Leases are callbacks with – A limited lifetime (a few seconds) – A guarantee that server will not accept updates during lease lifetime without first notifying client • Advantages: – No problems with lost callbacks – Automatically expire when server crashes
An example (I)
Server
Requests a lease During duration of lease
Alice
Alice controls the file Must now renew it Time
An example (II)
Server
Got a lease During duration of lease
Alice
Alice controls the file Also requests a lease
Bob
Time
An example
• When server receives Bob's request, – It will try to contact Alice and break the lease • Alice will then flush all the blocks she had updated and invalidate the contents of her cache – If Alice does not answer, server must wait until Alice's lease expires
File Consistency
• LBFS – Caches entire files – Implements close-to-open consistency • Client – Gets a lease first time a file is opened for read – Renews expired leases by requesting file attributes – Will then check if cached copy is still current
Reads and writes
• Use additional calls not in NFS – GETHASH for reads – MKTMPFILE,and three other for write • Server ensures atomicity of updates by
writing them first into a temporary file
Security
• More of an issue than in a well-controlled LAN • Uses SFS security infrastructure – Servers have public keys and authenticate themselves to clients • New Problem: – All LBFS users can check whether file system contains a specific chunk of data – Requires observing subtle timing differences
Implementation
• Some problems with the way NFS allocates i-node numbers
Evaluation (I)
• Compared upstream and downstream bandwidth of LBFS with those of – CIFS (Common Internet File System) – NFS – AFS – LBFS with leases and gzip but w/o chunking • Downstream traffic benefits most of chunking
Evaluation (II)
First four bars of each workload show upstream bandwidth, second four downstream bandwidth
Conclusions
• LBFS bandwidth usage is one order of magnitude less than conventional file systems