Transcript Document
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of Southern California Primary Functionality of Google Primary Functionality of Google Search content on the web in browsing mode. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is nonexit. It only means that Google does not know about it when the search was issued. Google may retrieve results if the search is issued again. Do not index/find me: Google provides tags to enable an information provider to prevent Google from indexing its pages. No one gets angry if Google does not retrieve information known to exist on the Internet. How is this different than financial applications? Functionality DB: IR: Search content on the web in browsing mode. Open world assumption: If your search with Google does not return results, it does not mean that the referenced content is nonexistent. It only means that Google does not know about it when the search was issued. Google may retrieve results if the search is issued again. Do not index/find me: Google provides tags to enable an information provider to prevent Google from indexing its pages. No one gets angry if Google does not retrieve information known to exist on the Internet. Query based: Looking for a needle in a hay stack. Closed world assumption: A data item that is not known does not exist. A query must retrieve correct results 100% of the time! If a customer insists the bank cannot find his or her account because the customer has used Google’s “do not find me” tags, the customer is kicked out! Customers become angry if the system retrieves incorrect data. Key Observation DB: IR: Okay to return either no or incorrect results. Acceptable for a user search to observe stale data. Not okay to return incorrect results. A transaction must observe consistent data. SQL front end. Big Picture A shared-nothing architecture consisting of thousands of nodes! A node is an off-the-shelf, commodity PC. Yahoo’s Pig Latin Google’s Map/Reduce Framework Google’s Bigtable Data Model Google File System ……. Big Picture Shared-nothing architecture consisting of thousands of nodes! A node is an off-the-shelf, commodity PC. Yahoo’s Pig Latin Google’s Map/Reduce Framework Google’s Bigtable Data Model Google File System ……. Divide & Conquer Big Picture Source code for Pig and hadoop are available for free download. Yahoo’s Pig Latin Google’s Map/Reduce Framework Google’s Bigtable Data Model Google File System ……. Pig Hadoop Data Shipping Client retrieves data from the node. Client performs computation locally. Limitation: Dumb servers, utilizes the limited network bandwidth. Process f(x) Xmit Data Data A Node Function Shipping Client ships the function to the node for processing. Relevant data is sent to client. Function f(x) should produce less data than the original data stored in the database. Minimizes demand for the network bandwidth. Process function f(x) Output of f(x) A Node Google Application (configured with GFS client) may run on the same PC as the one hosting a chunkserver. Requirements: Machine resources are not overwhelmed. The lower reliability is acceptable. References Pig Latin Map Reduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January 2008. Bigtable Olston et. al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Chang et. al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. GFS Ghemawat et. al. The Google File System. In SOSP 2003. Overview: GFS A highly available, distributed file system for inexpensive commodity PCs. Supports node failures as the norm rather than the exception. Stores and retrieves multi-GB files. Assumes files are append only (instead of updates that modify a certain piece of existing data). Atomic append operation to enable multiple clients to append to a file with minimal synchronization. Relaxed consistency model to simplify the file system and enhance performance. Google File System: Assumptions Google File System: Assumptions (Cont…) GFS: Interfaces Create, delete, open, close, read, and write files. Snapshot a file: Create a copy of the file. Record append operation: Allows multiple clients to append data to the same file concurrently, while guaranteeing the atomicity of each individual client’s append. GFS: Architecture 1 Master Multiple chunkservers File is partitioned into fixedsize chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Degree of replication is application specific; default is 3. Software Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks. GFS client caches meta-data about file system. Client receives data from chunkserver directly. Client and chunkserver do not cache file data. GFS: Architecture 1 Master Multiple chunkservers File is partitioned into fixedsize (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Degree of replication is application specific; default is 3. Software Client chooses one of the replicas. Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks. GFS client caches meta-data about file system. Client receives data from chunkserver directly. Client and chunkserver do not cache file data. GFS: Architecture 1 Master Multiple chunkservers File is partitioned into fixedsize (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Degree of replication is application specific; default is 3. • Unix allocates space lazily! • Many small logical files are stored in one file. Software Client chooses one of the replicas. Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks. GFS client caches meta-data about file system. Client receives data from chunkserver directly. Client and chunkserver do not cache file data. GFS Master 1 master simplifies software design. Master monitors availability of chunkservers using heart-beat messages. 1 master is a single point of failure: Master does not store chunk location information persistently: When the master is started, it asks each chunkserver about its chunks (and whenever a chunkserver joins). File and chunk namespaces, Mapping from files to chunks, Location of each chunk’s replica. Mutation = Update Mutation is an operation that changes the contents of either metadata (delete or create a file) or a chunk (append a record). Content mutation: Performed on all chunk’s replicas. Master grants a chunk lease to one of the replicas, primary. Primary picks a serial order for all mutations to the chunk. Lease: Granted by master, typically 60 seconds. Primary may request extensions. If master loses communication with a primary, it can safely grant a new lease to another replica after the current lease expires. Master & Logging Master stores 3 types of metadata: 1. 2. 3. The file and chunk namespaces, Mapping from files to chunks, Locations of each chunk’s replicas. First two types are kept persistent by: Logging mutations (updates) to an operation log stored on the master’s local disk, Replicating the operation log on multiple machines. What is required to support logging? Master & Logging Master stores 3 types of metadata: 1. 2. 3. The file and chunk namespaces, Mapping from files to chunks, Locations of each chunk’s replicas. First two types are kept persistent by: Logging mutations (updates) to an operation log stored on the master’s local disk, Replicating the operation log on multiple machines. What is required to support logging? Uniquely identify transactions and data items. Checkpointing. Master & Logging Master stores 3 types of metadata: 1. 2. 3. The file and chunk namespaces, Mapping from files to chunks, Locations of each chunk’s replicas. First two types are kept persistent by: Logging mutations (updates) to an operation log stored on the master’s local disk, Replicating the operation log on multiple machines. Files and chunks, as well as their versions, are uniquely identified by the logical times at which they were created. GFS responds to a cleint operation only after flushing the log record to disk both locally and remotely. With failures, during recovery phase, master recovers its file system by replaying the operation log. Checkpoints are fuzzy. Maintains a few older checkpoints and log files, deleting the prior ones. Master & Locking Namespace management: GFS represents its namespace as a lookup table mapping full pathnames to metadata. /d1/d2/…/dn/fileA consists of the following pathnames: /d1 /d1/d2 … /d1/d2/…/dn /d1/d2/…/dn/fileA Master & Locking Namespace management: GFS represents its namespace as a lookup table mapping full pathnames to metadata. Each node in the namespace tree has an associated read-write lock. Each master operation requires a set of locks before it can perform its read/mutation operation: Typically, an operation involving /d1/d2/…/dn/fileA will acquire read locks on /d1, /d1/d2, /d1/d2/…/dn and either a read or write lock on /d1/d2/…/dn/fileA. A read lock is the same as a Shared lock. A write lock is the same as an eXclusive lock. Example Operation 1: Copy directory /home/user to /save/user Operation 2: Create /home/user/foo Example Operation 1: Copy directory /home/user to /save/user Operation 2: Create /home/user/foo Could they have used IS and IX locks? Atomic Record Appends Background: With traditional writes, a client specifies the offset at which data is to be written. GFS cannot serialize concurrent writes to the same region. With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? Atomic Record Appends Background: With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? With traditional writes, a client specifies the offset at which data is to be written. GFS cannot serialize concurrent writes to the same region. The record is written as one sequence of bytes. Does GFS write the record partially? Atomic Record Appends Background: With record append, the client specifies only the data. GFS appends the record to the file at least once atomically at an offset of GFS’s choosing and returns that offset to the client. What does “atomically” mean? With traditional writes, a client specifies the offset at which data is to be written. GFS cannot serialize concurrent writes to the same region. The record is written as one sequence of bytes. Does GFS write the record partially? Yes, a record might be written partially to a file replica. Atomic Record Appends How? Discuss how regular chunk mutations are supported. Updates Atomic Record Appends: How? Client: Pushes data to all replicas of the last chunk of the file. Sends its write request to the primary. Primary appends data to its replica and tells the secondaries to write data at the exact offset where it has written. If all secondaries succeed, primary replies success to the client. If a record append fails at any replica, primary reports error and client retries the operation. One or more of the replicas may have succeeded fully (or written partially) → replicas of the same chunk may contain different data including duplicates of the same record. GFS does not guarantee that all replicas are bytewise identical. GFS guarantees that the record is written at the same offset at least once in its entirety (atmoic unit). Summary File namespace mutations are managed by requiring Master to implement ACID properties: locking guarantees atomicity, consistency, and isolation. Operation log provides durability. State of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations: A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A file region is defined after a mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interferance from concurrent writers, the affted region is defined and by implication consistent. Concurrent successful mutations leave the region undefined and consistent: all clients see the same data that consists of mingled fragments from multiple mutations.