PowerPoint Templates - Simon Fraser University

Download Report

Transcript PowerPoint Templates - Simon Fraser University

Large Scale File Systems
by Dale Denis
June 2013
• The need for Large Scale File Systems.
• Big Data.
• The Network File System (NFS).
• The use of inexpensive commodity hardware.
• Large Scale File Systems.
• The Google File System (GFS).
• The Hadoop File System (HDFS).
Outline
Dale Denis
• International Data Corporation (IDC):
• The digital universe will grow to 35 zettabytes
globally by 2020.
• The New York Stock Exchange generates
over one terabyte of new trade data per day.
• The Large Hadron Collider new Geneva,
Switzerland, will produce approximately 15
petabytes of data per year.
• Facebook hosts approximately 10 billion
photos, taking up 2 petabytes of data.
Big Data
Dale Denis
• 85 % of the data being stored is unstructured
data.
• The data does not require frequent updating
once it is written, but it is read often.
• The scenario is complimentary to data that is
more suitable for an RDBMS
• Relational Database Management Systems are
good at storing structured data:
• Microsoft SQL Server
• Oracle
• MySQL
Big Data
Dale Denis
• NFS: The ubiquitous distributed file system.
• Developed by Sun Microsystems in the early
1980’s.
• While its design is straightforward, it is also
very constrained:
• The files in an NFS volume must all reside on a
single machine.
• All clients must go to this machine to retrieve their
data.
NFS
Dale Denis
• The bottleneck of reading data from a drive:
• Transfer speeds have not kept up with storage
capacity.
• 1990: A typical drive with 1370 MB capacity had a
transfer speed of 4.4 MB/s so it would take 5
minutes to read all of the drives data.
• 2010: A terabyte drive with the typical transfer
speed of 100 MB/s takes around 2 hours to read
all of the data.
NFS
Dale Denis
• Cost-effective.
• One Google server rack:
• 176 processors
• 176 GB of memory
• $278,000
• Commercial grade server:
•
•
•
•
8 processors
1/3 the memory
Comparable amount of disk space
$758,000
• Scalable.
• Failure is to be expected.
Inexpensive Commodity Hardware
Dale Denis
• Apache Nutch:
•
•
•
•
•
Doug Cutting created Apache Lucene.
Apache Nutch was a spin-off of Lucene.
Nutch was an open source web search engine.
Development started in 2002.
The architecture had scalability issues due to the
very large files generated as part of the web crawl
and indexing process.
A Solution is Born
Dale Denis
• In 2003 the paper “The Google File System”
was published.
•
In 2004 work began on an open source
implementation of the Google File System (GFS) for
the Nutch web search engine.
• The project was called the Nutch Distributed File
System (NDFS).
• In 2004 Google published a paper introducing
MapReduce.
•
By early 2005 the Nutch developers had a working
implementation of MapReduce and by the end of
the year all of the major algorithms in Nutch had
been ported to run MapReduce on NDFS.
A Solution is Born
Dale Denis
• In early 2006 they realized that the
MapReduce implementation and NDFS had
the potential beyond web search.
•
The project was moved out of Nutch and was
renamed Hadoop.
• In 2006 Doug Cutting was hired by Yahoo.
• Hadoop became an open source project at
Yahoo!.
• In 2008 Yahoo! Announced that its production
search index was running on a 10,000-core
Hadoop cluster.
A Solution is Born
Dale Denis
• A scalable distributed file system for large
distributed data-intensive applications.
• Provides fault tolerance while running on
inexpensive commodity hardware.
• Delivers high aggregate performance to a
large number of clients.
• The design was driven by observations at
Google of their application workloads and
technological environment.
•
The file system API and the applications were codesigned.
The Google File System (GFS)
Dale Denis
• The system is built from many inexpensive
commodity components that often fail.
•
The system must tolerate, detect, and recovery
from component failure.
• The system stores a modest number of large
files.
•
Small files must also be supported but the system
doesn’t need to be optimized for them.
• The workloads primarily consist of large
streaming reads and small random reads.
GFS Design Goals and Assumptions
Dale Denis
• The workloads also have many large,
sequential writes that append data to files.
•
Small writes at arbitrary positions in a file are to be
supported but do not have to be efficient.
• Multiple clients must be able to concurrently
append to the same file.
• High sustained bandwidth is more important
than low latency.
• The system must provide a familiar file
system interface.
GFS Design Goals and Assumptions
Dale Denis
• Supports operations to create, delete, open,
close, read, and write files.
• Has snapshot and record append operations.
• Record append operations allow multiple
clients to append data to the same file
concurrently while guaranteeing the atomicity
of each client’s append.
GFS Interface
Dale Denis
• A GFS cluster consists of:
•
•
A single master.
Multiple chunk servers.
• The GFS cluster is accessed by multiple clients.
• Files are divided into fixed-size chunks.
•
The chunks are 64 MB, this is configurable.
• Chunk servers store the chunks on local disks.
• Each chunk is replicated on multiple servers.
•
A standard replication factor of 3.
GFS Architecture
Dale Denis
• Two files being stored on three chunk servers
with a replication factor of 2.
GFS Architecture
Dale Denis
• Maintains all file system metadata.
•
•
•
•
Namespace information.
Access control information.
Mapping from files to chunks.
The current location of the chunks.
• Controls system wide activities.
•
•
•
•
Executes all namespace operations.
Chunk lease management.
Garbage collection.
Chunk migration between the servers.
• Communicates with each chunk server in heart
beat messages.
GFS Single Master
Dale Denis
•
•
•
Clients interact with the master for metadata operations,
but all data-bearing communication goes directly to the
chunk servers.
A client sends the master a request for a file and the
master responds with the locations of all of the chunks.
The client then sends a request to one of the chunks
servers for a replica.
GFS Single Master
Dale Denis
• The chunk size is large.
•
Advantages:
•
•
•
•
Reduces the client’s need to interact with the master.
• Helps to keep the master from being a bottleneck.
Reduces the size of the metadata stored on the master.
• The master is able to keep all of the metadata in
memory.
Reduces the network overhead by keeping persistent TCP
connections to the chunk server over an extended period of
time.
Disadvantages:
•
Hotspots with small files if too many clients are accessing
the same file.
• Chunks are stored on local disks as Linux files.
GFS Chunks
Dale Denis
• The master stores three types of metadata.
•
•
•
File and chunk namespaces.
The mapping from files to chunks.
The location of each chunk’s replicas.
• The master doesn’t store the locations of the
replicas.
•
The master asks each chunk server about its chunks
•
•
At master startup.
When a chunk server joins the cluster.
• The master also includes rudimentary support
for permissions and quotas.
GFS Metadata
Dale Denis
• The operations log is central to GFS!
•
•
•
•
•
•
The operations log contains a historical record of
critical metadata changes.
Files and chunks are uniquely identified by the
logical times at which they were created.
The log is replicated on multiple remote machines.
The master recovers its state by replaying the
operation log.
Monitoring infrastructure outside of the GFS restarts
a new master process if the old master fails.
Read-only “Shadow Masters” provide read-only
access when the primary master is down.
GFS Metadata
Dale Denis
• A mutation is an operation that changes the
contents or metadata of a chunk.
• Leases are used to maintain a consistent
mutation order across replicas.
•
•
The master grants a lease to one of the replicas,
which is called the primary.
The primary picks a serial order for all mutations to
the chunk.
• The lease mechanism is designed to minimize
the management overhead at the master.
GFS Leases and Mutations
Dale Denis
1. The client asks the master which chunk server
holds the current lease for a chunk and the
locations of the other replicas.
2. The client pushes the data to all of the replicas.
3. When all replicas acknowledge receiving the
data the client sends a write request to the
primary.
4. The primary serializes the mutations and
applies the changes to its own state.
GFS The anatomy of a mutation
Dale Denis
5. The primary forwards the write request to all of
the secondary replicas.
6. The secondary's apply the mutations in the
same serial order assigned by the primary.
7. The secondary's reply to the primary that they
have completed.
8. The primary replies to the client.
GFS The anatomy of a mutation
Dale Denis
•
•
•
•
The data flow and the control flow have been
decoupled.
The data is pushed linearly along a carefully
picked chain of chunk servers in a pipeline
fashion.
Each chunk server forwards the data to the
next nearest chunk server in the chain.
The goal is to fully utilize each machine’s
network bandwidth and avoid bottlenecks.
GFS The anatomy of a mutation
Dale Denis
•
Write Control and Data Flow.
GFS The anatomy of a mutation
Dale Denis
•
Record appends are atomic.
•
•
•
The client specifies the data, the GFS appends it to
the file atomically at an offset of the GFS’s choosing.
In a traditional write, the client specifies the off-set at
which data is to be written.
The primary replica checks to see if appending
to the current chunk would exceed the
maximum size. If so, the primary pads the
current chunk and replies to the client that the
operation should be retried on the next chunk.
GFS Record Append
Dale Denis
•
Namespace Management and Locking.
•
•
•
•
•
Locks allow multiple operations to be active at the
same time.
Locks over regions of the namespace ensure proper
serialization.
Each master operation acquires a set of locks
before it runs.
The centralized server approach was chosen to in
order to simplify the design.
Note: GFS does not have a per-directory data
structure that lists all the files in that directory.
GFS Master Operations
Dale Denis
•
Replica Placement.
• The duel goals of replica placement policy:
•
•
•
Maximize data reliability and availability.
Maximize network bandwidth utilization.
Chunks must not only be spread across
machines, they must also be spread across
racks.
•
•
Fault tolerance.
To exploit the aggregate bandwidth of multiple
racks.
GFS Master Operations
Dale Denis
•
The master rebalances replicas periodically.
•
•
•
Chunks are re-replicated as the number of
replicas falls below a user-specified goal.
•
•
•
Replicas are removed from chunk servers with
below-average free space.
Through this process the master gradually fills up a
new chunk server.
Due to failure.
Data corruption.
Garbage collection is done lazily at regular
intervals.
GFS Master Operations
Dale Denis
•
When a file is deleted the file is renamed to a hidden
name and the file is given a deletion time stamp.
• After three days the file is removed from the
namespace.
•
•
•
•
•
The time interval is configurable.
Hidden files can be undeleted.
In the regular heartbeat message the chunk server
reports a subset of the chunks that it has.
The master replies with the id’s of the chunks that are
no longer in the namespace.
The chunk server is free to delete chunks that are not in
the namespace.
GFS Garbage Collection
Dale Denis
•
Each chunk server uses check summing to
detect the corruption of stored data.
•
•
Each chunk is broken into 64 KB blocks, each block
has a 32 bit checksum.
During idle periods the chunk servers are
scanned to verify the contents of inactive
chunks.
GFS servers generate diagnostic logs that
record many significant events.
•
•
•
Chunk servers going online and offline.
All RPC requests and replies.
GFS Data Integrity
Dale Denis
•
Experiment 1:
• One chunk server with approx. 15,000
chunks containing 600 GB of data was taken
off-line.
•
•
The number of concurrent clonings was restricted
to 40% of the total number of chunk servers.
All chunks were restored in 23.3 minutes, at
an effective replication rate of 440 MB/s.
GFS Recovery Time
Dale Denis
•
Experiment 2:
• Two chunk servers with approx. 16,000
chunks and 660 GB of data were taken offline.
• Cloning was set to a high priority.
• All chunks were restored to a 2x replication
within 2 minutes.
•
The cluster back in a state where it could tolerate
another chunk server failure without data loss.
GFS Recovery Time
Dale Denis
•
Test Environment.
•
•
16 client machines.
19 GFS servers.
• 16 chunk servers.
• 1 master, 2 master replicas.
• All machines had the same configuration.
• Each machine had a 100 Mbps full-duplex Ethernet
connection.
•
2 HP 2524 10/100 switches.
•
•
All 19 servers were connected to one switch and
all 16 clients were connected to the other.
A 1 Gbps link connected the two switches.
GFS Measurements
Dale Denis
•
•
•
•
N clients reading simultaneously from the file system.
Theoretical limit peaks at an aggregate of 125 Mbps when the 1
Gbps link is saturated.
Theoretical per client limit of 12.5 Mbps when the network
interface is saturated.
Observed read rate of 10 Mbps when one client is reading.
GFS Measurements - Reads
Dale Denis
•
•
•
•
N clients writing simultaneously to N distinct files.
Theoretical limit peaks at an aggregate of 67 Mbps because each
byte has to be written to 3 of the 16 chunk servers.
Observed write rate of 6.3 Mbps. This slow rate was attributed to
issues with the network stack that didn’t work well with GFS
pipeline scheme.
In practice that has not been a problem.
GFS Measurements - Writes
Dale Denis
•
•
•
N clients append simultaneously to a single file.
The performance is limited by the network bandwidth of the chunk
servers that store the last chunk of the file.
As the number clients increases the congestion on the chunk
servers also increases.
GFS Measurements - Appends
Dale Denis
•
•
•
•
The Hadoop Distributed File System: The open
source distributed file system for large data
sets that is based upon the Google File
System.
As with the GFS the HDFS is a distributed file
system that is designed to run on commodity
hardware.
The HDFS provides high throughput access to
application data and is suitable for applications
that have large data sets.
The HDFS is not a general purpose file system.
The Hadoop Distributed File System (HDFS)
Dale Denis
•
Hardware failure is the norm.
•
•
The detection of faults and quick, automatic
recovery is a core architectural goal.
Large Data Sets
•
•
Applications that run on HDFS have large data sets.
• A typical file is gigabytes to terabytes in size.
• It should support tens of millions of files.
Streaming Data Access.
•
•
Applications that run on HDFS need streaming
access to their data sets.
The emphasis is on high throughput of data access
rather than low latency.
HDFS Design Goals and Assumptions
Dale Denis
•
Simple Coherency Model
•
•
•
Once a file has been written it cannot be changed.
There is a plan to support appending-writes in the
future.
Portability across heterogeneous hardware and
software platforms.
•
•
•
In order to facilitate adoption of HDFS.
Hadoop is written in Java.
Provide interfaces for applications to move
themselves closer to where the data is.
•
“Moving computation is cheaper than moving data”
HDFS Design Goals and Assumptions
Dale Denis
• An HDFS cluster consists of:
•
•
A Name Node.
Multiple Data Nodes.
• Files are divided into fixed-size blocks.
•
•
The blocks are 64 MB, this is configurable.
The goal is to minimize the cost of seeks.
• Seek time should be 1% of transfer time.
• As transfers speeds increase block size can be
increased.
• Block sizes that are too big will cause MapReduce
jobs to run slowly.
HDFS Architecture
Dale Denis
• Writing data to the HDFS.
•
•
No control messages to or from the data nodes.
No concern for serialization.
HDFS Architecture
Dale Denis
•
Ideally bandwidth between nodes should be used to
determine distance.
• In practice measuring bandwidth between nodes is
difficult.
• HDFS assumes that in each of the following scenarios
that bandwidth becomes progressively less:
• Processes on the same node.
• Different nodes on the same rack.
• Nodes on different racks in the same data center.
• Nodes in different data centers.
HDFS Network Topology
Dale Denis
•
•
By default HDFS assumes that all nodes are on the same rack in
the same data center.
An XML configuration script is used to map nodes to locations.
HDFS Network Topology
Dale Denis
•
Trade-off between reliability, write bandwidth,
and read bandwidth.
•
All replicas on nodes at different data centers
provides high redundancy at the cost of high write
bandwidth.
HDFS Replica Placement
Dale Denis
•
First replica goes on the same node as the
client.
Second replica goes on a different rack,
selected at random.
The third replica is placed on the same rack as
the second but a different node is chosen.
Further replicas are placed on nodes selected
at random from the cluster.
•
•
•
•
•
Nodes are selected that are not too busy or full.
The system avoids placing too many replicas on the
same rack.
HDFS Replica Placement
Dale Denis
•
•
•
•
•
•
Based upon the POSIX model but does not provide
strong security for HDFS files.
Is designed to prevent accidental corruption or misuse
of information.
Each file and directory is associated with an owner and
a group.
For files there are separate permissions to read, write
or append to the file.
For directories there are separate permissions to create
or delete files or directories.
Permissions are new to HDFS, adding methods such
as Kerberos authentication in order to establish user
identify have been planned for the future.
HDFS Permissions
Dale Denis
•
•
•
HDFS provides a Java API.
A JNI-base wrapper, libhdfs had been
developed that allows you to work with the
Java API with C/C++.
Work is underway to expose HDFS through the
WebDAV protocol.
HDFS Accessibility
Dale Denis
•
The main web interface is exposed on the
NameNode at port 50070.
•
•
Contains an overview about the health, capacity and
usage of the cluster.
Each datanode also has a web interface at port
50075.
•
•
Logfiles generated by the Hadoop daemons can be
accessed through this interface.
Very useful for distributed debugging and
troubleshooting.
HDFS Web Interface
Dale Denis
•
The scripts to manage a Hadoop cluster were
written in the UNIX shell scripting language.
The HDFS configuration is located in a set of
XML files.
Hadoop can run under Windows.
•
•
•
Requires cygwin.
•
Not recommended for production environments.
•
Hadoop can be installed on a single machine.
•
•
•
Standalone - The HDFS is not used.
Pseudo-distributed – A functioning NameNode/DataNode is
installed.
Memory Intensive.
HDFS Configuration
Dale Denis
•
Who uses Hadoop?
•
•
•
•
•
•
•
•
•
•
•
Amazon
Adobe
Cloudspace
EBay
Facebook
IBM
LinkedIn
The New York Times
Twitter
Yahoo!
…
Hadoop
Dale Denis
•
Microsoft Data Management Solutions
• Relational Data
•
•
Microsoft SQL Server 2012
Non-Relational Data
•
Hadoop for Windows Server and Windows Azure
• An Apache Hadoop-based distribution for
Windows developed in partnership with
Hortonworks Inc.
Microsoft & Hadoop
Dale Denis
Microsoft & Hadoop
Dale Denis
•
The current technological environment has
presented the need for new Large Scale File
Systems.
•
•
•
•
Big data.
Unstructured data.
The use of inexpensive commodity hardware.
Large Scale File Systems.
•
•
The proprietary Google File System.
The open source Hadoop Distributed File System.
Conclusion
Dale Denis
• Barroso, Luis Andre. Dean, Jeffrey. Holzle, Urs. “Web
Search For A Planet: The Google Cluster Architecture. ”
Micro, IEEE (Volume: 23, Issue: 2), March-April 2003.
• Callaghan, Brent. NFS Illustrated. Reading: AddisonWesley Professional, 2000.
• Ghemawat, Sanjay. Bogioff, Howard, Bogioff. Leung,
Shun-Tak. “The Google File System.” SOSP’03, October
19-22 2003.
• Stross, Randall. Planet Google: One Company’s
Audacious Plan to Organize Everything We Know. New
York: Free Press, 2008.
• White, Tom. Hadoop: The Definitive Guide. Sebastopol:
O’Reilly Media, Inc., 2009.
References
Dale Denis
• May 2013.
<http://developer.yahoo.com/hadoop/tutorial/module2.ht
ml>
• May 2013.
http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html#
Introduction
• May 2012: http://strata.oreilly.com/2012/01/microsoftbig-data.html
• May 2013. <www.Microsoft.com>
Microsoft_Big_Data_Booklet.pdf
References
Dale Denis