Hadoop File Systems
Download
Report
Transcript Hadoop File Systems
Lei Xu
HADOOP DISTRIBUTED FILE
SYSTEM
1
Brief Introduction
Hadoop
An apache project for data-intensive applications
Typical application: Map-Reduce (OSDI’04), a
distributed algorithm for massive-data computation
Crawl and index web pages (Y!)
Analyze popular topics and trends (Twitter)
Led by Yahoo!/Facebook/Cloudera
2
Brief Introduction (cont’d)
Hadoop Distributed File System (HDFS)
A scalable distributed file system to serve Hadoop
MapReduce applications
Borrow the essential ideas from the Google File
System
Sanjay Ghenawat, Howard Gobioff and Shun-Tak
Leung. The Google File System. 19TH ACM Symposium
on Operating System Principles (SOSP’03)
Share same design assumptions
3
Google File System
A scalable distributed file system designed
for:
Data-intensive applications (mainly MapReduce)
Web page indexing
Then it has spread to other applications
E.g. Gmail, Big Table, App Engine
Fault-tolerant
Low-cost hardware
High throughputs
4
Google File System (cont’d)
Departure from other file system assumptions
Run on top of the commodity hardware
Component failures are common
Files are huge
Basic block size 64~128 MB
1~64KB in traditional file systems (Ext3/NTFS and etc.)
Massive-data/data-intensive processing
Large streaming read and small random read
Large, sequential writes
No (or bare) random writes
5
Hadoop DFS Assumptions
Other than the assumptions in Google File System,
HDFS assumes that:
Simple Coherency Model
Write-once-read-many
Once a file was created, written and closed, it can not be
changed anymore.
Moving Computation Is Cheaper than Moving Data
“Semi-Location-Aware” computation
Try its best to assign computations closer to the related data
Portability Across Heterogeneous Hardware and Software
Platforms
Is written in Java, multi-platform support
Google File System was written in C++ and run on Linux
Store data on top of existing file systems (NTFS/Ext4/Btrfs…)
6
HDFS Architecture
Master/Slave Architecture
NameNode
Metadata Server
File location ( file name -> the DataNode )
File attributions (atime/ctime/mtime, size, the number
of replicas and etc.)
DataNode
Manages the storage attached to the nodes that
they run on
Client
Producer and Consumers of data
7
HDFS Architecture (cont’d)
8
NameNode
Metadata Server
Only one NameNode in one cluster
Single Point Failure
Potential performance bottleneck
Manage the file system namespace
Traditional hierarchical namespace
Keep all file metadata in memory for fast access
The memory size of NameNode determines how many files
can be supported
Execute file system namespace operation:
Open/close/rename/create/unlink…
Return the location of data blocks
9
NameNode (cont’d)
Maintains system-wide activities
E.g. creating new replications of file data, garbage
collection, load balancing and etc.
Periodically communicates with DataNode to
collect their statuses
Is DataNode alive?
Is DataNode overload?
10
DataNode
Storage server
Store fixed-size data blocks on local file systems (
ext4/zfs/btrfs )
Serve read/write operations from the clients
Create, delete, replicate data blocks upon
instruction from the NameNode
Block size = 64MB
11
Client
Application-level implementations
Does not provide POSIX API
Hadoop has a FUSE interface
FUSE: Filesystem in Userspace
Has limited functions (e.g, no random write supports)
Query the NameNode for file locations and
metadata
Contact corresponding DataNodes for file I/Os
12
Data Replication
Files are stored as a sequence of blocks
The blocks (typically 64MB) are replicated for fault
tolerance
Replication factor is configurable per file
Can be specified at creation time, and can be changed later
The NameNode decides how to replicate blocks. It
periodically receives:
Heartbeat, which implies the DataNode is alive
Blockreport, which contains a list of all blocks on a
DataNode
When a DataNode is down, the NameNode replicas all
blocks on this DataNode to other active DataNode to
achieve enough replications
13
Data Replication (cont’d)
14
Data Replication (cont’d)
Rack Awareness
Hadoop instance runs on a cluster of computers
that spread across many racks:
Nodes in same rack are connected by one switches
Communications between two nodes in different
racks go through switches
Slower than nodes in same rack
One rack may fail due to network/power issues.
Improve data reliability, availability and network
bandwidth utilization
15
Data Replications (cont’d)
Rack Awareness (cont’d)
For common case, the replication factor is three
Two replicas are placed on two different nodes in
same rack
The third replica is placed on a node in a remote rack
Improves write performance
2/3 writes are in same rack, faster
Without compromising data reliability
16
Replica Selection
For READ operation:
Minimize the bandwidth consumption and latency
Prefer nearer node:
If there is a replica on the same node, it is
preferred
The cluster may span multiple data centers,
replicas in same data centers are preferred
17
Filesystem Metadata
The HDFS stores all file metadata on
NameNode
An EditLog
Record every change that occurs to filesystem
metadata
For failure recovery
Same as journaling file systems (Ext3/NTFS)
An FSImage
Stores mapping of blocks to files and file attributes
EditLog and FSImage are stored on NameNode
locally
18
Filesystem Metedata(cont’d)
DataNode has no knowledge about HDFS
files
It only stores data blocks as regular files on local
file systems
With a checksum for data integrity
It periodically reports a Blockreport that includes
all blocks stored on this DataNode to NameNode
Only the DataNode has knowledge about the
availability of one block replica.
19
Filesystem Metadata(cont’d)
When NameNode starts up
Load FSImage and EditLog from the local file
system
Update FSImage with latest EditLogs
Create a new FSImage for latest checkpoint and
store on local file system permanently
20
Communication Protocol
A Hadoop specific RPC on top of TCP/IP
NameNode is simply a server that only
responses to the requests issued by
DataNodes or clients
ClientProtocol.java – client protocol
DatanodeProtoco.java – datanode protocol
21
Robustness
Primary object of HDFS:
Reliable with component failures
In a typical large cluster (>1K nodes), component
failures are common
Three common types of failures:
NameNode failures
DataNode failures
Network failures
22
Robustness (cont’d)
Heartbeats
Each DataNode sends heartbeats to NameNode
periodically
System status and block reports
The NameNode marks DataNodes w/o recent
heartbeats as dead
Does not forward I/O to it
Mark all data blocks on these DataNodes as unavailable
Re-replicate these blocks if necessary (according to the
replication factor).
Can detect network failures and DataNode dies
23
Robustness (cont’d)
Re-Balancing
Automatically move the data on one DataNode to
another one
If the free space falls below a threshold
Data-Integrity
A block of data may be corrupted
Disk faults, network faults, buggy software
Client computes checksums for each block and stores
them in a separate hidden file in HDFS namespace
Verify data before read it
24
Robustness (cont’d)
Metadata failures
FSImage and EditLog are the central data structures
Once corrupted, HDFS can not build namespace and
access data
NameNode can be configured to support multiple-
copies of FSImage and EditLog
E.g: one FSImage/EditLog on local machine, another one
is stored on mounted remote NFS server.
Reduce the update performances
Once NameNode is down, it must to restart the cluster
manually
25
Data Organization
Data Blocks
HDFS is designed to support very large files and
streaming I/Os
A File is chopped up into 64MB blocks
Reduce the number of connection establishments
and accelerate TCP transmissions
If possible, each block of a file will reside on a
different DataNode
For future parallel I/O and computations (MapReduce)
26
Data Organization (cont’d)
Staging
When write a new file
A client firstly caches the file data into temporary
local file until this file worth over the HDFS block
size
Then the client contacts NameNode to assign a
DataNode
The client flushes the cached data to the chosen
DataNode
Fully utilized the bandwidth
27
Data Organization (cont’d)
Replication Pipeline
A client obtains a DataNode list to flush one block
The client firstly flushes the data to the first DataNode
The first DataNode starts to receive the data in small
portions (4kB), writes that portions to local storage,
and transfer it to the next DataNode in the list
immediately
The second DataNode acts as the first one
The total transfer time for one block(64MB) is:
T(64MB) + T(4kb) * 2 , for pipeline
3 * T(64MB), for non-pipeline
28
Replication Pipeline
The client asks the
NameNode where to put
data
The client push data to
DataNode linearly to fully
utilize network bandwidth
The secondary replicas
reply to the primary. Then
the primary replies to the
client for success.
* This figure was in “The Google File System” paper
29
See also
HBase – a BigTable implementation on Hadoop
Key-value storage
Pig – high-level language to run data analyze on
Hadoop
ZooKeeper
“ZooKeeper: Wait-free Coordination for Internet-scale
Systems”, ATC’10, Best Paper
CloudStore (KFS, previously Kosmosfs)
A C++ implementation of Google File System
Parallels the Hadoop project
30
Google v.s Y!/Facebook/Amazon..
Google
Hadoop
• Google File System
• MapReduce
• BigTable
• Hadoop DFS
• Hadoop MapReduce
• HBase
31
Known Issues and Research
Interests
NameNode is the single point failure
Limits the total files supported in the HDFS as well
RAM limitation
Google has changed the one-master architecture
to multiple-header cluster
However, the details are unrevealed
32
Known Issues and Research
Interests (cont’d)
Use replications to provide data reliability
Same problems to RAID-1 ?
Apply RAID technologies to HDFS?
“DiskReduce: RAID for Data-Intensive Scalable
Computing”, PDSW’09
33
Known Issues and Research
Interests (cont’d)
Energy Efficiency
DataNodes are alive for data availability
However, there may be no MapReduce
computations running on them.
Waste of energy
34
Conclusion
Hadoop Distributed File System is designed
to serve MapReduce computations
Provide high reliable storage
Support mass of data
Optimized data placement policies based on the
topology of data centers
Large companies build their core businesses on
top of these infrastructures
Google: GFS/MapReduce/BigTable
Yahoo!/Facebook/Amazon/Twitter/NY Times:
Hadoop/HBase/Pig
35
Reference
HDFS Architecture Guide:
http://hadoop.apache.org/hdfs/docs/current/
hdfs_design.html
36
Questions?
Thank you !
37