Survey : Cloud Storage Systems
Download
Report
Transcript Survey : Cloud Storage Systems
Presented By :
Nitya Shankaran
Ritika Sharma
Overview
Motivation behind the project
Need to know how your data can be stored efficiently in the
cloud i.e. choosing the right kind of storage.
What has been done so far
Types of Data Storage
Object storage
Relational Database storage
Distributed File systems etc.
Object Storage
Uses data objects instead of files to store and retrieve data
Maintains an index of Object ID (OID) numbers
Ideal for storing large files.
Amazon S3
“Provides a simple web interface that can be used to store and retrieve any
amount of data, at any time, from anywhere on the web”.
Currently, S3 stores over 449 billion user objects as of July 2011 and
handles 900 million user requests a day.
Amazon claims that S3 provides infinite storage capacity, infinite data
durability, 99.99% availability and good data access performance.
How Amazon S3 stores data
Makes use of buckets
Objects are identified by a unique key which is assigned by the user
S3 stores objects of upto 5 TB in size, each accompanied by 2 KB of
metadata (Content-type, date last modified etc.).
ACL’s
- Read for objects and buckets
- Write for buckets only
- Read and write for objects
Buckets and objects are created, listed and retrieved using REST-style
HTTP interface or a SOAP interface.
Evaluation of S3
Experimental Setup
Features and Findings:
- Data Durability
- Replica Replacement
- Data Reliability
- Availability
- It provides versioning of object
- Data Access performance
- Security
- Easy to use
Swift
Used for creating redundant, scalable object storage using clusters of
standardized servers to store petabytes of accessible data.
Provides greater scalability, redundancy and permanence due to no central
point of control.
Objects are written to multiple hardware devices in the data center, with the
OpenStack software responsible for ensuring data replication and integrity
across the cluster.
Storage clusters can scale horizontally by adding new nodes. Should a node
fail, OpenStack works to replicate its content from other active nodes.
Used mainly for virtual machine images, photo storage, email storage and
backup archiving.
Swift has a ReST-ful API.
Architecture of Swift
- Proxy server processes API requests and routes
requests to storage nodes
- Auth server authenticates and authorizes requests
- Ring represents mapping between the names of
entities stored on disk and their physical location
- Replicator provides redundancy for accounts,
containers, objects
- Updater processes failed or queued updates
- Auditor verifies integrity of objects, containers,
and accounts
- Account Server handles listing of containers,
stores as SQLite DB
- Container Server handles listing of objects, stores
as SQLite DB
- Object Server stores, retrieves, and deletes objects
stored on local devices
Evaluation of Swift
- Data Durability
- Replica Replacement
- Data Reliability
- Availability
- Data scalability
- Security
- Objects must be < 5GB
- Not a Filesystem
- No User Quotas
- No Directory Hierarchies
- No writing to a byte offset in a file
- No ACL’s
Swift is mainly used for:
o Storing media libraries (photos, music, videos, etc.)
o Archiving video surveillance files
o Archiving phone call audio recordings
o Archiving compressed log files
o Archiving backups (< 5GB each object)
o Storing and loading of OS Images, etc.
o Storing file populations that grow continuously on a practically infinite basis.
o Storing small files (<50 KB). OpenStack Object Storage is great at this.
o Storing billions of files.
o Storing Petabytes (millions of Gigabytes) of data.
Relational Database Storage (RDS)
Aims to move much of the operational burden of provisioning,
configuration, scaling, performance tuning, backup, privacy, and access
control from the database users to the service operator, offering lower
overall costs to users.
Advantages:
- Hardware costs are lower
- Operational costs are lower
Disadvantages:
- Inability to scale well
- Labor intensive (managing relational databases)
- Error prone
- Increased complexity since each database package comes with its own
configuration options, tools, performances sensitivities and bugs.
Microsoft SQL Azure
Cloud-based relational database service built on SQL Server
Uses T-SQL as the query language and Tabular Data Stream (TDS) as
the protocol
Does not provide a REST-based API to access the service over HTTP
unlike S3. Instead, uses SQL Azure accessed via Tabular Data Stream
(TDS).
Allows relational queries to be made against stored data
Enables querying data, search, data analysis and data synchronization.
Network Topology – Part 1
HTTP/REST
Client Layer
(At custom premises or
Windows Azure
Platform)
Network Topology – Part 2
Evaluation Of SQL Azure
Replica Replacement
Data Reliability
- Availability
-
Data Access performance
Security
Scalability:
You can store any amount of data, from kilobytes to terabytes, in SQL Azure. However, individual
databases are limited to 10 GB in size.
Sharding
Data sharding is a technique used by many applications to improve performance, scalability and cost by
partitioning the data.
For example, applications that store and process sales data using date or time predicates. These
applications can benefit from processing a subset of the data instead of the entire data set.
Amazon S3
Openstack Swift
SQL Azure
Type of Storage
Object Storage
Object Storage
RDS storage
Data Replication
Store multiple
redundant copies.
95.89% availability
rate
Consists of a
Replicator. Ensures
integrity of data.
Ensures data
availability by
replication (SQL
Azure fabric) and
provides load
balancing
Data Scalability
Scalable
High scalability
Individual databases
limited to 10GB.
Security
Clients authenticated
by using
public/private key
scheme.
Usage
Suitable for large data
objects and parallel
applications
----
Media libraries, OS
images, backups
and log files
Provides set of
security principles
Querying the data
and performing
analysis
Distributed File system
Google File System
Architecture
Single Master
Multiple Chunkservers
HeartBeat Messages
Chunk Size = 64MB
Metadata
File & Chunk namespace
Mapping from files to
chunks
Locations of each chunk’s
replica
Gfs- Micro-benchmarks
GFS Cluster
1 master and 2 master
replicas
16 chunkservers,
16 clients
Dual 1.4 GHz P3 processor
100Mbps full-duplex
ethernet connection to
switch
19 servers are connected to
switch S1 and 16 clients are
connected to switch S2.
S1 and S2 are connected to
with a 1 Gbps link
Reads
Limit Peak
125MB/s for 1 Gbps link
12.5MB/s for 100Mbps link
When 1 client is reading
Read rate = 10MB/s
= 80% of
12.5MB/s
When 16 clients are reading
Read rate = 94MB/s
= 6MB/s per client
= 75% of limit peak
Gfs- Micro-benchmarks (cont’d)
Writes
Records Append
Limit
Input connection = 12.5MB/s
Limit = 67MB/s (write each
Performance = n/w bandwidth of
byte to 3 of 16 chunkservers)
When 1 client is writing
6.3MB/s (delays in propagation
data b/w servers)
When 16 clients are writing
35MB/s
2.2 MB/s per client
chunkserver having last chunk of file
Independent of no. of clients
When 1 client is appending
Limit = 6MB/s
When 16 clents are appending
4.8 MB/s
Features
Data Integrity : Checksum
Replica Placement
Data reliability
Availability
Fast recovery
Chunk and Master replication
Rebalancing Replicas
Better disk space
Load balancing
Garbage Collection
Does not immediately reclaim the available storage
File – renamed to a hidden name
Removes after 3 days
Stale Replica Detection
Chunk version number
Increments on updation
Hadoop Distributed File system
Architecture
NameNode - metadata
Hierarchy of files & directories
Attributes
Block size = 128MB
Primary and Secondary NameNode
DataNode - Application data
Each block replica – 2 files
Data
Block’s metadata
Checksum
Handshake- Namenode & Datanode
Verify namespace ID
s/w version
Communication via TCP
Heartbeat Message
HDFS Client
Interface b/w user application and HDFS
Reading a file
Writing a file
Single-writer, multiple-reader model
Hdfs- benchmarks
HDFS clusters at Yahoo!
3500 nodes
2 quad core Xeopn
Processors @ 2.5ghz
Linux
16GB RAM
1Gbps ethernet
DFSIO benchmark
Read: 66MB/s per node
Write: 40 MB/s per node
Busy Cluster Read: 1.02
MB/s per node
Busy Cluster Write: 1.09
MB/s per node
NNThroughput benchmark
Starts NameNode app
and multiple client
threads on same node
Operation
Throughput
(op/s)
Open file for read
126100
Create file
5600
Rename file
8300
Delete file
207000
DataNode
heartbeat
300000
Blocks
report(blocks/s)
639700
Features
Good Placement Policy
1 replica per DataNode
< 2 replica per Rack
Data reliability
Availability
N/w bandwidth utilization
Replication Management
Priority queue
Load balancing
No strategy
Balancer – tool
Application Program
Disk usage
Cluster utilization – node utilization = range
(0,1)
Data Integrity
Block scanner on Node
Verifies Checksum
Inter-Cluster data Copy
GFS v/s HDFS
Google File System
Hadoop Distributed
File System
Read Rate
6MB/s per client
1.02 MB/s per node
Write Rate
2.2 MB/s per client
1.09 MB/s per node
Records Append
4.8 MB/s per client
-
Availability
High
Single point of failure.
Data Integrity
Checksum
Checksum
Replica Placement
Yes
Yes
Rebalancing
Yes
Yes
Garbage Collection
Yes
No
Stale Replica
Yes
No
Inter-cluster Data Copy
No
Yes