虚拟化与云计算 - Lu Jiaheng's homepage

Download Report

Transcript 虚拟化与云计算 - Lu Jiaheng's homepage

Cloud Computing and Scalable
Data Management
Jiaheng Lu and Sai Wu
Renmin Universtiy of China , National University of Singapore
APWeb’2011 Tutorial
Outline





Cloud computing
Map/Reduce, Bigtable and PNUT
CAP Theorem and datalog
Data indexing in the clouds
Conclusion and open issues
APWeb 2011
Part 1
Part 2
Cloud computing
Why we use cloud computing?
Why we use cloud computing?
Case 1:
Write a file
Save
Computer down, file is lost
Files are always stored in cloud, never lost
Why we use cloud computing?
Case 2:
Use IE --- download, install, use
Use QQ --- download, install, use
Use C++ --- download, install, use
……
Get the serve from the cloud
What is cloud and cloud computing?
Cloud computing is a style of computing in
which dynamically scalable and often
virtualized resources are provided as a serve
over the Internet.
Users need not have knowledge of, expertise
in, or control over the technology
infrastructure in the "cloud" that supports
them.
Characteristics of cloud computing
• Virtual.
software, databases, Web servers,
operating systems, storage and networking as
virtual servers.
• On demand.
add and subtract processors, memory,
network bandwidth, storage.
Types of cloud service
SaaS
Software as a Service
PaaS
Platform as a Service
IaaS
Infrastructure as a Service
Outline





Cloud computing
Map/Reduce, Bigtable and PNUT
CAP Theorem and datalog
Data indexing in the clouds
Conclusion and open issues
APWeb 2011
Part 1
Part 2
Introduction to
MapReduce
MapReduce Programming Model
• Inspired from map and reduce operations commonly
used in functional programming languages like Lisp.
• Users implement interface of two primary methods:
– 1. Map: (key1, val1) → (key2, val2)
– 2. Reduce: (key2, [val2]) → [val3]
Map operation
• Map, a pure function, written by the user, takes an
input key/value pair and produces a set of
intermediate key/value pairs.
– e.g. (doc—id, doc-content)
• Draw an analogy to SQL, map can be visualized as
group-by clause of an aggregate query.
Reduce operation
• On completion of map phase, all the
intermediate values for a given output key are
combined together into a list and given to a
reducer.
• Can be visualized as aggregate function (e.g.,
average) that is computed over all the rows
with the same group-by attribute.
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
MapReduce: Execution overview
MapReduce: Example
Research works
• MapReduce is slower than Parallel Databases by a
factor 3.1 to 6.5 [1][2]
• By adopting an hybrid architecture, the performance of
MapReduce can approach to Parallel Databases [3]
• MapReduce is an efficient tool [4]
• Numerous discussions in MapReduce community …
[1] A comparison of approaches to large-scale data analysis. SIGMOD 2009
[2] MapReduce and Parallel DBMSs: friends or foes? CACM 2010
[3] HadoopDB: an architectural hybrid of MapReduce and DBMS techonologies for analytical
workloads. VLDB 2009
[4] MapReduce: a flexible data processing tool. CACM 2010
An in-depth study (1)
A performance study of MapReduce (Hadoop) on a
100-node cluster of Amazon EC2 with various levels
of parallelism. [5]
•
•
•
•
Scheduling
I/O modes
Record Parsing
Indexing
[5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The Performance of
MapReduce: An In-depth Study. PVLDB 3(1): 472-483 (2010)
An in-depth study (2)
• By carefully tuning these factors, the overall
performance of Hadoop can be comparable
to that of parallel database systems.
• Possible to build a cloud data processing
system that is both elastically scalable and
efficient
[5] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu: The
Performance of MapReduce: An In-depth Study. PVLDB 3(1):
472-483 (2010)
Osprey: Implementing MapReduceStyle Fault Tolerance in a SharedNothing Distributed Database
Christopher Yang, Christine Yen, Ceryen Tan, Samuel Madden:
Osprey: Implementing MapReduce-style fault tolerance in a
shared-nothing distributed database. ICDE 2010:657-668
Problem proposed
• Problem: Node failures on distributed database
system
– Faults may be common on large clusters of machines
• Existing solution: Aborting and (possibly)
restarting the query
– A reasonable approach for short OLTP-style queries
– But it’s time-wasting for analytical (OLAP) warehouse
queries
MapReduce-style fault tolerance
for a SQL database
• Break up a SQL query (or program) into
smaller, parallelizable subqueries
• Adapt the load balancing strategy of greedy
assignment of work
Osprey
• A middleware
implementation of
MapReduce-style fault
tolerance for a SQL
database
SQL procedure
Query
Query Transformer
Subquery
Subquery
Subquery
Subquery
Subquery
Subquery
Subquery
Subquery
Subquery
PWQ
A
PWQ
B
PWQ
C
Scheduler
Execution
Result Merger
Result
PWQ :partition
work queue
Mapreduce Online Evaluation Plateform
Construction
Mapreduce
OnlineEvaluation
User
Problem
Theory test
Help
Register
Scan
Problem
Test
FAQs
Login
Submit
Solution
Result
Hadoop
Quick Start
Update
Info
Submission
History
School Of Information Renmin
University Of China
30
2015/7/21
Cloud data management
Four new principles in Cloud-based data
management
New principle in cloud dta
management(1)
• Partition Everything and key-value storage
• 切分万物以治之
•1st normal form cannot be satisfied
New principle in cloud dta
management (2)
• Embrace Inconsistency
• 容不同乃成大同
•ACID properties are not satisfied
New principle in cloud dta
management (3)
• Backup everything with three copies
• 狡兔三窟方高枕
• Guarantee 99.999999% safety
New principle in cloud dta
management (4)
• Scalable and high performance
•运筹沧海量兼容
Cloud data management
•切分万物以治之
•Partition Everything
•容不同乃成大同
•Embrace Inconsistency
•狡兔三窟方高枕
•Backup data with three copies
•运筹沧海量兼容
•Scalable and high performance
BigTable: A Distributed Storage
System for Structured Data
Introduction
• BigTable is a distributed storage system for managing
structured data.
• Designed to scale to a very large size
– Petabytes of data across thousands of servers
• Used for many Google projects
– Web indexing, Personalized Search, Google Earth, Google
Analytics, Google Finance, …
• Flexible, high-performance solution for all of
Google’s products
Why not just use commercial DB?
• Scale is too large for most commercial databases
• Even if it weren’t, cost would be very high
– Building internally means system can be applied across
many projects for low incremental cost
• Low-level storage optimizations help performance
significantly
– Much harder to do when running on top of a database
layer
Goals
• Want asynchronous processes to be continuously
updating different pieces of data
– Want access to most current data at any time
• Need to support:
– Very high read/write rates (millions of ops per second)
– Efficient scans over all or interesting subsets of data
– Efficient joins of large one-to-one and one-to-many
datasets
• Often want to examine data changes over time
– E.g. Contents of a web page over multiple crawls
BigTable
• Distributed multi-level map
• Fault-tolerant, persistent
• Scalable
–
–
–
–
Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient scans
• Self-managing
– Servers can be added/removed dynamically
– Servers adjust to load imbalance
Basic Data Model
• A BigTable is a sparse, distributed persistent multidimensional sorted map
(row, column, timestamp) -> cell contents
• Good match for most Google applications
WebTable Example
• Want to keep copy of a large collection of web pages and
related information
• Use URLs as row keys
• Various aspects of web page as column names
• Store contents of web pages in the contents: column
under the timestamps when they were fetched.
Rows
• Name is an arbitrary string
– Access to data in a row is atomic
– Row creation is implicit upon storing data
• Rows ordered lexicographically
– Rows close together lexicographically usually on
one or a small number of machines
Rows (cont.)
Reads of short row ranges are efficient and
typically require communication with a small
number of machines.
• Can exploit this property by selecting row keys
so they get good locality for data access.
• Example:
math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu
VS
edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys
Columns
• Columns have two-level name structure:
• family:optional_qualifier
• Column family
– Unit of access control
– Has associated type information
• Qualifier gives unbounded columns
– Additional levels of indexing, if desired
Timestamps
• Used to store different versions of data in a cell
– New writes default to current time, but timestamps for writes can also be set
explicitly by clients
• Lookup options:
– “Return most recent K values”
– “Return all values in timestamp range (or all values)”
• Column families can be marked w/ attributes:
– “Only retain most recent K values in a cell”
– “Keep values until they are older than K seconds”
Implementation – Three Major
Components
• Library linked into every client
• One master server
– Responsible for:
•
•
•
•
Assigning tablets to tablet servers
Detecting addition and expiration of tablet servers
Balancing tablet-server load
Garbage collection
• Many tablet servers
– Tablet servers handle read and write requests to its table
– Splits tablets that have grown too large
Tablets
• Large tables broken into tablets at row boundaries
– Tablet holds contiguous range of rows
• Clients can often choose row keys to achieve locality
– Aim for ~100MB to 200MB of data per tablet
• Serving machine responsible for ~100 tablets
– Fast recovery:
• 100 machines each pick up 1 tablet for failed machine
– Fine-grained load balancing:
• Migrate tablets away from overloaded machine
• Master makes load-balancing decisions
Refinements: Locality Groups
• Can group multiple column families into a
locality group
– Separate SSTable is created for each locality group
in each tablet.
• Segregating columns families that are not
typically accessed together enables more
efficient reads.
– In WebTable, page metadata can be in one group
and contents of the page in another group.
Refinements: Compression
• Many opportunities for compression
– Similar values in the same row/column at different
timestamps
– Similar values in different columns
– Similar values across adjacent rows
• Two-pass custom compressions scheme
– First pass: compress long common strings across a large
window
– Second pass: look for repetitions in small window
• Speed emphasized, but good space reduction (10-to1)
Refinements: Bloom Filters
• Read operation has to read from disk when desired
SSTable isn’t in memory
• Reduce number of accesses by specifying a Bloom
filter.
– Allows us ask if an SSTable might contain data for a
specified row/column pair.
– Small amount of memory for Bloom filters drastically
reduces the number of disk seeks for read operations
– Use implies that most lookups for non-existent rows or
columns do not need to touch disk
PNUTS /
SHERPA
To Help You Scale Your Mountains of Data
Yahoo! Serving Storage Problem
– Small records – 100KB or less
– Structured records – lots of fields, evolving
– Extreme data scale - Tens of TB
– Extreme request scale - Tens of thousands of requests/sec
– Low latency globally - 20+ datacenters worldwide
– High Availability - outages cost $millions
– Variable usage patterns - as applications and users change
55
What is PNUTS/Sherpa?
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Parallel database
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
…
)
Structured, flexible schema
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Geographic replication
Hosted, managed infrastructure
56
What Will It Become?
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
Indexes and views
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Technology Elements
Applications
Tabular API
PNUTS API
YCA: Authorization
PNUTS
• Query planning and execution
• Index maintenance
Distributed infrastructure for tabular data
• Data partitioning
• Update consistency
• Replication
YDOT FS
• Ordered tables
YDHT FS
• Hash tables
Tribble
• Pub/sub messaging
Zookeeper
• Consistency service
59
Data Manipulation
• Per-record operations
– Get
– Set
– Delete
• Multi-record operations
– Multiget
– Scan
– Getrange
60
Tablets—Hash Table
0x0000
Name
Description
Grape
Grapes are good to eat
$12
Lime
Limes are green
$9
Apple
Apple is wisdom
$1
Strawberry
0x2AF3
0x911F
0xFFFF
Strawberry shortcake
Price
$900
Orange
Arrgh! Don’t get scurvy!
$2
Avocado
But at what price?
$3
Lemon
How much did you pay for this lemon?
$1
Tomato
Is this a vegetable?
$14
Banana
The perfect fruit
$2
New Zealand
$8
Kiwi
61
Tablets—Ordered Table
A
Name
Description
Price
Apple
Apple is wisdom
$1
Avocado
But at what price?
$3
Banana
The perfect fruit
$2
Grape
Grapes are good to eat
$12
New Zealand
$8
How much did you pay for this lemon?
$1
Limes are green
$9
H
Kiwi
Lemon
Lime
Q
Orange
Strawberry
Tomato
Arrgh! Don’t get scurvy!
$2
Strawberry shortcake
$900
Is this a vegetable?
$14
Z
62
Flexible Schema
Posted date
Listing id
Item
Price
6/1/07
424252
Couch
$570
6/1/07
763245
Bike
$86
6/3/07
211242
Car
$1123
6/5/07
421133
Lamp
$15
Color
Condition
Good
Red
Fair
Detailed Architecture
Remote regions
Local region
Clients
REST API
Routers
Tribble
Tablet Controller
Storage
units
64
Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal partitions of the table)
Storage unit may become a hotspot
Storage unit
Tablet
Overfull tablets split
Tablets may grow over time
Shed load by moving tablets to other servers
65
QUERY
PROCESSING
66
Accessing Data
4 Record for key k
1
Get key k
3 Record for key k
SU
SU
2
Get key k
SU
67
Bulk Read
1
{k1, k2, … kn}
2
Get k1
Get k2
SU
SU
Get k3
Scatter/
gather
server
SU
68
Range Queries in YDOT
• Clustered, ordered retrieval of records
Apple
Avocado
Grapefruit…Pear?
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
Grapefruit…Lime?
Lime…Pear?
Router
Lime
Mango
Orange
Strawberry
Apple
Tomato
Avocado
Watermelon
Banana
Blueberry
Storage unit 1
Canteloupe
Storage unit 3
Lime
Storage unit 2
Strawberry
Storage unit 1
Strawberry
Tomato
Watermelon
Storage unit 1
Lime
Mango
Orange
Canteloupe
Grape
Kiwi
Lemon
Storage unit 2
Storage unit 3
Updates
1
8
Write key k
Sequence # for key k
Routers
Message brokers
3
Write key k
2
7
Sequence # for key k
4
Write key k
5
SU
SU
SU
6
SUCCESS
Write key k
70
SHERPA
IN CONTEXT
71
Types of Record Stores
• Query expressiveness
S3
PNUTS
Oracle
Simple
Feature rich
Object
retrieval
Retrieval from
single table of
objects/records
SQL
Types of Record Stores
• Consistency model
S3
PNUTS
Oracle
Best effort
Eventual
consistency
Timeline
consistency
Object-centric
consistency
ACID
Program
centric
consistency
Strong
guarantees
Types of Record Stores
• Data model
PNUTS
CouchDB
Oracle
Flexibility,
Schema evolution
Object-centric
consistency
Optimized for
Fixed schemas
Consistency
spans objects
Types of Record Stores
• Elasticity (ability to add resources on demand)
Oracle
PNUTS
S3
Inelastic
Elastic
Limited
(via data
distribution)
VLSD
(Very Large
Scale
Distribution
/Replication)
Application Design Space
Get a few
things
Sherpa
MySQL Oracle
BigTable
Scan
everything
Everest
Records
MObStor
YMDB
Filer
Hadoop
Files
76
SQL/ACID
Consistency
model
Updates
Structured
access
Global low
latency
Availability
Operability
Elastic
Alternatives Matrix
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
Dynamo
Cassandra
77
Outline





Cloud computing
Map/Reduce, Bigtable and PNUT
CAP Theorem and datalog
Data indexing in the clouds
Conclusion and open issues
APWeb 2011
Part 1
Part 2
The CAP Theorem
Availability
Consistency
Partition
tolerance
The CAP Theorem
Once a writer has written, all
readers will see that write
Availability
Consistency
Partition
tolerance
The CAP Theorem
Availability
Consistency
Partition
tolerance
System is available during
software and hardware
upgrades and node failures.
The CAP Theorem
Availability
Consistency
Partition
tolerance
A system can continue to
operate in the presence of a
network partitions.
The CAP Theorem
Availability
Consistency
Partition
tolerance
Theorem: You can have
at most two of these
properties for any
shared-data system
Consistency
• Two kinds of consistency:
– strong consistency – ACID(Atomicity Consistency Isolation
Durability)
– weak consistency – BASE(Basically Available Soft-state Eventual
consistency )
A tailor
LOCK
RDBMS
3NF
ACID
Datalog
• Main expressive advantage: recursive
queries.
• More convenient for analysis: papers look
better.
• Without recursion but with negation it is
equivalent in power to relational algebra
• Has affected real practice: (e.g., recursion
in SQL3, magic sets transformations).
Datalog
• Example Datalog program:
• parent(bill,mary). parent(mary,john).
• ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :parent(X,Z),ancestor(Z,Y).
• ?- ancestor(bill,X)
Joseph’s Conjecture(1)
• CONJECTURE 1. Consistency And Logical
Monotonicity (CALM).
• A program has an eventually consistent,
coordination-free execution strategy if and
only if it is expressible in (monotonic) Datalog.
Joseph’s Conjecture (2)
• CONJECTURE 2. Causality Required Only for
Non-monotonicity (CRON).
• Program semantics require causal message
ordering if and only if the messages participate
in non-monotonic derivations.
Joseph’s Conjecture (3)
• CONJECTURE 3. The minimum number of
Dedalus timesteps required to evaluate a
program on a given input data set is equivalent
to the program’s Coordination Complexity.
Joseph’s Conjecture (4)
• CONJECTURE 4. Any Dedalus program P can be
rewritten into an equivalent temporallyminimized program P’ such that each inductive
or asynchronous rule of P’ is necessary:
converting that rule to a deductive rule would
result in a program with no unique minimal
model.
Circumstance has presented a rare
opportunity—call it an imperative—for the
database community to take its place in the sun,
and help create a new environment for parallel
and distributed computation to flourish.
------Joseph M. Hellerstein (UC Berkeley)
Open questions and conclusion
Open Questions
•
•
•
•
•
•
What is the right consistency model?
What is the right programming model?
Whether and how to make use of caching?
How to balance functionality and scale?
What are the right cloud abstractions?
Cloud inter-operatability
VLDB 2010 Tutorial
Concluding
• Data Management for Cloud Computing poses a
fundamental challenge to database researchers:
–
–
–
–
Scalability
Reliability
Data Consistency
Elasticity
• Database community needs to be involved –
maintaining status quo will only marginalize our
role.
VLDB 2010 Tutorial
New Textbook “Distributed
System and Cloud computing”
《分布式系统与云计算》
 分布式系统概述 (Introduction to Distributed
System)
 分布式云计算技术综述 ( Distributed
Computing)
 分布式云计算平台 (Cloud-based platform)
 分布式云计算程序开发 (Cloud-based
programming)
96
Further Reading
F. Chang et al.
Bigtable: A distributed storage system for structured data. In OSDI, 2006.
J. Dean and S. Ghemawat.
MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
G. DeCandia et al.
Dynamo: Amazon’s highly available key-value store. In SOSP, 2007.
S. Ghemawat, H. Gobioff, and S.-T. Leung.
The Google File System. In Proc. SOSP, 2003.
D. Kossmann.
The state of the art in distributed query processing.
ACM Computing Surveys, 32(4):422–469, 2000.
Further Reading
Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)
Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee,
Ramana Yerneni, Raghu Ramakrishnan
PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)
Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,
Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen,
Nick Puz, Daniel Weaver, Ramana Yerneni
Asynchronous View Maintenance for VLSD Databases,
Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and
Raghu Ramakrishnan
SIGMOD 2009
Cloud Storage Design in a PNUTShell
Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava
Beautiful Data, O’Reilly Media, 2009
Further Reading
F. Chang et al.
Bigtable: A distributed storage system for structured data. In OSDI, 2006.
J. Dean and S. Ghemawat.
MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
G. DeCandia et al.
Dynamo: Amazon’s highly available key-value store. In SOSP, 2007.
S. Ghemawat, H. Gobioff, and S.-T. Leung.
The Google File System. In Proc. SOSP, 2003.
D. Kossmann.
The state of the art in distributed query processing.
ACM Computing Surveys, 32(4):422–469, 2000.
Thanks
谢谢!