Transcript ppt
云端的小飞象系列报告之二
Cloud组
L/O/G/O
Hadoop in SIGMOD 2011
L/O/G/O
www.themegallery.com
Outline
Introduction
Nova: Continuous Pig/Hadoop Workflows
Apache Hadoop Goes Realtime at Facebook
Emerging Trends in the Enterprise Data Analytics
A Hadoop Based Distributed Loading Approach to
Parallel Data Warehouses
Industrial Session in Sigmod 2011
Data Management for
Feeds and Streams(2)
Applying Hadoop
(4)
Dynamic Optimization and
Unstructured Content (4)
Industrial
session
Support for Business Analytics
and Warehousing (4)
BusinessAnalytics(2)
Nova: Continuous Pig/Hadoop Workflows
By Yahoo!
Nova Overview
Scenarios
Ingesting and analyzing user behavior logs
Building and updating a search index from a stream of crawled web
pages
Processing semi-structured data
Two-layer programming model (Nova over Pig)
Continuous processing
Independent scheduling
Cross-module optimization
Manageability features
Workflow Model
Workflow
Two kinds of vertices: tasks (processing
steps) and channels (data containers)
Edges connect tasks to channels and channels
to tasks
Four common patterns of processing
Non-incremental (template detection)
Stateless incremental (shingling)
Stateless incremental with lookup table
(template tagging)
Stateful incremental (de-duping)
Workflow Model (Cont.)
Data and Update Model
Blocks: A channel’s data is divided into blocks
Base block
Contains a complete snapshot of data on a
channel as of some point in time
Base blocks are assigned increasing sequence
numbers(B0,B1,B2……Bn)
Delta block
Used in conjunction with incremental
processing
Contains instructions for transforming a base
block into a new base block( i j Bi B j (i j ) )
Workflow Model (Cont.)
Task/Data Interface
Consumption mode: all or new
Production mode: B or Δ
Workflow Model (Cont.)
Workflow Programming and Scheduling
Data-based trigger.
Time-based trigger
Cascade trigger.
Data Compaction and Garbage Collection
If a channel has blocks B0,01 ,12 , 23 ,the
compaction operation computes and adds B3 to the channel
After compaction is used to add B3 to the channel,and current
cursor is at sequence number 2, then B0,01 ,
12
can be garbage-collected.
Nova System Architecture
Apache Hadoop Goes Realtime at Facebook
By Facebook
Workload Types
Facebook Messaging
High Write Throughput
Large Tables
Data Migration
Facebook Insights
Realtime Analytics
High Throughput Increments
Facebook Metrics System (ODS)
Automatic Sharding
Fast Reads of Recent Data and Table Scans
Why Hadoop & HBase
Elasticity
High write throughput
Efficient and low-latency strong consistency semantics within
a data center
Efficient random reads from disk
High Availability and Disaster Recovery
Fault Isolation
Atomic read-modify-write primitives
Range Scans
Tolerance of network partitions within a single data center
Zero Downtime in case of individual data center failure
Active-active serving capability across different data centers
Realtime HDFS
High Availability - AvatarNode
Realtime HDFS (Cont.)
Hadoop RPC compatibility
Block Availability: Placement Policy
a pluggable block placement policy
Realtime HDFS (Cont.)
Performance Improvements for a Realtime Workload
RPC Timeout
Reads from Local Replicas
New Features
HDFS sync
Concurrent Readers
Production HBase
ACID Compliance (RWCC: Read Write Consistency Control)
Atomicity (WALEdit)
Consistency
Availability Improvements
HBase Master Rewrite,Region assignment in memory -> ZooKeeper
Online Upgrades
Distributed Log Splitting
Performance Improvements
Compaction(minor and major)
Read Optimizations
Emerging Trends in the Enterprise Data Analytics:
Connecting Hadoop and DB2 Warehouse
By IBM
Motivation
1.Increasing volumes of data
2. Hadoop-based solutions in conjunction with
data warehouses
A Hadoop Based Distributed Loading Approach to
Parallel Data Warehouses
By Teradata
Motivation
ETL(Extraction Transformation Loading) is a critical
part of data warehouse
While data are partitioned and replicated across all
nodes in a parallel data warehouse, load utilities reside
on a single node(bottleneck)
Why Hadoop for Teradata EDW(Enterprise Data Warehouse)?
More disk space can be easily added
Use as a intermediate storage
MapReduce for transformation
Load data in parallel
Block Assignment Problem
–
HDFS file F on a cluster of P nodes (each node is uniquely
identified with an integer i where 1 ≤ i ≤ P)
–
The problem is defined by: assignment(X, Y, n,m, k, r)
X is the set of n blocks (X = {1, . . . , n}) of F
Y is the set of m nodes running PDBMS (called PDBMS nodes)
(Y⊆ {1, . . . , P })
k copies, m nodes
r is the mapping recording the replicated block locations of
each block. r(i) returns the set of nodes which has a copy of the
block i.
Block Assignment Problem(Cont.)
• An assignment g from the blocks in X to the nodes in Y is
denoted by a mapping from X = {1, . . . , n} to Y where g(i)
= j (i ∈ X, j ∈ Y ) means that the block i is assigned to the
node j.
•
An even assignment g is an assignment such that ∀ i ∈ Y ∀
j ∈ Y | |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y)
= j}| | ≤ 1.
•
The cost of an assignment g is defined to be cost(g) = |{i |
g(i) r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks
assigned to remote nodes.
Thank You!
L/O/G/O