Data Freeway : Near-Realtime Message Bus using HDFS

Download Report

Transcript Data Freeway : Near-Realtime Message Bus using HDFS

Data Freeway :
Scaling Out to Realtime
Author: Eric Hwang, Sam Rash {ehwang,rash}@fb.com
Speaker : Haiping Wang [email protected]
Agenda
»
»
»
»
Data at Facebook
Realtime Requirements
Data Freeway System Overview
Realtime Components
›
›
›
›
›
Calligraphus/Scribe
HDFS use case and modifications
Calligraphus: a Zookeeper use case
ptail
Puma
» Future Work
Big Data, Big Applications / Data at Facebook
» Lots of data
›
›
›
›
More than 500 million active users
50 million users update their statuses at least once each day
More than 1 billion photos uploaded each month
More than 1 billion pieces of content (web links, news stories, blog posts,
notes, photos, etc.) shared each week
› Data rate: over 7 GB / second
» Numerous products can leverage the data
›
›
›
›
Revenue related: Ads Targeting
Product/User Growth related: AYML, PYMK, etc
Engineering/Operation related: Automatic Debugging
Puma: streaming queries
Example: User related Application
» Major challenges: Scalability , Latency
Realtime Requirements
› Scalability: 10-15 GBytes/second
› Reliability: No single point of failure
› Data loss SLA: 0.01%
• Loss due to hardware: means at most 1 out of 10,000 machines can lose data
› Delay of less than 10 sec for 99% of data
• Typically we see 2s
› Easy to use: as simple as ‘tail –f /var/log/my-log-file’
Data Freeway System Diagram
» Scribe & Calligraphus get data into the system
» HDFS at the core
» Ptail provides data out
» Puma is a emerging streaming analytics platform
Scribe
• Scalable distributed logging framework
• Very easy to use:
• scribe_log(string category, string message)
• Mechanics:
• Built on top of Thrift
• Runs on every machine at Facebook, Collect the log data into a bunch of destinations
• Buffer data on local disk if network is down
• History:
• 2007: Started at Facebook
• 2008 Oct: Open-sourced
Calligraphus
» What
› Scribe-compatible server written in Java
› Emphasis on modular, testable code-base, and performance
» Why?
› Extract simpler design from existing Scribe architecture
› Cleaner integration with Hadoop ecosystem
• HDFS, Zookeeper, HBase, Hive
» History
› In production since November 2010
› Zookeeper integration since March 2011
HDFS : a different use case
» Message hub
› Add concurrent reader support and sync
› Writers + concurrent readers a form of pub/sub model
HDFS : add Sync
» Sync
› Implement in 0.20 (HDFS-200)
• Partial chunks are flushed
• Blocks are persisted
› Provides durability
› Lowers write-to-read latency
HDFS : Concurrent Reads Overview
» Without changes, stock
Hadoop 0.20 does not allow
access to the block being
written
» Need to read the block being
written for realtime apps in
order to achieve < 10s latency
HDFS : Concurrent Reads Implementation
1. DFSClient asks Namenode
for blocks and locations
2. DFSClient asks Datanode
for length of block being
written
3. opens last block
Calligraphus: Log Writer
Scribe categories
Calligraphus
Servers
Server
Category 1
Category 2
Server
Category 3
Server
How to persist to HDFS?
HDFS
Calligraphus (Simple)
Calligraphus
Servers
Scribe categories
HDFS
Server
Category 1
Server
Category 2
Category 3
Server
Number of
categories
x
Number of
servers
=
Total number of
directories
Calligraphus (Stream Consolidation)
Scribe categories
Calligraphus
Servers
Router
Writer
Router
Writer
Router
Writer
HDFS
Category 1
Category 2
Category 3
ZooKeeper
Number of
categories
=
Total number of
directories
ZooKeeper: Distributed Map
» Design
› ZooKeeper paths as tasks (e.g. /root/<category>/<bucket>)
› Cannonical ZooKeeper leader elections under each bucket for bucket
ownership
› Independent load management – leaders can release tasks
› Reader-side caches
› Frequent sync with policy db
A
1
2
3
Root
B
4
5
1
2
3
C
4
5
1
2
3
D
4
5
1
2
3
4
5
Canonical Realtime ptail Application
 Hides the fact we have many HDFS instances:
user can specify a category and get a stream
 Check pointing
Puma
Puma Overview
» Realtime analytics platform
» Metrics
› count, sum, unique count, average, percentile
» Uses ptail check pointing for accurate calculations in the case of failure
» Puma nodes are sharded by keys in the input stream
» HBase for persistence
Puma Write Path
Puma Read Path
» Performance
› Elapsed time typically 200-300 ms for 30 day queries
› 99th percentile, cross-country, < 500ms for 30 day queries
Future Work
» Puma
› Enhance functionality: add application-level transactions on Hbase
› Streaming SQL interface
» Compression