An overview of Hulu’s metrics platform Tristan Reid [email protected] Prasan Samtani [email protected] What we do • Streaming video service • > 5.5 million subscribers • > 20 million.

Download Report

Transcript An overview of Hulu’s metrics platform Tristan Reid [email protected] Prasan Samtani [email protected] What we do • Streaming video service • > 5.5 million subscribers • > 20 million.

An overview of Hulu’s metrics
platform
Tristan Reid
[email protected]
Prasan Samtani
[email protected]
What we do
• Streaming video service
• > 5.5 million subscribers
• > 20 million unique
visitors/month
• > 1 billion ads/month
It all begins with beacons
Living room device
(Roku, Xbox, etc)
Mobile device
(Android, iPhone,
etc)
Web
(hulu.com)
Beacon
collection service
What’s in a beacon
80 2013-04-01 00:00:00
/v3/playback/start?
bitrate=650
&cdn=Akamai
&channel=Anime
&clichéent=Explorer
&computerguid=EA8FA1000232B8F6986C3E0BE55E
9333
&contentid=5003673
…
Reporting platform (RP2)
Find Metrics & Dimensions
Design and execute
reports
The pipeline
Devices
Devices
Devices
Beacon collection
service
LogCollector/Flume
HDFS
Monitoring
(metstat)
MapReduce
jobs/JobScheduler
Developers
Hive
Reporting
(RP2)
Business
Harpy – continuous
aggregation
RDBMS
Log Collection
Devices
Devices
Devices
Load balancer
Log Collection
machine #1
…
HDFS
Files bucketed by beacon
type and partitioned by hour
Log Collection
machine #11
Directory hierarchy on HDFS
201401010100_
playback_1.seq
playback/
201401010100_
playback_2.seq
revenue/
…
201401010000/
/user/hadoop/t
2
playback/
201401010100
revenue/
MapReduce - going from beacons to
basefacts
computerguid
EA8FA1000232B8F6986C3E
0BE55E9333
userid
5238518
video_id
289696
content_partner_id
398
distribution_partner_id
602
distro_platform_id
14
is_on_hulu
0
…
hourid
383149
watched
76426
If a program manipulates a large
amount of data, it does so in a small
number of ways
- Alan Perlis
The BeaconSpec compiler
Definitions of
beacons and
base-facts
Beaconspec
compiler
Java
MapReduce
code that can
run on the
cluster
What does our language look like?
basefact playback_watched_uniques from playback/(position|end) {
dimension harpyhour.id as hourid;
dimension computerguid as computerguid;
dimension userid as userid;
required dimension video.id as video_id;
required dimension contentPartner.id as content_partner_id;
…
dimension siteSessionId.chosen as site_session_id;
dimension facebook.isfacebookconnected as is_facebook_connected;
fact sum(watched.out) as watched;
}
FAQ: Why didn’t we just use Pig?
The superior [program] cultivates itself
so as to give rest to [programmers]
- Confucius, the Way of the Superior
Man
Scheduling jobs
Outside
world
JobScheduler
Interface
JobMonitor
MapReduce
job
JobMonitor
MapReduce
job
JobMonitor
MapReduce
job
JobScheduler
Logmanager
databases
Checks databases for
jobs that are ready to
run and whether
dependencies are met
JobScheduler technology
• The actor model of concurrency
– Communication through async messaging
– Completely encapsulated state
Message
passing
Actor
creation
Central idea: Treat local objects as if they are
distributed, as opposed to treating distributed objects
as if they are local
Fault-tolerance – let it crash!
Harpy – continuous aggregations
Harpy
Metadata
Queue
Processor
DataSync
Hive
HDFS
NFS
Publishing
Holding
Sweeper
Agg
Scheduler
HoldingDB
Output DBs
RP2
• Reporting Portal for pulling Metrics +
Dimensions
• Quick ‘Demo’
Let’s Reexamine the pipeline:
Devices
Devices
Devices
Beacon collection
service
LogCollector/Flume
HDFS
Monitoring
(metstat)
MapReduce
jobs/JobScheduler
Developers
Hive
Reporting
(RP2)
Business
Harpy – continuous
aggregation
RDBMS
Metstat
•
•
•
•
•
Python Django App
Tasks on Celery + RabbitMQ
JQuery
Tracks status, status changes and statistics
Gets data directly from various sources
(databases, HDFS)
FAQ: Why didn’t we just use Pig?
• Dataflow language – runs on Hadoop
• Pig philosophy
– (Taken from the Apache website)
– Pigs eat anything
– Pigs live anywhere
– Pigs are domestic animals
– Pigs fly
Beaconspec
REGISTER ./tutorial.jar;
raw = LOAD 'excite.log' USING PigStorage('\t') AS
(user, time, query);
clean1 = FILTER raw BY
org.apache.pig.tutorial.NonURLDetector(query);
clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;
Beware of the Turing tar-pit where
everything is possible but nothing of
interest is easy
- Alan Perlis
Beaconspec
FAQ: What is open sourced?
• Slickint – database interface generation for
Scala
– github.com/zenbowman/slickint
• Local filesystem caching for hadoop
– github.com/ZenBowman/luna