An overview of Hulu’s metrics platform Tristan Reid [email protected] Prasan Samtani [email protected] What we do • Streaming video service • > 5.5 million subscribers • > 20 million.
Download ReportTranscript An overview of Hulu’s metrics platform Tristan Reid [email protected] Prasan Samtani [email protected] What we do • Streaming video service • > 5.5 million subscribers • > 20 million.
An overview of Hulu’s metrics platform Tristan Reid [email protected] Prasan Samtani [email protected] What we do • Streaming video service • > 5.5 million subscribers • > 20 million unique visitors/month • > 1 billion ads/month It all begins with beacons Living room device (Roku, Xbox, etc) Mobile device (Android, iPhone, etc) Web (hulu.com) Beacon collection service What’s in a beacon 80 2013-04-01 00:00:00 /v3/playback/start? bitrate=650 &cdn=Akamai &channel=Anime &clichéent=Explorer &computerguid=EA8FA1000232B8F6986C3E0BE55E 9333 &contentid=5003673 … Reporting platform (RP2) Find Metrics & Dimensions Design and execute reports The pipeline Devices Devices Devices Beacon collection service LogCollector/Flume HDFS Monitoring (metstat) MapReduce jobs/JobScheduler Developers Hive Reporting (RP2) Business Harpy – continuous aggregation RDBMS Log Collection Devices Devices Devices Load balancer Log Collection machine #1 … HDFS Files bucketed by beacon type and partitioned by hour Log Collection machine #11 Directory hierarchy on HDFS 201401010100_ playback_1.seq playback/ 201401010100_ playback_2.seq revenue/ … 201401010000/ /user/hadoop/t 2 playback/ 201401010100 revenue/ MapReduce - going from beacons to basefacts computerguid EA8FA1000232B8F6986C3E 0BE55E9333 userid 5238518 video_id 289696 content_partner_id 398 distribution_partner_id 602 distro_platform_id 14 is_on_hulu 0 … hourid 383149 watched 76426 If a program manipulates a large amount of data, it does so in a small number of ways - Alan Perlis The BeaconSpec compiler Definitions of beacons and base-facts Beaconspec compiler Java MapReduce code that can run on the cluster What does our language look like? basefact playback_watched_uniques from playback/(position|end) { dimension harpyhour.id as hourid; dimension computerguid as computerguid; dimension userid as userid; required dimension video.id as video_id; required dimension contentPartner.id as content_partner_id; … dimension siteSessionId.chosen as site_session_id; dimension facebook.isfacebookconnected as is_facebook_connected; fact sum(watched.out) as watched; } FAQ: Why didn’t we just use Pig? The superior [program] cultivates itself so as to give rest to [programmers] - Confucius, the Way of the Superior Man Scheduling jobs Outside world JobScheduler Interface JobMonitor MapReduce job JobMonitor MapReduce job JobMonitor MapReduce job JobScheduler Logmanager databases Checks databases for jobs that are ready to run and whether dependencies are met JobScheduler technology • The actor model of concurrency – Communication through async messaging – Completely encapsulated state Message passing Actor creation Central idea: Treat local objects as if they are distributed, as opposed to treating distributed objects as if they are local Fault-tolerance – let it crash! Harpy – continuous aggregations Harpy Metadata Queue Processor DataSync Hive HDFS NFS Publishing Holding Sweeper Agg Scheduler HoldingDB Output DBs RP2 • Reporting Portal for pulling Metrics + Dimensions • Quick ‘Demo’ Let’s Reexamine the pipeline: Devices Devices Devices Beacon collection service LogCollector/Flume HDFS Monitoring (metstat) MapReduce jobs/JobScheduler Developers Hive Reporting (RP2) Business Harpy – continuous aggregation RDBMS Metstat • • • • • Python Django App Tasks on Celery + RabbitMQ JQuery Tracks status, status changes and statistics Gets data directly from various sources (databases, HDFS) FAQ: Why didn’t we just use Pig? • Dataflow language – runs on Hadoop • Pig philosophy – (Taken from the Apache website) – Pigs eat anything – Pigs live anywhere – Pigs are domestic animals – Pigs fly Beaconspec REGISTER ./tutorial.jar; raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query); clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; Beware of the Turing tar-pit where everything is possible but nothing of interest is easy - Alan Perlis Beaconspec FAQ: What is open sourced? • Slickint – database interface generation for Scala – github.com/zenbowman/slickint • Local filesystem caching for hadoop – github.com/ZenBowman/luna