Hadoop @
… and other stuff
Who Am I?
I'm this guy!
Hadoop… what is it good for?
Directly influenced by Hadoop
Indirectly influenced by Hadoop
50% for business analytics
A long long time ago (or 2009)
• 40 Million members
• Apache Hadoop 0.19
• 20 node cluster
• Machines built from Frys (pizza boxes)
• PYMK in 3 days!
• Over 5000 nodes
• 6 clusters (1 production, 1 dev, 2 etl, 2 test)
• Apache Hadoop 1.04 (Hadoop 2.0 soon-ish)
• Security turned on
• About 900 users
• 15-20K Hadoop Jobs submissions a day
• PYMK < 12 hours!
Current Setup
• Use Avro (mostly)
• Dev/Adhoc cluster
o Used for development and testing of workflows
o For analytic queries
• Prod clusters
o Data that will appear on our website
o Only reviewed workflows
• ETL clusters
• Walled off
Three Common Problems
Data In
Hadoop Cluster
Data Out
Data In
Databases (c. 2009-2010)
Originally pulled directly through JDBC on backup DB
o Pulled deltas when available and merged
Data comes extra late (wait for replication of replicas)
o Large data pulls affected by daily locks
Very manual. Schema, connections, repairs (manual)
No delta’s meant no Scoop
Costly (Oracle)
Live Site
Live Site
Live Site
24 hr
5-12 hr
Databases (Present)
• Commit logs/deltas from Production
• Copied directly to HDFS
• Converted/merged to Avro
• Schema is inferred
Live Site
Live Site
Live Site
< 12 hr
Databases (Future 2014?)
• Diffs sent directly to Hadoop
• Avro format
• Lazily merge
• Explicit schema
Databus ( < 15 min )
Webtrack (c. 2009-2011)
• Flat files (xml)
• Pulled from every servers periodically,
grouped and gzipped
Uploaded into Hadoop
Failures nearly untraceable
I seriously don’t know
how many hops and copies
Webtrack (Present)
• Apache Kafka!! Yay!
• Avro in, Avro out
• Automatic pulls into Hadoop
• Auditing
5-10 mins end to end
Apache Kafka
• LinkedIn Events
• Service metrics
• Use schema registry
Compact data (md5)
Auto register
Validate schema
Get latest schema
• Migrating to Kafka 0.8
o Replication
Apache Kafka + Hadoop = Camus
• Avro only
• Uses zookeeper
o Discover new topics
o Find all brokers
o Find all partitions
• Mappers pull from Kafka
• Keeps offsets in HDFS
• Partitions into hour
• Counts incoming events
Kafka Auditing
• Use Kafka to Audit itself
• Tool to audit and alert
• Compare counts
• Kafka 0.8?
Lesson’s We Learned
• Avoid lots of small files
• Automation with Auditing = sleep for me
• Group similar data = smaller/faster
• Spend time writing to spend less time reading
o Convert to binary, partition, compress
• Future:
o adaptive replication (higher for new, lower for old)
o Metadata store (hcat)
o Columnar store (Orc?, Parquett?)
Processing Data
Pure Java
• Time consuming writing jobs
• Little code re-use
• Shoot yourself in the face
• Only used when necessary
o Performance
o Memory
• Lots of libraries to help (boiler plate stuff)
Little Piggy (Apache Pig)
• Mainly a pigsty (Pig 11.0)
• Used by data products
• Transparent
• Good performance, tunable
• UDF’s, Datafu
• Tuples and bags? WTF
• Hive 11
• Only for Adhoc queries
o Biz ops, PM’s, analyst
• Hard to tune
• Easy to use
• Lots of adoption
• Etl data in external tables :/
• Hive server 2 for JDBC
Future in Processing
• Giraph
• Impala, Shark/Spark… etc
• Tez
• Crunch
• Other?
• Say no to streaming
Run hadoop jobs in order
Run regular schedules
Be notified on failures
Understand how flows are executed
View execution history
Easy to use
Azkaban @ LinkedIn
Used in LinkedIn since early 2009
Powers all our Hadoop data products
Been using 2.0+ since late 2012
2.0 and 2.1 quietly released early 2013
Azkaban @ LinkedIn
• One Azkaban instance per cluster
• 6 clusters total
• 900 Users
• 1500 projects
• 10,000 flows
• 2500 flow executing per day
• 6500 jobs executing per day
Azkaban (before)
Engineer designed UI...
Azkaban 2.0
Azkaban Features
Schedules DAGs for executions
Web UI
Simple job files to create dependencies
Project Isolation
Extensible through plugins (works with any
version of Hadoop)
Azkaban - Upload
Zip Job files, jars, project files
Azkaban - Execute
Azkaban - Schedule
Azkaban - Viewer Plugins
HDFS Browser
Future Azkaban Work
Higher availability
Generic Triggering/Actions
Embedded graphs
Conditional branching
Admin client
Data Out
• Distributed Key-Value Store
• Based on Amazon Dynamo
• Pluginable
• Open-source
Voldemort Read-Only
• Filesystem store for RO
• Create data files and index on Hadoop
• Copy data to Voldemort
• Swap
Voldemort + Hadoop
• Transfers are parallel
• Transfer records in bulk
• Ability to Roll back
• Simple, operationally low maintenance
• Why not Hbase, Cassandra?
o Legacy, and no compelling reason to change
o Simplicity is nice
o Real answer: I don’t know. It works, we’re happy.
Apache Kafka
• Reverse the flow
• Messages produced by Hadoop
• Consumer upstream takes action
• Used for emails, r/w store updates, where
Voldemort doesn’t make sense etc
Nearing the End
Misc Hadoop at LinkedIn
• Believe in optimization
o File size, task count and utilization
o Reviews, culture
• Strict limits
o Quotas size/file count
o 10K task limit
• Use capacity scheduler
o Default queue with 15m limit
o marathon for others
We do a lot with little…
• 50-60% cluster utilization
o Or about 5x more work than some other companies
• Every job is reviewed for production
o Teaches good practices
o Schedule to optimize utilization
o Prevents future headaches
• These keep our team size small
o Since 2009, hadoop users grew 90x, clusters grew
25x, LinkedIn employees grew 15x
o hadoop team 5x (to 5 people)
The End