Apache Kafka A high-throughput distributed messaging system Johan Lundahl Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and.
Download ReportTranscript Apache Kafka A high-throughput distributed messaging system Johan Lundahl Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and.
Apache Kafka A high-throughput distributed messaging system Johan Lundahl Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 2 What is Apache Kafka? • Distributed, high-throughput, pub-sub messaging system – • Main use cases: – • • • Fast, Scalable, Durable log aggregation, real-time processing, monitoring, queueing Originally developed by LinkedIn Implemented in Scala/Java Top level Apache project since 2012: http://kafka.apache.org/ Comparison to other messaging systems – – Traditional: JMS, xxxMQ/AMQP New gen: Kestrel, Scribe, Flume, Kafka Message queues Low throughput, low latency Log aggregators High throughput, high latency RabbitMQ JMS ActiveMQ Flume Hedwig Kafka Scribe Batch jobs Qpid 4 Kestrel Kafka concepts Producers Frontend Frontend Topic1 Topic1 Service Topic3 Topic2 Push Broker Kafka Pull Topic3 Topic3 Topic1 Topic2 Topic3 Topic2 Topic1 Consumers 5 Monitoring Stream processing Batch processing Data warehouse Distributed model Producer Producer Producer Producer persistence Partitioned Data Publication Broker Broker Broker Zookeeper Ordered subscription Topic1 consumer group 6 Topic2 consumer group Intra cluster replication Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 7 Performance factors • • • • Broker doesn’t track consumer state Everything is distributed Zero-copy (sendfile) reads/writes Usage of page cache backed by sequential disk allocation • • • • • Like a distributed commit log Low overhead protocol Message batching (Producer & Consumer) Compression (End to end) Configurable ack levels From: http://queue.acm.org/detail.cfm?id=1563874 8 Kafka features and strengths • • • • • • • • 9 Simple model, focused on high throughput and durability O(1) time persistence on disk Horizontally scalable by design (broker and consumers) Push - pull => consumer burst tolerance Replay messages Multiple independent subscribes per topic Configurable batching, compression, serialization Online upgrades Tradeoffs • • • • Not optimized for millisecond latencies Have not beaten CAP Simple messaging system, no processing Zookeeper becomes a bottleneck when using too many topics/partitions (>>10000) • Not designed for very large payloads (full HD movie etc.) • Helps to know your data in advance 10 Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 11 Message/Log Format Message Log based queue (Simplified model) Broker Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logback appender Producer1 Producer2 Topic1 Topic2 Message1 Message1 Message2 Message2 Message3 Message3 Message4 Message4 Message5 Message5 Message6 Message6 Message7 Message7 Message8 • Batching • Compression • Serialization Message9 Message10 Consumer1 Consumer2 Consumer3 Consumer3 Consumer3 ConsumerGroup1 Partitioning Broker Partitions Group1 Consumer Topic1 Producer Consumer Producer Producer Consumer Group2 Topic2 Producer Group3 Producer Consumer Consumer Consumer No partition for this guy Consumer Keyed messages #partitions=3 hash(key) % #partitions BrokerId=1 BrokerId=2 Topic1 Topic1 Topic1 Message1 Message2 Message3 Message5 Message4 Message7 Message9 Message6 Message11 Message13 Message8 Message15 Message17 Message10 Message12 Message14 Producer BrokerId=3 Message16 Message18 Intra cluster replication Replication factor = 3 Follower fails: • • Follower dropped from ISR When follower comes online again: fetch data from leader, then ISR gets updated Leader fails: • • Detected via Zookeeper from ISR New leader gets elected Producer ack Broker1 InSyncReplicas Broker2 Broker3 Topic1 leader Topic1 follower Topic1 follower Message1 Message1 Message1 Message2 Message2 Message2 Message3 Message3 Message3 Message4 Message4 Message4 Message5 Message5 Message5 Message6 Message6 Message6 Message7 Message7 Message7 Message8 Message8 Message8 Message9 Message10 3 commit modes: Commit mode Latency Durability Fire & Forget “none” Weak Leader ack 1 roundtrip Medium Full replication 2 roundtrips Strong ack Message9 Message10 ack Message9 Message10 Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 17 Producer API …or for log aggregation: Configuration parameters: ProducerType (sync/async) CompressionCodec (none/snappy/gzip) BatchSize EnqueueSize/Time Encoder/Serializer Partitioner #Retries MaxMessageSize … Consumer API(s) • • High-level (consumer group, auto-commit) Low-level (simple consumer, manual commit) Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 20 Broker Protips • • • • • • • • • Reasonable number of partitions – will affect performance Reasonable number of topics – will affect performance Performance decrease with larger Zookeeper ensembles Disk flush rate settings message.max.bytes – max accept size, should be smaller than the heap socket.request.max.bytes – max fetch size, should be smaller than the heap log.retention.bytes – don’t want to run out of disk space… Keep Zookeeper logs under control for same reason as above Kafka brokers have been tested on Linux and Solaris Operating Kafka • Zookeeper usage – Producer loadbalancing – Broker ISR – Consumer tracking • Monitoring – JMX – Audit trail/console in the making Distribution Tools: • • • • • • • Controlled shutdown tool Preferred replica leader election tool List topic tool Create topic tool Add partition tool Reassign partitions tool MirrorMaker Multi-datacenter replication 23 Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 24 Ecosystem Producers: • Java (in standard dist) • Scala (in standard dist) • Log4j (in standard dist) • Logback: logback-kafka • Udp-kafka-bridge • Python: kafka-python • Python: pykafka • Python: samsa • Python: pykafkap • Python: brod • Go: Sarama • Go: kafka.go • C: librdkafka • C/C++: libkafka • Clojure: clj-kafka • Clojure: kafka-clj • Ruby: Poseidon • Ruby: kafka-rb • Ruby: em-kafka • PHP: kafka-php(1) • PHP: kafka-php(2) • PHP: log4php • Node.js: Prozess • Node.js: node-kafka • Node.js: franz-kafka • Erlang: erlkafka Consumers: • Java (in standard dist) • Scala (in standard dist) • Python: kafka-python • Python: samsa • Python: brod • Go: Sarama • Go: nuance • Go: kafka.go • C/C++: libkafka • Clojure: clj-kafka • Clojure: kafka-clj • Ruby: Poseidon • Ruby: kafka-rb • Ruby: Kafkaesque • Jruby::Kafka • PHP: kafka-php(1) • PHP: kafka-php(2) • Node.js: Prozess • Node.js: node-kafka • Node.js: franz-kafka • Erlang: erlkafka • Erlang: kafka-erlang Common integration points: Stream Processing Storm - A stream-processing framework. Samza - A YARN-based stream processing framework. Hadoop Integration Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great. Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution. AWS Integration Automated AWS deployment Kafka->S3 Mirroring Logging klogd - A python syslog publisher klogd2 - A java syslog publisher Tail2Kafka - A simple log tailing utility Fluentd plugin - Integration with Fluentd Flume Kafka Plugin - Integration with Flume Remote log viewer LogStash integration - Integration with LogStash and Fluentd Official logstash integration Metrics Mozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging system Ganglia Integration Packing and Deployment RPM packaging Debian packaginghttps://github.com/tomdz/kafka-deb-packaging Puppet integration Dropwizard packaging Misc. Kafka Mirror - An alternative to the built-in mirroring tool Ruby Demo App Apache Camel Integration Infobright integration What’s in the future? • • • • • • • • • • Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559) Producer side persistence (KAFKA-156/KAFKA-789) Exact mirroring (KAFKA-658) Quotas (KAFKA-656) YARN integration (KAFKA-949) RESTful proxy (KAFKA-639) New build system? (KAFKA-855) More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260) Client API rewrite (Proposal) Application level security (Proposal) Agenda • Kafka overview – Main concepts and comparisons to other messaging systems • Features, strengths and tradeoffs • Message format and broker concepts – Partitioning, Keyed messages, Replication • Producer / Consumer APIs • Operation considerations • Kafka ecosystem If time permits: • Kafka as a real-time processing backbone • Brief intro to Storm • Kafka-Storm wordcount demo 27 Stream processing Kafka as a processing pipeline backbone Producer Producer Producer Process1 Kafka topic1 Process1 Process2 Kafka topic2 Process1 Process2 Process2 System1 System2 What is Storm? • Distributed real-time computation system with design goals: – – – – – • • • • 29 Guaranteed processing No orphaned tasks Horizontally scalable Fault tolerant Fast Use cases: Stream processing, DRPC, Continuous computation 4 basic concepts: streams, spouts, bolts, topologies In Apache incubator Implemented in Clojure Streams an [infinite] sequence (of tuples) (timestamp,sessionid,exception stacktrace) (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) Spouts a source of streams (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) Connects to queues, logs, API calls, event data. Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer 30 Bolts (t5,s4) 31 (t4,s2,e2) • • Filters Transformations • • • • • Apply functions Aggregations Access DB, APIs etc. Emitting new streams Trident = a high level abstraction on top of Storm Topologies (t5,s4) 32 (t4,s2,e2) Storm cluster Deploy Compare with Hadoop: Topology (JobTracker) Nimbus Zookeeper Supervisor Supervisor Supervisor Mesos/YARN 33 Supervisor Supervisor (TaskTrackers) Links Apache Kafka: Papers and presentations Main project page Small Mediawiki case study Storm: Introductory article Realtime discussing blog post Kafka+Storm for realtime BigData Trifecta blog post: Kafka+Storm+Cassandra IBM developer article Kafka+Storm@Twitter BigData Quadfecta blog post 34