Lorenzo Alberton

Contact me ·

« Articles

Kafka proposed as Apache incubator project

PHP, Performance, Scalability, NoSQL, Message Queues, Kafka · 24 June 2011 08:38 · 1 comment

Abstract: Kafka is a distributed publish-subscribe messaging system developed at LinkedIn, designed to support a very high throughput, persistent messages and parallel loading into Hadoop. A proposal has been submitted to make Kafka an Apache incubator project.

At the end of November 2010, LinkedIn officially published its first public Kafka release, along with some design documents. It was immediately clear that it was going to be a real contender in the messaging systems arena, and some of its features immediately caught my eyes: Hadoop support, distributed and persistent queue, extremely high throughput (hundreds of thousands of messages per second).

At DataSift we deal with several thousand messages every second, and that means that we need to move data around really, really fast. For a while, we've been using Redis internally for the intermediate queues, perhaps abusing its pub-sub support. Redis has served us well so far, but it showed several minor issues, and we've gradually replacing it with real message systems, or to be more precise with a high performance duo: Kafka and 0mq. I'm going to talk about how we use 0mq and the patterns that are more appropriate for each use case in another article.

Today I'm happy to read that Kafka has been proposed as an Apache incubator project. Since November, it's come a long way: after the native Scala and Java clients, I contributed the first PHP client, and now there are clients in Python, Ruby, C#, Clojure, Node.Js. This week, my colleague Ben wrote a C++ producer that we'll contribute soon.

Kafka is still a young project, but it's maturing fast, and we're confident enough to use it in production (as a matter of fact, we've been using it for months now) in front of our HBase cluster and to collect monitoring events sent from all our internal services. We chose Kafka especially for its persistent storage (which is essentially a partitioned binary log), but we plan to do some analytics via its support for Hadoop soon. And its distributed nature (coordination between consumers and brokers is done via Zookeeper) makes it very appealing too.

We encourage everyone to try it out and vote for it by subscribing to the general incubator mailing list. If you're already using Kafka, I'd love to hear about your experiences with it, please leave a comment below.

Update 2011-07-04: Kafka is now an Apache Incubator Project!

Back