Hello Kafka

Lately I’ve been looking around distributed messaging systems. As the new kid on the block, I decided to start with Apache Kafka and I was surprised with how easy it is to start with the basics.

With Kafka,

  • Producers push and Consumers pull messages.
  • Messages are persisted (filesystem) and organized into topics.
  • Topics can have multiple partitions (on different machines, if needed).
  • Runs in clusters.
  • Nodes/servers are called brokers.
log_anatomy
Source: http://kafka.apache.org/documentation.html

Each topic has it’s own offset which allows you to start consuming a topic from a specified offset. Every new message is added to the partition’s head and a specific message is identified by it’s partition, topic and offset. Kafka allows you to have multiple consumers consuming the same partition on the same topic, as messages are not removed once they are consumed. E.g. ConsumerA is consuming from the offset 3 and ConsumerB started consuming from the offset 3 but is currently reading from the offset 10, when ConsumerA reaches offset 10 he will have consumed the exact same messages as Consumer B.

Still, need to note that the persistence time is configurable. Kafka can persist the messages for a predefined amount of time, as in the previous example if ConsumerA takes too long to consume it may not consume the exact same messages. Kafka doesn’t keep a track of consumers, i.e. it is the consumer’s responsibility to keep track of what they have consumed and where they are currently in the topic (partition & offsets).

Each broker has one or more partitions and each partition has a leader. The consumers and producers only talk with the partition leader. The leader replicates the information with the followers.

Producers load balance the partition, i.e. producers randomly talk to the leader partitionA for an X amount of time. When that time expires the producer will select another leader to talk to. Still, it is possible to customise the load balancer by your needs.

consumer-groups
Source: http://kafka.apache.org/documentation.html

Consumers can be grouped, consumer groups. Each consumer takes a part of the data where the group as a whole does the work. This is handy for scaling, if a consumer is struggling to keep up with the work load you can fire up a new consumer to help out.

Ok, enough of cheap talking. Let’s get our hands dirty.

First you need to download Kafka and extract it to your server/computer. Kafka has binaries for Win and Linux systems. For both operation systems the procedure is quite the same, on Windows use “yourKafkaFolderLocation\bin\windows” folder.

First thing first, let’s have a look at the configuration file. Go to config folder and open up server.properties. The content is pretty much self explanatory. Each instance of Kafka requires it’s own .properties file. If you intend to run more than one instance of Kafka in your machine/server you need a copy of server.properties file and each .properties file has to have a unique broker.id and port number.

First, let’s create a topic by using the following command for Linux:

And for Windows:

To run Kafka you first need to have an instance of Zookeeper running. To do so, on the bin folder run the following command for Linux:

and for Windows:

Once Zookeeper is up and running you can fire up Kafka, to do so on Linux run the following command:

and for Windows:

Now that you can start an instance of Kafka let’s code a producer and a consumer.

For the producer, we need to set some configurations. We do so with the following method.

Here we set the broker url, the message serializer, we wait for the leader’s acknowledgement after the leader replica has received the data and we specify the producer type (if it is a synchronous or an asynchronous producer).
More information about these and other possible configurations can be found here.

To produce a message into a Kafka’s topic we call the following method.

And that’s pretty much it.
On the consumer side we also need to set up some configurations. To do so, we use the following method.

Then, we start the consumer by calling the start() method.

Whenever we want to fetch for messages we call the following method where we get a stream iterator. Using the iterator we retrieve the next message and return it.

Here is the output for the producer.

producer

And here is the consumer output.

consumer

The full code can be found on my GitHub repository or by following this link.

For the sake of brevity and simplicity I followed the KISS principle. A lot more can be done with Kafka. I’ll try to come up with slightly more advanced use cases for Kafka in the future.