Kafka

Kafka is a low level queue system. You read and write data onto queues. The first data written in is the first data to be read, i.e., first-in-first-out (FIFO). Similar to a car tunnel, the first car to go in is the first car to pop out the other side (hopefully!)

The cool (or uncool) part is that there isn't any magic added to those queues. Kafka wants to make the simple processs fast, robust and scalable. Its benefits are high throughput, robust and redundant persistence, scalablility and goodies that make for a large system. It's downsides (compared to other queue systems, like RabbitMQ) are a lack of features around handling the data inside the queues. No routing, filtering for specific consumers, etc. Much of that is to be handled by you, or extensions to the Kafka system.

When does this make sense?

Because it's low-level, there are quite a few use-cases for it. One interesting use-case is event-sourcing. Instead of relying on databases as the source of truth, you rely on events.

Other use-cases include dealing with high volumes of streamed data that needs to be summarized or filtered, like handling logging messages.

How kafka works

Producers write Logs to Topics via Brokers and Consumers read from those Topics, also via Brokers. It doesn't matter much what is being written, nor what happens when you read them. Each write to a Topic is put into a box called a Log, which we can think of as a message or event. Logs have a key and a value. Although we're gonna focus on the value, not the key.

Kafka will persist the Topic's data, quite robustly, so that if any services die, the data should still be waiting there, just where things left off. And reading a Log the Topic will not remove it, instead Kafka keeps track of which Log was last read by which Consumers.

This is a little different to other queues, that try to route the data directly from the producer to the consumer. Kafka instead keeps the Topics immutable, data written into them doesn't change regardless of how many consumers want to read, or if there are no consumers at all.

There isn't infinite space, so Topics will rotate when it runs out of disk space. The oldest Logs will drop to make space for newer Logs. There are ways to scale this, using Partitions.

An example

We can think of event-sourcing as saying your bank balance is made up of all the transactions that have happened since you opened the account. This is different to saying, the bank balance is whatever the database says it is.

Using a bank account model. We can have a Topic called bank.transactions and can write positive or negative dollar amounts into it. Then a Consumer called keep-balance can read those one-by-one and keep track of the running balance.

If we reset/kill/corrupt our consumer, then it can be possible to start reading from the beginning again and should get back to the same answer. This is cool, since we can migrate databases on the side, while keeping the old one running! This does assume you have kept every Log around and haven't dropped any.

If we write too many Logs to the Topic, then it will drop the older Logs.

Topic: bank.transactions

bank.transactions

$10

$20

Consumer: keep-balance

Balance: $0

Partitions and Consumer Groups

Conceptually, Producers write Logs to Topics and Consumers read Logs from Topics. But there is a lower level abstraction here. Producer and Consumers actually write/read to Paritions of a Topic.

Which Paritions? Producers write to any Parition they like, they are responsible for deciding which is best. Consumers will read from the Partition that their Consumer Group tells them to read.

What is a Partition?

A Partition is a slice of a Topic. Like a shard in a database. You can divide a Topic into one or more Partitions, and each Partition will store some of the Logs for that Topic. If a Topic has three Partitions and ten Logs, you may find three Logs in Partition-0, three Logs in Partition-1 and the remaining four Logs in Partition-2. Partitions are "real" in that they live on a server somewhere and have an identity.

What is a Consumer Group?

A Consumer Group is a group of Consumers that are all trying to do the same thing. Each Consumer in the group will process some of the Logs in the Topic. This is for scaling the processing of those Logs. You can have any many Consumer Groups as you like and each will probably do something different.

Consumer Groups are defined in Kafka and they don't exist as their own service. The Consumers are their own services and they're the ones running application code that do some useful processing of the Logs. The group will keep track of which Consumers have read which Logs and will try to balance Partitions to Consumers, so that each Consumer knows which Partition(s) to read from and that two Consumer within the group don't both read the same Log.

Topic: bank.transactions

Partition-0

Partition-1

Partition-2

Consumer Group: keep-balance

Lag: 0

Partition-0 (Offset: 0) Consumer-A

Partition-1 (Offset: 0) Consumer-A

Partition-2 (Offset: 0) Consumer-B

Consumer A

(Idle)

Consumer B

(Idle)

Producer A

Next...

Partition-0

$10

Balance: $0

Scaling things

We scale things in Kafka, by ensuring that each Topic has enough Partitions from the start. Then when we need to increase the load, we add more Consumers to Consumer Groups (which can be on the fly).

Another important feature is robustness, we want to ensure that our data is safe. To do that we set up a Replication Factor, which is the number of copies we want our Topics to have. A factor of two, means we have duplicated all the data within the Topic twice. Your replication needs may vary depending on what the data is.

An important consideration is the number of Partitions you set your Topic to have. It ends up representing the maximum number of Consumers (within a Consumer Group) that can read from the Topic. If you have ten Partitions for your Topic, then each Consumer Group can only use ten Consumers. Adding more Consumers to those groups will have no effect. This is because a constraint in Kafka is that a Partition can only be read by one Consumer.

What is a good number of Partitions? If you don't know, go with 5-100 Partitions for a Topic.
One partition is fine if you'll never have more than one Consumer (per Consumer Group) reading from that Topic.

The mechanics of Kafka

You work with Kafka by using their SDKs. The over-the-wire protocol that Kafka exposes isn't designed to be used directly.

As a Producer service your job will be to write data into Partitions of a Topic. The SDK will likely provide some out-of-the-box strategies... like "just write to any random Partition".

As a Consumer service you'll join a Consumer Group and subscribe to Logs, then process them anyway you see fit. You can lean your SDK or manually acknowledge that you've processed a Log.

The protocol Kafka uses send data over the wire is binary and includes ideas from Protobuf. It's not advised to directly interact with it over this protocol, instead there are SDKs and other API frontends (like REST). https://kafka.apache.org/protocol. For reasons why Kafka went with their own protocol they wrote up a nice segment in the same page: https://kafka.apache.org/protocol#protocol_philosophy.

Writing events to a Topic

Let's get started by writing an event onto a topic. The mechanics look like this...

Connect (via TCP) to any Kafka Broker
Fetch metadata of cluster. This will include where all the brokers are hosted
Fetch metadata of the Topic. This will include where all the Partitions are, i.e., which Brokers have them.
Write a Log to a Partition of your choice.
Done

That was pretty straight forward. The only magic is for high availability, and figuring out where to do the actual writing. Finding out the cluster information includes the list of Topics, but more importantly the list of Partitions.

Kafka doesn't tell you which Partitions to write to, but it doesn't need to be complicated. There are a couple of strategies. One is Round-robin, so if you wrote to Parition-0, write to Parition-1 next. Another is Random, which is fine too. And lastly, by hashing the key. So if you want to group Logs based on something for performance (like branch location of the bank), then you can hash that info so that the same bank branch will be written to the same Partition -- of course this requires thinking about, as you could end up writing every Log to the same Parition, depends entirely on how/what you hash.

Reading events

Kafka Topics are meant to be completely read. You're not really supposed to read from the middle of them (you can), as that might take things out of context. Although you could read from the latest Log and ignore all Logs before that.

Because Kafka wants to be really scalable they introduced the idea of a Consumer Group. This thing will make sure that many Consumers will get a chance to read every Log that a Topic has. It does so by mapping Consumers (individual services) to Partitions. A Partition can only be read by one Consumer, which ensures that two Consumers (within a Consumer Group) will never read the same Log.

Because of that one-consumer-per-partition constraint, part of the infrastructure concern is to ensure that there are enough Partitions in a Topic to accommodate more load, i.e., more Consumers. If you only have one Partition, then you'll only be able to support one Consumer!

The process of reading a Log goes like this.

Connect (via TCP) to any Kafka Broker
Fetch metadata of cluster. This will include where all the brokers are hosted
Connect to a Consumer Group, optionally giving your ID.
Subscribe to a Topic
Process Logs as they arrive (they can be batched)...
Tell the Consumer Group you've read a Log, so it can keep track. Either automatically via the SDK or manually if you want more control.

const { Kafka } = require('kafkajs')
let balance = 0

const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['kafka1.example.com:9092', 'kafka2.example.com:9092'],
})

const consumer = kafka.consumer({ groupId: 'keep-balance' })
await consumer.connect()
await consumer.subscribe({ topic: 'bank.transactions', fromBeginning: true })

console.log('Connected to bank.transactions')

await consumer.run({
  eachMessage: async ({ topic, partition, message }) => {
    const dollarAmount = JSON.parse(message.value.toString)
    console.log('Transaction: $' + dollarAmount)
    balance += dollarAmount
    console.log('Balance: $' + balance)
  },
})

Link to KafkaJS

Connected to bank.transactions

Summary

Kafka is cool for high volume, or as acting as a special kind of Database (event-sourcing). Knowing how it works should hopefully add a new tool to your design arsenal. Here are my highlights.

Kafka is low level, not much magic.
When it comes to designing the API or Event-Driven Architecture, Kafka is simple.
Queues are called Topics and data flows FIFO.
Producers are responsible for distributing Logs to Partitions, but doesn't need to be complicated.
Consumer Groups allow for splitting up the processing work to one or more Consumers (application code).
There is some complexity around the scalability and infrastructure side, but that makes it powerful.

Jargon

For even more reference, here is a glossary.

Term	Description
Broker	The Kafka server, there is usually more than one for high availability. Consumers and Producers both connect to these.
Cluster	A collection of Kafka Brokers connected together.
Log	Is an event or message, depending which you prefer to think of it as.
Topic	Where you can write/read Logs to/from
Partition	Topics are divided into Partitions, each Partition will contain some of the Logs for the Topic.
Consumer	A service that reads Logs from a Partition via a Broker. Consumers must belong to a Consumer Group.
Consumer Group	A group of Consumers. The group will try to process all Logs on a Topic. The group will balance Partitions and Consumers together. You can have as many Consumer Groups as you like on a Topic.
Consumer Offset	Which Log a Consumer Group is currently processing. This plus the Lag gives you the total size of the Topic
Producer	A service that writes Logs to Partitions of a Topic, via a Broker
Lag	How many Logs the consumer has still got to process
Zookeeper	It is a highly-available (HA), distributed key-value store. Separate from Kafka, it aims to bring HA to Kafka, like assigning leadership.
Bootstrap URL(s)	The URL(s) pointing to Brokers. Kafka clients will start with these, then fetch more metadata to find out all the other Brokers, their Topics and which are leaders for which Partitions.
Replication Factor	The number of copies of the Partitions within a Topic. Each Topic can set its own Replication Factor.