Kafka

Setup

We will be using a virtual machine in the faculty's cloud.

When creating a virtual machine in the Launch Instance window:

Name your VM using the following convention: cc_lab<no>_<username>, where <no> is the lab number and <username> is your institutional account.
Select Boot from image in Instance Boot Source section
Select CC 2024-2025 in Image Name section
Select the m1.xlarge flavor.

In the base virtual machine:

Download the laboratory archive from here. Use: wget https://repository.grid.pub.ro/cs/cc/laboratoare/lab-kafka.zip to download the archive.
Extract the archive.
Run the setup script bash lab-kafka.sh.

$ # download the archive
$ wget https://repository.grid.pub.ro/cs/cc/laboratoare/lab-kafka.zip
$ unzip lab-kafka.zip
$ # run setup script; it may take a while
$ bash lab-kafka.sh

Before we start

What is Kafka?

Apache Kafka is an open-source distributed event streaming platform used to build real-time data pipelines and streaming applications. Originally developed at LinkedIn and later open-sourced through the Apache Software Foundation, Kafka is designed for high-throughput, low-latency handling of real-time data feeds.

The key features that Kafka provides are:

High throughput: Kafka can handle millions of messages per second
Scalability: Kafka clusters can scale horizontally by adding more brokers and partitions
Durability: Data is persisted on disk and replicated across multiple brokers for fault tolerance

We will explain each keyword in the following chapters.

Why do we need Kafka?

Kafka is used entensively across industries and domains that require real-time data streaming, reliable data pipelines and scalable system architectures.

Here is a list of companies that use Kafka in their tech stack:

Netflix
- monitors streaming quality and buffer times
- delivers recommendations based on recent watch history
- trigger adaptive bitrate switching based on network conditions
Riot Games (League of Legends)
- track millions of in-game events per second
- analyze performance, gameplay balance and user engagement
- push live updates without interrupting gameplay
Uber
- tracks driver location for live updates
- updates trip status (driver on the way, trip completed)
- ETA calculations, route updates and dynamic pricing (surge)
Adobe
- part of the Pipeline solution in Adobe Experience Platform
- processes hundreds of billions of messages each day
- replicates messages across 15 different data centers in AWS, Azure, and on-prem

Kafka is powerful, but what are the alternatives?

Some of the most commonly used Kafka alternatives are RabbitMQ, Redis and ActiveMQ.

RabbitMQ is a traditional message broker. It's key features are:

Supports multiple messaging protocols: includes AMQP, MQTT, STOMP or Pub/Sub
User-friendly setup and management: the system is designed to ensure an easier setup and management
High Availability (HA) Cluster Support: includes native support for HA clusters, ensuring redundancy and fault tolerance to keep the messaging system operational
Built-in Monitoring Web UI: a web-based user interface is available by default for real-time monitoring
Message prioritization: critical messages are processed first, ensuring more important tasks to be handled promptly even in high traffic conditions

Why Kafka might be better:

Lower throughput compared to Kafka: performance may not match Kafka's copabilities, especially in high-throughput scenarios
Scalability Challenges: may encounter operational difficulties and performance bottlenecks when deployed at large scale or under heavy load
RabbitMQ is designed with simplicity in mind, making it ideal for lightweight messaging scenarios rather than complex, large-scale data streaming architectures

Redis is an open-source, in-memory data structure store that is often used as a cache, message broker or key-value store. It is designed to provide fast access to data by keeping it in memory rather than on disk. It's key features are:

Exceptional throughput: capable of handling a large number of operations per second with ease
Ultra-low Latency: delivers minimal delay for fast data retrieval and processing
Memory efficient: optimized to make the most of available memory for better resource usage
High Availability (HA) Cluster support: offers support for HA through clustering, ensuring system uptime and redundancy

Why Kafka might be better:

Not designed for persistence: Primarily intended for in-memory use and not optimized for long-term data storage
Higher throughput than RabbitMQ, but slower than Kafka with persistence: while Redis offers better throughput than RabbitMQ, enabling persistence can slow it down compared to Kafka

ActiveMQ is an open-source mesage broker developed by Apache that facilitates communication between distributed systems by enabling asynchronous message passing. It's key features are:

Supports multiple messaging protocols: includes AMQP, MQTT, STOMP or Pub/Sub
Message persistence: ensures reliable message storage for durability and recovery
High Availability (HA) cluster support: provides robust clustering features for redundancy and fault tolerance
Highly customizable: offers extensive configuration options to tailor the system to specific needs
Built-in Monitoring Web UI: a web-based user interface is available by default for real-time monitoring

Why Kafka might be better:

Lowest performance: Offers the least performance compared to other alternatives
Complex setup and configuration: requires significant effor for setup and fine-tuning
High resource consumptions: demands considerable system resources for optimal operation
Challenges with Horization Scalability: struggles to scale efficiently across multiple nodes

	Kafka	RabbitMQ	Redis	ActiveMQ
Performance	Very High	Moderate	High	Moderate to High
Latency	Low	Low to Moderate	Very Low	Moderate
Persistence	Strong (Log-based)	Optional	Limited (Volatile)	Optional (Persistent or Non-Persistent)
Scalability	Excellent (Horizontal)	Moderate to High	Moderate (Single-node)	Moderate

What tools we will use?

In this lab, we'll use two docker containers: the kafka broker and a third-party tool for UI visualization, called kafka-ui. For the coding (tutorial and exercises), we'll use Python3 with the confluent-kafka package.

note

kafka-ui is not an official tool, but a visualization tool designed and developed by the community, useful in the development process.

info

kafka-ui is a web application, already configured in your VM to run on localhost:8080.

There are two options for connecting to Kafka UI: SSH tunneling or Chrome Remote Desktop.

info

Option 1: SSH tunneling

Follow this tutorial to configure the SSH service to bind and forward the 8080 port to your machine:

ssh -J fep -L 8080:127.0.0.1:8080 -i ~/.ssh/id_fep  student@10.9.X.Y

info

Option 2: Chrome Remote Desktop

An alternative to SSH tunneling or X11 forwarding is Chrome Remote Desktop, which allows you to connect to the graphical inteface of your VM.

If you want to use this method, follow the steps from here.

Set the connection

Go to localhost:8080 and configure as follows:

Cluster name: CC_lab
Host: kafka
Post: 9092

At the bottom of the page click Validate. If everything goes well, we will see a toastr, Configuration is valid. We can click on Submit. After a couple of seconds, we will be able to see more options on the left panel:

The Brokers page shows the list of kafka nodes available in the cluster.
The Topics page presents all the topics.
The Consumers page lists the consumer groups, with all the consumers in each group.

Useful scripts

Kafka provides a set of useful scripts for cluster management or just producing and consuming events.

For this part of the lab, we will need two shell instances. In both of them, run the following commands to enter the container and go to the scripts directory:

$ docker exec -it kafka bash
$ cd /opt/kafka/bin

Kafka topics

Kafka topics represent logical channels or categories to which records (messages) are published and from which records are consumed. We can view a topic as a directory from our filesystem. Each topic has a name, as each directory, and we can consume (read) from our topic by using its name, as listing the files from the directory.

Each topic is composed of one or more partitions, which are units of parallelism and storage within a topic. Imagine a Kafka topic as a highway where data (messages) flows like cars. The highway has three lanes, each lane represents a partition of the topic. Cars in a single lane (partition) follow a strict order, but multiple lanes allow more cars to travel at once.

Schema

Create a topic

Now that we have all the technical details, let's create our first topic.

$ ./kafka-topics.sh --bootstrap-server localhost:19092 --create --topic post.office

This script will connect to our Kafka node on port 19092 (used for external clients) and create a topic called post.office. Let's go back to kafka-ui on the Topics page and we will see a new topic post.office with 3 Partitions and 1 Replication Factor.

Delete a topic

We can delete topics using the same script.

warning

If you run the below script, you will have to recreate the post.office.

$ ./kafka-topics.sh --bootstrap-server localhost:19092 --delete --topic post.office

Produce events on a topic

The following command will produce an event on the post.office topic.

$ echo '{"event": "new_envelope", "to": "Alan Turing", "message": "You are my role model"}' | ./kafka-console-producer.sh --bootstrap-server localhost:19092 --topic post.office

Let's take a look at the Topics page. We will see that the Number of messages field increased. Click on the topic name and a more detailed page will pop. We can see on the Messages tab our message, but the Consumers tab is empty.

Consume events from a topic

The following instruction will consume an event from the post.office topic.

$ ./kafka-console-consumer.sh --bootstrap-server localhost:19092 --topic post.office

note

As we can see, the script will lock our terminal waiting for events, but the first event is not received. Let's produce the same event again and see the result.

note

If we go on the Messages tab, we will see with a high chance that the second message is produced on other partition than the first one. That is a result of Kafka's internal routing system. Kafka will try to send messages on different partitions to increase the parallelism when we have multiple consumers.

Updating the number of partitions

As we were able to see in the above section, more partitions mean higher parallelism. We want to increase the number of partitions from 3 to 5 using the following script:

$ ./kafka-topics.sh --alter --topic post.office --partitions 5 --bootstrap-server localhost:9092

A real-world example would be an online shop. We use kafka to produce some events to another service that sends emails to customers. The entire year, three partitions work just fine, but the Black Friday comes. All the customers will start searching for products and purchasing all kinds of stuff. Three event consumers might not be enough and we don't want to miss or delay sending any purchasing email.

tip

After increasing the number of partitions for a topic, a decrease is not allowed.

Try to decrease the number of partitions back to 3 and see what happens.

Exercises

From this point, we will use the python script provided in the ZIP archive.

note

To continue the lab, we need to create a virtual environment and install the confluent-kafka package. Run the following commands (directly on the VM, not in the kafka container):

$ python3.10 -m venv venv
$ source venv/bin/activate
$ pip3 install confluent-kafka

warning

Make sure your prompt contains the (venv) message, as the following example:

(venv) student@cc-lab:~$

The confluent-kafka package is available only in the virtual environment, not on the machine.

For more in-depth details about confluent-kafka, check the official documentation.

Task 1

Here we have a Python script that creates multiple threads, one for each consumer/producer. We will always have one producer, which creates an event per second without any output, until the SIGINT (CTRL + C) signal is caught. We have a variable number of consumer threads, which will print everytime they consume something. Example of output:

$ python3 kafka.py
Consumer 0: {"to": "Steve Jobs", "from": "KFYbPZO@gmail.com", "message": "We will have affordable prices, right?"} (key: None)
SIGINT received. Stopping thread...

Follow TODO1 comments and let some events to be produced. What is the result in each consumer?

Details

Read me after

Each consumer will get all the events. Sometimes this is what we want, but sometimes this behaviour can lead to duplicating the actions. An example is the online shop that send events each time an user purchases something. One email service would want to subscribe to these events to send details to customers. Another service, that generates invoices for businesses, would also be a consumer. Both require the same events, not just a subset of them.

What about a high traffic day that require two invoice services to generate the documentation in time? It would be a disaster to generate and send two invoices for one purchase, right?

Task 2

Follow TODO2 comments and let some events to be produced. What is the result in each consumer?

Details

Read me after

As we can see, grouping multiple consumers under the same ID means that we will not consume the same event twice.

Kafka has an internal routing system based on partitions and the number of consumers in a consumer group. In this case, we can have maximum 5 active consumers because we have 5 partitions. The rest of the consumers will be on hold and will run only if active consumers stop for any reason.

Task 3

Follow TODO3 comments and let some events to be produced. What is the result in each consumer?

Details

Read me after

Up until this moment, we sent events that had a value, but without a key. When we send an event with a key, Kafka makes a hash of the key and assigns it to a partition. From that moment, all the events containing that key hash will be routed to the same partition.

You can also check the kafka-ui dashboard.

note

We are creating a small amount of events compared to what Kafka can handle. There is a chance that some consumers will not get events.

Kafka does not guarantee that events with different keys will be sent on different partitions.

Kafka guarantees that events with the same key will also get on the same partition.

Setup​

Before we start​

What is Kafka?​

Why do we need Kafka?​

Kafka is powerful, but what are the alternatives?​

What tools we will use?​

Set the connection​

Useful scripts​

Kafka topics​

Create a topic​

Delete a topic​

Produce events on a topic​

Consume events from a topic​

Updating the number of partitions​

Exercises​

Task 1​

Task 2​

Task 3​

Setup

Before we start

What is Kafka?

Why do we need Kafka?

Kafka is powerful, but what are the alternatives?

What tools we will use?

Set the connection

Useful scripts

Kafka topics

Create a topic

Delete a topic

Produce events on a topic

Consume events from a topic

Updating the number of partitions

Exercises

Task 1

Task 2

Task 3