Skip to main content

Kafka

Setup

We will be using a virtual machine in the faculty's cloud.

When creating a virtual machine in the Launch Instance window:

  • Name your VM using the following convention: cc_lab<no>_<username>, where <no> is the lab number and <username> is your institutional account.
  • Select Boot from image in Instance Boot Source section
  • Select CC 2024-2025 in Image Name section
  • Select the m1.xlarge flavor.

In the base virtual machine:

  • Download the laboratory archive from here. Use: wget https://repository.grid.pub.ro/cs/cc/laboratoare/lab-kafka.zip to download the archive.
  • Extract the archive.
  • Run the setup script bash lab-kafka.sh.
$ # download the archive
$ wget https://repository.grid.pub.ro/cs/cc/laboratoare/lab-kafka.zip
$ unzip lab-kafka.zip
$ # run setup script; it may take a while
$ bash lab-kafka.sh

Before we start

What is Kafka?

Apache Kafka is an open-source distributed event streaming platform used to build real-time data pipelines and streaming applications. Originally developed at LinkedIn and later open-sourced through the Apache Software Foundation, Kafka is designed for high-throughput, low-latency handling of real-time data feeds.

The key features that Kafka provide are:

  • High throughput: Kafka can handle millions of messages per second
  • Scalability: Kafka clusters can scale horizontally by adding more brokers and partitions
  • Durability: Data is persisted on disk and replicated across multiple brokers for fault tolerance

We will explain each keyword in the following chapters.

Why do we need Kafka?

Kafka is used entensively across industries and domains that require real-time data streaming, reliable data pipelines and scalable system architectures.

Here is a list of companies that use Kafka in their tech stack:

  1. Netflix
    • monitors streaming quality and buffer times
    • delivers recommendations based on recent watch history
    • trigger adaptive bitrate switching based on network conditions
  2. Riot Games (League of Legends)
    • track millions of in-game events per second
    • analyze performance, gameplay balance and user engagement
    • push live updates without interrupting gameplay
  3. Uber
    • tracks driver location for live updates
    • updates trip status (driver on the way, trip completed)
    • ETA calculations, route updates and dynamic pricing (surge)

Kafka is powerful, but what are the alternatives?

Some of the most commonly used Kafka alternatives are RabbitMQ, Redis and ActiveMQ.

RabbitMQ is a traditional message broker. It's key features are:

  • Supports multiple messaging protocols: includes AMQP, MQTT, STOMP or Pub/Sub
  • User-friendly setup and management: the system is designed to ensure an easier setup and management
  • High Availability (HA) Cluster Support: includes native support for HA clusters, ensuring redundancy and fault tolerance to keep the messaging system operational
  • Built-in Monitoring Web UI: a web-based user interface is available by default for real-time monitoring
  • Message prioritization: critical messages are processed first, ensuring more important tasks to be handled promptly even in high traffic conditions

Why Kafka might be better:

  • Lower throughput compared to Kafka: performance may not match Kafka's copabilities, especially in high-throughput scenarios
  • Scalability Challenges: may encounter operational difficulties and performance bottlenecks when deployed at large scale or under heavy load
  • RabbitMQ is designed with simplicity in mind, making it ideal for lightweight messaging scenarios rather than complex, large-scale data streaming architectures

Redis is an open-source, in-memory data structure store that is often used as a cache, message broker or key-value store. It is designed to provide fast access to data by keeping it in memory rather than on disk. It's key features are:

  • Exceptional throughput: capable of handling a large number of operations per second with ease
  • Ultra-low Latency: delivers minimal delay for fast data retrieval and processing
  • Memory efficient: optimized to make the most of available memory for better resource usage
  • High Availability (HA) Cluster support: offers support for HA through clustering, ensuring system uptime and redundancy

Why Kafka might be better:

  • Not designed for persistence: Primarily intended for in-memory use and not optimized for long-term data storage
  • Higher throughput than RabbitMQ, but slower than Kafka with persistence: while Redis offers better throughput than RabbitMQ, enabling persistence can slow it down compared to Kafka

ActiveMQ is an open-source mesage broker developed by Apache that facilitates communication between distributed systems by enabling asynchronous message passing. It's key features are:

  • Supports multiple messaging protocols: includes AMQP, MQTT, STOMP or Pub/Sub
  • Message persistence: ensures reliable message storage for durability and recovery
  • High Availability (HA) cluster support: provides robust clustering features for redundancy and fault tolerance
  • Highly customizable: offers extensive configuration options to tailor the system to specific needs
  • Built-in Monitoring Web UI: a web-based user interface is available by default for real-time monitoring

Why Kafka might be better:

  • Lowest performance: Offers the least performance compared to other alternatives
  • Complex setup and configuration: requires significant effor for setup and fine-tuning
  • High resource consumptions: demands considerable system resources for optimal operation
  • Challenges with Horization Scalability: struggles to scale efficiently across multiple nodes
KafkaRabbitMQRedisActiveMQ
PerformanceVery HighModerateHighModerate to High
LatencyLowLow to ModerateVery LowModerate
PersistenceStrong (Log-based)OptionalLimited (Volatile)Optional (Persistent or Non-Persistent)
ScalabilityExcellent (Horizontal)Moderate to HighModerate (Single-node)Moderate

What tools we will use?

In this lab, we'll use two docker containers: the kafka broker and a third-party tool for UI visualization, called kafka-ui. For the coding (tutorial and exercises), we'll use Python3 with the confluent-kafka package.

note

kafka-ui is not an official tool, but a visualization tool designed and developed by the community, useful in the development process.

info

kafka-ui is a web application, already configured in your VM to run on localhost:8080.

There are two options for connecting to the Argo CD user interface: SSH tunneling or Chrome Remote Desktop.

info

Option 1: SSH tunneling

Follow this tutorial to configure the SSH service to bind and forward the 8080 port to your machine:

ssh -J fep -L 8080:127.0.0.1:8080 -i ~/.ssh/id_fep  student@10.9.X.Y
info

Option 2: Chrome Remote Desktop

An alternative to SSH tunneling or X11 forwarding is Chrome Remote Desktop, which allows you to connect to the graphical inteface of your VM.

If you want to use this method, follow the steps from here.

Set the connection

Go to localhost:8080 and configure as follows:

  • Cluster name: CC_lab
  • Host: kafka
  • Post: 9092

At the bottom of the page click Validate. If everything goes well, we will see a toastr, Configuration is valid. We can click on Submit. After a couple of seconds, we will be able to see more options on the left panel:

  • The Brokers page shows the list of kafka nodes available in the cluster.
  • The Topics page presents all the topics.
  • The Consumers page lists the consumer groups, with all the consumers in each group.

Useful scripts

Kafka provides a set of useful scripts for cluster management or just producing and consuming events.

For this part of the lab, we will need two shell instances. In both of them, run the following commands to enter the container and go to the scripts directory:

$ docker exec -it kafka bash
$ cd /opt/kafka/bin

Kafka topics

Kafka topics represent logical channels or categories to which records (messages) are published and from which records are consumed. We can view a topic as a directory from our filesystem. Each topic has a name, as each directory, and we can consume (read) from our topic by using its name, as listing the files from the directory.

Each topic is composed of one or more partitions, which are units of parallelism and storage within a topic. Imagine a Kafka topic as a highway where data (messages) flows like cars. The highway has three lanes, each lane represents a partition of the topic. Cars in a single lane (partition) follow a strict order, but multiple lanes allow more cars to travel at once.

Schema

Create a topic

Now that we have all the technical details, let's create our first topic.

$ ./kafka-topics.sh --bootstrap-server localhost:19092 --create --topic post.office

This script will connect to our Kafka node on port 19092 (used for external clients) and create a topic called post.office. Let's go back to kafka-ui on the Topics page and we will see a new topic post.office with 3 Partitions and 1 Replication Factor.

Delete a topic

We can delete topics using the same script.

warning

If you run the below script, you will have to recreate the post.office.

$ ./kafka-topics.sh --bootstrap-server localhost:19092 --delete --topic post.office

Produce events on a topic

The following command will produce an event on the post.office topic.

$ echo '{"event": "new_envelope", "to": "Alan Turing", "message": "You are my role model"}' | ./kafka-console-producer.sh --bootstrap-server localhost:19092 --topic post.office

Let's take a look at the Topics page. We will see that the Number of messages field increased. Click on the topic name and a more detailed page will pop. We can see on the Messages tab our message, but the Consumers tab is empty.

Consume events from a topic

The following instruction will consume an event from the post.office topic.

$ ./kafka-console-consumer.sh --bootstrap-server localhost:19092 --topic post.office
note

As we can see, the script will lock our terminal waiting for events, but the first event is not received. Let's produce the same event again and see the result.

note

If we go on the Messages tab, we will see with a high chance that the second message is produces on other partition than the first one. That is a result of Kafka's internal routing system. Kafka will try to send messages on different partitions to increase the parallelism when we have multiple consumers.

Updating the number of partitions

As we were able to see in the above section, more partitions mean higher parallelism. We want to increase the number of partitions from 3 to 5 using the following script:

$ ./kafka-topics.sh --alter --topic post.office --partitions 5 --bootstrap-server localhost:9092

A real-world example would be an online shop. We use kafka to produce some events to another service that sends emails to customers. The entire year, three partitions work just fine, but the Black Friday comes. All the customers will start searching for products and purchasing all kinds of stuff. Three event consumers might not be enough and we don't want to miss or delay sending any purchasing email.

Exercises

From this point, we will use the python script provided in the ZIP archive.

note

To continue the lab, we need to create a virtual environment and install the confluent-kafka package. Run the following commands:

$ python3.10 -m venv venv
$ source venv/bin/activate
$ pip3 install confluent-kafka
warning

Make sure your prompt contains the (venv) message, as the following example:

(venv) student@cc-lab:~$

The confluent-kafka package is available only in the virtual environment, not on the machine.

For more in-depth details about confluent-kafka, check the official documentation.

Task 1

Here we have a Python script that creates multiple threads, one for each consumer/producer. We will always have one producer, which creates an event per second without any output, until the SIGINT (CTRL + C) signal is caught. We have a variable number of consumer threads, which will print everytime they consume something. Example of output:

$ python3 kafka.py
Consumer 0: {"to": "Steve Jobs", "from": "KFYbPZO@gmail.com", "message": "We will have affordable prices, right?"} (key: None)
SIGINT received. Stopping thread...

Follow TODO1 comments and let some events to be produced. What is the result in each consumer?

Details

Read me after Each consumer will get all the events. Sometimes this is what we want, but sometimes this behaviour can lead to duplicating the actions. An example is the online shop that send events each time an user purchases something. One email service would want to subscribe to these events to send details to customers. Another service, that generates invoices for businesses, would also be a consumer. Both require the same events, not just a subset of them.

What about a high traffic day that require two invoice services to generate the documentation in time? It would be a disaster to generate and send two invoices for one purchase, right?

Task 2

Follow TODO2 comments and let some events to be produces. What is the result in each consumer?

Details

Read me after As we can see, grouping multiple consumers under the same ID means that we will not consume the same event twice.

Kafka has an internal routing system based on partitions and the number of consumers in a consumer group. In this case, we can have maximum 3 active consumers because we have 3 partitions. The rest of the consumers will be on hold and will run only if active consumers stop for any reason.

Task 3

Follow TODO3 comments and let some events to be produces. What is the result in each consumer?

Details

Read me after Up until this moment, we sent events that had a value, but without a key. When we send an event with a key, Kafka makes a hash of the key and assigns it to a partition. From that moment, all the events containing that key hash will be routed to the same partition.

You can also check the kafka-ui dashboard.

note

We are creating a small amount of events compared to what Kafka can handle. There is a chance that some consumers will not get events.

Kafka does not guarantee that events with different keys will be sent on different partitions.

Kafka guarantees that events with the same key will also get on the same partition.