Skip to main content

Kafka Fundamentals 基础

1. Kafka Overview

In this section, we will learn all of the fundamentals and understand the various Apache Kafka components such as:

  • Kafka Topics
  • Kafka Producers
  • Kafka Consumers
  • Kafka Consumer Groups and Consumer Offsets
  • Kafka Brokers
  • Kafka Topic Replication
  • Zookeeper
  • KRaft Mode

This content is intended for users who want to understand and use Apache Kafka properly.

1.1 What is a Kafka Topic?

Similar to how databases have tables to organize and segment datasets, Kafka uses the concept of topics to organize related messages.

A topic is identified by its name. For example, we may have a topic called logs that may contain log messages from our application, and another topic called purchases that may contain purchase data from our application as it happens.

Kafka topics can contain any kind of message in any format, and the sequence of all these messages is called a data stream. Topics 可以存储任意类型任何格式的消息,这些消息的顺序叫做数据流。

Kafka Topics - Warning

Unlike database tables, Kafka topics are not query-able. Instead, we have to create Kafka producers to send data to the topic and Kafka consumers to read the data from the topic in order.

Kafka topics 是不可查询的。

Data in Kafka topics is deleted after one week by default (also called the default message retention period), and this value is configurable. This mechanism of deleting old data ensures a Kafka cluster does not run out of disk space by recycling topics over time. Topics 中的数据默认在 1 周后删除。这种删除旧数据的机制通过随着时间的推移回收主题来确保 Kafka 集群不会耗尽磁盘空间。

1.2 What are Kafka Partitions?

Topics are broken down into a number of partitions. A single topic may have more than one partition, it is common to see topics with 100 partitions.

The number of partitions of a topic is specified at the time of topic creation. Partitions are numbered starting from 0 to N-1, where N is the number of partitions. The figure below shows a topic with three partitions, with messages being appended to the end of each one.

The offset is an integer value that Kafka adds to each message as it is written into a partition. Each message in a given partition has a unique offset. 偏移量 offset 是 Kafka 在写入分区 partition 时添加到每条消息的整数值。给定分区中的每条消息都有一个唯一的偏移量。

Kafka Topics

Kafka topics are immutable: once data is written to a partition, it cannot be changed

1.3 Kafka Topic example

A traffic company wants to track its fleet of trucks. Each truck is fitted with a GPS locator that reports its position to Kafka. We can create a topic named - trucks_gps to which the trucks publish their positions. Each truck may send a message to Kafka every 20 seconds, each message will contain the truck ID and the truck position (latitude and longitude). The topic may be split into a suitable number of partitions, say 10. There may be different consumers of the topic. For example, an application that displays truck locations on a dashboard or another application that sends notifications if an event of interest occurs.

1.4 What are Kafka Offsets?

Apache Kafka offsets represent the position of a message within a Kafka Partition. Offset numbering for every partition starts at 0 and is incremented for each message sent to a specific Kafka partition. This means that Kafka offsets only have a meaning for a specific partition, e.g., offset 3 in partition 0 doesn’t represent the same data as offset 3 in partition 1.

Kafka Offset Ordering

If a topic has more than one partition, Kafka guarantees the order of messages within a partition, but there is no ordering of messages across partitions.

Even though we know that messages in Kafka topics are deleted over time (as seen above), the offsets are not re-used. They continually are incremented in a never-ending sequence. 消息会随着时间而被删除,但是 offsets 不会被复用,他们将永无止境地顺序增加。

nce a topic has been created with Kafka, the next step is to send data into the topic. This is where Kafka Producers come in. 使用 Kafka 创建主题后,下一步就是将数据发送到主题中。这就是 Kafka Producers 的用武之地。

2. Kafka Producers

Applications that send data into topics are known as Kafka producers. 将数据发送到主题的应用程序称为 Kafka 生产者。

A Kafka producer sends messages to a topic, and messages are distributed to partitions according to a mechanism such as key hashing (more on it below).

For a message to be successfully written into a Kafka topic, a producer must specify a level of acknowledgment (acks). This subject will be introduced in depth in the topic replication section.

2.1 Message Keys 消息键

当生产者发送消息并指定了 key 时,Kafka 会使用该 key(通常通过 hash function)来确定将此消息存储到哪个分区。这确保了具有相同 key 的所有消息都被发送到相同的分区。这在某些场景下很有用,例如,当你想要确保相同的用户的所有消息都在同一分区上时。

Each event message contains an optional key and a value.

In case the key (key=null) is not specified by the producer, messages are distributed evenly across partitions in a topic. This means messages are sent in a round-robin fashion (partition p0 then p1 then p2, etc... then back to p0 and so on...). 如果生产者未指定键,则消息将均匀分布在主题中的分区之间。这意味着消息以轮询的方式发送。

If a key is sent (key != null), then all messages that share the same key will always be sent and stored in the same Kafka partition. A key can be anything to identify a message - a string, numeric value, binary value, etc.

Kafka message keys are commonly used when there is a need for message ordering for all messages sharing the same field. For example, in the scenario of tracking trucks in a fleet, we want data from trucks to be in order at the individual truck level. In that case, we can choose the key to be truck_id. In the example shown below, the data from the truck with id truck_id_123 will always go to partition p0.

2.2 Kafka Message Anatomy 消息剖析

Kafka messages are created by the producer. A Kafka message consists of the following elements:

  • Key. Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the key is serialized into binary format. Key 将被序列化为二进制格式。

  • Value. The value represents the content of the message and can also be null. The value format is arbitrary and is then also serialized into binary format. 值的格式是任意的,然后也被序列化为二进制格式。

  • Compression Type 数据压缩的类型. Kafka messages may be compressed. The compression type can be specified as part of the message. Options are none, gzip, lz4, snappy, and zstd.

  • Headers. There can be a list of optional Kafka message headers in the form of key-value pairs. It is common to add headers to specify metadata about the message, especially for tracing.

  • Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The combination of topic+partition+offset uniquely identifies the message. 一旦将消息发送到 Kafka 主题,它就会收到一个分区号和一个偏移量 ID。 topic+partition+offset 的组合唯一标识了消息。

  • Timestamp. A timestamp is added either by the user or the system in the message.

2.3 Kafka Message Serializers 消息序列化器

In many programming languages, the key and value are represented as objects, which greatly increases the code readability. However, Kafka brokers expect byte arrays as keys and values of messages. The process of transforming the producer's programmatic representation of the object to binary is called message serialization.

As shown below, we have a message with an Integer key and a String value. Since the key is an integer, we have to use an IntegerSerializer to convert it into a byte array. For the value, since it is a string, we must leverage a StringSerializer.

2.4 For the curious: Kafka Message Key Hashing

A Kafka partitioner is a code logic that takes a record and determines to which partition to send it into.

In that effect, it is common for partitioners to leverage the Kafka message keys to route a message into a specific topic-partition. As a reminder, all messages with the same key will go to the same partition. 具有相同密钥的所有消息将转到同一个分区。

Kafka Key Hashing

Key Hashing is the process of determining the mapping of a key to a partition.

In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm, with the formula below for the curious:

targetPartition = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1);