This post was originally published on my personal blog.
Apache Kafka is an amazing system for building a scalable data streaming platform within an organization. It’s being used in production from all the way from small startups to Fortune 500 companies. As the adoption of a core platform grows within an enterprise, it’s important to think about maintaining consistency and enforcing standards.
Today, I’ll be discussing how one can do that in regards to Apache Kafka and its core data structure: a topic. As most engineers who have used Kafka know, a topic is a category or feed to which messages are stored and published. These are similar to queues in message bus systems such as RabbitMQ or ActiveMQ.
Imagine a company building a simple order management system using Kafka as its backbone. They might create a couple of microservices that rely on a few core topics:
As the company grows, and as more teams are onboarded to the platform, more topics will be needed. The company may add data pipelines for inventory, fraud detection, and more. They might now have additional topics like:
How do you manage to keep topic names consistent? How do people from other teams know exactly what the topic contains? It is easy when there are only a few topics and a small number of people using the platform. Once you are in a large organization with many teams creating and using topics, it becomes much harder.
A quick search leads to some great blog posts, StackOverflow answers, and mailing list posts discussing how to name topics. There is also a vast number of opinions on the best way to do this. Here are some examples:
A decent topic naming strategy, proposed by Chris Riccomini in his popular blog post, How to paint a bike shed: Kafka topic naming conventions, is:
<message type>.<dataset name>.<data name>
At first glance, none of these look particularly bad – some even look great. Before we go in-depth on how to best name a Kafka topic, let’s discuss what makes a topic name good.
When it comes to naming a Kafka topic, two parts are important. The structure of the name and the semantics of the name.
The structure of a name defines what characters are allowed and the format to use. In its topic names, Kafka allows alphanumeric characters, periods (_ . ), underscores ( _ ), and hyphens ( - _).
Although ‘_’ and ‘.’ are allowed to be used together, they can collide due to limitations in metric names. It is best to pick one and use either, but not both.
We cannot change what Kafka allows, but we can further define how dashes are used or enforce that all topics be lowercase.
The semantics of a name define what fields should go in that name and in what order they should be placed in. There are a few rules that should be applied to naming topics when it comes to semantics.
Fields that can change should not be used in topic names. Fields such as team name, product name, and service owner should never be used in topic names.
As most engineers know, over time, these things change as organizations evolve. It’s not an easy task to change a topic name once it is in use all over an enterprise, so it is best to leave those fields out from the beginning.
Topic names should not be tied to service names unless they are completely internal to a single service and are not meant to be produced to or consumed from any other service.
Most topics eventually end up with more than one consumer and its producer could change in the future. It’s best to name topics after the data they hold rather than what is creating or reading the data.
Metadata that can be found elsewhere, such as in the data payload or in a schema registry, should be left out of the topic name. This includes things such as partition count, security information, schema information, etc.
Okay, so you may be thinking, there are a ton of fields to pick from and I’m not sure what semantics I should enforce. Let’s get to the details on what a great topic name convention should look like. My recommended rules to follow are:
Topic names should be completely lowercase and adhere to the following regular expression:
[a-z0-9.-]. All topics should follow
kebab-base, such as
Readability and ease-of-understanding play a huge role in proper topic naming. Lowercase topic names are easy to read and kebab-case flows nicely; we avoid the use of underscores due to metric naming collisions with periods. Additionally, periods make for a great separator between sections in a topic name, which is described below.
My recommendation is to follow the following naming convention:
Let’s discuss what each part of the name means:
The data center which the data resides in. This is not required, but is helpful when an organization reaches the size where they would like to do an active/active setup or replicate data between data centers. For example, if you have one cluster in AWS and one in Azure, your topics may be prefixed with
A domain for the data is a well understood, permanent name for the area of the system the data relates to. These should not include any product names, team names, or service names.
Examples of this vary wildly between industries. For example, in a transportation organization, some domains might be:
- comms: all events relating to device communications
- fleet: all events relating to trucking fleet management
- identity: all events relating to identity and auth services
The classification of data within a Kafka topic tells an end-user how it should be interpreted or used. This should not tell us about data format or contents. I typically use the following classifications:
- fct: Fact data is information about a thing that has happened. It is an immutable event at a specific point in time. Examples of this include data from sensors, user activity events, or notifications.
- cdc: Change data capture (CDC) indicates this topic contains all instances of a specific thing and receives all changes to those things. These topics do not capture deltas and can be used to repopulate data stores or caches. These are commonly found as compacted topics within Kafka.
cmd: Command topics represent operations that occur against the system. This is typically found as the request-response pattern, where you have a verb and a statement. Examples might include
- sys: System topics are used for internal topics within a single service. They are operational topics that do not contain any relevant information outside of the owning system. These topics are not meant for public consumption.
The description is arguably the most important part of the name and is the event name that describes the type of data the topic holds. This is the subject of the data, such as
The version of a topic is often the most forgotten section of a proper topic name. As data evolves within a topic, there may be breaking schema changes or a complete change in the format of the data. By versioning topics, you can allow a transitionary period where consumers can switch to the new data without impacting any old consumers.
By convention, it is preferred to version all topics and to start them at
Examples, using the following convention, may be:
After a naming convention is decided upon and put into place, how does one enforce that topics conform to that convention? The first step is to ensure auto topic creation is disabled on the broker side; this is done via the
auto.topic.create.enable property. In newer versions of Kafka, this is set to
false by default, which is what we want. Ideally, your cluster should also have security enabled and disallow the creation of topics by services. This enforces topic creation is done in a standardized way and controlled by an operations team.
The recommended approach is to create topics through a continuous integration pipeline, where topics are defined in source control and created through a build process. This ensures scripts can validate that all topic names conform to the desired conventions before getting created. A helpful tool to manage topics within a Kafka cluster is kafka-dsf.
Hopefully reading this has provoked some thought into how to create useful topic naming conventions and how to prevent your Kafka cluster from becoming the Wild West. It’s important to enforce consistency early and put a standard process in place before its too late — because things like topic names are hard to change later, and probably never will. :-)