DEV Community: Alexander Hertwig

Kadeck 4.0 Announcement

Alexander Hertwig — Wed, 14 Sep 2022 13:07:35 +0000

Today, we at Xeotek are pleased to announce the biggest release of Kadeck to date. Release 4.0.0 is a very special milestone for all of us - a release whose scope consists of changes and new features that will tangibly, literally visibly change the way you work with Kadeck and Apache Kafka.

Many of you have given us feedback, and we've even talked to some of you about the strengths and weaknesses of Kadeck in hour-long online meetings. We are all very grateful for that at Xeotek. As an engineer, it's especially great to get feedback from other engineers on your product. And we took the feedback from you to heart, even if it meant a lot of work.

You are the ones who work with Apache Kafka or Amazon Kinesis and need a tool that supports you. An application that does the tedious work for you and makes you faster, that you can rely on, but that is also fun to work with. Kadeck is your daily companion - we are very proud of that. As your daily companion, every change has a huge impact on your daily work. Therefore, I am happy to say: This update, Kadeck 4.0.0, will revolutionize your work. We have developed a completely new user interface that not only makes Kadeck shine in a modern and new light. Many functions are now easy to use, faster to find and self-explanatory.

A new data catalog view allows you to manage all streams [Kafka or Kinesis Streams] from one central location. The more data streaming applications you develop, the more confusing and time-consuming it becomes to find the right stream with the data you're looking for. With the help of the data catalog, data streams can be marked and the responsible data owners can be specified so that everything remains structured and organized. All streams from all Apache Kafka & Kinesis clusters connected to KaDeck are displayed in a list. Teams, in combination with Kadeck rights management (which can be connected to LDAP, e.g. Active Directory), only see the streams they have access to in their project or department.

Teams are also very much at the center of this update. It is now possible to allow teams to work independently with Apache Kafka in their area of responsibility: from creating or deleting streams belonging to their project or department, to creating ACLs for those streams and for their applications in their namespace, and ultimately monitoring their applications and the data in their streams. Kadeck Rights Management allows all this to be configured via groups and even synchronized with LDAP (e.g. Active Directory).

In addition, many other improvements have been added: multiple streams can now be emptied (deleting all records) or fully deleted and a connection test for Apache Kafka, Kinesis and Schema Registry gives detailed information about the error when a connection is not possible.

In addition, it is now much easier to navigate the data in the Data Browser. When I think about the previous releases, I see them as preliminary work. Preliminary work that was extensive and time consuming, but necessary to lay the foundation for what we at Xeotek see as the future of data streaming. With Release 4.0.0, we complete this groundwork and lay the foundation for new features that will simplify the operation and development of data streaming applications.

Release 4.0.0 is a completely new Kadeck with the strengths you love.

I hope you find this release as exciting as we do. Please share your feedback with us - in the app, via our social channels or via mail.

Ben

Kafka topic naming conventions - 5 recommendations with examples

Alexander Hertwig — Tue, 30 Aug 2022 14:12:30 +0000

There are different opinions and a lot of confusion about the naming of Topics. In this article, I present the best practices that have proven themselves in my experience and that scale best, especially for larger companies.

Right at the beginning of the development of new applications with Apache Kafka, the all-important question arises: what name do I give my Topics? If each team or project has its own naming scheme, this can perhaps be tolerated at development time. However, it is not very conducive to collaboration if it is not clear which topic is to be used and which data it carries. At the latest, however, a decision must be made when going live in order to prevent a proliferation of naming schemes. After all, topics cannot be renamed afterward: if you decide on a new name over time, you have to delete the old topic, create a new topic with the new name and adapt all dependent applications. So how do you proceed, what scales best, and what should you pay attention to?

Naming things is always a very sensitive topic: I well remember meetings where a decision was to be made for the company-wide programming guidelines and this item on the agenda just wouldn’t disappear from meeting to meeting because of disputes about the naming of variables. With this article, I would like to provide you with a decision-making basis for topic naming in your project or company based on our experience at Xeotek. As a vendor of a datastream exploration and management software for Apache Kafka & Amazon Kinesis (Xeotek KaDeck), we have probably seen and experienced almost every variation in practical use.

The beer coaster rule

The “best practices” presented here have been gained from various projects with a wide range of customers and industries. However, one thing is crucial: don’t do too little, but don’t overdo it either! The methodology used for naming topics naturally depends on the size of the company and the system landscape. Over-engineering should be avoided as much as possible: if at the end of the day the guidelines for topic names fill pages and are only understood by a small group of people, then this is not useful. Regarding the scope, a quote from a colleague always comes to mind, which seems appropriate at this point:
“It has to fit on a beer coaster.“

The strucural design

Since topics cannot technically be grouped into folders or groups, it is important to create a structure for grouping and categorization at least via the topic name. The question arises how the different “folders”, “properties” or simply “components” should be separated. This is primarily a matter of taste. The separation by a dot (.) and the structure in the sense of the Reverse Domain Name Notation (reverse-DNS) has proven itself.

This is the approach we have found most frequently with our customers, followed by underscores. CamelCase or comparable approaches, on the other hand, are found rather rarely.

When separating with dots, it is recommended (as with domains) to avoid capitalization: write everything in lower case. This is a simple rule and avoids philosophical questions like which spelling of “MyIBMId”, “MyIbmId” or “MyIBMid” is better now.

What is the name of the data?

Once the structural design has been determined, it is a question of what we want to structure in the first place: so what all belongs in the topic name? Of course, the topic should bear the name of the data. But what is the name of the data contained in the topic?

Readers who have already experienced the attempt to create a uniform, company-wide data model (there are many legends about it!) know the problem: not only that there can be distinctions between technical and business names. Also between different departments, one and the same data set can have a completely different name (“ubiquitous language”). Therefore, data ownership must be clarified at this point: who is the data producer or who owns the data? And in terms of domain-driven design (DDD): in which domain is the data located?

In order to be able to name the data, it is, therefore, necessary to specify the domain and, if applicable, the context. The actual, functional, or technical name of the data set is appended at the end.

<domain>.<subdomain1>.<subdomain...>.<data>

Example:

risk.portfolio.analysis.loans.csvimport or
sales.ecommerce.shoppingcarts

As the example shows, this is also a question of company size and system landscape: you may only need to specify one domain, or you may even need several subdomains.

Who may use the data?

In the previous section, data was structured on the basis of domains and subdomains. Particularly in larger companies, it can make sense to mark cross-domain topics and thus control access and use. In this way, it is already clear from the topic name whether it is data that is only intended for internal processing within an area (domain), or whether the data stream (for example, after measures have been taken to ensure data quality) can be used by others as a reliable data source. Of course, this does not replace rights management and it is not intended to do so. However, explicitly marking the data as “private” or “public” with a corresponding prefix prevents other users from mistakenly working with “unofficial”, perhaps even experimental data without knowing it.
Example:

public. sales.ecommerce.shoppingcarts
private.risk.portfolio.analysis.loans.csvimport

What should be avoided?

In addition to the above recommendations that have worked well in the past, there are also a number of approaches that do not work so well. You should have good reasons for these approaches (and there may well be), otherwise, it is best to avoid them.

One of these negative experiences I count the appending of a version number to the topic name. This approach does not only lead to the fact that countless topics are created quickly, which may not be able to be deleted as quickly. Especially with a topic or partition limit, as is common with many managed Apache Kafka providers, this can lead to a real problem. Also, in the worst case, other users of the topic have to deploy one instance per topic version if the application can only read/write from one topic. If the application can read from several topics at the same time (e.g. from all versions), the next problem already arises when writing data back to a topic: do you write to only one topic or do you split the outgoing topics into the respective versions again, because downstream processes might have a direct dependency on the different versions of the topic? As you can see: this will quickly get you into hot water. The better way is to add the version number of the used schema as part of the header to the respective record. This does not solve the problem of handling versions in downstream processes, but the overview is not lost. It is even better to use a schema registry in which all information about the schema, versioning, and compatibility is stored centrally.

Using application names as part of the topic name can also be problematic: a stronger coupling is hardly possible. However, there are exceptions here, for example for applications in the company that are set in stone anyway. In such a case, it makes no sense to create a large abstraction layer, especially if everyone in the company asks for the data of application X anyway and the “neutral” name causes confusion. However, the name of the domain service (e.g. “pricingengine”) can often be used as a good alternative in the sense of Domain-Driven Design.

Example: Using “pricingengine” as application name to avoid coupling.

private.risk.portfolio.pricingengine.assetpricing

What about namespaces or company names?

You should only use namespaces if there is really no other way. For example, if you have different clients in an Apache Kafka environment, it makes sense to prepend the company name, e.g.:

public.com.xeotek.sales.ecommerce.shoppingcarts

If there is no such reason, then you should avoid this unnecessary information: your colleagues usually know the name of the company where they work. So no need to repeat this in every topic name.

Enforcing topic naming rules and adminstrative tasks

To enforce topic naming rules, be sure to set the auto.create.topics.enable setting for your Apache Kafka broker to false. This means that topics can only be created manually, which from an organisational point of view requires an application process. For example, the responsible infrastructure team can be considered as a contact for the manual creation of topics. For the creation of topics, the console application “create-topic” supplied with Apache Kafka can be used, although a look at other third-party tools with a graphical interface is recommended, not only because of the comprehensibility but above all because of the enormous time savings for this and other typical tasks.

In KaDeck Web, for example, the various teams can be granted rights for the independent creation of topics, provided that the topics correspond to a defined naming scheme. This means that teams within their own area (domain) can avoid a bureaucratic process and create and delete topics at short notice, e.g. for testing purposes, without outside help. The user, the action and the affected topic can be traced via an audit log integrated in KaDeck.

By the way, Apache Kafka generally supports wildcards when selecting topics, for example when consuming data (i.e. in the consumer) or when assigning rights via ACLs. The proposed naming scheme for topics works very well in this combination: both, the recommended separation of “private” and “public” topics, as well as the use of domain names as part of the name, allow access for teams from different domains to be created and controlled very intuitively and quickly.

Conclusion

This article is a list of recommendations that have proven useful in the past when naming topics. The exception proves the rule: perhaps another dimension to structure your topics makes sense, or some of the ideas I’ve listed to the list of approaches to avoid make sense in your case. Feel free to let me know (Twitter: @benjaminbuick or the Xeotek team via @xeotekgmbh)!

Ben

How many partitions do I need in Apache Kafka?

Alexander Hertwig — Wed, 10 Aug 2022 18:56:00 +0000

Apache Kafka is our rocket, and the individual partitions provide order and structure to all work processes at every stage of the flight. But how many partitions should we set up? A good question, which should be answered before we deploy Kafka to production. And also a question that the following blog post addresses. 3,2,1 – launch. 🚀

Why do we need partitions in Apache Kafka at all?

In Kafka, we use partitions to speed up the processing of large amounts of data. So instead of writing all of our data from a topic to one partition (and thus one broker), we split our topics into different partitions and distribute those partitions to our brokers. Unlike other messaging systems, our producers are responsible for deciding which partition our messages are written to. If we use keys, then the producer distributes the data such that data with the same key ends up on the same partition. This way, we can guarantee that the order between messages with the same key is guaranteed. If we do not use keys, then messages are distributed to the partitions in a round-robin manner.

Furthermore, we can have at most as many (useful) consumers in a consumer group as we have partitions in the topics they consume. The bottleneck in data processing is often not the broker or the producer, but the consumer, which must not only read the data, but also process it.

In general, the more partitions, the

higher is our data throughput: Both the brokers and the producers can process different partitions completely independently – and thus in parallel. This allows these systems to better utilise the available resources and process more messages. Important: The number of partitions can also be significantly higher than the number of brokers. This is not a problem.
more consumers we can have in our Consumer Groups: This also potentially increases the data throughput because we can spread the work across more systems. But beware: The more individual systems we have, the more parts can fail and cause issues.
more open file handles we have on the brokers: Kafka opens two files for each segment of a partition: the log and the index. This has little impact on performance, but we should definitely increase the allowed number of open files on the operating system side.
longer downtimes occur: If a Kafka broker is shut down cleanly, then it notifies the controller and the controller can move the partition leaders to the other brokers without downtime. However, if a broker fails, there may be longer downtime because there is no leader for many partitions. Due to limitations in Zookeeper, the consumer can only move one leader at a time. This takes about 10 ms. With thousands of leaders to move, this can take minutes in some circumstances. If the controller fails, then it must read in the leaders of all partitions. If this takes about 2 ms per leader, then the process takes even longer. With KRaft this problem will become much smaller.
more RAM is consumed by the clients: the clients create buffers per partition and if a client interacts with very many partitions, possibly spread over many topics (especially as a producer), then the RAM consumption adds up a lot.

Limits on partitions

There are no hard limits on the number of partitions in Kafka clusters. But here are a few general rules:

maximum 4000 partitions per broker (in total; distributed over many topics)
maximum 200,000 partitions per Kafka cluster (in total; distributed over many topics)
resulting in a maximum of 50 brokers per Kafka cluster

This reduces downtime in case something does go wrong. But beware: It should not be your goal to push these limits. In many “medium data” applications, you don’t need any of this.

Rules of thumb

As already described at the beginning, there is no “right” answer to the question about the number of partitions. However, the following rules of thumb have become established over time:

No prime numbers: Even though many examples on the Internet (or in training courses) use three partitions, in general, it is a bad idea to use prime numbers because prime numbers are very difficult to divide among different numbers of brokers and consumers.
Well divisible numbers: Therefore, numbers that can be divided by many other numbers should always be used.
Multiples of consumers: This allows partitions to be distributed evenly among the consumers in a consumer group.
Multiples of the brokers: This way we can distribute the partitions (and leaders!) evenly among all brokers this way
Consistency in Kafka Cluster: Especially when we want to use Kafka Streams, we realise that it makes sense to keep the number of partitions the same across our topics. For example, if we intend to do a join across two topics and these two topics do not have the same number of partitions, Kafka Streams needs to repartition a topic beforehand. This is a costly issue that we would like to avoid if possible.
Depending on the performance measurements: If we know our target data throughput and also know the measured throughput of our Consumer and Producer per partition, we can calculate from that how many partitions we need. For example, if we know we want to move 100 MB/s of data and can achieve 10 MB/s in the Producer per partition and 20 MB/s in the Consumer, we require at least 10 partitions (and at least 5 Consumers in the Consumer Group).
Do not overdo it: This is not a contest to set up the largest number of partitions possible. If you only process a few tens of thousands of messages per day, you don’t need hundreds of partitions.

Practical examples

In my consulting practice, 12 has become a good guideline for the number of partitions. For customers who process very little data with Kafka (or need to pay per partition), even smaller numbers can make sense (2,4,6). If you are processing large amounts of data, then Pere’s Excel spreadsheet that he has made available on GitHub will help: kafka-cluster-size-calculator.

This blog post was written by a guest-author from Community Stream: by developers, for developers.

Looking to publish with us or share your Kafka Knowledge?
Join our Discord to take part!

As an IT trainer and a frequent Blogger for Xeotek, Anatoly Zelenin teaches Apache Kafka to hundreds of participants in interactive training sessions. For more than a decade, his clients from the DAX environment and German medium-sized businesses have appreciated his expertise and his inspiring manner. In that capacity, he is delivering trainings for Xeotek Clients as well. His book is available directly from him, from Hanser-Verlag, Amazon. You can reach Anatoly via his e-mail. In addition, he is not only an IT Consultant, Trainer and Blogger but also explores our planet as an adventurer.

What are your favourite communities, bubbles and people to follow? Comment your favs!

Alexander Hertwig — Tue, 02 Aug 2022 09:11:38 +0000

I love the concept of Dev.to and I'd like to find more amazing communites on dev.to or niche bubbles. Less about products, more about engaged and active communities and learning.

From small but passionate creators to engaged Companies, what are your personal favourites?

Cross-share your own profiles, discord links, company pages or share others :)

Why a data-centric Kafka UI is essential for your Kafka project’s success

Alexander Hertwig — Mon, 01 Aug 2022 14:29:00 +0000

When integrating Apache Kafka into your company’s infrastructure and starting your first Apache Kafka project, you will stumble upon certain challenges that will slow you down but can be easily avoided.

This article covers the main reasons, why Apache Kafka application and data integration projects fail and how you can prevent this from happening by using a data-centric monitoring solution and Kafka UI from the very beginning.

Before getting into what a data-centric Kafka monitoring solution looks like, let me first describe the tasks and issues, I have come accross in projects with Apache Kafka. If you have already implemented a solution with Apache Kafka, many of them might be very familiar to you.

A typical Apache Kafka project

These are the common tasks that you will find in any Apache Kafka project:

Data modelling (understanding data models and business requirements)
Expose new data objects for other services (ingest data & data validation)
Consume data objects from existing or parallel developed applications
Enrich data (consume existing data objects and create new ones)
Sink: integrate data from Apache Kafka into standard software (e.g. SAP, …)

The pitfalls

The common challenges can essentially be broken down to:

transparency
communication and collaboration
control

Transparency

When consuming data from other services, developers have a hard time understanding how the actual data objects and especially the different characteristics of the data objects look like as documentation is often outdated or not available. This leads to a tedious trial and error process, which slows down the developers and results in less time being spent on the actual work. This problem is even more relevant when several applications are developed at the same time as changes to data models occure more often during the process, which leads us to the next problem: communication and collaboration.

Communication and collaboration

Why do data models change that often during development? The main reason for this is very simple: lack of communication and collaboration. When designing business objects, feedback loops with business departments are necessary but also require a lot of time and effort from both sides. Because of the missing transparency, business departments and business analysts alike are not able to quickly analyse and validate the data objects that are being produced by a new application on their own. As a former software architect I can tell that bringing together the business view and the technical view is something, you spend most of your time on. Enabling your stakeholders to access and give them control over relevant data quickly leads to a better understanding and communication.

Control

Thus, developers, your testing team, business departments and operations need to be in control of the relevant data. If it takes your contact person more than half an hour to give you a copy of relevant data, you will not be asking for this very often. This leads to more guessing instead of informed decision making during application development and day 2 operations.

As Apache Kafka is all about data, these problems multiply.

Take care of these problems before they evolve and possibly risk your project’s success.

The following section is about the approach we took at Xeotek and introduces you to our data-centric Kafka monitoring solution “KaDeck”.

Available monitoring solutions

At Xeotek we always recommend having a monitoring solution on the infrastructure level for your infrastructure operations. This is crucial for discovering service outages at a technical level and to get a better understanding and overview of your docker containers. Gartner has published a comprehensive list of these tools with reviews.

Having said this, a monitoring solution on the technical level is not sufficient. The challenges we have identified above require a much more data-centric approach.

Currently there are only a few tools available for monitoring Apache Kafka data. But the few tools available for directly viewing Apache Kafka data, so called Kafka UIs or topic browsers, are not satisfactorily addressing the challenges we have identified as they deliver an experience which makes it even for a developer not so trivial to get a better visibility of the data objects.

Our Approach

At Xeotek we wanted to comprehensively solve these challenges: our goal was to develop a solution that enables all stakeholders to quickly and easily get access to relevant data, resulting in a better understanding of the application and data landscape. By making it easy to analyse data, we wanted to create transparency which leads to informed decision making processes and collaboration spanning the full application life-cycle.

We wanted something that looks like this:

It is important to be able to quickly view relevant data objects, filter for various characteristics and types and share these reports with others. It is also important to still be flexible enough to dive into the underlying technical details as well. We think we were able to combine these two worlds and can provide you with one central data monitoring solution for your Apache Kafka system.

We are very enthusiastic about our product and invite you to learn more at www.kadeck.com

Watch our KaDeck 3.0 release video to get an overview of the functionalities: https://youtu.be/1TCojOdFhyQ

If you require more information about the challenges of application development and data integration with Apache Kafka, let me know in the comment section blow.