DEV Community: Shawn Seymour

Automating Kafka Topic & ACL Management

Shawn Seymour — Wed, 27 May 2020 12:56:38 +0000

This post was originally published on my personal blog.

Apache Kafka, a distributed streaming platform, has become a key component of many organization's infrastructure and data platforms. As adoption of Kafka grows across an organization, it is important to manage the creation of topics and access control lists (ACLs) in a centralized and standardized manner. Without proper procedures around topic and ACL management, Kafka clusters can quickly become hard to manage from a data governance and security standpoint.

Today, I'll be discussing how to automate Kafka topic and ACL management and how it can be done with a continuous integration/continuous delivery (CI/CD) pipeline. I'll explain how to do this while following GitOps patterns: all topics and ACLs will be stored in version control. This is a model followed by many companies, both small and large, and can be applied to any Kafka cluster.

Although I'll be discussing in terms of organizations, these processes can be applied to local development clusters and smaller Kafka implementations as well.

Background

As most developers who have used Kafka know, it is quite easy to create topics. They can be created through a single usage of the kafka-topics tool or with various user interfaces. Before jumping into our tutorial, let's dive into some background.

Automatic Topic Creation

Outside of the tools mentioned above, it is even easier to create topics – they are automatically created due to the broker configuration auto.create.topics.enabled being set to true by default. Although this configuration makes it easy to create topics, it is considered by most to be a bad practice. With some platforms, such as Confluent Cloud, it is even impossible to enable auto topic creation.

Allowing the automatic creation of topics can be problematic:

Security and access control become a lot harder to manage.
Test topics and unused topics end up in the cluster and likely do not get cleaned up.
Any developer or any service can create topics without giving thought to proper partitioning and potential overhead.

Outside of a development cluster, every topic should have a purpose that is understood and has an underlying business need to justify its existence. Additionally, allowing automatic topic creation does not solve the need for creating and managing ACLs.

Manual Topic & ACL Creation

The next logical step most organizations take is to create topics manually through tools such as kafka-topics or Confluent Control Center. This usually happens when Kafka is fairly new to an organization or used by a small group of people, e.g. a team or two.

Manually creating topics and ACLs only works until the usage of Kafka within an organization starts to grow. There are typically two patterns that are followed by with manual topic creation:

Anyone has access: All developers/operations team members who can access the cluster can create topics as well as ACLs. This leads to topic naming standards and security best practices being thrown out the window. If anyone can make ACLs, there is no real security on the cluster.
Operations has access: A centralized operations team manages topics & ACLs manually through a change management/request process. Although this allows for some governance to be enforced, it leaves an operations team doing manual work.

A major issue with manual topic & ACL creation is that it is not repeatable. It may be enticing to use a web interface to quickly create topics, but more-often-than-not it becomes a pain-point in the future.

Imagine a scenario where you want to migrate to a new cluster or spin up a new environment; how easy is it to re-create all of the topics, topic configurations, and ACLs if they are not defined and easily accessible? It's pretty hard.

Automated Topic & ACL Creation

After manual topic & ACL creation becomes a limiting factor, teams usually seek to build tooling and automation around it. Most organizations in today's world are automating as much as they can. We see automation around immutable infrastructure, deploying applications, managing business processes, and much more.

The first step in automating the creation of Kafka resources is usually a simple python or bash script. Teams might define their topics and ACLs in files such as JSON or YAML. These scripts are then either run by teams themselves or included in a continuous integration process.

Unfortunately, these scripts are usually quick-and-dirty. They often cannot easily change topic configurations, delete unneeded topics, or provide insight into what your actual cluster has defined in terms of topics and ACLs. Lastly, ACLs can be quite verbose: it can be hard to understand the needed ACLs depending on the complexity of the application and its security needs (e.g. Kafka Connect is much more complicated than a simple consumer).

GitOps for Apache Kafka Management

GitOps, as commonly found in Kubernetes deployment models, is a pattern centered around using a version control system (such as Git) to house information and code describing a system. This information is then used in an automated fashion to make changes to infrastructure (such as deploying a new Kubernetes workload).

This pattern is essentially how most implementations of Terraformwork: infrastructure gets defined in Terraform state files, a plan with the desired changes is generated, and then the plan is executed to apply the desired changes.

Note: This blog post describes how to manage topics & ACLs with GitOps, and not an actual Apache Kafka cluster deployment.

Kafka GitOps

In this tutorial, I'll be introducing a tool called kafka-gitops. This project is a resources-as-code tool which allows users to automate the management of Apache Kafka topics and ACLs. Before we dive in, I'd like to introduce some terminology:

Desired State: A file describing what your Kafka cluster state should look like.
Actual State: The live state of what your Kafka cluster currently looks like.
A Plan: A set of topic and/or ACL changes to apply to your Kafka cluster.

Topics and services are defined in a YAML desired state file. When run, kafka-gitops compares your desired state to the actual state of the cluster and generates a plan to execute against the cluster. The plan will include any creates, updates, or deletes to topics, topic configurations, and ACLs. After validating the plan looks correct, it can be applied and will make your topics and ACLs match your desired state.

On top of topic management, if your cluster has security, kafka-gitops can generate the needed ACLs for most applications. There is no need to manually define a bunch of ACLs for Kafka Connect or Kafka Streams. By defining your services, kafka-gitops will build the applicable ACLs.

Example kafka-gitops workflow

The major features of kafka-gitops compared to other management tools:

🚀 Built For CI/CD: Made for CI/CD pipelines to automate the management of topics & ACLs.
🔥 Configuration as code: Describe your desired state and manage it from a version-controlled declarative file.
👍 Easy to use: Deep knowledge of Kafka administration or ACL management is NOT required.
⚡️️ Plan & Apply: Generate and view a plan with or without executing it against your cluster.
💻 Portable: Works across self-hosted clusters, managed clusters, and even Confluent Cloud clusters.
🦄 Idempotency: Executing the same desired state file on an up-to-date cluster will yield the same result.
☀️ Continue from failures: If a specific step fails during an apply, you can fix your desired state and re-run the command. You can execute kafka-gitops again without needing to rollback any partial successes.

Automating Topics & ACLs via Kafka GitOps

I'll provide an overview of how kafka-gitops works and how it can be applied to any Kafka cluster. An in-depth tutorial on how to use it will be posted in the next blog post; otherwise, the documentation has a great getting started guide.

Reminder: This tool works on all newer Kafka clusters; including self-hosted Kafka, managed Kafka solutions, and Confluent Cloud.

Desired State File

Topics and services that interact with your Kafka cluster are defined in a YAML file, named state.yaml by default.

Example desired state file:

topics:
  test-topic:
    partitions: 6
    replication: 3
    configs:
      cleanup.policy: compact

services:
  test-service:
    type: application
    principal: User:testservice
    produces:
      - test-topic

This state file defines two things:

A compacted topic named test-topic with six partitions and a replication factor of three.
An application service named test-service tied to the principal User:testservice.

The type of the service tells kafka-gitops what type of ACLs to generate. In the case of application, it will generate the needed ACLs for producing to and/or consuming from its specified topics. In this case, kafka-gitops will generate a WRITE ACL for the topic test-topic.

Currently, we support three types of services: application, kafka-connect, and kafka-streams. Each service has a slightly different schema due to the nature of the service.

Example Kafka Streams service:

services:
  my-stream:
    type: kafka-streams
    principal: User:mystream
    consumes:
      - test-topic
    produces:
      - test-topic

Kafka Streams services have special ACLs included for managing internal streams topics.

Example Kafka Connect service:

services:
  my-connect-cluster:
    type: kafka-connect
    principal: User:myconnect
    connectors:
      rabbitmq-sink:
        consumes:
          - test-topic

Kafka Connect services have special ACLs for working with their internal topics as well as defined ACLs for each running connector.

Essentially, all topics and all services for a specific cluster get put into this YAML file. If you are not using security, such as on a local development cluster, you can omit the services block.

Note : For full examples and specific requirements for each service, read the services documentation page. The specification for the desired state file and its schema can be found on the specification documentation page.

Plan Changes To A Kafka Cluster

Once your desired state file is created, you can generate a plan of changes to be applied against the cluster.

Note : kafka-gitops is configured to connect to clusters via environment variables. See the documentation for more details.

This does NOT actually change the cluster. We can generate the plan by running:

kafka-gitops -f state.yaml plan -o plan.json

This will output a JSON file with the plan as well as a prettified output describing the changes. This is an example plan for the first state.yaml file described when including only the topics block:

Generating execution plan...

An execution plan has been generated and is shown below.

Resource actions are indicated with the following symbols:
  + create
  ~ delete

The following actions will be performed:

Topics: 1 to create, 0 to update, 0 to delete.
+ [TOPIC] test-topic

ACLs: 0 to create, 0 to update, 0 to delete.

Plan: 1 to create, 0 to update, 0 to delete.

If there are topics or ACLs on the cluster that are not in the desired state file, the plan will include changes to update and/or delete them.

Note: It is possible to disable deletion by passing the --no-delete flag after -f state.yaml.

Apply Changes To A Kafka Cluster

Once the plan is created, we can apply the changes to the cluster.

Warning: This WILL change the cluster to match the plan generated from the desired state file. Without the --no-delete flag, this can be destructive.

Changes are applied using the apply command:

kafka-gitops -f state.yaml apply -p plan.json

This will execute the changes to the running Kafka cluster and output the results.

Executing apply...

Applying: [CREATE]

+ [TOPIC] test-topic

Successfully applied.

[SUCCESS] Apply complete! Resources: 1 created, 0 updated, 0 deleted.

If there is a partial failure, successes will not be rolled back. Instead, fix the error in the desired state file or manually within the cluster and rerun plan and apply.

After a successful apply, you can re-run the plan command to generate a new plan – except this time, there should be no changes, since your cluster is up to date with your desired state file!

Additional Features

On top of the brief description of the features above, kafka-gitops supports:

Automatically creating Confluent Cloud service accounts.
Splitting the topics and services blocks into their own files.
Ignoring specific topics from being deleted when not defined in the desired state file.
Defining custom ACLs to a specific service (e.g. for a service such as Confluent Control Center).

Kafka Topic & ACL Automation Workflow

Now that we've had an overview of how kafka-gitops works, we can examine how to put this workflow into action within an organization. First, we can define typical roles within an organization:

Developers: Engineers who are writing applications and services utilizing Kafka.
Operations: Engineers who manage, monitor, and maintain Kafka infrastructure.
Security: Engineers who are responsible for security operations within an organization.

Next, we can define an example setup and process for a GitOps workflow. This is not a one-size-fits-all answer – a lot depends on the organization and culture; however, this is a generalized approach that will work well if implemented correctly.

Automation Workflow Overview

A scalable implementation of the kafka-gitops workflow within an organization looks like this:

All desired state files are stored within a repository owned by Operations.
Operations owns the master branch, which should reflect the live state of every cluster.
Developers fork this repository to make changes to their topics & services.
Developers create a pull request with their changes and mark it ready to review by Operations and Security.
Operations and Security review the changes and merge to master.
A CI/CD system kicks off a kafka-gitops plan build to generate a new plan.
(Optional) The plan output is reviewed by Operations, ensuring it looks correct.
The plan is then applied, either manually by Operations or automatically, through kafka-gitops apply. The desired changes will then be reflected in the live cluster and the cluster will match the desired state file in master.

As described above, all topics and services (which includes ACLs) are defined in version-controlled code. Developers are responsible for their topic and service definitions. Operations is responsible for managing the changes to the cluster (e.g. ensuring teams are not doing crazy things) as well as responsible for deploying the changes. Security is responsible for ensuring sensitive data is being properly locked down to the services that require it.

Setting Up The Workflow

Create a centralized git repository for storing Kafka cluster desired state files.
In that repository, create folders for each environment and/or cluster.
In each cluster's folder, create its state file. Define any existing topics, services, and ACLs.

Note: If adding this workflow to an existing Kafka cluster: the easiest way to get it set up is to continually run plan against the live cluster as you update the desired state file to contain the correct information. Continue to do this until there are no changes planned.

Setting Up CI/CD

Setting up CI/CD is highly dependent on which build system you are using. This is a general outline of how it could be configured:

Set up a main CI job that is triggered on changes to the master branch.
The main job should look for changes in each desired state file.
For each desired state file with a change, trigger a side job.
The side job(s) should utilize kafka-gitops plan to generate an execution plan.
(Optional) The side job(s) should then wait until Operations can review the generated plan.
The side job(s) should then utilize kafka-gitops apply to execute the planned changes to the specified Kafka cluster.

Benefits Of GitOps for Apache Kafka

Once the full process is in place, you gain many benefits that allow you to easily govern the clusters as the adoption of Kafka continues within an organization.

Developers have a well-defined process to follow to create topics & services.
Operations has control over what is changing within the cluster and can ensure standards are followed.
Security can easily audit and monitor access changes to data within the streaming platform.

Additionally, kafka-gitops provides:

A defined process to make any changes to the Kafka cluster; no manual steps.
A full audit log and history of changes to your cluster via version control.
Automatic ACL generation for common services, reducing time spent on security.
The ability to re-create a cluster's complete topic and ACL setup (e.g. for a new environment).

Limitations and Upcoming Features

Although kafka-gitops is actively being used in production, there are a few upcoming features to address some limitations:

The ability to set a custom group.id for consumers & streams applications (currently, this must match the service name)
The ability to set custom connect topic names (currently, this has a predefined pattern)
Tooling around creating the initial desired state file from existing clusters
Eventually, the optional ability to run it as-a-service to actively monitor for changes and source from locations such as git, AWS S3, etc.

Conclusion

Automating the management of Kafka topics and ACLs brings significant benefits to all teams working with Apache Kafka. Whether working with a large enterprise set of clusters or defining topics for your local development cluster, the GitOps pattern allows for easy, repeatable cluster resource definitions.

By adopting a GitOps pattern for managing your Kafka topics and ACLs, your organization can reduce time spent managing Kafka and spend more time providing value to your core business.

In some upcoming blog posts, I will be providing in-depth tutorials on using kafka-gitops with self-hosted clusters and with Confluent Cloud.

Apache Kafka: Topic Naming Conventions

Shawn Seymour — Sun, 29 Mar 2020 20:39:02 +0000

This post was originally published on my personal blog.

Apache Kafka is an amazing system for building a scalable data streaming platform within an organization. It’s being used in production from all the way from small startups to Fortune 500 companies. As the adoption of a core platform grows within an enterprise, it’s important to think about maintaining consistency and enforcing standards.

Today, I’ll be discussing how one can do that in regards to Apache Kafka and its core data structure: a topic. As most engineers who have used Kafka know, a topic is a category or feed to which messages are stored and published. These are similar to queues in message bus systems such as RabbitMQ or ActiveMQ.

Topic Naming: The Wild West

Imagine a company building a simple order management system using Kafka as its backbone. They might create a couple of microservices that rely on a few core topics:

orders
customers
payments

As the company grows, and as more teams are onboarded to the platform, more topics will be needed. The company may add data pipelines for inventory, fraud detection, and more. They might now have additional topics like:

inventory
pageviews
fulfillment

How do you manage to keep topic names consistent? How do people from other teams know exactly what the topic contains? It is easy when there are only a few topics and a small number of people using the platform. Once you are in a large organization with many teams creating and using topics, it becomes much harder.

Common Topic Naming Proposals

A quick search leads to some great blog posts, StackOverflow answers, and mailing list posts discussing how to name topics. There is also a vast number of opinions on the best way to do this. Here are some examples:

<project>.<product>.<event-name>
<app-name>.<data-type>.<event-name>
<team-name>.<app-name>.<event-type>.<event-name>

A decent topic naming strategy, proposed by Chris Riccomini in his popular blog post, How to paint a bike shed: Kafka topic naming conventions, is:

<message type>.<dataset name>.<data name>

At first glance, none of these look particularly bad – some even look great. Before we go in-depth on how to best name a Kafka topic, let’s discuss what makes a topic name good.

Naming Kafka Topics: Structure

When it comes to naming a Kafka topic, two parts are important. The structure of the name and the semantics of the name.

The structure of a name defines what characters are allowed and the format to use. In its topic names, Kafka allows alphanumeric characters, periods (_ . ), underscores ( _ ), and hyphens ( - _).

Although ‘_’ and ‘.’ are allowed to be used together, they can collide due to limitations in metric names. It is best to pick one and use either, but not both.

We cannot change what Kafka allows, but we can further define how dashes are used or enforce that all topics be lowercase.

Naming Kafka Topics: Semantics

The semantics of a name define what fields should go in that name and in what order they should be placed in. There are a few rules that should be applied to naming topics when it comes to semantics.

Do not use fields that can change

Fields that can change should not be used in topic names. Fields such as team name, product name, and service owner should never be used in topic names.

As most engineers know, over time, these things change as organizations evolve. It’s not an easy task to change a topic name once it is in use all over an enterprise, so it is best to leave those fields out from the beginning.

Do not tie topic names to services, consumers, or producers

Topic names should not be tied to service names unless they are completely internal to a single service and are not meant to be produced to or consumed from any other service.

Most topics eventually end up with more than one consumer and its producer could change in the future. It’s best to name topics after the data they hold rather than what is creating or reading the data.

Leave metadata out of the name if it can be found elsewhere

Metadata that can be found elsewhere, such as in the data payload or in a schema registry, should be left out of the topic name. This includes things such as partition count, security information, schema information, etc.

Topic Naming Recommendations

Okay, so you may be thinking, there are a ton of fields to pick from and I’m not sure what semantics I should enforce. Let’s get to the details on what a great topic name convention should look like. My recommended rules to follow are:

Naming Format

Topic names should be completely lowercase and adhere to the following regular expression: [a-z0-9.-]. All topics should follow kebab-base, such as my-awesome-topic-name.

Readability and ease-of-understanding play a huge role in proper topic naming. Lowercase topic names are easy to read and kebab-case flows nicely; we avoid the use of underscores due to metric naming collisions with periods. Additionally, periods make for a great separator between sections in a topic name, which is described below.

Naming Structure

My recommendation is to follow the following naming convention:

<data-center>.<domain>.<classification>.<description>.<version>

Let’s discuss what each part of the name means:

Data Center

The data center which the data resides in. This is not required, but is helpful when an organization reaches the size where they would like to do an active/active setup or replicate data between data centers. For example, if you have one cluster in AWS and one in Azure, your topics may be prefixed with aws and azure.

Domain

A domain for the data is a well understood, permanent name for the area of the system the data relates to. These should not include any product names, team names, or service names.

Examples of this vary wildly between industries. For example, in a transportation organization, some domains might be:

comms: all events relating to device communications
fleet: all events relating to trucking fleet management
identity: all events relating to identity and auth services

Classification

The classification of data within a Kafka topic tells an end-user how it should be interpreted or used. This should not tell us about data format or contents. I typically use the following classifications:

fct: Fact data is information about a thing that has happened. It is an immutable event at a specific point in time. Examples of this include data from sensors, user activity events, or notifications.
cdc: Change data capture (CDC) indicates this topic contains all instances of a specific thing and receives all changes to those things. These topics do not capture deltas and can be used to repopulate data stores or caches. These are commonly found as compacted topics within Kafka.
cmd: Command topics represent operations that occur against the system. This is typically found as the request-response pattern, where you have a verb and a statement. Examples might include UpdateUser and UserUpdated.
sys: System topics are used for internal topics within a single service. They are operational topics that do not contain any relevant information outside of the owning system. These topics are not meant for public consumption.

Description

The description is arguably the most important part of the name and is the event name that describes the type of data the topic holds. This is the subject of the data, such as customers, invoices, users, payments, etc.

Version

The version of a topic is often the most forgotten section of a proper topic name. As data evolves within a topic, there may be breaking schema changes or a complete change in the format of the data. By versioning topics, you can allow a transitionary period where consumers can switch to the new data without impacting any old consumers.

By convention, it is preferred to version all topics and to start them at 0.

Examples

Examples, using the following convention, may be:

aws.analytics.fct.pageviews.0
azure.comms.fct.gps.0
dc1.identity.cdc.users.1
gcp.notifications.cmd.emails.3
gcp.notifications.sys.email-cache.0

Enforcing Standards

After a naming convention is decided upon and put into place, how does one enforce that topics conform to that convention? The first step is to ensure auto topic creation is disabled on the broker side; this is done via the auto.topic.create.enable property. In newer versions of Kafka, this is set to false by default, which is what we want. Ideally, your cluster should also have security enabled and disallow the creation of topics by services. This enforces topic creation is done in a standardized way and controlled by an operations team.

The recommended approach is to create topics through a continuous integration pipeline, where topics are defined in source control and created through a build process. This ensures scripts can validate that all topic names conform to the desired conventions before getting created. A helpful tool to manage topics within a Kafka cluster is kafka-dsf.

Conclusion

Hopefully reading this has provoked some thought into how to create useful topic naming conventions and how to prevent your Kafka cluster from becoming the Wild West. It’s important to enforce consistency early and put a standard process in place before its too late — because things like topic names are hard to change later, and probably never will. :-)

Apache Kafka: Docker Quick Start

Shawn Seymour — Mon, 25 Nov 2019 15:14:27 +0000

Apache Kafka is a distributed streaming platform that can act as a message broker, as the heart of a stream processing pipeline, or even as the backbone of a large enterprise data synchronization system. Kafka is not only a highly-available and fault-tolerant system; it also handles vastly higher throughput compared to other message brokers such as RabbitMQ or ActiveMQ.

In this tutorial, you will utilize Docker & Docker Compose to run Apache Kafka & ZooKeeper. Docker with Docker Compose is the quickest way to get started with Apache Kafka and to experiment with clustering and the fault-tolerant properties Kafka provides. A full Docker Compose setup with 3 Kafka brokers and 1 ZooKeeper node can be found here.

Prerequisites

To complete this tutorial, you will need:

A UNIX environment (Mac or Linux)
Docker & Docker Compose

Note: Docker can be installed by following the official installation guide.

System Architecture

Before running Kafka with Docker, let's examine the architecture of a simple Apache Kafka setup.

Kafka Cluster : A group of Kafka brokers forming a distributed system
Kafka Broker : An instance of Kafka that holds topics of data
ZooKeeper : A centralized system for storing and managing configuration
Producer : A client that sends messages to a Kafka topic
Consumer : A client that read messages from a Kafka topic

Kafka utilizes ZooKeeper to manage and coordinate brokers within a cluster. Producers and consumers are the main clients that interact with Kafka, which we'll take a look at once we have a running Kafka broker.

Architecture diagram of integrations used in this tutorial

The above diagram shows the architecture of the systems we are going to run in this tutorial. It also helps demonstrate how Kafka brokers utilize ZooKeeper and shows the ports of the running services. In this tutorial, we'll start by running one Apache Kafka broker and one ZooKeeper node (seen above in blue). Later on, we'll form a three node cluster by adding in two more Kafka brokers (seen above in green).

Running ZooKeeper in Docker

Ensure you have Docker installed and running. You can verify this by running the following command; you should see a similar output.

docker -v
> Docker version 18.09.2, build 6247962

Additionally, verify you have Docker Compose installed:

docker-compose -v
> docker-compose version 1.23.2, build 1110ad01

We're ready to begin! Create a directory, such as ~/kafka, to store our Docker Compose files. Using your favorite text editor or IDE, create a file named docker-compose.yml in your new directory.

We'll start by getting ZooKeeper running. In the Docker Compose YAML file, define a zookeeper service as shown below:

version: '3'

services:
  zookeeper:
    image: zookeeper:3.4.9
    hostname: zookeeper
    ports:
      - "2181:2181"
    environment:
        ZOO_MY_ID: 1
        ZOO_PORT: 2181
        ZOO_SERVERS: server.1=zookeeper:2888:3888
    volumes:
      - ./data/zookeeper/data:/data
      - ./data/zookeeper/datalog:/datalog

A brief overview of what we're defining:

Line 1: docker compose file version number, set to 3
Line 4: starting the definition of a ZooKeeper service
Line 5: The docker image to use for ZooKeeper and its version
Line 6: The hostname the container will use when running
Lines 7-8: The ports to expose to the host; ZooKeeper's default port
Line 10: The unique ID of this ZooKeeper instance, set to 1
Line 11: The port this ZooKeeper instance should run with
Line 12: The list of ZooKeeper servers; in our case just one
Lines 13-15: Mapping volumes on the host to store ZooKeeper data

Note: We've mapped ./data/zookeeper on the host to directories within the container. This allows ZooKeeper to persist data even if you destroy the container.

We can now start ZooKeeper by running the following command in the directory containing the docker-compose.yml file:

docker-compose up

Logs will start printing, and should end with a line similar to this:

zookeeper_1 | ... binding to port 0.0.0.0/0.0.0.0:2181

Congrats! ZooKeeper is running and exposed on port 2181. You can verify this utilizing netcat in a new terminal window:

echo ruok | nc localhost 2181
> imok

Running Kafka In Docker

We can now add our first kafka service to our Docker Compose file. We're calling this kafka2 as it will have a broker id of 2 and run on the default port of 9092. Later on, we'll add in kafka1 and kafka3. This is to demonstrate that order does not matter and broker ids are just for identification.

version: '3'

services:
...
  kafka2:
    image: confluentinc/cp-kafka:5.3.0
    hostname: kafka2
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka2:19092,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_BROKER_ID: 2
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    volumes:
      - ./data/kafka2/data:/var/lib/kafka/data
    depends_on:
      - zookeeper

If you prefer, copy the full gist found here. A brief overview of what we're defining:

Line 6: The docker image to use for Kafka; we're using the Confluent image
Line 7: The hostname this Kafka broker will use when running
Line 8-9: The ports to expose; set to Kafka's default (9092)
Line 11: Kafka's advertised listeners. Robin Moffatt has a great blog post about this.
Line 12: Security protocols to use for each listener.
Line 13: The inter-broker listener name (used for internal communication)
Line 14: The list of ZooKeeper nodes Kafka should use
Line 15: The broker ID of this Kafka broker.
Line 16: The replication factor of the consumer offset topic (1 for one broker)
Lines 17-18: Mapping volumes on the host to store Kafka data
Lines 19-20: Start the ZooKeeper service before the Kafka service

Let's start the Kafka broker! In a new terminal window, run the following command in the same directory:

docker-compose up

ZooKeeper should still be running in another terminal, and if it isn't, Docker Compose will start it. You'll see a lot of logs being printed and then Kafka should be running! We can verify this by creating a topic.

If you have the Kafka command line tools installed, run:

kafka-topics --zookeeper localhost:2181 --create --topic new-topic --partitions 1 --replication-factor 1
> Created topic "new-topic".

If you don't have the Kafka command line tools installed, you can run a command using Docker as well:

docker exec -it kafka_kafka2_1 kafka-topics --zookeeper zookeeper:2181 --create --topic new-topic --partitions 1 --replication-factor 1
> Created topic "new-topic".

If you get any errors, verify both Kafka and ZooKeeper are running with docker ps and check the logs from the terminals running Docker Compose.

Yay! You now have the simplest Kafka cluster running within Docker. Kafka with broker id 2 is exposed on port 9092 and ZooKeeper on port 2181. Data for this Kafka cluster is stored in ./data/kafka2.

To stop the containers, you can use ctrl + c or cmd + c on the running Docker Compose terminal windows. If they don't stop, you can run docker-compose down. To remove the containers if they don't get removed as a part of down, you can run docker-compose rm.

Running Three Kafka Brokers In Docker

To run three brokers, we need to add two more kafka services to our Docker Compose file. We'll run broker 1 on port 9091 and broker 3 on port 9093.

Add two more services as so:

version: "3"

services:
...
  kafka1:
    image: confluentinc/cp-kafka:5.3.0
    hostname: kafka1
    ports:
      - "9091:9091"
    environment:
      KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka1:19091,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9091
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_BROKER_ID: 1
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    volumes:
      - ./data/kafka1/data:/var/lib/kafka/data
    depends_on:
      - zookeeper

  kafka3:
    image: confluentinc/cp-kafka:5.3.0
    hostname: kafka3
    ports:
      - "9093:9093"
    environment:
      KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka3:19093,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9093
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
      KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
      KAFKA_BROKER_ID: 3
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    volumes:
      - ./data/kafka3/data:/var/lib/kafka/data
    depends_on:
      - zookeeper

You can find a full gist with ZooKeeper and three Kafka brokers here. Essentially, we update the ports, the broker ID, and the data directory on the host.

Note: In a production setup, you'd want the offset topic replication factor to be set higher than 1, but for the purposes of this tutorial I've left it at one since we started with one broker.

We can now verify that all three brokers are running by creating a topic with a replication factor of 3:

docker exec -it kafka_kafka2_1 kafka-topics --zookeeper zookeeper:2181 --create --topic three-isr --partitions 1 --replication-factor 3
> Created topic "three-isr".

If you receive an error, ensure all three Kafka clusters are running. Woohoo! You've now got a Kafka cluster with three brokers running.

Conclusion

Congrats! You've successfully started a local Kafka cluster using Docker and Docker Compose. Data is persisted outside of the container on the local machine which means you can delete containers and restart them without losing data. For next steps, I'd suggest playing around with Kafka's fault tolerance and replication features.

For example, you could create a topic with a replication factor of 3, produce some data, delete broker 2, delete broker 2's data directory (./data/kafka2), and start broker 2 and see that the data is replicated to the new broker. Pretty cool!

For full sets of Docker Compose files for running various Kafka Cluster setups, check out Stephane Maarek's kafka-stack-docker-compose repository. This post was inspired by it. :-).

Apache Kafka: Quick Start

Shawn Seymour — Mon, 20 May 2019 20:30:58 +0000

This post was originally published on my personal blog.

Apache Kafka is a distributed streaming platform that can act as a message broker, as the heart of a stream processing pipeline, or even as the backbone of an enterprise data synchronization system. Kafka is not only a highly-available and fault-tolerant system; it also handles vastly higher throughput compared to other message brokers such as RabbitMQ or ActiveMQ.

In this tutorial, you will install Apache Kafka, run three brokers in a cluster, and learn how to produce and consume messages from your cluster. This tutorial assumes that you have no existing Kafka or ZooKeeper installation.

Prerequisites

To complete this tutorial, you will need:

A UNIX environment (Mac or Linux)
Java 8+ installed

Note: Java 7 support was dropped in 2.0.0. Java 11 support was added in 2.1.0.

Installation

Download Apache Kafka and its related binaries from the Apache Kafka website. At the time of this article, the latest version is Apache Kafka 2.1.1. After downloading from the previous link, extract the .tgz file from the location it was downloaded to.

tar -xzf kafka_2.11-2.1.0.tgz
cd kafka_2.11-2.1.0

System Architecture

Let's take a look at the architecture of a simple Apache Kafka setup.

Kafka Cluster: A group of Kafka brokers forming a distributed system
Kafka Broker: An instance of Kafka that holds topics of data
ZooKeeper: A centralized system for storing and managing configuration
Producer: A client that sends messages to a Kafka topic
Consumer: A client that read messages from a Kafka topic

Kafka utilizes ZooKeeper to manage and coordinate brokers within a cluster. Producers and consumers are the main components that interact with Kafka, which we'll take a look at once we have a running Kafka broker. In this tutorial, we'll be running three Kafka brokers and one ZooKeeper node.

Architecture diagram of integrations used in this tutorial

The above diagram shows the architecture of the systems and tools used in this tutorial. It helps demonstrate how Kafka brokers utilize ZooKeeper, which components the command line tools we'll be using interact with, and shows the ports of the running services.

Starting Zookeeper

ZooKeeper is a centralized service that is used to maintain naming and configuration data as well as to provide flexible and robust synchronization within distributed systems. Kafka requires ZooKeeper, so we must start an instance of ZooKeeper before we start Kafka.

Conveniently, the download for Apache Kafka includes an easy way to run a ZooKeeper instance. Inside of the bin directory, there is a file named zookeeper-server-start.sh. To start ZooKeeper, run the following command from the root directory of your download:

bin/zookeeper-server-start.sh config/zookeeper.properties

In your terminal, ZooKeeper logs will start flowing and you will shortly see a line that states ZooKeeper is running on port 2181. This is ZooKeeper's default port, and can be changed in config/zookeeper.properties.

Note: The default directory where ZooKeeper stores its state is set to /tmp/zookeeper. If you restart your machine, all ZooKeeper data will be lost.

Lastly, open a new terminal window and let ZooKeeper continue running in your original terminal. Ensure you cd to the root directory of your extracted Kafka download.

Setting Up A Kafka Cluster

The official Kafka quick start guide only runs one broker – that's not really a distributed system or a cluster; so we're going to run three brokers! :)

Let's examine the configuration file for a Kafka broker located at config/server.properties. You can view the configuration file from your new terminal window by running:

cat config/server.properties

There's quite a bit of configuration, but the main properties we care about are the following:

broker.id=0: the unique id of the broker
listeners=PLAINTEXT://:9092: the protocol and port of the broker
logs.dir=/tmp/kafka: the storage location for data in the broker

All three of these configuration properties must be unique per broker. By default, you can see the default broker id is 0 and the default Kafka port is 9092. Since we're going to start 3 brokers, let's copy this file for each broker and leave server.properties as-is for reference. We can do this by running:

cp config/server.properties config/server-1.properties
cp config/server.properties config/server-2.properties
cp config/server.properties config/server-3.properties

Next, we need to modify the properties listed above to be unique per broker. You'll want to ensure you uncomment the listeners property. Modify the files using your favorite text editor, or via a CLI program such as vim. Make sure to only modify the lines below, and not to replace the whole file with them!

server-1.properties

broker.id=1
listeners=PLAINTEXT://:9091
log.dirs=/tmp/kafka-1

server-2.properties

broker.id=2
listeners=PLAINTEXT://:9092
log.dirs=/tmp/kafka-2

server-3.properties

broker.id=3
listeners=PLAINTEXT://:9093
log.dirs=/tmp/kafka-3

Yay! We now have a configuration file for each broker. Each broker has a unique id, listens on a unique port, and stores data in a unique location.

Note: As with ZooKeeper, the data is stored in the /tmp directory. All data will be lost when you restart your machine.

Starting Kafka

In addition to your current terminal, open two more terminal windows and cd to your Kafka download directory. You should have four terminals open at this point; one running ZooKeeper and three for running Kafka.

To start Kafka, you'll want to run the following commands, with each one in a separate terminal:

bin/kafka-server-start.sh config/server-1.properties

bin/kafka-server-start.sh config/server-2.properties

bin/kafka-server-start.sh config/server-3.properties

You'll start to see logs in each terminal for the brokers you started. If you look at your ZooKeeper terminal, you'll also see logs from the brokers connecting to ZooKeeper. Each terminal should end with a line similar to:

[2019-03-02 15:28:21,074] INFO [KafkaServer id=1] started (kafka.server.KafkaServer)

Congrats! You now have a Kafka cluster running, with a unique broker exposed on ports 9091, 9092, and 9093.

Creating A Topic

Now that we have a Kafka cluster running, let's send some messages! To do this, we must first create a topic. Kafka includes some command line tools to do this, located in the bin directory. Open a new terminal window and cd to the Kafka download directory.

Let's create a topic named test. We can do this by utilizing the kafka-topics.sh script in the bin directory.

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic test

Let's analyze the arguments we're passing the script:

--create: flag to create a topic
--zookeeper: pass the zookeeper connect utilized by Kafka
--replication-factor: set the replication factor
--partitions: set the number of partitions
--topic: set the topic name

In the command above, we create a single partition topic. We also set the replication factor to 3. This means that data will be replicated (copied for redundancy) to all of our brokers.

Note: The max replication factor for a topic is the number of brokers you have running. In this case, we have max replication factor of 3.

We can now describe the topic to gain insight into our newly created topic:

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test

This will output something similar to:

Topic:test  PartitionCount:1    ReplicationFactor:3 Configs:
    Topic: test Partition: 0    Leader: 2   Replicas: 2,3,1 Isr: 2,3,1

This explains that our topic test has one partition, a replication factor of three, and no non-default configurations set. It also shows for our one partition, partition 0, that the leader is broker 2 and that we have 3 in-sync replicas. Your leader may be different than broker 2, but you should have 3 in-sync replicas.

To learn more about what partitions, replicas, and in-sync replicas mean, go check out and read my post Apache Kafka: An Introduction.

Producing Messages

Now that we have a Kafka topic, let's send some messages to it! We can do this using the kafka-console-producer.sh script in the bin directory. This is a handy tool for producing messages from the command line.

Run the console producer with the following command:

bin/kafka-console-producer.sh --broker-list localhost:9091,localhost:9092,localhost:9093 --topic test

We pass the list of Kafka brokers with the --broker-list argument and the name of the topic to produce to with the --topic argument. You should now have a terminal line starting with >. From here, you can type a message and hit enter to send it to Kafka. For example:

> hello world, this is my first message
> this is a second message

Once you've sent some messages, exit out of the console producer by using cmd + c or ctrl + c.

Consuming Messages

We've successfully sent some messages to our Kafka topic, so the last thing we need to do is read those messages. We can do this by using the kafka-console-consumer.sh script in the bin directory. This is a handy tool for consuming messages from the command line.

Run the console consumer against our topic with the following command:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9091,localhost:9092,localhost:9093 --topic test --from-beginning

We set the --bootstrap-server argument to a comma-separated list of our brokers; this can be one or all of the brokers. I typically use all brokers for consistency. We also set the argument --topic to our topic name and pass the --from-beginning flag to read all messages in the topic. If you don't pass --from-beginning, you'll only see messages that have been produced since starting the consumer.

You should see the messages sent earlier appear in the output:

hello world, this is my first message
this is a second message

To exit the consumer, use cmd + c or ctrl + c.

Conclusion

Congrats! You've successfully started a local Kafka cluster, created a topic, sent messages to it with a console producer, and read messages from it with a console consumer. For fun, you can start the console producer and console consumer in separate terminal windows and produce some more messages. You'd then be able to see messages get consumed and printed in real time! Sweet!

You can stop the Kafka brokers and ZooKeeper node by using cmd + c or ctrl + c in their respective terminal windows. I hope this tutorial helped you in getting a local Kafka cluster set up, and now you should be ready to continue on in your Kafka journey!

Apache Kafka: An Introduction

Shawn Seymour — Thu, 16 May 2019 21:39:07 +0000

This post was originally posted on my personal blog.

Apache Kafka is a distributed data streaming platform made for publishing, subscribing to, storing, and processing streams of events or records in real time. It is designed to take in data from multiple sources, store that data in a reliable way, and allow that data to be consumed from multiple systems. It is also designed to handle trillions of events per day. It was originally developed at LinkedIn and is now an open source Apache project.

Message Queues

Apache Kafka is an alternative to traditional message queue systems, such as ActiveMQ or RabbitMQ. A message queue is a form of asynchronous service-to-service communication. It allows a service to send messages to a queue, where another service can then read those messages. Services that write to a queue are typically called producers. Services that subscribe and read from a queue are called consumers.

This communication is called asynchronous because once a service sends the message, it can continue doing other work instead of waiting for a response from another service. In a nutshell, message queues allow a number of systems to pull a message, or a batch of messages, from the end of the queue. Typically, after a message has been read, it is removed from the queue.

Simple message queue data flow example

The most general implementation of a message queue is a task list, where a consumer reads a task off the queue and processes it. Multiple consumers can be added to add concurrency and improve task processing speed, but it does not allow for multiple actions to happen based on the same message. This can be generalized as a list of commands where each command is only processed by one consumer.

To improve upon this, the publish/subscribe model was born (a.k.a. pub/sub). In the pub/sub model, multiple consumers can subscribe to the same queue and each consumer can read the same message independently.

For example, imagine a queue that provides the latest stock price of a given stock. There could be many systems that would be interested in consuming the latest stock price. Those systems can subscribe to the queue and each system will read the latest stock price, even if another independent system has already read it.

This can be generalized as a list of events where each consumer can process every event.

Introducing Apache Kafka

Apache Kafka, as stated above, is an alternative messaging system that encompasses the concepts of message queues, pub/sub, and even databases. A producer can publish a record to a topic, rather than a queue. Consumers can then subscribe to and read messages from that topic. Unlike most message queues, messages from a topic are not deleted once they are consumed; rather, Kafka persists them to disk. This allows you to replay messages and allows a multitude of consumers to process differing logic for each record, or like the example above, each event.

Benefits

There are many benefits provided by Apache Kafka that most message queue systems were not built to provide.

Reliability : Kafka is distributed, partitioned, replicated and fault tolerant. We'll explore what this means later on.
Scalability : Kafka scales easily to multiple nodes and allows for zero-downtime deployments and upgrades.
Durability : Kafka's distributed commit log allows for messages to be persisted on disk.
Performance : Kafka's high-throughput for publishing and subscribing allows for highly performant distributed systems.

As described above, Kafka provides a unique range of benefits over traditional message queues or pub/sub systems. Let's dig deeper into the internals of Kafka and how it works.

Kafka Terminology

The architecture of Kafka is organized into a few key components. As a distributed system, Kafka runs as a cluster. Each instance of Kafka within a cluster is called a broker. All records within Kafka are stored in topics. Topics are split into partitions of data; more on that later. Lastly, producers write to topics and consumers read from topics.

Apache Kafka is the backbone of a data streaming platform

Commit Log

At the heart of Apache Kafka lies a distributed, immutable commit log, which is quite similar to the git log we all know and love. Each record published to a topic is committed to the end of a log and assigned a unique, sequential log-entry number. This is also often called a "write-ahead log". Essentially, we get an ordered list of events that tell us two things: what happened and when it happened. In distributed systems, for many reasons, this is typically the heart of the problem.

Example of a write-ahead log with a sequential id for each entry

As a side effect of Kafka topics being based around a commit log, we get durability. Data is persisted to disk and is available for consumers to read as many times as they would like to. If desired, Kafka can then be used as a source of truth, much like a database.

For example, imagine a users topic. Each time a new user registers within an application, an event is sent to Kafka. From here, one service can then read from the users topic and persist it in a database. Another service might read the users topic and send a welcome email. This allows us to decouple services from one another and often helps implement microservices and event-driven architectures.

Topics and Partitions within Kafka

As described above, Kafka stores data within topics. Topics are then split into partitions. A partition is an ordered, immutable log of records that is continually appended to. Each record in a partition is assigned a sequential id number, called the offset, that uniquely identifies the record within the partition. A topic is made up of one or more partitions.

Example anatomy of a Kafka topic with three partitions

Splitting topics into multiple partitions provides multiple benefits:

Logs can scale larger than the size of one server; each partition must fit within the size of one server but a topic with multiple partitions can spread across many servers
Consumption of topics can be parallelized by having a consumer for each partition of a topic, which we will explain later on

A Kafka cluster persists all published records using a configurable retention period. This is true for records that have and have not been consumed. Kafka's performance is not affected with respect to the size of the data on the disk; so storing data for a long time is not a problem. The retention period can be set based on a length of time or the size of the topic.

For example, if the retention policy is set to five days, then a record can be consumed for up to five days since being published. After those five days have passed, Kafka will discard the record to free up disk space.

Kafka can also persist data indefinitely based on the key of a message. This is very similar to a database table, where the latest record for each key is stored. This is called log compaction, and leads to what is called a compacted topic. Messages with an outdated record will eventually be garbage collected and removed from the topic.

Distribution and Reliability within Kafka

Each broker holds a set of partitions where each partition is either a leader or a replica for a given topic. All writes to and reads from a topic happen through the leader. The leader coordinates updates to replicas when new records are appended to a topic. If a leader fails, a replica takes over as a new leader. Additionally, a replica is said to be in-sync if all data has been replicated from the leader. By default, only in-sync replicas can become a leader if the leader fails. Out-of-sync replicas can be a sign of broker failure or problems within Kafka.

Example diagram showing replication for a topic with two partitions and a replication factor of three

By having multiple replicas of a topic, we help ensure data is not lost if a broker fails. For a cluster with n brokers and topics with a replication factor of n, Kafka will tolerate up to n-1 server failures before data loss occurs.

For example, let's say you have a cluster with 3 brokers. Imagine a users topic with a replication factor of 3. If one broker is lost, users will have 2 in-sync replicas and no data loss occurs. Even further, if another broker is lost, users will have 1 replica and there is still no data loss. Impressive!

Load of the cluster is managed by distributing the number of partition leaders across multiple brokers within the cluster. This allows Kafka to handle high amounts of reads and writes without putting all the strain on one broker – unless you only have 1 broker!

Producing to Kafka

Producers publish to topics of their choosing. Producers are responsible for assigning a partition to the record within the topic it's producing to. This can be done in a round-robin fashion to balance it or according to a semantic partition function (such as based on a key within the record).

For example, the default partition strategy for the Java clients use a hash of the record's key to choose the partition. This preserves message order for messages with the same key. If the record's key is null, then the Java client will partition the data randomly. This can be useful for easily partitioning high-volume data where order does not matter.

Consuming from Kafka

Consumers in Kafka are organized into consumer groups. A consumer group is a set of consumer instances that consume data from partitions in a topic.

Example consumer group with three consumers reading from a topic with three partitions

Consumers read from a single partition at a time, which allows us to scale the number of consumers to the number of partitions to increase the consumption throughput. Each consumer within a consumer group for a topic reads from a unique partition. The group as a whole then consumes all messages from the entire topic.

For example, imagine a topic with 6 partitions. If you have 6 consumers in a consumer group, each consumer will read from 1 partition. If you have 12, six of the consumers will be idle while the other six consume from 1 partition. If you have 3 consumers, each consumer will read from 2 partitions. If you had 1 consumer, it would read from all of the partitions.

Each consumer group reads from a topic independent of any other consumer group. This allows for many systems (each having their own consumer group) to read every message in the topic, unlike consuming messages from a traditional message queue.

It's important to note that ordering within a topic is only guaranteed for each partition. Thus, if you care about the order of records, it's important to partition based on something that preserves ordering (such as a primary key) or to only use one partition.

Conclusion

This post was only a simple introduction to the key concepts of Kafka. We'll dig deeper into the internals of Kafka, the guarantees it makes, real-world use cases, and in-depth tutorials on how to use Kafka in further posts.

Overall, Kafka is quickly becoming the backbone of many organization's data pipelines. It allows for massive throughput of messages while maintaining stability. It enables decoupling of producers and consumers for a flexible and adaptive architecture. Lastly, it provides reliability, consistency, and durability guarantees that many traditional message queue systems do not. I hope you enjoyed learning about how Kafka can be a useful tool when building large-scale data platforms!

Kafka Shell - Supercharge Your Apache Kafka CLI

Shawn Seymour — Wed, 20 Mar 2019 16:37:24 +0000

Kafka Shell

Are you working with the Apache Kafka command line tools? Ever had trouble remembering what options are available, or remembering URLs for your clusters? Kafka shell to the rescue!

Kafka Shell is a supercharged, interactive Kafka shell that is built on top of the existing Kafka command line tools. It features auto completion, auto suggestion from history, key commands, and much more. It's an open source project I just released, built with Python.

Features

Auto completion of Kafka commands, options, and configuration options.

Configuration of clusters, schema registries, and properties that will be automatically added to commands being run.

Supported Commands

Kafka shell currently supports a lot of most popular Kafka command line tools, such as Kafka-topics, kafka-console-consumer, and more. I plan to add the rest after I get some initial feedback!

kafka-topics
kafka-configs
kafka-console-consumer
kafka-console-producer
kafka-avro-console-consumer
kafka-avro-console-producer
kafka-verifiable-consumer
kafka-verifiable-producer
kafka-broker-api-versions
kafka-consumer-groups
kafka-delete-records
kafka-log-dirs
kafka-dump-log
kafka-acls
ksql

Get It Now

Let me know what you think -- and I hope this can help improve working with Apache Kafka. :)

devshawn / kafka-shell

⚡A supercharged, interactive Kafka shell built on top of the existing Kafka CLI tools.

kafka-shell

A supercharged, interactive Kafka shell built on top of the existing Kafka CLI tools.

Kafka shell allows you to configure a list of clusters, and properties such as --bootstrap-server and --zookeeper for the currently selected cluster will automatically be added when the command is run. No more remembering long server addresses or ports!

Installation

Kafka shell requires python and pip. Kafka shell is a wrapper over the existing Kafka command-line tools, so those must exist within your PATH.

You can install kafka-shell using pip:

pip install kafka-shell

Usage

Kafka shell is an interactive shell. You can run it from the terminal by:

kafka-shell

From here, you can start typing kafka and the autocomplete will kick in.

Key Commands

Change Cluster: The selected cluster commands are run against can be cycled through by pressing F2.
Fuzzy Search: By default, fuzzy search of commands is enabled…

View on GitHub