How to run multi-tenant Kafka

Apache Kafka is a beast. Forget for a moment what it actually does. I’m talking about running it in production. Even experienced teams find that getting the most out of Apache Kafka can become a serious time sink.

At Heroku, their DevOps team looks after Kafka on behalf of thousands of developers through the Apache Kafka on Heroku service. Not everyone can justify the expense of their own dedicated Kafka cluster, though. To make Kafka available for testing, development, and smaller production use cases, Heroku offers access to multi-tenant clusters.

If running just one Kafka instance is a full-time job, though, what does it take to operate multi-tenant Kafka clusters at Heroku’s scale?

Making the right trade-offs

Running anything multi-tenant is about making the right trade-offs. But some compromises are off the table. Security, as you’d expect, isn’t up for discussion.

Performance, too, must be good enough that someone could use a multi-tenant Apache Kafka on Heroku cluster as part of their production stack.

So, what about functionality? Perhaps Heroku could reduce the number of brokers or Zoopkeeper instances. But that, too, is a compromise too far. There’s little use in a testing or development instance that behaves differently from a production cluster.

So, what’s left? How do you run secure, performant, fully functional multi-tenant Kafka at a fraction of the price of a dedicated Kafka cluster? By using Kafka’s own functionality, plus some of Heroku’s, to securely divide a single cluster between multiple customers.

Oh, and lots of planning, testing, and automation.

Security through isolation

Security in multi-tenant environments usually boils down to, “Can other people see my stuff?”

The Heroku Kafka team handles this problem in two main ways. The first solution is to use Kafka’s access control lists (ACLs). They’re enforced at the cluster-level and specify which users can perform which actions on which topics.

Second, Heroku uses namespaces to separate each tenant’s resources. Let’s say you add multi-tenant Kafka to your Heroku account. The Heroku provisioning system automatically generates a name -- something like wabash-58799 -- and associates it with your account when it creates the Kafka resource. After that, Heroku verifies that your account is associated with the right resource name each time you perform an action on Kafka. That way, only your account can access any activity on that resource, providing another level of security that is unique to Heroku.

Staying on top of noisy neighbors

Just as one tenant must not be able to access another’s data, all tenants of a cluster must have fair access. So, even if another customer is processing huge numbers of events, it should not disturb your usage.

Heroku uses Kafka’s built-in support for quotas on producers and consumers, meaning that there is a fixed limit on the number of bytes each tenant can read or write per second. That way, every user gets their fair share of the computing resources available.

Maintaining availability

Noisy neighbours are a solvable problem. However, some multi-tenant services make it almost impossible to avoid them.

Think about traditional shared hosting offerings, where they promise the Earth for $3 a month. Much of the time, they’re overprovisioning. Squeezing a thousand customers each expecting 100 GB of disk space onto a machine with a 1 TB hard drive works only if most customers use only a fraction of their full allocation.

Heroku’s multi-tenant Kafka immediately provisions the full set of resources purchased. So, you don’t end up with a hundred people all trying to use the same gigabyte of disk space. And even if a customer does go beyond their disk quota, for example, Heroku will automatically expand their limit while emailing a notification that they need to upgrade their service.

Availability is basically about setting sane defaults, like this. Have the system behave in the way that maximizes its usefulness. Often that means provisioning more than the Kafka defaults. For example, higher partition settings (from one to eight, to maximize throughput), additional replicas (from one to three, to ensure data is not lost), and more in-sync replicas (from one to two, to truly confirm that a replica received a write).

Testing for the real world

The saying goes that being prepared is half the battle. Knowing what could go wrong enables you to avoid those problems before they happen.

The Heroku team have run extensive tests on their multi-tenant Kafka offering to simulate real-world usage, failure scenarios, and extreme workloads. For example, hammering a cluster with a million messages, then taking one of the brokers offline to see what happens. Or operating a cluster normally then stopping and restarting a server to check that failover works.

Those one-off tests have developed into a test suite that creates an empty cluster then generates fifty users. Those users attach the Kafka add-on to their application and then create several producers and consumers each. From there, realistic usage profiles are assigned, such as having 10% of the test users generate very small amounts of traffic, while 20% send very large messages at slow speeds, and so on. Then, the tests gradually increase the number of users to determine a multi-tenant cluster’s operational limits.

Through that testing, the Heroku team identified issues before they became a problem for real users. There’s more detail in a talk called “Running Hundreds of Kafka Clusters with Five People.”

Kafka in more places

For most development teams, getting the benefits of Kafka without having to actually run the Kafka cluster is the ideal situation. You’re free to focus on building your product without worrying about learning the ins and outs of yet another platform. Multi-tenant Kafka takes that a step further by making it affordable for situations where a dedicated cluster is overkill and yet where Kafka can have a benefit.

There’s more about what the Heroku team have learned from working with Kafka over on the Heroku blog.

Cover photo by Sophie Dale
Stormtrooper photo by Liam Tucker

Latest comments (2)

Marcelo Avancini • Mar 29 '21 • Edited

Great article and awesome work, congrats for Heroku team.

I have just a few questions:

when you say: "heroku uses namespaces to separate each tenant’s resources", what is a "namespace" is this context, a Kafka concept applied, or something developed internally?
how would be the isolation of topics with the same name, is it a name convention prefixing the name, for instance?

Marcelo Avancini • Mar 29 '21

I've found this video from Ali Hamidi, and it seems that the namespace is represented by a prefix on any resource like topics or groups.

youtube.com/watch?v=-AtHKoTNR1k

Let me know if I've missed something, thank you!