This blog post is adapted from a talk given by Stella Cotton at RailsConf 2018 titled "So You’ve Got Yourself a Kafka."
In recent years, designing software as a collection of services, rather than a single, monolithic codebase, has become a popular way to build applications. In this post, we'll learn the basics of Kafka and how its event-driven process can be used to power your Rails services. We’ll also talk about practical considerations and operational challenges that your event-driven Rails services might face around monitoring and scaling.
What is Kafka?
Suppose you want to know more information about how your users are engaged on your platform: the pages they visit, the buttons they click, and so on. A sufficiently popular app could produce billions of events, and sending such a high volume of data to an analytics service could be challenging, to say the least.
Enter Kafka, an integral piece for web applications that require real-time data flow. Kafka provides fault-tolerant communication between producers, which generate events, and consumers, which read those events. There can be multiple producers and consumers in any single app. In Kafka, every event is persisted for a configured length of time, so multiple consumers can read the same event over and over. A Kafka cluster is comprised of several brokers, which is just a fancy name for any instance running Kafka.
One of the key performance characteristics of Kafka is that it can process an extremely high throughput of events. Traditional enterprise queuing systems, like AMQP, have the event infrastructure itself keep track of the events that each consumer has processed. As your number of consumers scales up, that infrastructure will suffer under a greater load, as it needs to keep track of more and more states. And even establishing an agreement with a consumer is not trivial. Should a broker mark a message as "done"
once it's sent over the network? What happens if a consumer goes down and needs a broker to re-send an event?
Kafka brokers, on the other hand, do not track any of its consumers. The consumer service itself is in charge of telling Kafka where it is in the event processing stream, and what it wants from Kafka. A consumer can start in the middle, having provided Kafka an offset of a specific event to read, or it can start at the very beginning or even very end. A consumer's ability to read event data is constant time of O(1); as more events arrive, the amount of time to look up information from the stream doesn't change.
Kafka also has a scalable and fault-tolerant profile. It runs as a cluster on one or more servers that can be scaled out horizontally by adding more machines. The data itself is written to disk and then replicated across multiple brokers. For a concrete number around what scalable looks like, companies such as Netflix, LinkedIn, and Microsoft all send over a trillion messages per day through their Kafka clusters!
Setting Kafka up in Rails
Heroku provides a Kafka cluster add-on that can be used in any environment. For Ruby apps, we recommend using the ruby-kafka gem for real-world use cases. A bare minimum implementation only requires you to provide hostnames for your brokers:
# config/initializers/kafka_producer.rb
require "kafka"
# Configure the Kafka client with the broker hosts and the Rails
# logger.
$kafka = Kafka.new(["kafka1:9092", "kafka2:9092"], logger: Rails.logger)
# Set up an asynchronous producer that delivers its buffered messages
# every ten seconds:
$kafka_producer = $kafka.async_producer(
delivery_interval: 10,
)
# Make sure to shut down the producer when exiting.
at_exit { $kafka_producer.shutdown }
After setting up the Rails initializer, you can start using the gem to send event payloads. Because of the asynchronous behavior of sending events, we can write an event outside the thread of our web execution, like this:
class OrdersController < ApplicationController
def create
@comment = Order.create!(params)
$kafka_producer.produce(order.to_json, topic: "user_event", partition_key: user.id)
end
end
We'll talk more about Kafka's serialization formats below, but in this scenario, we're using good old JSON. The topic
keyword argument refers to the log where Kafka is going to write the event. Topics themselves are divided into partitions, which allow you to "split" the data in a particular topic across multiple brokers for scalability and reliability. It's a good idea to have two or more partitions per topic so if one partition fails, your events can still be written and consumed. Kafka guarantees that events are delivered in order inside a partition, but not inside a whole topic. If the order of the events is important, passing in a partition_key will ensure that all events of a specific type go to the same partition.
Kafka for your services
Some of the properties that make Kafka valuable for event pipeline systems also make it a pretty interesting fault tolerant replacement for RPC between services. Let's use an example of an e-commerce application to illustrate what this means in practice:
def create_order
create_order_record
charge_credit_card # call to Payments Service
send_confirmation_email # call to Email Service
end
Let's assume that when a user places an order, this create_order
method is going to be executed. It'll create an order record, charge the user's credit card, and send out a confirmation email. Those last two steps have been extracted out into services.
One challenge with this setup is that the upstream service is responsible for monitoring the downstream availability. If the email system is having a really bad day, the upstream service is responsible for knowing whether that email service is available. And if it isn't available, it also needs to be in charge of retrying any failing requests. How might Kafka's event stream help in this situation? Let's take a look:
In this event-oriented world, the upstream service can write an event to Kafka indicating that an order was created. Because Kafka has an "at least once" guarantee, the event is going to be written to Kafka at least once, and will be available for a downstream consumer to read. If the email service is down, the event is still persisted for it to consume. When the downstream email service comes back online, it can continue to process the events it missed in sequence.
Another challenge with an RPC-oriented architecture is that, in increasingly complex systems, integrating a new downstream service means also changing an upstream service. Suppose you'd like to integrate a new service that kicks off a fulfillment process when an order is created. In an RPC world, the upstream service would need to add a new API call out to your new fulfillment service. But in an event-oriented world, you would add a new consumer inside the fulfillment service that consumes the order once the event is created inside of Kafka.
Incorporating events in a service-oriented architecture
In a blog post titled "What do you mean by “Event-Driven”," Martin Fowler discusses the confusion surrounding "event-driven applications." When developers discuss these systems, they can actually be talking about incredibly different kinds of applications. In an effort to bring a shared understanding to what an event-driven system is, he's started defining a few architectural patterns.
Let's take a quick look at what these patterns are! If you'd like to learn more about them, check out his keynote at GOTO Chicago 2017 that covers these in-depth.
Event Notification
The first pattern Fowler talks about is called Event Notification. In this scenario, one service simply notifies the downstream services that an event happened with the bare minimum of information:
{
"event": "order_created",
"published_at": "2016-03-15T16:35:04Z"
}
If a downstream service needs more information about what happened, it will need to make a network call back upstream to retrieve it.
Event-Carried State Transfer
The second pattern is called Event-Carried State Transfer. In this design, the upstream service augments the event with additional information, so that a downstream consumer can keep a local copy of that data and not have to make a network call to retrieve it from the upstream service:
{
"event": "order_created",
"order": {
"order_id": 98765,
"size": "medium",
"color": "blue"
},
"published_at": "2016-03-15T16:35:04Z"
}
Event-Sourced
A third designation from Fowler is an Event-Sourced architecture. This implementation suggests that not only is each piece of communication between your services kicked off by an event, but that by storing a representation of an event, you could drop all your databases and still completely rebuild the state of your application by replaying that event stream. In other words, each payload encapsulates the exact state of your system at any moment.
An enormous challenge to this approach is that, over time, code changes. A future API call to a downstream service might return a different set of data that previously available, which makes it difficult to recalculate the state at that moment.
Command Query Responsibility Segregation
The final pattern mentioned is Command Query Responsibility Segregation, or CQRS. The idea here is that actions you might need to perform on a record--creating, reading, updating--are split out into separate domains. That means that one service is responsible for writing and another is responsible for reading. In event-oriented architectures, you'll often see event systems nestled in the diagrams at the place where commands are actually written.
The writer service is going to read off of the event stream, process commands, and store them to a write database. Any queries happen on a read-only database. Separating out read and write logic into two different services adds an increase in complexity, but it does allow you to optimize performance separately for those systems.
Practical considerations
Let's talk about a few practical considerations that you might run into while integrating Kafka into your service-oriented application.
The first thing to consider are slow consumers. In an event-driven system, your services need to be able to process events as quickly as the upstream service produces them. Otherwise, they will slowly drift behind, without any indication that there's a problem, because there won't be any timeouts or call failures. One place where you can identify timeouts will be on the socket connection with the Kafka brokers. If a service is not processing events fast enough, that connection can timeout, and reestablishing it has an additional time cost since it's expensive to create those sockets.
If a consumer is slow, how do you speed it up? For Kafka, you can increase the number of consumers in your consumer group so that you can process more events in parallel. You'll want at least two consumer processes running per service, so that if one goes down, any other failed partitions can be reassigned. Essentially, you can parallelize work across as many consumers as you have topic partitions. (As with any scaling issue, you can't just add consumers forever; eventually you're going to hit scaling limits on shared resources, like databases.)
It's also extremely valuable to have metrics and alerts around how far behind you are from when an event was added to the queue. ruby-kafka is instrumented with ActiveSupport notifications, but it also has StatsD and Datadog reporters that are automatically included. You can use these to report on whether you're lagging behind when the events are added. The ruby-kafka gem even provides a list of recommended metrics to monitor!
Another aspect to building systems with Kafka is to design your consumers for failure. Kafka is guaranteed to send an event at least once; there's never a chance that messages will not be sent at all. But you need to design consumers to expect duplicated events. One way to do that is to always rely on UPSERT
to add new records to your database. If a record already exists with the same attributes, the call will essentially be a no-op. Alternatively, you can include a unique identifier to each event, and just skip operating on events that have already been seen before.
Payload formats
One surprising aspect to Kafka is its very permissive attitude towards data. You can send it anything in bytes and it will simply send that back out to consumers without any verification. This feature makes its usage extremely flexible because you don't need to adopt a specific format. But what happens when an upstream service decides to change an event that it produces? If you just change that event payload, there's a really good chance that one of your downstream consumers will break.
Before you begin adopting an event-driven architecture, choose a data format and evaluate how it can help you register schemas and evolve them over time. It's much easier to think about validation and evolution of schemas before you actually implement them.
One format to use is, of course, JSON, the format of the web. It's human readable and supported in basically every programming language. But there are a few downsides. For one, the actual size of JSON payloads can be really large. The payloads require you to send key-value pairs, which are flexible, but are also often duplicated across every event. There's no built-in documentation inside a payload, such that, given a value, you might not know what it means. Schema evolution is also a challenge, since there's no built-in support for aliasing one key to another if you need to rename a field.
Confluent, the team that built Apache Kafka, recommends using Avro as a data serialization system. The data is sent over in binary, so it's not human-readable. But the upside is that there is a more robust schema support. A full Avro object includes its schema and its data. Avro comes with support for simple types, like integers, and complex ones, like dates. It also embeds documentation into the schema, which allows you to comprehend what a field does in your system. It provides built-in tools that help you evolve your schemas in backwards-compatible ways over time.
avro-builder is a gem created by Salsify that offers a very Ruby-ish DSL to help you create your schemas. If you're curious to learn more about Avro, the Salsify engineering blog has a really great writeup on avro and avro-builder!
More information
If you're curious to learn more about how we run our hosted Kafka products, or how we use Kafka internally at Heroku, we do have two talks other Heroku engineers have given that you can watch!
Jeff Chao's talk at DataEngConf SF '17 was titled "Beyond 50,000 Partitions: How Heroku Operates and Pushes the Limits of Kafka at Scale," while Pavel Pravosud spoke at Dreamforce '16 about " Dogfooding Kafka: How We Built Heroku's Real-Time Platform Event Stream." Enjoy!
Top comments (3)
Passing along a Twitter question
Thanks! I answered on Twitter if anyone is curious:
Karafka? Hell, no! I already mixed up Kafka and Karaf (OSGI runtime), but now that I know something called Karafka exists, too, I can mix up things on a whole new level. :-D