DEV Community: Smily

RDS Database Migration Series - Integrating Ruby on Rails applications with RDS Proxy

Karol Galanciak — Mon, 29 Jul 2024 09:32:09 +0000

In the previous blog post, we covered our story of migrating a giant database of almost 11 TB. Here comes the last part of the series—making Ruby on Rails applications work with RDS Proxy.

Why use RDS Proxy in the first place?

Before migration, we were using PgBouncer. We hosted multiple databases (for multiple applications) per cluster, and it was often the case that a single application required 300 or even 400 hundred connections alone. Hence, the connection pooler was a natural solution to the issues we had. We were really happy with it as it was simple to integrate with, and it did the job, yet we decided not to use PgBouncer anymore as AWS does not offer it as a managed service, and the entire point of migration was to not self-host database anymore. We were left then with RDS Proxy as the only available solution. It looked pretty straightforward to add, and since it was the dedicated option for RDS, we expected that things would work out-of-box, assuming that we keep the same config as for PgBouncer (which mainly was disabling prepared statements and using transaction-level advisory locks over session level ones). Well, it turned out that we were wrong.

First issues with RDS Proxy

After trying out the RDS Proxy with the first application, it looked like the connection pooling did not work. When inspecting logs, we saw tons of warnings that looked like this:

The client session was pinned to the database connection [dbConnection=1189232136] for the remainder of the session. The proxy can't reuse this connection until the session ends. Reason: SQL changed session settings that the proxy doesn't track. Consider moving session configuration to the proxy's initialization query. Digest: "set client_encoding to $1".

Connection pinning means that the connection cannot be reused, which explains why it looked like the proxy didn't work, especially since the problem was the initialization query.

Thanks to some available articles and existing Github issues, we figured out that we needed to move some config parts from the pg gem and Rails Postgres adapter to the RDS Proxy initialization query. "Moving" meant some heavy-monkey patching and adjusting some surprising low-level config and setting Encoding.default_internal to a nil value, which pg gem depends on. However, it seems like the issue was fixed in pg 1.5.4, so making sure the gem is up-to-date will help avoid the problem.

Fixing RDS Proxy - getting it right with the initialization query

We started addressing the issue one warning at at time, and it turned out that we had to adjust a couple of config parameters:

client_encoding - the one that was set in pg gem based on the Encoding.default_internal
statement_timeout - we used it as the extra config in database.yml, so we had to make sure that none of the variables were applied
intervalstyle - this one had to be adjusted in ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
client_min_messages - same as above, we had to monkeypatch ActiveRecord::ConnectionAdapters::PostgreSQLAdapter and remove it
standard_conforming_strings - same as above
timezone again, same as above

This is what the final init_query looked like:

init_query = "SET client_encoding TO unicode; SET statement_timeout TO 300000; SET intervalstyle TO iso_8601; SET client_min_messages TO warning; SET standard_conforming_strings TO on; SET timezone TO utc"

And it solved most of the issues!

The remaining issue that we didn't address

There was only one problem remaining:


2023-09-06T08:28:13.685Z [WARN] [proxyEndpoint=default] [clientConnection=51706963] The client session was pinned to the database connection [dbConnection=973587044] for the remainder of the session. The proxy can't reuse this connection until the session ends. Reason: The connection ran a SQL query which exceeded the 16384 byte limit.

Unfortunately, there was no easy solution to that problem as it is a known limitation of RDS Proxy. However, the number of database connections was acceptable for us, so we stopped at this point.

The final config for Rails applications

We've put everything into the single initializer and added some extra ENV variables for the more straightforward release and potential rollback if something goes wrong.

# frozen_string_literal: true

return unless ENV.fetch("APPLY_CONFIG_FOR_RDS_PROXY", "false") == "true"

Encoding.default_internal = nil # for pg version >= 1.5.4 it's not necessary

class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
  private

  def exec_no_cache(sql, name, binds, async: false)
    materialize_transactions
    mark_transaction_written_if_write(sql)

    # make sure we carry over any changes to ActiveRecord.default_timezone that have been
    # made since we established the connection
    update_typemap_for_default_timezone

    type_casted_binds = type_casted_binds(binds)
    log(sql, name, binds, type_casted_binds, async:) do
      ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
        # -- monkeypatch --
        # to use async_exec instead of exec_params if prepared statements are disabled

        if ActiveRecord::Base.connection_db_config.configuration_hash.fetch(:prepared_statements, "true").to_s == "true"
          Retryable.perform(times: 3, errors: [PG::ConnectionBad, PG::ConnectionException], before_retry: ->(_) { reconnect! }) do
            @connection.exec_params(sql, type_casted_binds)
          end
        else
          Retryable.perform(times: 3, errors: [PG::ConnectionBad, PG::ConnectionException], before_retry: ->(_) { reconnect! }) do
            @connection.exec(sql)
          end
        end
        # -- end of monkeypatch --
      end
    end
  end

  protected

  def configure_connection
    # if @config[:encoding]
    #   @connection.set_client_encoding(@config[:encoding])
    # end
    # self.client_min_messages = @config[:min_messages] || "warning"
    self.schema_search_path = @config[:schema_search_path] || @config[:schema_order]
    #
    # # Use standard-conforming strings so we don't have to do the E'...' dance.
    # set_standard_conforming_strings
    #
    # variables = @config.fetch(:variables, {}).stringify_keys
    #
    # # If using Active Record's time zone support configure the connection to return
    # # TIMESTAMP WITH ZONE types in UTC.
    # unless variables["timezone"]
    #   if ActiveRecord::Base.default_timezone == :utc
    #     variables["timezone"] = "UTC"
    #   elsif @local_tz
    #     variables["timezone"] = @local_tz
    #   end
    # end
    #
    # # Set interval output format to ISO 8601 for ease of parsing by ActiveSupport::Duration.parse
    # execute("SET intervalstyle = iso_8601", "SCHEMA")
    #
    # # SET statements from :variables config hash
    # # https://www.postgresql.org/docs/current/static/sql-set.html
    # variables.map do |k, v|
    #   if v == ":default" || v == :default
    #     # Sets the value to the global or compile default
    #     execute("SET SESSION #{k} TO DEFAULT", "SCHEMA")
    #   elsif !v.nil?
    #     execute("SET SESSION #{k} TO #{quote(v)}", "SCHEMA")
    #   end
    # end
  end
end

The initializer works with Rails 6.0+ versions.

Also, we added some extra retryable behavior as it turned out that for whatever reason (likely killing some idle connections), the RDS Proxy was randomly closing some connections. Reconnecting to the database seemed to solve most of the issues (although not all). Here is the code behind Retryable class:

# frozen_string_literal: true

class Retryable
  def self.perform(times:, errors:, before_retry: ->(_error) {})
    executed = 0
    begin
      executed += 1
      yield
    rescue *errors => e
      if executed < times
        before_retry.call(e)
        retry
      else
        raise e
      end
    end
  end
end

Conclusions

While integrating Ruby on Rails applications with RDS Proxy turned out to be way more complex than doing it with popular connection poolers such as PgBouncer, we managed to solve most (but not all) the issues we encountered with a single initializer on the applications' side and by fine-tuning the initialization query on the RDS Proxy side*.

RDS Database Migration Series - Facing The Giant: How we migrated 11 TB database

Karol Galanciak — Mon, 13 May 2024 09:21:15 +0000

In the previous blog post, we covered our story of migrating to AWS RDS using AWS Database Migration Service (DMS), a complex initiative with multiple issues we had to address.

Nevertheless, almost all the migrations we did could have been generalized to the same strategy - in the end, we found a way to use DMS that works fine, even for mid-size databases.

One database, though, required some extra preparation - the 11 TB giant (10.9 TB to be exact). Despite all the steps we took, it was not possible to migrate via AWS DMS via full load within an acceptable time, even when applying parallel load. In that case, we had to develop our custom migration script, which turned out to be almost 20 times faster than AWS DMS.

Even more surprising is that the database shrunk to 3.1% of its original size after the migration!

Let's review all the steps we took to prepare this giant database for the migration and what our custom migration script exactly looked like.

How did we get to 11 TB of Postgres database in the first place?

Before we get into the details, let's start by explaining how we got to the point where the database was so massive.

The primary culprits were two tables (and their huge indexes) that contributed approximately 90% to the total size of the database. One of them was an audit trail (paper trail versions, to be exact), and the second one was more domain-specific for short-term rentals. It's a pre-computed cache of prices for properties depending on various conditions so that they don't need to be computed each time on the fly and can be easily distributed to other services.

In both cases, the origin of the problem is the same, just with a slightly different flavor—trying to store as much as possible in the Postgres database without defining any reasonable retention policy.

The tricky part is that we couldn't just delete some batch records after a certain period defined by a retention policy. We had to keep some paper trail versions forever. And for the computed cache, we still had to preserve many computation results for an extended period to debug potential problems.

That doesn't mean we had to keep them in the Postgres database, e.g., AWS S3. Pulling data on demand (and removing them once no longer needed) is an alternative that may take some extra time to develop, but it's much more efficient.

This is what we did to a considerable extent for the records representing pre-computed prices; we started archiving shortly after they were no longer applicable by pushing them to S3, deleting them from the Postgres database, and pulling data on demand when debugging was needed (and deleting them again when no longer required).

We also applied a retention policy for paper trail versions and archived the records by uploading them to S3 and deleting them from the Postgres database. However, we also decided to split the giant unpartitioned table into several tables (one generic one as a catch-all/default table and a couple of model-specific tables for the most important models), essentially introducing partitioning by model type. Due to that split/partitioning, we temporarily increased the total size of the database. Still, we could ignore the original table during the migration and, at the same time, make migration via AWS DMS faster by simplifying parallelization during the full load.

The interesting realization is that the majority of the data we stored were historical records, which are not critical business data.

Overall, we deleted a massive amount of records from the Postgres database. However, the database size didn't change at all. Not even by a single gigabyte!

What happened?

I have a massive database. Now what?

If you are in a similar situation, you will likely have a big issue to solve. This is due to a few reasons:

Deleting records from Postgres does not make the table smaller (and neither vacuum does, at least not a normal one).
While there are multiple ways to shrink table size after deleting many records, they are usually complex.
Even if you manage to shrink the size of massive tables or even delete them completely, you are still likely to keep paying the same price for the storage - if you use, e.g., AWS EBS volumes, you cannot shrink them; you can only increase their size.
At that point, you will likely need to migrate to a new EBS volume (or equivalent). If you don't use a managed service, you could solve the problem like we did and migrate to AWS RDS.

Let's look at all these issues closer.

It is essential to remember that deleting records in the Postgres database does nothing to reduce the table size—it only makes the records no longer available. However, running vacuum marks the internal tuples from deleted rows as reusable for the future. So, it does slow down the table's growth (as a part of already allocated storage is reusable), yet we can't expect the size to go down.

Nevertheless, there are multiple ways to shrink the table size:

Use vacuum full - the most straightforward option (as it's a matter of running a single command), yet the most dangerous one. Not only does it require an exclusive lock for the entire table for a potentially very long time, but it also requires more disk space initially, as it will create a copy of the table without the deleted records.
Using an extension that provides similar functionality but does not acquire an exclusive lock on the entire table - pg_repack is a popular solution.
Copy the data to the new table and delete the old one - that one cannot be solved by running a single simple command but is potentially the safest one and offers the most flexibility as there are many ways how you can import the data from one table to another, keep them in sync for a while and delete the previous table.

While we have several solutions to the problem, the sad reality is that if we use block-storage services such as AWS EBS, we will still pay the same price for the allocated storage, which cannot be deallocated.

If the price is an issue, the options we could consider at this point would be moving data to a smaller EBS volume or migrating to a new cluster. We went with the second option as it naturally fit our plan to move from self-managed clusters to the AWS-managed service (AWS RDS).

Benchmarking the migration with AWS DMS - not that smooth this time.

After all the hard work in preparing the database for migration, we did a test benchmark to see how long it would take. And it looked very promising initially - almost all the tables migrated in approximately 3 hours. All except the one - the table that stored the results of computed prices. After close to 5.5 hours, we gave up, as this was an unacceptable time for just a single table. 5 hours would be fine for the entire downtime window, including the indexes re-creation, not just migrating a single table on top of 3 hours that it took to migrate the rest of the database.

That was a big surprise - the original table was huge (2.6 TB), but we deleted over 90% of the records, so we expected the migration to be really fast. Especially with the parallel load!

However, most of the storage space was occupied by LOBs—*the array of decimals (prices)—which could have been the primary cause behind the slowness. Most likely, the massive bloat after deleting such a big part of the table didn't help. And probably the most crucial reason is that there is a limit to how much we can parallelize the process. The maximum full load subtasks number allowed on AWS DMS* is 49.

At that point, we had to figure out some custom solution, as we didn't want to change the migration strategy from just the full load to the combination of full load and CDC. The good news was that we had already been using something that could be very useful in designing a custom solution - performing a bulk insert (using activerecord-import) of the archived records. It proved to be fast enough to restore a significant number of records. Also, nothing was preventing us from having a way higher parallelization degree than DMS. This could be our solution.

Custom migration script to the rescue

We had to do a couple of tweaks to reuse the functionality we implemented to restore the archived records. Not only the source of the data would be different (from S3 to Postgres database), but there were also important considerations:

The entry point for running the process would be scheduling a Sidekiq job from the rails console.
We had to make it easily parallelizable by allowing a huge number of Sidekiq workers to process independent migration jobs.
To make it happen, the optimal way would be using ID ranges as arguments per job, especially if the table had a numeric (bigint) primary key (that would not be so simple when using uuid)
However, given the massive number of records, we could not afford to have a single process scheduling jobs sequentially one by one for the consecutive ranges. That way, scheduling these jobs could become a bottleneck.
To satisfy the requirement from the previous point, we could take the range between minimum and maximum ID from the table and split it into giant buckets. By further dividing these buckets, we could have jobs that would schedule other jobs.
In the end, we would have two types of jobs:
1. Schedulers - operating on huge batches that would be splitting them into smaller ones, and each of the sub-buckets would be used as an argument for the second type of job
2. Migrators - the jobs that would be calling the actual operation that knows how to perform the migration

And this is how we did it:

The scheduler job:

# frozen_string_literal: true

class DatabaseMigration::ScheduleIdsRangeMigrationStrategyJob
  include Sidekiq::Worker

  sidekiq_options queue: :default

  def self.enqueue(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, min_id, max_id,
    sub_range_size_for_data_migration, batch_size_per_job)

    current_index = 0
    current_min_id = 0
    while current_min_id <= max_id
      current_min_id = (min_id + (batch_size_per_job * current_index))
      maximum_possible_end_id = current_min_id + batch_size_per_job - 1
      current_max_id = [maximum_possible_end_id, max_id].min

      perform_async(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, current_min_id,
        current_max_id, sub_range_size_for_data_migration)
      current_index += 1
    end
  end

  def perform(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, start_id, end_id,
    sub_range_size_for_data_migration)
    DatabaseMigration::IdsRangeMigrationStrategyJob.enqueue(table_name, id_column, source_db_uri_env_var_name,
      target_db_uri_env_var_name, start_id, end_id, sub_range_size_for_data_migration)
  end
end

The migration job:

# frozen_string_literal: true

class DatabaseMigration::IdsRangeMigrationStrategyJob
  include Sidekiq::Worker

  sidekiq_options queue: :default

  def self.enqueue(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, min_id, max_id,
    sub_range_size = 1000)

    (min_id..max_id).each_slice(sub_range_size).lazy.each do |range|
      perform_async(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, range.first,
        range.last)
    end
  end

  def perform(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, start_id, end_id)
    DatabaseMigration::IdsRangeMigrationStrategy
      .new(id_column: id_column, table_name: table_name, source_database_uri: ENV.fetch(source_db_uri_env_var_name),
        target_database_uri: ENV.fetch(target_db_uri_env_var_name))
      .migrate(start_id, end_id)
  end
end

The operation performing the actual migration:

class DatabaseMigration::IdsRangeMigrationStrategy
  attr_reader :id_column, :table_name, :source_database_uri, :target_database_uri, :batch_size
  private     :id_column, :table_name, :source_database_uri, :target_database_uri, :batch_size

  def initialize(id_column:, table_name:, source_database_uri:, target_database_uri:)
    @id_column = id_column
    @table_name = table_name
    @source_database_uri = source_database_uri
    @target_database_uri = target_database_uri
    @batch_size = ENV.fetch("DMS_IDS_RANGE_STRATEGY_BATCH_SIZE", 1000).to_i
  end

  def migrate(start_id, end_id)
    source_model.where(id_column => [start_id..end_id]).in_batches(of: batch_size).lazy.each do |records_batch|
      target_model.import(records_batch.map(&:attributes), on_duplicate_key_ignore: true)
    end
  end

  private

  def source_model
    @source_model ||= Class.new(ApplicationRecord) do
      def self.name
        "SourceModelForIdsRangeMigrationStrategy"
      end
    end
      .tap { |klass| klass.table_name = table_name }
      .tap { |klass| klass.establish_connection(source_database_uri) }
  end

  def target_model
    @target_model ||= Class.new(ApplicationRecord) do
      def self.name
        "TargetModelForIdsRangeMigrationStrategy"
      end
    end
      .tap { |klass| klass.table_name = table_name }
      .tap { |klass| klass.establish_connection(target_database_uri) }
  end
end

Take a few minutes to analyze how things work exactly.

Running the entire process was limited to merely executing this code:

table_name = TABLE_NAME
id_column = "id"
source_db_uri_env_var_name = DMS_SOURCE_DATABASE_URL
target_db_uri_env_var_name = DMS_TARGET_DATABASE_URL
min_id = MIN_ID # from YourSourceModel.minimum(:id)
max_id = MAX_ID # from YourSourceModel.maximum(:id)
batch_size_per_job = 100_000_000 # the big batch size
sub_range_size_for_data_migration = 500_000 # the sub-batch size

DatabaseMigration::ScheduleIdsRangeMigrationStrategyJob.enqueue(table_name, id_column, source_db_uri_env_var_name, target_db_uri_env_var_name, min_id, max_id, sub_range_size_for_data_migration, batch_size_per_job) ; nil

And it was really that simple to implement an alternative solution to AWS DMS! We did a test benchmark, and it turned out that with a comparable parallelization (50 Sidekiq workers), it took close to 45 minutes, so it was perfectly fine for our needs! Even more interesting was that there was no significant database load increase, even while running it on the production database actively processing standard workload. The potential for parallelization was even greater, which we wanted to see in action during the last migration.

Performing the final migration

And this day finally came - facing the 11 TB giant. Fortunately, the migration went perfectly fine! The entire process took approximately 5 hours (the actual downtime). And there were even more things to celebrate!

The migration of the troublesome table took merely 18 minutes with 125 Sidekiq workers! We haven't tried going beyond 5.5 hours for this table using AWS DMS, but even assuming that it would be the final load time, our custom migration script that took a few hours to build turned out to be almost 20x faster (18.3, to be precise). And there was plenty of room to optimize it even further - for example, we could play with different buckets sizes' and also run the process with 200 Sidekiq workers. Furthermore, we don't know how long it would take AWS DMS to finish the process - maybe it could be 6 or 7 hours. It would not be impossible then to have a custom process that would be 30x or even 40x faster.

And there was one more thing—the database size after the migration turned out to be merely 347 GB, which is 3% of the original size! Reducing all the bloat definitely paid off.

Conclusions

Ultimately, we managed to migrate the colossus, which started as an 11 TB giant and became a mid-size and well-maintained database of 347 GB (3.1% of the original size). That was a challenging initiative, yet thanks to all the steps we took, we managed to shrink it massively and migrate it during a reasonable downtime window. However, it wouldn't have been possible without our custom migration script, which we used together with the AWS DMS full load.

Stay tuned for the next part, as we will provide a solution for making AWS RDS Proxy work with Rails applications, which was not that trivial.

Event Sourcing with Rails from scratch

Oleg Borys — Fri, 22 Mar 2024 14:48:07 +0000

In the previous article Introduction to Event Sourcing and CQRS we got familiar with the main concepts of Event Sourcing and reviewed the cons and pros of this approach.

Implementing Event Sourcing in Rails can be a powerful way to handle complex business logic and maintain a reliable audit trail of changes to your application's state. Event Sourcing involves storing a sequence of events that represent changes to the state of your application over time.

Now, before we plummet into using Event Sourcing in Rails with help or battle-tested solutions, let's research the basics and learn how to implement things from scratch. However, be aware - you should think twice before using custom implementation (too many chances you going to miss some small detail and the consequences may be huge).
For these purposes let's create a simple application for listing rental advertisements on the website.

The process may be quite complicated, but for the sake of simplicity, let's assume it goes as below:

the ad listing is created
the content is updated
the listing goes published on the website
ad listing removed

Preparations

where we start building a path in our journey and getting familiar with events

So, let's create a new project for our implementation purposes first:

> rails new event_sourced_ads --database=postgresql --skip-test

Skipping tests, since I’m a rspec fan so we'll use that:

gem "rspec-rails", "~> 6.1.0"

Then bundle and rails generate rspec:install.

Now, once the initial setup is done we are going to use events. Let’s implement that part. And guess what we’re starting with:

dclass CreateEvents < ActiveRecord::Migration[7.1]
  def change
    create_table :events, id: :uuid do |t|
      t.string :stream_name
      t.string :event_type
      t.jsonb :data

      t.timestamps
    end
  end
end

We use UUID here for primary keys. When dealing with distributed systems and the need for worldwide uniqueness, opting for a UUID could be the optimal decision.

We need streams to separate events related to certain entities. Streams are needed to group events of a particular kind. In our case, I’m going to group events related to one single ad, so those can be easily fetched. Frankly speaking, it’s not a good idea to store stream names like that, since the events may be in different streams. For the sake of simplicity, let’s consider doing some evil (we’ll do more till the end).

We need some basic event class that we can publish and verify input with, also we need some way to use pub-sub (quite a crucial part). I’m excited about dry-rb stack - I can’t say it’s perfect, but it usually perfectly suits all my needs, so I'm turning on the imagination and seeing what it brings… ImaginationCompleted.publish(data: {idea: "Create BaseEvent", pub_sub: "KISS rails has one built-in"})

Who am I to argue with that 😇 Let’s start with a spec:

...
it "persists an event record" do
  expect { publish }.to change { Event.count }.by(1)
  expect(Event.last).to have_attributes(
    event_type: "FakeEvent",
    data: {"name" => "whatever"},
    stream_name: "123123"
  )
end

it "sends a notification" do
  allow(ActiveSupport::Notifications).to receive(:instrument)
  publish
  expect(ActiveSupport::Notifications).to have_received(:instrument).with(
    "FakeEvent", data: {name: "whatever"}, stream_name: "123123"
  )
end
...

and after some struggle, we come up with:

# lib/events/base_event.rb
module Events
  class BaseEvent
    class InvalidAttributes < StandardError; end

    class MissingContract < StandardError; end

    attr_reader :data

    def self.schema(&block)
      inner_schema = block.call
      define_method(:params_schema) do
        Dry::Schema.Params do
          required(:data).hash(inner_schema)
        end
      end
    end

    def self.publish(**args)
      new(**args.slice(:data)).publish(stream_name: args[:stream_name])
    end

    def initialize(**args)
      validate_input(args)
      @data = args[:data]
    end

    def publish(stream_name: nil)
      Event.create!(
        event_type: self.class.name, data:, stream_name:
      )
      ActiveSupport::Notifications.instrument(self.class.name, data:, stream_name:)
      self
    end

    def params_schema
      ->(_) { raise MissingContract, "Contract needs to be implemented" }
    end

    def validate_input(args)
      data_validation = params_schema.call(args)
      raise InvalidAttributes.new(data_validation.errors.to_h) if data_validation.errors.any?
    end
  end
end

And let’s try using our new pet. You know where to start…

RSpec.describe Events::AdCreated do
  describe ".publish" do
    subject(:publish) do
      described_class.publish(
        data: {title: "Some title", body: "Some description"},
        stream_name: "123456789",
      )
    end

    it "persists the event in database" do
      expect { publish }.to change { Event.count }.by(1)
      expect(Event.last).to have_attributes(
        event_type: "Events::AdCreated",
        data: {
          "title" => "Some title",
          "body" => "Some description"
        },
        stream_name: "123456789"
      )
    end
  end
end

and the event itself:

# lib/events/ad_created.rb
class Events::AdCreated < Events::BaseEvent
  schema do
    Dry::Schema.Params do
      required(:title).filled(:string)
      required(:body).filled(:string)
    end
  end
end

Aggregate part

the one where we learn to manipulate our ads

So, as user we should be able to create new ad, possibly modify that and publish. However, we shouldn’t be able to edit already published ad. So we need some consistency in actions and having corresponding event published after the action is executed. That’s where the aggregate comes to place.

So, we create AdAggregate class and start with test for the new instance:

# spec/services/ad_aggregate_spec.rb
RSpec.describe AdAggregate do
    it "has valid attributes on initialization" do
    expect(aggregate).to have_attributes(
      id: kind_of(String),
      state: :new
    )
  end
# app/services/ad_aggregate.rb
class AdAggregate
  attr_reader :id, :attributes, :state

  def initialize(id = nil)
    @id = id || SecureRandom.uuid
    @state = :new
  end
end

Next we need possibility to actually create new draft and have those attributes in the aggregate. Also we need to publish an event that the draft is created.

describe "#create_draft" do
  subject(:create_draft) { aggregate.create_draft(**attributes) }

  context "with valid attributes" do
    let(:attributes) { valid_attributes }

    it "updates attributes and state" do
      create_draft
      expect(aggregate).to have_attributes(
        attributes: {
          title: "Test title",
          body: "Test description"
        },
        state: :draft
      )
    end
  end
end

The one is easy to implement, but we face a problem here. We should be able to restore the state of the aggregate later when we want to apply next actions to that. The aggregate is supposed to be event sourced one. So we need a way to apply events to that and all we should actually do here is to apply an event

def create_draft(title:, body:)
  apply Events::AdCreated.new(data: {ad_id: id, title:, body:})
end

We need a handler in the aggregate to understand how we modify the attributes, how the state is changed and how we can restore the state of an aggregate from history of events (for this purpose we’ll create another class in a while 😉). Also, the events should be published when we store the aggregate. For these purposes, let’s add handler methods to explain how we want to modify aggregate’s state on event and common method that will also create a queue of unpublished events

  def unpublished_events
    @unpublished_events ||= []
  end

  def apply_event(event)
    send("apply_#{event.class.name.demodulize.underscore}", event)
  end

    private 

    def apply(event)
    unpublished_events << event
    apply_event(event)
  end

  def apply_ad_created(event)
    @state = :draft
    @attributes = event.data.slice(:title, :body)
  end

So, when the new event is applied we save that in a queue of unpublished events and call the corresponding handler. But what’s the sense of that without having the events stored? How to fetch previously created aggregate? We could implement that here in this class, though according Single Responsibility Principle, it’s definitely a work that someone else should do. That’s where we need a repository:

# frozen_string_literal: true

require "rails_helper"

RSpec.describe Repository do
  describe '.load' do
    subject(:load) { described_class.load(aggregate_class, stream_name) }

    let(:aggregate_class) { AdAggregate }
    let(:stream_name) { SecureRandom.uuid }

    context "without events" do
      it "loads new aggregate" do
        expect(load).to be_instance_of(aggregate_class).and have_attributes(
          id: stream_name,
          state: :new
        )
      end
    end

    context "with existing events" do
      context "when applying AdCreated event" do
        before do
          Event.create(
            event_type: "Events::AdCreated", stream_name:,
            data: {ad_id: stream_name, title: "title", body: "body"}
          )
        end

        it "applies event to aggregate" do
          expect(load).to be_a(AdAggregate).and have_attributes(
            id: stream_name,
            state: :draft,
            attributes: {
              title: "title",
              body: "body"
            }
          )
        end

        context "when applying AdPublished" do
          before do
            Event.create(
              event_type: "Events::AdPublished", stream_name:,
              data: {ad_id: stream_name, remote_id: "xosfjoj"}
            )
          end

          it "applies event to aggregate" do
            expect(load).to be_a(AdAggregate).and have_attributes(
              id: stream_name,
              state: :published,
              attributes: {
                title: "title",
                body: "body"
              }
            )
          end
        end
      end
    end
  end

  describe '.store' do
    subject(:store) { described_class.store(aggregate) }

    context "with unpublished events" do
      let(:aggregate) do
        instance_double(AdAggregate, id: stream_name, unpublished_events: [event])
      end
      let(:stream_name) { SecureRandom.uuid }
      let(:event) do
        Events::AdCreated.new(data: {ad_id: stream_name, title: "title", body: "body"})
      end

      it "publishes pending events" do
        expect { store }.to change { Event.count }.by(1)
        expect(Event.last).to have_attributes(
          stream_name:,
          event_type: "Events::AdCreated",
          data: {
            "ad_id" => stream_name,
            "title" => "title",
            "body" => "body"
          }
        )
      end
    end
  end
end

and the implementation of that is easy enough. I’ll omit description of that to save some precious space and time

module Repository
  extend self

  def load(aggregate_class, stream_name)
    events = Event.where(stream_name:).map do |event|
      event.event_type.constantize.new(data: event.data)
    end
    aggregate_class.new(stream_name).tap do |aggregate|
      events.each do |event|
        aggregate.apply_event(event)
      end
    end
  end

  def store(aggregate)
    aggregate.unpublished_events.each do |event|
      event.publish(stream_name: aggregate.id)
    end
  end
end

We do have a possibility to store the aggregate to load that from existing events. However, we are missing one of the main purposes for the aggregate. We should disallow editing already published ads, also we definitely can’t publish the same ad twice (well technically we can, but for sure that’s wrong). So, as usually:

describe "#update_content" do
  subject(:update_content) { aggregate.update_content(**new_attributes) }

  let(:aggregate) { described_class.new }
  let(:new_attributes) do
    {title: "Updated title", body: "Updated description"}
  end

  context "when ad is in draft state" do
    before { aggregate.create_draft(**valid_attributes) }

    it "updates ad attributes" do
      update_content
      expect(aggregate).to have_attributes(
        attributes: {
          title: "Updated title",
          body: "Updated description"
        },
        state: :draft
      )
    end
  end

  context "when ad is in published state" do
    before do
      aggregate.create_draft(**valid_attributes)
      aggregate.publish
    end

    it "raises an error" do
      expect { update_content }.to raise_error(described_class::AlreadyPublished)
    end
  end
end

describe "#publish" do
  subject(:publish) { aggregate.publish }

  let(:aggregate) { described_class.new }

  context "when ad is in draft state" do
    before { aggregate.create_draft(**valid_attributes) }

    it "updates state to published" do
      publish
      expect(aggregate.state).to eq(:published)
    end
  end

  context "when ad is in published state" do
    before do
      aggregate.create_draft(**valid_attributes)
      aggregate.publish
    end

    it "raises an error" do
      expect { publish }.to raise_error(described_class::AlreadyPublished)
    end
  end
end

You can check the implementation of the methods in the repository and it’d be a good idea to try implementing that by yourself 😉

CQRS part

the one where we get familiar with read models and presentation to users

Ok, pub is ready, now it’s time to ~~drink some beer~~ have sub part:

# config/initializers/event_listeners.rb
Rails.application.config.after_initialize do
  {
    AdEventListener: [
      Events::AdCreated,
    ]
  }.each do |listener, events|
    events.each { |event| ActiveSupport::Notifications.subscribe(event.to_s, listener.to_s.constantize) }
  end
end

So, here we are going to rule where events happen to be. In the example, AdEventListener will get the ActiveSupport event we broadcast with BaseEvent and send a call to our listener. Perfect… but not exactly what we need.

class ApplicationEventListener
  def self.call(event)
    public_send(
      "apply_#{event.name.demodulize.underscore}", **event.payload
    )
  end
end

and now we should be able to create listeners in a very convenient form:

class AdEventListener < ApplicationEventListener
  class << self
    def apply_ad_created(data:, stream_name:)
      Ad.create!(id: stream_name, **data)
    end
  end
end

🤔 …but stop, something’s wrong here. What’s Ad.create!? We don’t have that implemented… The part is omitted for a reason.

What we implemented above is a CQRS system and Ad is a read model. The structure of that is not important and should suit your needs. In this example project I’ve implemented Events::AdModified, Events::AdPublished, Events::AdRemoved. You can get familiar with the project

Retrospective part

the one where we look over what we did

We’ve just implemented an application using Event Sourcing from scratch. I definitely would recommend to stay away from self-made solutions in production. Several simplifications were made (but you may need those once your project grows). Anyway, it’s good to know what‘s inside the black-box (gem) you use.

In the upcoming articles we’re going to play with some recognized instruments to implement event sourced applications

RDS Database Migration Series - A horror story of using AWS DMS with a happy ending

Karol Galanciak — Mon, 18 Mar 2024 14:13:54 +0000

In the previous blog post, an intro to our database migration series, we promised to tell the story of our challenges with AWS Database Migration Service, which turned out to be far from all sunshine and rainbows as it might initially seem after skimming through the documentation.

When we started using it, it went significantly downhill compared to expectations, with all the errors and unexpected surprises. Nevertheless, we made the migration possible and efficient with extra custom flows outside DMS.

AWS Database Migration Service - what is it and how it works?

If you already have data in any storage system (within or outside AWS) and want to move it to the Amazon-managed service, using Database Migration Service is the way to go. That implies that it's not a tool just for migrating from self-managed PostgreSQL to AWS RDS as we used it - it's just one of the possible paths. In fact, AWS DMS supports an impressive list of sources and targets, including non-obvious ones such as Redshift, S3 or Neptune.

For example, it's possible to migrate data from PostgreSQL to S3 and use AWS DMS for that purpose, which already gives an idea of how powerful the service can be.

Essentially, we can have two types of migrations:

Homogenous when the source and the target database are equivalent, e.g., migration from self-hosted PostgreSQL to AWS RDS PostgreSQL.
Heterogenous - when the source and the target database are different, e.g., migrating from Oracle to PostgreSQL

In our case, that was a homogenous migration (from PostgreSQL to PostgreSQL), which sounds way simpler compared to the heterogenous one (which is likely to require tricky schema conversions, among other things).

When performing the migration via AWS DMS, we also need a middleman between the source and target databases responsible for reading data from the source database and loading it into the target database.

There are two ways how we can work with that middleman:

AWS DMS Serverless - in that approach, AWS will take care of provisioning and managing the replication instance for us.
Replication instance - the management of the replication instance is entirely up to us in this case.

On top of that, we have three types of homogenous PostgreSQL migrations:

Full load AWS DMS uses native pg_dump/pg_restore for the migration process
Full load + Change Data Capture (CDC) - In the first stage*,* AWS DMS performs Full load (so pg_dump/pg_restore) and then switches to ongoing logical replication.
CDC - the process is based entirely on logical replication.

The choice here comes down mainly to the trade-off between simplicity and downtime. If you can tolerate some downtime (depending on the database size and type of data stored), a Full Load sounds like a preferential option, as fewer things can go wrong here—it's simpler. If it doesn't sound like a possible option, using CDC (with or without Full Load) is the only way to achieve near-zero downtime. However, the complexity might be a big concern here.

The initial plan for migration and the first round of apps

Our initial plan assumed that for applications where we can afford a downtime outside of business hours (like 3 or 4 AM UTC+1), we would proceed with the Full Load approach, and for applications where we cannot tolerate the downtime, that would be required to perform the entire migration, we would likely go with Full Load and CDC.

DMS Serverless also looked appealing as it would remove the overhead of managing the replication instance.

We tested that approach with staging apps, and all migrations were smooth - there were no errors, and the process was fast. The databases were tiny, which helped with the speed, but in the end, the entire process looked promising and frictionless.

So, we started with the migration of the production apps. The first few small ones were relatively fast yet substantially longer than the staging. But that made sense as the size was significantly greater - it was no longer a few hundred megabytes in size but rather a few gigabytes or tens of gigabytes.

Then, we got into far bigger databases, reaching over 100 GB. And this is where the severe issues began.

AWS DMS Serverless nightmare

Before migrating any database, it's necessary to perform test migrations to get an idea of whether it will work at all and how much time it might take. It's even more critical for big databases to benchmark the process a few times to tell how long it will take. So, we did exactly that for the bigger databases and achieved promising and consistent results. The migrations were supposed to take quite a while, but that was still acceptable, so we proceeded and scheduled the longer migrations.

Then, we ran into the first significant issue. According to the previous benchmark, we consistently achieved a migration time below 1 hour while the database was under normal load during tests. And then, out of nowhere, with no traffic on the source database, it was taking almost 2 hours with no sign of being closed to finish! Based on the expected size of the target database that we knew from the test migrations, there was still a long way to go.

Sadly, we had to stop the migration process, bring the app back up and running on the original database cluster, and think about what went wrong. Overall, waking up after 5 AM and the extended downtime went for nothing. We tried to replicate the issue with another test migration, but it was working just fine, so we considered this an isolated incident and committed to performing another migration the next day, although for a different application, as we wanted to avoid extended downtime for two days in a row.

However, it wasn't any better during the next day. Even though the process took more or less what was expected based on the test, the database shrank from 267 GB to... 5332 MB! We expected the bloat there, but the bloat couldn't take the majority of the size. And it was a very different result from what we achieved during test runs.

The migration status inside AWS DMS UI was Load Complete, but after checking the number of records in the tables, it turned out all were empty!

That was another failed migration, the second in a row, without any apparent reason why it failed.

Moving to Replication Instance

At that point, we concluded that the Serverless approach was not an option. It proved unreliable for bigger databases, and the lack of control over the process became an issue.

Fortunately, we had one more option left for the Full Load strategy - doing it via Replication Instance. It looked more complex, but at the same time, we had more control over the process.

We attempted the first migration, and wow, that was fast! It was way faster than Serverless, and all the records were there! That looked very promising. Except for the fact that all secondary indexes were missing! And foreign key constraints... And other constraints... And the sequences for generating numeric IDs! Literally everything was missing except for the actual rows...

We double-checked the configuration, and there was nothing about excluding constraints. Also, the config for LOBs was correct—a config param that one needs to be very careful about, as AWS DMS makes it easy for many data types to either not be migrated at all or truncated beyond a specific limit. And apparently, it's not only about JSON or array types but also text types*!*

We re-read the documentation to see what happens to the indexes during the migration, and we found very conflicting information, especially after our previous Serverless migrations, which migrated the indexes and constraints without any issues.

Let's see what AWS DMS documentation says about indexes:

"AWS DMS supports basic schema migration, including the creation of tables and primary keys. However, AWS DMS doesn't automatically create secondary indexes, foreign keys, user accounts, and so on, in the target database." - how come it worked with Serverless then? Based on the documetation, this recommendation doesn't seem to apply to Replication Instance only.
"For a full load task, we recommend that you drop primary key indexes, secondary indexes, referential integrity constraints, and data manipulation language (DML) triggers. Or you can delay their creation until after the full load tasks are complete." - here, it looks like indexes are indeed created automatically but it's recommended to drop them before loading the data.
"AWS DMS creates tables, primary keys, and in some cases unique indexes, but it doesn't create any other objects that aren't required to efficiently migrate the data from the source. For example, it doesn't create secondary indexes, non-primary key constraints, or data defaults." - some cases? What does it even mean? And how does it match the behavior of the serverless approach?

Anyway, we found a reason why the migration was so fast. We also had to find a way to recreate all the missing constraints, indexes, and other things.

Fortunately, native tools helped us a lot. To get all the indexes and constraints, we used pg_dump with the --section=post-data option and then inlined the content of the dump and ran it directly from the Postgres console to have better visibility and control of the process. To bring back sequences, we used this script. It was very odd that AWS DMS does not have any option to handle this—it's capable of migrating Oracle to Neptune, yet it's not capable of smoothly handling indexes for the Replication Instance strategy, even though it's a trivial operation.

After recreating all these items, the state of the application database looked correct according to our post-migration check script (which will be shared later)—all the indexes and constraints were there, and the record counts matched for all tables.

At that point, we concluded that we were ready for another migration. And it looked smooth this time! It went fast, and the state of the source and target databases looked correct. We could bring back the application using a new database.

It looked just perfect! At least until we started receiving very unexpected errors from Sentry: PG::StringDataRightTruncation: ERROR: value too long for type character varying(8000) (ActiveRecord::ValueTooLong). Why did it stop working correctly after the migration? And where is this 8000 number coming from? Did AWS DMS convert the schema without saying anything about this?

We quickly modified the database schema to remove the limit, and everything returned to normal. However, we had to find out what had happened.

Let's see what the documentation says about schema conversion: "AWS DMS doesn't perform schema or code conversion". That clearly explains what happened! Another time where the documentation does not reflect the reality.

We couldn't find anything in the AWS DMS docs regarding the magic 8000 number for character varying type. However, we found this - docs for Qlik and the mapping between PostgreSQL types and Qlik Replicate data types. And it was there: "CHARACTER VARYING - If no length is specified: WSTRING (8000)", which was precisely the case! More conversions were also mentioned, for example, NUMERIC -> NUMERIC(28,6), which also happened for a couple of columns in our case.

It's not clear if the services are related anyhow but this finding is definitely an interesting one.

We haven't been able to confirm with 100% certainty why this exact magic number (8000) was applied here, but it's likely related to PostgreSQL page size, which is commonly 8 kB.

That was not the end of our problems, though. The content of the affected columns got truncated! To fix this, we had to look for all records with content over 8000 characters and backfill the data from the source database to the target database if it hadn't been updated yet on the new database.

We also had to do 3 more things:

Review all columns using character varying type and convert them to text type if any row contains over 8000 characters.
We no longer allow DMS to load the schema from the source database. Instead, we use pg_dump with the -section=pre-data option to have the proper schema.
Update our post-migration verification script to ensure that the schema stays the same.

Establishing the flow that works

Until that point, the AWS DMS experience had been a horror story. Fortunately, this is where it stopped. We finally found a strategy that worked! The subsequent migrations were smooth, and I haven't experienced any issues after that. Even the migrations of databases closer to 1 TB went just fine - although they were a bit slow and required a few hours of downtime.

We could have achieved way better results in terms of minimizing the downtime by using CDC, but after our experience with a Full Load, which is the most straightforward approach, we didn't want to enable logical replication and let AWS DMS handle it to find out that yet another disaster happened - we lost trust in DMS and we wanted to stick to something that we know it works.

This approach worked well almost until the very end. The only friction we experienced with this final flow was the migration of the biggest database. We ran into a specific scenario where performance for one of the tables was far from acceptable, so we developed a simple custom service to speed up the migration. Yet, the other tables were perfectly migratable via DMS. Before the migration to RDS, that database was almost 11 TB, so it also required a serious effort to shrink its size before moving it to RDS.

We will cover everything we've done to prepare that database for the migration in the upcoming blog post, along with the custom database migration service.

The story might look chaotic, but that's for the purpose - even though we found a couple of negative opinions about AWS DMS, the magnitude of the problems wasn't apparent, so this is the article we wished we had read before all the migrations. Hopefully, it will help clarify that AWS DMS is a tool that looks magnificent, but at the time of writing this article, the quality in a lot of areas is closer to the open beta service rather than a production one that is supposed to deal with the business-critical assets - the data. Especially since AWS DMS proved incapable of handling the homogenous migration - we had to use pg_dump/pg_restore to make it work.

Nevertheless, if we were to migrate self-managed PostgreSQL clusters to AWS RDS one more time, we would use Database Migration Service again—we mastered the migration flow and understood the service's limitations to the extent that we would feel confident that we could make it work. And we developed a post-migration verification script that performs a thorough check to ensure that the target database's state is correct. Hopefully, after reading this article, you will be able to do the same thing without the problems we encountered in our migration journeys.

Here is the final list of hints and recommendations for using AWS DMS when performing homogenous migration from self-managed PostgreSQL to AWS RDS PostgreSQL:

Do not use AWS DMS Serverless for large databases. At the time of writing this article (before March 2024), it didn't prove to be a production-ready service. However, this might change in the future.
Use the AWS DMS Replication Instance approach, which you can manage on your own.
Execute the following steps for the migration:
1. Use pg_dump with -section=pre-data to load the schema - do not allow AWD DMS to load the schema, or you will end up with unexpected schema conversions.
2. Use Replication Instance only to copy the rows between the source and the target database.
3. Use pg_dump with -section=post-data to load the indexes and constraints after loading all the rows.
4. Rebuild sequences (for numeric IDs - it doesn't apply to UUIDs) by running this script on the source database and running the output on the target database.
Test the migration result with the following Ruby/Rails script—this is the final version of the script after all the problematic migrations.
Use either Full LOB mode or Inline LOB mode, or you will lose data for many columns, especially with JSON, array, or text types. We've managed to achieve the best performance using Inline LOB mode. This script was quite handy for determining the config threshold size.
Use parallel load. The range type works especially well for large tables using numeric IDs, as it allows you to divide the rows into segments by the range of IDs.
If the source database can survive the high load during migration and there are many tables, aim for the highest value of MaxFullLoadSubTasks (maximum 49), which determines the number of tables that can be loaded in parallel.

Conclusions

Amazon Database Migration Service might initially seem like a perfect tool for a smooth and straightforward migration to RDS. However, our overall experience using it turned out to be closer to an open beta product rather than a production-ready tool for dealing with a critical asset of any company, which is its data. Nevertheless, with the extra adjustments, we made it work for almost all our needs.

Stay tuned for the next part of the series, where we will focus on preparing the enormous database for the migration and a very particular use case where our custom simple database migration tool was far more efficient than DMS (even up to 20x faster in a benchmark!) allowing us to migrate one of the databases simultaneously using both AWS DMS and our custom solution for different tables.

A story of a spectacular success - Intro to AWS RDS database migration series

Karol Galanciak — Tue, 20 Feb 2024 11:55:17 +0000

Overview of our database initiative

In recent months, one of our most significant initiatives was migrating from our self-managed PostgreSQL clusters to the managed service on Amazon - AWS RDS. Overall, we managed to migrate 5 database clusters of 54 applications (including staging ones), several of which had a size of close to 1 TB, and one giant cluster with a single database of a massive size - 11 TB, which required a lot of extra work to make it migratable - otherwise, it would have taken an unacceptably long time to migrate it. Not to mention the cost of keeping that storage

Overall, we migrated a considerable amount of data, and the initiative turned out to be way more complex than anticipated. We've received very little support from AWS Developer Support service, and what is the trickiest part - the docs for Database Migration Service (AWS DMS) seemed to be poorly written - some parts were vague or it was not clear to which type of a strategy of migration they are referring to ("AWS DMS creates only the objects required to efficiently migrate the data: tables, primary keys, and in some cases, unique indexes." - some cases? ). Some migration strategies rarely ever worked (AWS DMS Serverless - I'm looking at you) - they often ended up with errors that didn't say anything about what went wrong, and the way to deal with it was to try enough times - not exactly the smoothest path for migrating a production database! And that is just for the Database Migration Service - we had to make our Rails applications also work with AWS RDS Proxy, which was another challenge!

Although many things didn't go well, we managed to figure out a very robust and stable migration process, even for databases close to 1 TB in size. The end outcome turned out to be way beyond expectations. Not only did we substantially decrease the costs for AWS infra, which may seem counter-intuitive, but few databases were 50% of the pre-migration size - thanks to getting rid of all the bloat that had been accumulated for years, especially prior to introducing retention policies. And the biggest one, after the deep rework, turned out to be... slightly over 3% of its original size! Yes, from 10.9 TB to 347 GB! And all of this is on top of the great advantages that RDS brings!

There is much to share and learn from these migrations. Hopefully, you will find this series helpful and will be able to assess whether migration to AWS RDS could be a good choice for you and what to expect from the process, which can play a massive role here.

This article will focus on why we decided to migrate to AWS RDS with an extra overview of our infrastructure, its scale, and the challenges we used to have. In the next ones, we are going to move on to the following cases:

AWS DMS (Database Migration Service) - all the issues we've faced, strategies we tried and one that worked for us most of the time, odd things in the documentation we encountered, and the final script we used to verify if the migration went as expected (why we even needed it in the first place is also going to be covered)
Our custom Database Migration service - AWS DMS, is a general-purpose tool that will be adequate for most use cases but not all of them. We ran into two very specific scenarios where a self-built service allowed us to achieve a way better result (even 20x faster than AWS DMS!).
RDS Proxy and how to make it work Ruby on Rails - don't expect things to work out of the box, even if everything was perfectly fine when using pgbouncer as a connection pooler. There are extra things you will need to do to make it work, including monkey patching that looks highly questionable at first glance. Fortunately, it works without any issues.
How to prepare for the migration of the almost 11 TB Postgres database - ideally, you would never have a single PostgreSQL database that is as huge, but if that happens, there are certain things you might consider doing or what to prevent knowing that you can also end up in a similar situation. We will show what was enough for us and also discuss potential alternatives we considered.
Extra insights from our infrastructure team - what it took to integrate RDS with the rest of the infrastructure, how we made the RDS proxy work (and debugged its issues), and a couple of extras, like a more detailed integration with Datadog for low-level metrics.

For now, let's discuss how we ended up here.

Our infrastructure before (not only) AWS RDS migration

Before discussing self-managed PostgreSQL clusters, let's look back even further to the past to gain more context about our experience with self-managed services.

We've been using AWS EC2 instances with self-managed components (Redis, PostgreSQL, Kubernetes...) for years, all bound by Chef, with occasional exceptions such as AWS MSK for Kafka or CloudAMQP for RabbitMQ. We had been quite satisfied with that approach, mainly when operating on a much smaller scale. As we grew, we started to experience various issues that kept getting worse.

The greatest source of problems used to be our platform based on Deis/Hephy Workflow, backed by Kubernetes. Not only did we have a lot of maintenance burden when it came to Kubernetes itself, especially upgrading it - and this matters a lot for a tiny infrastructure team that has a lot of other components to maintain - but we also significantly outgrew Deis itself, and working with it became a nightmare. Deployments for bigger applications with over 100 Docker containers were randomly failing, which is quite a big issue if the deployment takes over 30 minutes. Sometimes, the only way to make it work was to add extra computing power via provisioning an extra EC2 node. Since we didn't have autoscaling capabilities back then, you can only imagine how painful it was. On top of that, we had issues with etcd that used to cause the greatest problems in the middle of the night, triggering Pager Duty, especially when it was under a high load, and the way to deal with it was to significantly over-provision resources for the potential load spikes.

The conclusion was clear - we had to migrate from both self-managed Kubernetes and Deis/Hephy. Given the complexity of our ecosystem of apps, keeping Kubernetes as an orchestrator was a natural choice. So was AWS EKS - Elastic Kubernetes Service. To our surprise, not only was the maintenance simpler, but it also seemed to be significantly cheaper compared to self-managed one on EC2 nodes (especially when considering overprovisioned instances for etcd as well). That's how we managed to finish the migration in March 2023 to our new platform based on EKS, Helm, ArgoCD, and Jenkins, and the results have been amazing.

PostgreSQL clusters prior to migration

Even though we achieved significant success with that massive initiative, that was not the end of our problems. PostgreSQL cluster maintenance started becoming a huge issue (sound familiar already?). Since we were hosting multiple databases on a single cluster, we are talking about several terabytes per cluster. Plus, one single application, which at that time started to get close to 10 TB. Performing any version upgrade was a challenging experience because we were afraid of doing maintenance work there - if it's not broken, better not touch it. Especially after one of the incidents when the pg_wal directory started to grow uncontrollably due to the archiver getting stuck without any exit code, reaching over 1 TB size. Fortunately, it eventually self-healed, yet that made a decision for us clear - we were at the scale where managing massive PostgreSQL clusters by a tiny infrastructure team was no longer an option, and we prioritized a migration to managed PostgreSQL service under AWS RDS, especially that doing so for Kubernetes turned out to be an ideal choice.

If that wasn't enough, there were even more reasons why the migration was a good idea. Our monthly AWS invoices were huge! Not only for EC2 instances but mostly the parts concerning Data Transfer (between regions) and EBS volumes. Things started to get interesting when we began estimating the costs of RDS clusters. The results were very promising - it turned out that RDS won't be more expensive! If anything, it had a great potential to be significantly cheaper - especially knowing that we had a vast database bloat in most of the applications, so migrating to a new database could be a solution. You might think that performing vacuum full (which would require an extended downtime of the application or a part of it) or using pg_repack (doesn't require exclusive locks but using it for enormous tables might not be trivial) would also solve the problem with the bloat even without any migration but not necessarily - yes, the bloat would be gone. But we would still be paying for the allocated EBS volumes as the storage cannot be deallocated - it only goes up. If we tried hard enough, we could figure out a migration path to a new EBS volume and copy the existing data after reducing the bloat and replacing the original one, but this would merely address one of the items in the long list of problems (without considering the complexity of doing that).

What is more, we could start using instances powered by AWS Graviton processors, so there was also a good chance that the costs would be even lower as we would be likely able to use a smaller size of the instance compared to the legacy infra.

There is also one more thing that is not considered very often - Disaster Recovery.

Disaster Recovery

Disaster recovery is a critical aspect of database maintenance that is neglected way too often. Putting aside the incident response plan for such occasions, it gets tricky when you have tens of terabytes of data, tens of databases, and multiple clusters. At that scale, it will most likely involve a combination of taking EBS snapshots (e.g., once or twice per day) and storing WAL files to be replayed to get the latest possible state of the database.

A key question at that point: how easy it could be for a small team of infra developers (who also need to maintain a lot of other components) who are not Postgres experts to maintain backups properly and test and improve the procedure often enough to achieve a decent Recovery Point Objective (RPO - how much data we can lose, measured by time) and Recovery Time Objective (RTO - how much time it takes to bring the database back)? Let's say that it would be challenging.

It would ideally be something where we could minimize our involvement. Clearly, it cannot be fully "delegated" to a managed service - even if it took just one click from UI to recover, it would still require an intervention, so using a managed service doesn't remove a necessity to maintain a proper Disaster Recovery plan. However, AWS RDS massively simplifies it.

First of all, we don't need to know the deep details of restoring from the EBS snapshot and replay WAL Files or any low-level procedures like that - AWS has our back here and provides proper tools for that purpose and a convenient interface so that we can focus on a bigger picture.

It also makes it trivial to achieve a very decent RPO - with automated backups, it would be at most 5 minutes. For less drastic scenarios (like an outage of the master database instance), we have options such as Multi-AZ offering a standby instance.

Disaster Recovery with RDS could use a separate article, and I would definitely recommend the one from AWS blog. Nevertheless, the conclusion is clear - with minimum involvement, we can have a decent Disaster Recovery plan and make it excellent with an extra effort.

That was yet another reason why we should consider AWS RDS.

Final decision and the results

The final decision was made: perform a homogenous (from PostgreSQL to PostgreSQL) migration to AWS RDS. We were also considering using a different engine, such as Aurora, which is compatible with PostgreSQL. It offers interesting benefits but would likely be more expensive than RDS PostgreSQL. It would also potentially increase the complexity of the migration and make a vendor lock-in even stronger, so sticking to PostgreSQL seemed like the most optimal choice.

And that was definitely the case! In the end, with migration to RDS, we achieved the following results:

Drastically simpler maintenance
Superior Disaster Recovery
Significantly smaller databases' size - for some of them, approximately 40%-50% smaller thanks to eliminating the database bloat and, in one of the cases, a reduction by close to 97%, from almost 11 TB to less than 350 GB.
Overall reduction of AWS costs by close to 30% - yes, it's better and so much cheaper!

Sounds exciting? If so, stay with us for the rest of the story.

Wrapping Up

Moving terabytes of data and tens of PostgreSQL databases to AWS RDS service was a massive and complex initiative. As per our research and post-migration metrics, the effort was indeed proven worth it, with certain outcomes greatly exceeding our expectations - the result is superior in many aspects compared to self-managed clusters, including way lower infrastructure costs! Follow this series to learn if this might be a good choice for you and what to expect along the way.

Integration Patterns for Distributed Architecture - Intro to dionysus-rb

Karol Galanciak — Mon, 18 Dec 2023 16:02:22 +0000

Integration Patterns for Distributed Architecture - Intro to dionysus-rb

In the previous blog post, I promised we would introduce something special this time. So here we go - meet almighty Dionysus, who knows how to make the most of Kafka.

Change Data Capture

Change Data Capture is a popular pattern for establishing communication between microservices - it allows to turn all inserts/updates/deletes for all rows in any table into individual events that other services could consume, which would not only provide a way to notify the other service about the change but also to transfer the data.

Thanks to tools like Debezium, this is relatively straightforward to implement if you use Kafka. However, this approach has one serious problem - coupling to the database schema of the upstream service.

Individual tables and their columns often don't reflect the domain correctly in the upstream service, especially for relational databases. And for downstream microservices, it would be even worse. Not only your domain model might be composed of multiple entities (think of Domain-Driven Design Aggregates), but some attributes' values might be a result of a computation depending on more than a single entity, or it might be desired to publish some entity/aggregate change if there is a change in some dependency. For example, you might want to publish an event that some Account got updated when the new Rental is created to propagate the change of a potential rentals_count attribute.

Such an approach is quite natural when building HTTP APIs as it's simple to expose resources that don't directly map to database schema. Yet, with the CDC, this might be challenging. A potential workaround would be creating dedicated database tables that would store the data in the expected format and refresh them based on dependencies in the domain (so updating rentals_count in an appropriate row for a given Account after a new Rental is created if considering the example above), which would be pretty similar to materialized views. Nevertheless, it's still more like a workaround to comply with some constraints - in that case, it would be CDC operating on database rows.

A more natural approach would be CDC on the domain-model level. Something that would be close to defining serializers for REST APIs.

Meet almighty Dionysus, who knows how to make the most of karafka to achieve the result.

Dionysus-rb

Dionysus is quite a complex gem with multiple features, and some of them could use a separate blog post, which is something that we are likely to publish in the near future. Yet, the gem's documentation would be your best friend for now. Keep in mind, though, that this has been a private gem for a long time, so at the time of writing this article, some parts of the documentation might not be super clear.

Let's now implement a simple producer and consumer to demonstrate the gem's capabilities. Before releasing anything to production, read all the docs first. The following example is supposed to show the simplest possible scenario only, which is far from something that would be production-grade.

Example App

Let's start with the producer.

Producer

First, generate a new application:

rails new dionysus_producer

and add dionysus-rb to the Gemfile:

gem "dionysus-rb"

Let's create the database as well:

rails db:migrate

And now, we can create a karafka.rb file with the following content:


Dionysus.initialize_application!(
  environment: ENV["RAILS_ENV"],
  seed_brokers: ["127.0.0.1:9092"],  # assuming that this is where the kafka is running
  client_id: "dionysus_producer",
  logger: Rails.logger
)

For a simple demo, let's assume that we will have a User model on both the producer and consumer side with a name attribute to keep things simple.

Let's generate the model:

rails generate model User name:string
rails db:migrate

And let's make this model publishable:

class User < ApplicationRecord
  include Dionysus::Producer::Outbox::ActiveRecordPublishable
end

We will also use a transactional outbox pattern to ensure maximum durability so that we don't lose messages. For the sake of optimization, we will also publish messages after the commit.

In the production setup, you should also run an outbox worker as a separate process so that it can pick up any messages that failed for some reason, but again, to keep things simple, we are not going to do this for this demonstration.

Let's generate the outbox model:

rails generate model DionysusOutbox

And use the following migration:

class CreateDionysusOutbox < ActiveRecord::Migration[7.0]
  def change
    create_table(:dionysus_outboxes) do |t|
      t.string "resource_class", null: false
      t.string "resource_id", null: false
      t.string "event_name", null: false
      t.string "topic", null: false
      t.string "partition_key"
      t.datetime "published_at"
      t.datetime "failed_at"
      t.datetime "retry_at"
      t.string "error_class"
      t.string "error_message"
      t.integer "attempts", null: false, default: 0
      t.datetime "created_at", precision: 6, null: false
      t.datetime "updated_at", precision: 6, null: false

      # some of these indexes are not needed, but they are here for convenience when checking stuff in console or when using a tartarus for archiving
      t.index ["topic", "created_at"], name: "index_dionysus_outboxes_publishing_idx", where: "published_at IS NULL"
      t.index ["resource_class", "event_name"], name: "index_dionysus_outboxes_on_resource_class_and_event"
      t.index ["resource_class", "resource_id"], name: "index_dionysus_outboxes_on_resource_class_and_resource_id"
      t.index ["topic"], name: "index_dionysus_outboxes_on_topic"
      t.index ["created_at"], name: "index_dionysus_outboxes_on_created_at"
      t.index ["resource_class", "created_at"], name: "index_dionysus_outboxes_on_resource_class_and_created_at"
      t.index ["resource_class", "published_at"], name: "index_dionysus_outboxes_on_resource_class_and_published_at"
      t.index ["published_at"], name: "index_dionysus_outboxes_on_published_at"
    end
  end
end

And run the migration:

rails db:migrate

And include the outbox module in the model:

class DionysusOutbox < ApplicationRecord
  include Dionysus::Producer::Outbox::Model
end

We can move on now to more Kafka-related things - topics. Or rather a single topic - to publish users. Let's wrap it in the dionysus_demo namespace so the actual Kafka topic name will be dionysus_demo_users.

We will also need to define two serializers:
the primary one that infers other serializers based on the model class (DionysusDemoSerializer)
the actual serializer for the model (UserSerializer)

Knowing all these things, let's create dionysus.rb initializer:

Rails.application.config.to_prepare do
  Karafka::App.setup do |config|
    config.producer = ::WaterDrop::Producer.new do |producer_config|
      producer_config.kafka = {
        'bootstrap. servers': 'localhost:9092', # this needs to be a comma-separated list of brokers
        'request.required. acks': 1,
        "client.id": "dionysus_producer"
      }
      producer_config.id = "dionysus_producer"
      producer_config.deliver = true
    end
  end

  Dionysus::Producer.configure do |config|
    config.database_connection_provider = ActiveRecord::Base
    config.transaction_provider = ActiveRecord::Base 
    config.outbox_model = DionysusOutbox 
    config.default_partition_key = :id # we don't care about the partition key at this time, but we need to provide something
    config.transactional_outbox_enabled = true
    config.publish_after_commit = true
  end

  Dionysus::Producer.declare do
    namespace :dionysus_demo do
      serializer DionysusDemoSerializer

      topic :users do
        publish User
      end
    end
  end
end

And let's create the serializers mentioned above:

class DionysusDemoSerializer < Dionysus::Producer::Serializer
  def infer_serializer
    "#{model_klass}Serializer".constantize
  end
end

The only method we care about at this stage is infer_serializer. The implementation will be pretty simple to infer the UserSerializer class from the' User' model.

And the second serializer:

class UserSerializer < Dionysus::Producer::ModelSerializer
  attributes :name, :id, :created_at, :updated_at
end

Now, let's run the Rails console and see how everything is working:

User.create!(name: "Dionysus")

DionysusOutbox.last

The outbox should look like this:

#<DionysusOutbox:0x0000000112e2b400
 id: 1,
 resource_class: "User",
 resource_id: "1",
 event_name: "user_created",
 topic: "dionysus_demo_users",
 partition_key: "[FILTERED]",
 published_at: Fri, 08 Dec 2023 13:59:45.541653000 UTC +00:00,
 failed_at: nil,
 retry_at: nil,
 error_class: nil,
 error_message: nil,
 attempts: 0,
 created_at: Fri, 08 Dec 2023 13:59:45.481140000 UTC +00:00,
 updated_at: Fri, 08 Dec 2023 13:59:45.481140000 UTC +00:00>

Having some timestamp in published_at means the record was published successfully to Kafka. So we are done as far as the producer goes!

Let's add a consumer that will be able to consume these messages.

Consumer

First, generate a new application:

rails new dionysus_producer

and add dionysus-rb to the Gemfile:

gem "dionysus-rb"

Let's create the database as well:

bundle exec rake db:migrate

And now, we can create a karafka.rb file with the following content:

Dionysus.initialize_application!(
  environment: ENV["RAILS_ENV"],
  seed_brokers: ["127.0.0.1:9092"],  # assuming that this is where the kafka is running
  client_id: "dionysus_producer",
  logger: Rails.logger
)

As the consumer is going to consume events related to the User, let's create a model for it:

rails generate model User name:string synced_id:bigint synced_created_at:datetime synced_updated_at:datetime synced_data:jsonb

synced_id is the reference to the primary key on the producer side, and synced_created_at/synced_updated_at are timestamps from the producer, and synced_data is a JSON containing all the attributes that were published.

Let's run the migration:

rails db:migrate

We will need to do two more things:
declare which topic we want to consume from - we need topic users under the dionysus_demo namespace
infer the User model for User-related models - we will do this via model_factory

Let's create the dionysus.rb initializer:

Rails.application.config.to_prepare do
  Dionysus::Consumer.declare do
    namespace :dionysus_demo do
      topic :users
    end

    Dionysus::Consumer.configure do |config|
      config.transaction_provider = ActiveRecord::Base
      config.model_factory = DionysusModelFactory
    end
  end

  Dionysus.initialize_application!(
    environment: ENV["RAILS_ENV"],
    seed_brokers: ["127.0.0.1:9092"],
    client_id: "dionysus_consumer",
    logger: Rails.logger
  )
end

And define the DionysusModelFactory:

class DionysusModelFactory
  def self.for_model(model_name)
    model_name.classify.constantize rescue nil
  end
end

So, from the "User" string, we will infer the User class.

We can now run the karafka server:

bundle exec karafka server

And let's check the end result in the console:

User.last

That should give us a similar result to this:

#<User:0x0000000110a420e8
 id: 1,
 name: "Dionysus",
 synced_id: 1,
 synced_created_at: Fri, 08 Dec 2023 14:02:36.280000000 UTC +00:00,
 synced_updated_at: Fri, 08 Dec 2023 14:02:36.280000000 UTC +00:00,
 synced_data: {"name"=>"Dionysus", "synced_id"=>8, "synced_created_at"=>"2023-12-08T14:02:36.280Z", "synced_updated_at"=>"2023-12-08T14:02:36.280Z"},
 created_at: Fri, 08 Dec 2023 14:02:42.171312000 UTC +00:00,
 updated_at: Fri, 08 Dec 2023 14:02:42.171312000 UTC +00:00>

It's that simple to use Dionysus and implement CDC on the domain model level!

Conclusions

This blog post introduced dionysus-rb - a robust framework built on top of karafka, allowing CDC (Change Data Capture)/logical replication on the domain model level. This time, it covered only a tiny portion of what Dionysus is capable of, so stay tuned for the upcoming blog posts.

Introduction to Event Sourcing and CQRS

Oleg Borys — Wed, 13 Dec 2023 23:36:17 +0000

In a galaxy far, far away, enter the saga of CQRS and event sourcing, where data updates unfold like an epic space opera. Jokes away, let's see what are those

Event sourcing is not a new term. Event Sourcing is a powerful paradigm for managing data in the ever-evolving software development landscape. While it might not be as widely known as other data management methods, Event Sourcing offers a unique approach that can benefit applications significantly.

Command Query Responsibility Segregation (CQRS) is an architectural paradigm that divides the responsibilities of read and write operations within a system.

At its heart is the notion that you can use a different model to update information than the model you use to read information Martin Fowler

In the realm of CQRS application architecture, the system is separated into two distinct segments. The first caters to the realms of updates and deletions – a territory we call the writing model. Concurrently, the second segment takes on the noble task of reading – aptly named the read model. Unlike the conventional CRUD approach, which relies on a single database, CQRS embraces two databases (or at least different tables). Each side is obsessed with the act of reading or writing.

To address scalability concerns, CQRS can be coupled with event sourcing. In this scenario, events generated by commands are stored in an event store. These events are subsequently asynchronously transmitted to a separate read data store, undergoing a transformation process to align with the read data model. This integration helps overcome some of the scalability challenges inherent in CQRS.

In this article, we will delve into the fundamentals of Event Sourcing coupled with CQRS, explore its advantages, and discuss when it might be better to avoid using this approach.

What is event-sourcing

If you work with technology, you must have come across it. It's a puissant tool used by many large organizations for data modeling. It can scale and meet the needs of the modern data processing industry.

Event sourcing is a compelling architectural pattern that might initially seem a bit eccentric. Instead of focusing on the state of your system, it keeps track of every change that happens. It's like holding a detailed diary of every emotional rollercoaster it goes through. All the events that are changing the state of your system are recorded, and such records serve as both a source for the current state and an audit trail of what has happened in the application over its lifetime.

Domain experts usually describe their systems as a collection of entities, which are containers for storing a state and events, representing changes to entities as a result of processing input data within various business processes. Events are often triggered by commands invocated by users, background processes, or integrations with external systems. The naming is usually defined by Ubiquitous Language.

Ubiquitous Language is a domain-driven design (DDD) concept, a software development approach. Imagine Ubiquitous Language as the magical Babelfish of software development. It's the secret code that lets developers, domain experts, and stakeholders all speak the same lingo, like a universal translator for geeks and business folks. With this shared language, the jargon barrier becomes a thing of the past, and everyone can boogie on the same wavelength.

Among existing practices to define the ubiquitous language, I'd emphasize Event-storming. However, sometimes it may not be the best option (team or resource constraints, limited initial knowledge, etc). Then, the process may be neglected, and developers apply the most widely used terms.

Many architectural patterns treat entities as a primary concept. These patterns describe how to store them, how to access them, and how to modify them. Within this architectural style, events are often "on the side": they are the consequences of entity changes. Unlike the traditional systems, with an event-sourcing approach, the events are considered the only source of truth.

At first glance, it may sound unusual, but most of the serious systems we know and interact with do not emphasize the concept of the current state or the final state (financial, banking, and many others). As Greg Young (creator of CQRS) said in one of his speeches, your bank account is not just a column in a table but the sum of all transactions that occurred with your bank account (renewals, write-offs, and recovery). For example, if you have a disagreement with your bank, the balance sheet and the bank confirm that the balance is 69 dollars, but your position is still 96. You won't hear a response like "The column states it's 69 dollars, and it's all you have".

Designing systems with a focus on events and event logs provides the following benefits:

It helps reduce impedance mismatches and the need for concept mapping, allowing technology teams to "speak the same language" (ubiquitous language) as the business when discussing the system.
Encourages separation of responsibility into commands and queries (command/query responsibility segregation), allowing you to optimize writing and reading independently of each other.
It provides temporality and a history of change, allowing questions to be answered about what the system looked like at specific points in the past and what events occurred before that point.

Key Concepts of Event Sourcing

Events

Events are the fundamental building blocks of Event Sourcing. They represent discrete changes in the state of an entity. Each event is immutable and contains information about what happened when it occurred and any relevant data associated with the change.

You can register as many events as you like. The name of the event should have clear semantics. Events are always talking about the past and can reveal with their names what has changed and how it has changed (for example, OrderCanceled, OrderItemAdded, ProductRemoved, OrderPlaced).

Event Store

The Event Store is the central repository for storing events in the order they occurred. It ensures that events are appended to the end of the log and provides methods to read, write, and query events.

Aggregates

Imagine Aggregates as a consistency boundary around a group of domain objects, such as entities and value objects. In Event Sourcing, Aggregates are reconstituted from a single fine-grained event stream (e.g., representing an order flow). During this operation, the current state of the aggregate is calculated so that it can be used to handle a command.

The state is a crucial part of an Aggregate toolkit. It's like their memory is wiped clean after every command unless they're into some fancy snapshotting business. Consider it a superhero's utility belt, equipped to deal with duplicate commands and other unexpected villains, thanks to at-least-once delivery guarantees. It's not playing movies; it's making decisions.

Projection

Projections are used to derive the current state of an entity or a view of the data by replaying events. Projections are separate from the Event Store and can be optimized for specific query needs.

An integral part of Event Sourcing is the concept of snapshots – intermediate state captures of an aggregate in the EventStore. Sometimes, to obtain the final state of an object, you need to replay many events, starting from the very first one. To optimize this process, snapshots are taken, and we will restore the final state not from the very first event in the system but from the latest snapshot. Despite the apparent benefit of this pattern, it should only be applied when obtaining the final state takes excessive time. Many might think that a downside is the exponential growth of data volume with this approach, but that's not the case. Events are storing not the complete data models but only the state changes. In reality, the slowdown in such a system is associated with the construction of aggregates from all domain events that have occurred until now. So, Event Sourcing is the storage of a series of events, and the data schema that reflects these events is a direct derivative of these events. The data schema in Event Sourcing systems is temporary and can be rebuilt or reconstructed from events at any time. Isn't it like a time machine?

Benefits of Event Sourcing: The Fun Side of Managing Data

Event sourcing coupled with CQRS promotes decentralized modification and reading of data. This architecture scales well and is suitable for systems that already work with event processing or want to migrate to such an architecture.

Complete Audit Trail: With Event Sourcing, you've got a complete record of your software's shenanigans. It's the ultimate way to check if your software's been up to no good or just having a few harmless adventures.
Temporal Querying: It's like going back in time to see what your software was thinking and doing at a specific moment. Wondering why it behaved oddly on a Tuesday six months ago? Event Sourcing has your back.
Parallel Processing: Event-sourcing turns your software into a multitasking genius. It can handle multiple tasks simultaneously, like a magician juggling flaming torches. Events are processed asynchronously, so your software can easily handle high loads.
Distributed Systems: In a distributed system, events can be processed asynchronously and independently, leading to improved scalability. Each component or microservice can consume and process events at its own pace without blocking others.
Flexibility: Event-sourcing is like a chameleon for data – it can adapt to different situations. It allows you to create various "disguises" for your data, tailored to specific needs. You could achieve the same with other approaches (like Postgres materialized views or building your own reports). However, event-sourcing allows you to achieve the same in a very natural way.
Fault Tolerance: Your software's history is safe and sound. In case of mishaps, you can simply turn back the clock and replay events to restore order in your software universe.

When to Avoid Event Sourcing: Not Every Party Needs a Diary

As much as we'd love to make Event Sourcing the life of the party, it's not always the best guest for every occasion. Here's when it's better to opt for something else:

Simplicity: If your application is as uncomplicated as a one-page novel, Event Sourcing might be like using a sledgehammer to crack a walnut. It's great for epic tales but overkill for short stories.
Performance: Event Sourcing might seem too leisurely for applications that demand real-time updates. Due to eventual consistency, the data may be slightly outmoded
Overhead: In read-focused systems, traditional databases may be more suited for your "need-it-now" attitude (building the projections would require some extra computing power)
Learning Curve: Implementing Event Sourcing can be like learning a new language. It's not something you can master overnight, so if your project is on a tight schedule or your team is new to this, be prepared for a learning adventure.

In short, shorter, shortest…

Event Sourcing is like the eccentric uncle who turns up quirky but surprisingly insightful at family gatherings. It offers a unique approach to managing data, providing complete auditability, temporal querying, scalability, flexibility, and fault tolerance. However, like any eccentric guest, it might not be the best fit for every occasion. So, choose Event Sourcing when the story is epic and complex, but feel free to opt for traditional methods when your tale is short and sweet. After all, the software world is diverse, and there's room for both diary keepers and straightforward note-takers. In the upcoming article, I'll share practical hints on implementing event-sourcing and CQRS in the Rails world.

Integration Patterns for Distributed Architecture - Kafka at Smily

Karol Galanciak — Wed, 08 Nov 2023 09:40:53 +0000

In the last blog post, we covered some fundamental concepts about Kafka. This time, let's discuss how we use Kafka in Smily, how we got where we are now, what the decision drivers were, and how the overall architecture has evolved over time.

A short story of Smily Architecture

Like most of the technological startups, Smily (or rather BookingSync at that time) started as a single monolithic application. Yet, almost ten years ago (yes, this is correct, in early 2014), the ecosystem began to grow significantly. Not only did the new ideas appear in the roadmap that were distinct enough to be separate applications (communicating with the existing application - let's call it "Core"), but we were also looking into opening our ecosystem to external partners interested in building an integration with us.

Being a company in its still early stage meant looking for the simplest solution to the problem. Under those circumstances, the natural way was to go with HTTP API, which resulted in the release of API v3 - the API that is still in use at the time of writing this article by our own applications and external Partners.

There were multiple advantages of doing so back then:

Synchronous communication is easy to reason about and debug, as we explained in the first part of this series.
Familiarity - HTTP APIs are ubiquitous. Most experienced developers can get into such a project and quickly understand what happens under the hood and figure out how to work with such an ecosystem.
Dogfooding - using the same API that we expose to Partners for our applications meant killing two birds with one stone. It also helps with being knowledgeable and opinionated about API usage. We could propose to our partners the exact patterns, solutions, and tools we used for our apps. For example, the synced gem for data synchronization.
Authentication/Authorization flexibility (thanks to OAuth) without reinventing the wheel.

Core-centric Model

All these points lead to the architectural model ("Core-centric Model") that could be visualized in the following way:

This model was built upon two fundamental Ruby gems:

1. API v3 Client gem

2. Synced, a tool for keeping local models synced with their equivalent API v3 Resources (based on long-polling and periodically fetching the records in subsequent queries updated since the timestamp of the previous synchronization)

On top of HTTP API v3, we also introduced webhooks as an extra addition based on the publish/subscribe pattern, which was mostly a way to implement the Event Notification Pattern so that consumer apps don't need to wait potentially a long time for the next polling interval to act (which happens for some Partner Apps to be every hour or even less often!).

The beginning of the problems

This architecture was sufficient and worked quite fine in the beginning, and only occasionally did it cause any more significant issues. At some point, though, problems started to happen on both Core (significant database load caused by the massive traffic in API v3 requiring a considerable number of pumas to handle it) and consumer apps (taking too much time to synchronize resources, OOMs in Siekiq workers, introducing various workarounds in the synced gem for large batches and various edge cases) clearly showing that this model might not be a good choice anymore.

The list of the suboptimal things in this architectural model could get pretty long, but these were some vital points:

Things like Authentication/Authorization flexibility are great when you need to expose API outside the internal ecosystem. For the internal apps, this is often unnecessary overhead.
The overhead of the HTTP protocol for internal traffic might also be unnecessary.
Scalability problems
1. long-running requests
2. batch processing from all paginated records requiring a lot of memory to process
3. constantly high traffic in API v3 and a significant load on the Core database
4. requests being slow or redundant (e.g., polling scheduled every 5 minutes, which could result in unnecessary requests because nothing was returned, or too many items were returned requiring pagination through multiple pages if the polling interval was too long)
5. every application performing pretty much the same type of request, so if 10 apps needed the same resource, the same serialization on Core in the API would happen over and over again for each of them. Caching responses wasn't an option as each application was sending a different timestamp when using updated_since flow
Reinventing the wheel with synced - updated_since flow and storing the timestamp of the last synchronization of the data on the consumers’ side and using that as an offset in the API for a given endpoint is pretty much redoing Dumb Broker/Smart Consumer model (just like in Kafka) over the HTTP REST API in a very underoptimized way.
It gets pretty expensive to scale for that model when you consider resources to cover so many pumas and Sidekiq workers

That was the right time to rethink the entire architecture model. At the same time, given that we were a relatively small team back then, we wanted to avoid any significant re-rewrites and re-use what we had as much as possible.

In the end, the list of the requirements that we were expecting from the new architecture was the following:

A replacement for long polling via synced/API v3, using the same HTTP resources as we had available API v3
Significantly smaller utilization of resources (CPU, memory) on the consumers' side
Getting rid of a large percentage of API v3 traffic
Decreasing database load, both in the Core application and consumers' applications
Ability to react to changes to the resources on the consumers’ side almost right away after they happen (e.g., doing something on the rental_created event a few seconds after it happened)
If using any message broker, retaining events for an arbitrarily long time (ideally indefinitely)
Ability to replay events, especially when a new consumer joins the ecosystem (e.g. when a new internal app is introduced that requires gigabytes of data from previous years)
Ideally, a few seconds of latency between a change on Core and the time it takes for the consumers to start processing the change, as some use cases were very time-sensitive.

Introducing Kafka

Under these circumstances, Kafka was a natural choice, especially since it fulfilled all the requirements we had and the way we were using synced and timestamp-based offsets with updated_since flow was close to the dumb broker/smart consumer model implemented by Kafka. It was also straightforward to adapt all the components used for the serialization of resources to JSON in API v3 and do the same thing when publishing to Kafka upon every change of the resource (that would bump the updated_at timestamp).

Thanks to this change, our system turned into the following model:

It could be argued that this model was still a Core-centric one - the difference was that an extra layer (Kafka) was introduced to decouple consumer apps from the Core application. Nevertheless, it turned out to be a great success, and this change has brought considerable benefits in solving problems we used to have with the model based on synced/API v3*.*

Also, given how simple it was to introduce Kafka publishers in other applications (especially when comparing how much it would take to build HTTP API), it was pretty straightforward to turn that model into the following one:

Thanks to that, Kafka could become a data lake/event lake for the entire ecosystem if there was a need for that, and also, it will allow in the future to separate bigger applications (like Core) into smaller (micro)services.

How did we get here?

You might be wondering at that point - how we made this happen so that we could so quickly change from using HTTP-based long-polling to Kafka, especially since one of the requirements was to keep using API v3 resources?

We developed our own framework on top of karafka that made it trivial to introduce new producers and consumers, thanks to the powerful declarative DSL in the gem and adapting something that could be compared to Change Data Capture (CDC) pattern, but not at the database level but rather on the model level.

And given that this is almost the end of this blog post, you probably already know what the next part of this series will be about :).

For that special occasion, we will release our framework publicly (after removing some internal dependencies and reworking the entire documentation, as it's heavily based on the details of our ecosystem), so stay tuned, as this is going to be an opportunity to learn about a complete framework allowing integration of Ruby on Rails application via Kafka in a very simple way.

Conclusion

In this blog post, we covered our old architecture model used to integrate our applications, what problems we experienced with it, and why we decided to switch to Kafka.

Stay tuned for the next part of this series which is going to introduce our framework for doing CDC on the domain level.

Integration Patterns for Distributed Architecture - Intro to Kafka (for Rubyists)

Karol Galanciak — Thu, 05 Oct 2023 11:12:58 +0000

In the last blog post, we covered a general overview of integration patterns for distributed architecture, and now it's time to get into further details.

Let's start with perhaps the most exciting piece of tech we use in Smily - Apache Kafka.

What is Kafka?

Generally speaking, Apache Kafka is an open-source distributed event streaming platform developed originally at LinkedIn. It is designed to handle data streams and provide a fault-tolerant, durable, and scalable framework for building real-time data pipelines and streaming applications.

In the previous blog post, we learned that Kafka is a tool we can use to implement the publish/subscribe type of integration between services. Given that there is a variety of message brokers that we could use to achieve the same result, let's focus on what makes Kafka unique and its major advantages.

Let's take a look at the basic visualization of how Kafka works, and let's make sure we understand the key concepts.

Everything starts on the Producer's side, responsible for publishing events. For example, if we use Kafka for activity tracking (as LinkedIn did when creating Kafka), we could send an event such as page_visited with some JSON payload containing a timestamp, user ID, and many other things that could be useful.

These events will get published to topics, which are essentially append-only logs where each event can be identified under a given offset (similar to an array's index).

Topics can be divided into multiple partitions to allow parallelization*, and the partition key* provided when publishing the message will determine to which partition exactly the event will be delivered to.

Topics are like categories - so events that are somehow similar should go into the same topic. This does not necessarily mean that, for example, each database table/application model would have a dedicated topic in Kafka. Actually, that could be a really poor design in many cases.

When designing topics, we need to remember the critical factor - that we deal with append-only logs. So all the events will be ordered in a given partition's topic. In many cases, we want to preserve the causality/sequence of the events. For example, we would expect the payment_paid event to be processed after the payment_created *event. But if we published these two events into separate topics, that might not necessarily be the case! The same thing could be for events such as order_created and payment_paid (for a given order) - there is a good chance that we want to keep the order of such events and have them in the same topic. And things related to a given order should be in the same partition (which will be determined by the provided partition key, which could be, for example, order ID). But probably, we don't care if we processed customer_profile_picture_updated before or after the payment got paid, so there is a good chance that we could use separate topics here.

Since we've already started discussing how things are processed, let's move to consumers organized within consumer groups. Consumers are responsible for processing events. Think of them as workers - some separate processes consuming from the topics/partitions, just like Sidekiq workers process jobs from queues. And consumer groups are like independent receivers. For example, you might have two applications requiring consuming payment-related events from Kafka - one for payment processing and the other for business intelligence. And these two would be two consumer groups. However, you can also have multiple consumer groups in a single application. For example, if you have a modular monolith, each module/Bounded Context could be a separate consumer group and consume things independently from all other modules.

What we need to keep in mind is that within the same consumer group, a single consumer can consume from multiple partitions, but a given partition can have only one consumer assigned! This is the only way to ensure that the events will be processed in a given order (there are some ways to parallelize processing in a given partition and still preserve the order to a limited scope, but that's not available in Kafka.) But nothing is blocking us from having one consumer consuming from multiple partitions.

For example, if we have a single topic with five partitions, we could have just a single consumer (in a given consumer group), and that consumer would process all the messages from the partitions. However, if the consumer does not process messages fast enough resulting in a lag (the difference between the offset of the latest message published to the given partition and the last processed offset on the consumer side), we could increase the number of consumers up to five. That way, each consumer would be consuming from a single partition only.

And what if we added one more consumer? That will be essentially useless - you cannot have more than a single consumer within a single consumer group for a given partition so having more workers than partitions will result in workers that don't have anything to process. That's why having an appropriate number of partitions is critical, as this is how to parallelize processing and ensure it's fast enough.

What consumers do under the hood is go through messages one by one (usually by fetching a batch of events), execute the processing logic, and periodically store the offset of the latest processed event in a dedicated internal Kafka topic (this behavior is configurable, but it's more or less a standard use case for microservices integration). That's how the consumers can identify where they should start processing another batch of events.

And what happens if something crashes during the processing of the batch? This is dependent on the config, as we can have three delivery semantics:

at-most-once the event will be processed either once (when everything goes fine) or might not be processed at all (when something goes wrong). However, there is still a chance that the event will be processed more than once due to how it works internally (committing offsets happening in fixed-time intervals). This is probably not a good config for the integration between microservices. Still, for frequent data reading from sensors, for example, it might be acceptable to lose some messages if we can achieve higher throughput.
at-least-once - the event will be processed either once (when everything goes fine) or potentially more than once (when something goes wrong), as the offset is committed only after processing the messages. This would be the recommended semantics for the integration between microservices. However, in this scenario, we need to make sure that the processing is idempotent so that processing the same event twice will not result in having side effects executed twice as well (for example, we probably want to ensure that we won't charge a credit card twice).
exactly-once somewhat arguable given that we are talking about distributed systems, yet you will quickly find that Kafka supports such semantics. Discussing exactly-once semantics would go way beyond the scope of Intro to Kafka. If you want to understand it a bit more, I recommend reading this article from Confluent.

And this is why we say that Kafka implements the Dumb Broker/Smart Consumer model - the broker is not responsible for delivering anything to consumers, it's up to consumers to handle consuming and be aware of the offset.

However, this is not everything that concerns the delivery semantics. We've just discussed the one between the broker and the consumer. What about the one between the Producer and the broker?

As you might expect, we also have at-most-once/at-least-once (and exactly-once, when the producer is configured to be idempotent, but the exact details go beyond the scope of this article) semantics with some interesting edge cases. Such as at-least-once delivery, but with some probability of a data loss!

In most production systems, we want to achieve high availability and ensure that the Kafka cluster will be operational, even if some broker goes down. That means we need to have multiple brokers (usually 3 or 5) and replication.

The semantics will be mainly determined based on the config of Acks (acknowledgments). We have three options here:

Acks = 0 - it's essentially a "fire and forget" approach. The producer just publishes the event and doesn't care about any response from the broker. That way, we can achieve a higher throughput, but we also have a higher risk of data loss. This is the way to achieve at-most-once semantics.
Acks = 1 - in that case, the producer expects to get a response from the broker that everything went fine. If there is no response, it will keep retrying until it receives the response or hits the retry limit. Given that this approach involves multiple attempts, it might turn out that the same event will be delivered more than once. This is the way to achieve at-least-once semantics. However, the replication is an independent step that happens after, so it might turn out that the brokers might go down between acknowledging the message and replicating it.
Acks = All - similar to the previous case, yet the broker responds only after the replication has been performed. That does not necessarily mean that it has been performed to all the brokers! That depends on the separate configuration option about minimum in-sync replicas - and if you set it to 1, you might end up with a very different result than you would expect from Acks set as All.

There is a clear trade-off between durability, availability and latency. The production setup for microservices integration requires a careful analysis of the actual needs as well as getting familiar with more advanced concepts. Minimum in-sync replicas config is just a start, but there is more to it, for example, a leader election process and its impact on the potential data loss, especially the unclean leader election.

Consequences of the design & some challenges

Now that we've learned quite a lot about how Kafka works internally, let's think about some consequences of that design, both good and bad, and some other aspects worth considering when dealing with Kafka.

Retention

The first one would be retention - since it's up to the consumer to manage their position in the log (offset), we have some interesting things to consider, especially as we don't have the behavior of a typical message queue where the event is gone after processing it.

It turns out that in Kafka, retention is what we set it to. And we can even set it to be indefinite as if it was a database!

We have two options: retention specified by time (e.g., to retain events for seven days), which is probably more popular, and the one based on total size.

Replaying events/skipping events

Consumers in a given consumer group know where to start processing based on the offset they stored in Kafka for a given partition. And it also turns out that we can change the value of the stored offset ourselves!

Nothing prevents us from resetting the offset to the position from the previous day if we discover some potential bug and need to reprocess the events. Or maybe, for some reason, we want to skip processing some messages when a massive number of events got published that we don't care much about, and it will take hours to process them. At the same time, there are some other important events to be published in a moment that would ideally be processed immediately.

Dead-letter queue

Here comes an interesting question: what happens on the consumer side if there is some error when processing the message, especially when it's not an issue the consumer can self-heal, perhaps due to some bug in the processing logic?

The retrying policy is on the consumer side to be defined, but there is one essential problem here - until the message gets processed, the consumer will not move on to the next one. Which means that the consumer might be stuck forever with that single message!

There is no dead-letter queue equivalent available out-of-box in Kafka (remember - it's a dumb broker/smart consumer model), so it's up to the consumer to handle exceptions correctly.

Fortunately, we have some options for the Ruby on Rails application that make it straightforward to handle such cases, which I'll get back to in a moment.

Log compaction

Imagine that what you publish to Kafka are projections of the models that get updated very often, and you have a very long retention configured for the topics. That will mean a lot of data will be stored in Kafka. However, there is a good chance that it would be enough to keep just the most recent projection of the model (as we typically do when using a database).

By default, if a given model is published 100 times after the updates to Kafka, we will have 100 events stored there, which is not optimal for storage. Fortunately, we can enable log compaction!

Thanks to that feature of Kafka, as long as we send the same message key for a given model with every update (which should be straightforward; we can use the model name and its ID, for example, "Rental-123") and enable log compaction, we can be sure that the previous messages with that message key will be dropped (or rather compacted).

Slow consumers

This is something that is rarely thought about when starting to use Kafka until the first time you experience the issue.

Kafka (the broker) somehow needs to be able to distinguish between consumers that "alive" and actively processing messages and the ones that are no longer processing anything - especially that only one consumer within a single consumer group can consume from a given partition. But this is also important when something goes wrong or even during the deployments.

It is based on the heartbeats - the broker expects to "hear" from the consumer within a given time interval, and if it doesn't "hear" from it, the consumer will be considered inactive and "kicked out". If processing events from the batch takes longer than this expected time interval, you are guaranteed to experience a huge problem and potentially stuck consumers.

Fortunately, as with everything else in Kafka, this is configurable, yet the awareness of the potential issue is essential.

In reality, slow consumers are more complex than that, and there are multiple configuration options involved here. And if you know what you do, you can even have long-running jobs with Kafka, but I wanted to focus on a problem that is overlooked too often.

Monitoring

Overall, Kafka is a complex tool, and there are a lot of things that can go wrong for various reasons. Given that it's possible to run into a problem where a consumer is stuck for hours with some message, solid monitoring is essential when running Kafka in production.

What exactly we should monitor when using Kafka deserves a separate article (you can expect it in the near future), but for now, the takeaway would be that it's critical to set it up.

Production setup

Just use some managed service, such as Amazon Managed Streaming for Apache Kafka (MSK). Running Kafka in production might be quite a challenge to get it right, especially when considering high availability and durability. Configuring Kafka and using it optimally is already a challenge; don't add an even bigger one unless you know what you do.

Why Kafka?

After reading all of this, you might wonder if it's a good idea ever to use Kafka because it seems like everything can go wrong!

Don't worry, your Sidekiq/Redis combo probably has been regularly losing jobs unless you configured it for minimum reasonable durability.

Joking aside, the essential idea is that you need to understand the tools you use. Even such a popular combination as Sidekiq/Redis can cause some unexpected problems unless you are aware of them and you know what to do to prevent them from happening in the first place.

The same thing is in Kafka - as long as you understand how it works, at least on the fundamental level, and have appropriate monitoring in place, most likely, you will be fine.

But before that, you must ensure that Kafka is exactly what you need.

Consider Kafka if at least one of the following scenarios apply:

you need strict ordering of the events
you do stream processing
you build data pipelines
you process a considerable amount of data/huge number of events
you need the actual retention of the events
you are sure that what you need is something that implements a dumb broker/smart consumer model
the tooling/framework available for Kafka will allow you to get the job done significantly easier, even if you could use some alternative

If you need just a standard message queue, probably using RabbitMQ or Amazon SNS/SQS would be a better idea as it would simply be a simpler solution to the problem.

There are also some alternatives to Kafka that would be appropriate for the scenarios mentioned above. One example would be Apache Pulsar, which could be a superior choice in some scenarios. Yet, it's a less popular tool, so fewer tools and integrations are available.

Kafka with Ruby on Rails

Let's see now Kafka in action.

The good news is that we have many tools available that we could add to our Ruby on Rails applications to make them work with Kafka. And there is even better news - one of these tools is a clear winner - Karafka.

Not only does it provide a straightforward way to implement Kafka producers and consumers, but it also provides many extras that often allow to bypass "traditional" Kafka limitations. Here is a couple of examples:

Dead Letter Queue - we've discussed the scenario where the processing can be blocked due to some error, so it's already apparent how useful this feature could be.
Active Job Adapter and support for long-running jobs - Kafka is often discouraged as a tool for background jobs processing, especially for long-running ones. With Karafka, this is simple as well.
Complex routing patterns - via regular expression
Virtual partitions - remember the part about consumers and partitions and that partitions are the parallelization unit, and there can be only one consumer in a given consumer group for a given partition? Clearly, we cannot have more than one consumer for a partition. However, we can have further parallelization within a single partition while preserving the order of the messages in most cases, thanks to virtual partitions!
Web UI - essential for debugging. If you cannot imagine using Sidekiq without Web UI, you can only imagine how useful it could be for Kafka given the overall complexity.

Let's see what building a minimal producer and consumer would take. As this is a simple proof of concept, we don't really need two separate applications. A single one will be enough.

Assuming that you have Kafka already set up, you can start by adding the karafka gem to the Gemfile:


gem "karafka"

Right afterward, you can run the following command:


bundle exec karafka install

It's going to create karafka.rb config file, app/consumers/application_consumer.rb (a base class for all consumers), and app/consumers/example_consumer.rb (well, as the name indicated, an example consumer).

The karafka.rb config file should look more or less like this:


# frozen_string_literal: true

class KarafkaApp < Karafka::App
  setup do |config|
    config.kafka = { 'bootstrap.servers': '127.0.0.1:9092' }
    config.client_id = 'example_app'
    # Recreate consumers with each batch. This will allow Rails code reload to work in the
    # development mode. Otherwise Karafka process would not be aware of code changes
    config.consumer_persistence = !Rails.env.development?
  end

  # Comment out this part if you are not using instrumentation and/or you are not
  # interested in logging events for certain environments. Since instrumentation
  # notifications add extra boilerplate, if you want to achieve max performance,
  # listen to only what you really need for given environment.
  Karafka.monitor.subscribe(Karafka::Instrumentation::LoggerListener.new)
  # Karafka.monitor.subscribe(Karafka::Instrumentation::ProctitleListener.new)

  # This logger prints the producer development info using the Karafka logger.
  # It is similar to the consumer logger listener but producer oriented.
  Karafka.producer.monitor.subscribe(
    WaterDrop::Instrumentation::LoggerListener.new(
      # Log producer operations using the Karafka logger
      Karafka.logger,
      # If you set this to true, logs will contain each message details
      # Please note, that this can be extensive
      log_messages: false
    )
  )

  routes.draw do
    # Uncomment this if you use Karafka with ActiveJob
    # You need to define the topic per each queue name you use
    # active_job_topic :default
    topic :example do
      # Uncomment this if you want Karafka to manage your topics configuration
      # Managing topics configuration via routing will allow you to ensure config consistency
      # across multiple environments
      #
      # config(partitions: 2, 'cleanup.policy': 'compact')
      consumer ExampleConsumer
    end
  end
end

The key part for us will be the routes.draw do block - it declares that the application will consume from the example topic (its all partitions) via ExampleConsumer.

Our ExampleConsumer will probably look like this:



# frozen_string_literal: true

# Example consumer that prints messages payloads
class ExampleConsumer < ApplicationConsumer
  def consume
    messages.each { |message| puts message.payload }
  end

  # Run anything upon partition being revoked
  # def revoked
  # end

  # Define here any teardown things you want when Karafka server stops
  # def shutdown
  # end
end

So it only prints out the payload of each message in the batch. And ApplicationConsumer is merely a base class that inherits from Karafka::BaseConsumer.

Let's see our consumer in action now!

Start the karafka server process:


bundle exec karafka server

And from rails console, let's publish some event to example topic:


3.2.2 :001 > Karafka.producer.produce_sync(topic: "example", payload: { "Karafka is awesome" => "true" }.to_json)

[c3e48c35d33d] Sync producing of a message to 'example' topic took 17.234999895095825 ms

And in karafka server output we should see something like this:


[b3d1d38425a2] Polled 1 messages in 277.64600002765656ms

[076ac2fd7b7b] Consume job for ExampleConsumer on example/0 started

{"Karafka is awesome"=>"true"}

[076ac2fd7b7b] Consume job for ExampleConsumer on example/0 finished in 0.18400001525878906ms

And that's it! That's enough to set up a communication via Kafka using Karafka!

Wrapping up

We've just covered some key aspects of Kafka - what it is, how it works, some good reasons to use it, and a simple demonstration of the karafka framework that makes Kafka straightforward with Ruby (on Rails) applications.

Stay tuned for the upcoming article that will get into more detail on how we use Kafka at Smily.

Integration patterns for distributed architecture

Karol Galanciak — Fri, 08 Sep 2023 07:28:34 +0000

Distributed architectures have been growing in popularity for quite a while for some good reasons. The rise of cloud services making the deployments simpler, as well as the ever-growing complexity of the applications, resulted in a shift away from monolithic architecture for many technical ecosystems. Microservices have emerged as an alternative solution offering greater modularity, scalability, reliability, agility, and ease of collaboration between multiple teams. Nevertheless, these benefits don't come for free. The price to pay could be significant due to many factors, and one of them is dealing with some challenges that don't necessarily happen when working on a monolith. One of such challenges is establishing the best way of integration and communication between services.

Let's examine the four primary ways services can be integrated and how they all play their part in our architecture in Smily (formerly BookingSync). This article aims to provide a general overview of these patterns, and we will cover them in more detail in the upcoming blog posts.

Four Primary Ways of Integration of Microservices

Are there really four ways of integration/communication in distributed architecture? Isn't it just HTTP API and async events?

It turns out that there are some other ways. One is often considered an anti-pattern, and the other is a bit questionable as a standalone communication pattern as it usually requires some extra one to be involved, but it's still worth mentioning it.

Shared database

Using a shared database is probably the simplest way to establish interaction ("communication" might be an overstatement in that case). You might have two or more applications using the same database without extra overhead, such as building extra APIs or publishing events, so it sounds very appealing, at least at the beginning.

That's why using a shared database is often considered an anti-pattern - as it can easily lead to a poor design with extremely tight coupling and limited scalability. Just think about using a shared PostgreSQL database - coupling to the same schema is just the beginning of potential problems. Deadlocks can also become a headache at some point. And what about a massive number of connections and a significant load on the database cluster causing performance degradation? However, is it truly an anti-pattern?

Let's think about the definition of the "anti-pattern". It's usually defined as something that might initially look like a good idea but is a wrong choice in the end. If we introduce tight coupling and have limited scalability, it could be indeed an anti-pattern.

But at the same time, it might not be a problem at all. Or maybe these trade-offs are perfectly justified. It really all comes down to a trade-off analysis and making deliberate decisions.

Imagine that you have a single monolithic Ruby on Rails application. And at some point, you want to introduce some Business Intelligence that might require heavy reporting. It could turn out that due to some technological choices and the type of analysis of the data you will perform, a new service will be required. For example, a new Python app as Python is often a preferred solution in that domain. This app will need access to the data from the original monolithic solution that will only involve a periodic reading of the data.

Which pattern would be more appropriate?

Building a dedicated REST/GraphQL API for the new service to fetch the data
Introducing Kafka to the system and doing Change Data Capture to let the new app consume the stream of the events
Connecting to the database of a monolithic application

Given the complexity and time needed to implement the first and second options, the shared database will probably be the best choice. And it's not that the dedicated API or doing CDC over Kafka are wrong solutions - having them could be highly beneficial for multiple reasons, and they would also work in this particular case, but they are not the right solutions to this problem. And the shared database is not perfect either. Although there are ways to improve it, for example, connecting to the read-only replica instead of the master to avoid excessive load that could cause performance degradation for the primary monolith.

There are also other cases where using a shared database might be an interesting option, for example, as a temporary mean of communication between services when breaking down a monolith into multiple applications.

Claiming that a shared database is anti-pattern is simply harmful as it might be a good choice for specific use cases. Just because it will be a bad one for many of them, it doesn't mean it needs to be crossed out entirely. Architecture is ultimately about trade-offs and supporting the key non-functional requirements, so making well-informed decisions is essential.

This pattern has such a bad reputation, even though it's fine in a couple of use cases, that in the near future, we will most likely publish a separate article covering this integration pattern in more detail.

And how do we use this pattern in Smily?

There are two distinct use cases:

Business Intelligence. We have a dedicated service responsible for data preparation and storing the data in the PostgreSQL database, and we use AWS Quicksight as a Business Intelligence tool that reads the data from the read-only replica.
Avoiding processing a massive amount of data by all microservices that need it and just letting a single application do it, letting other ones read from its database. This use case is fairly complex, and deciding how to architect it that way deserves a separate article. Yet, to keep it simple for now, it's one of the cases where there was no perfect solution, and it was about picking the one that is the lesser evil. Especially when comparing the cost of the potential solutions - processing massive amount of data is not cheap, especially when considering the price of required AWS EC2 instances and the price for storage on AWS EBS volumes.

File transfer

This is not the typical pattern you think of when integrating the microservices. Especially since the first thought it brings is probably the old-school FTP. It's essentially about exporting the data to the file and letting the consumer take care of it. It's not necessarily a standalone pattern, as it requires some other communication pattern (such as synchronous HTTP API). Yet, it's pretty handy when moving a large volume of data, so let's discuss it separately.

Imagine the following use case - there is a need to export gigabytes of data periodically for multiple consumers. Fetching some data is perfectly normal for almost every HTTP API, and you could use pagination when many records are involved. Still, this may not be the most efficient solution if we are talking about gigabytes of data.

Fortunately, there is a simple alternative - export the data to a CSV file (e.g., via postgres-copy gem ), upload the file to some cloud storage, such as AWS S3, and return the links in the API.

And this is exactly how we use this pattern in Smily in one of our public APIs! The results are partitioned by day, and a single response contains a few hundred links to AWS S3 containing CSV files that are periodically uploaded in the background jobs, which massively limits the traffic in our API (although some Sidekiq workers take care of exporting the data) and simplifies the entire process for the API consumers as they can get everything just in a single request and processing the files can be easily parallelized, thanks to partitioning by day.

Synchronous Request-Response

It is probably the most common communication pattern in distributed architectures. And for some good reasons. At least if we consider the typical use case, HTTP API, like REST API. We cannot forget here about RPCs (Remote Procedure Calls), which have some great benefits, and even though it might be a less popular integration pattern, it can be a superior choice compared to REST or GraphQL API.

RPC definitely deserves a separate article as it comes with different flavors (gRPC has been growing in popularity for quite a while, but even RabbitMQ, which is a message broker for typically asynchronous messaging makes it relatively straightforward to implement RPC. And there is SOAP, but at this point is pretty much dead), and we are going to cover it in more detail in the future.

And for now, let's focus on typical HTTP APIs and some of their significant benefits:

HTTP APIs are ubiquitous, both REST and GraphQL, so most of the experienced developers are familiar with the concepts and the expected problems and patterns to handle them (such as retrying failed requests, timeouts, circuit breakers, idempotency keys)
No extra tech is required to establish the communication, such as message brokers, so there is no additional overhead of managing new infrastructure, establishing extra monitoring, etc.
Multiple standards are available (for example, JSONAPI or the GrapQL itself), so there is no need to reinvent the wheel for the payload structure
Simple to reason about thanks to synchronous nature - the feedback is immediately available
Flexibility of authentication and authorization and well-known standards for that (JSON Web Tokens, OAuth)

As great as it sounds, this integration pattern can be a wrong choice for many cases and reasons:

Complex/high-latency operations are involved - if generating a response takes minutes or even hours, synchronous communication is definitely not an efficient solution. Even though you could, to some extent, design it so that you don't need to introduce asynchronous events, e.g., by having an endpoint where you could enqueue the operation to be processed and then periodically check the completion status, it doesn't mean that it's the best way to solve this problem.
Increased coupling - using HTTP APIs leads to a way tighter coupling than async messaging, as you need to know quite a lot about the service you are calling. Also, when one service is down, the failure can propagate to the other services.
Scalability - the synchronous nature of communication involves way more overhead than the async one.
Fetching huge volumes of data - even though it's possible to do this via REST API, as demonstrated in the previous section about file transfers, it might be a highly suboptimal choice for many use cases, often leading to reinventing Kafka over REST API (once we cover Kafka in more detail, that phrase will become clearer). Imagine that you operate on millions of records, and somehow, you have already managed to fetch these records via API. For the subsequent GET calls, you only want to get the records that have changed since the last request. Usually, this is implemented by storing timestamps and providing these timestamps to filter out the records that have been updated since that time in the subsequent calls, which is, to some extent operating with timestamp-based offsets. It might sound like a decent solution on a small scale, but for a massive volume of data that is updated often, and when you want to get this data as quickly as possible and by multiple other services, it quickly becomes ugly. It requires handling a massive number of requests, which only increases with a growing number of endpoints where this happens, and the same thing is performed each time for every service that cannot be cached easily, as the response would depend on the timestamps. And even when using some fixed timestamps with the same fixed value, storing all the cache would be another challenge. Just because it might be doable via REST API, it doesn't mean it's the best way to do it.
Not suitable for complex workflows - it can become quite awkward when you implement sagas with REST API and deal with compensating transactions upon failures and generally error handling.

It turns out that the synchronous request-response communication style is not necessarily a clear winner for most cases, but again, architecture is about the trade-offs.

And how do we use it in Smily?

To start with, we have two public APIs:

The primary general-purpose one that we use quite heavily for internal applications as well
A more specialized one

On top of that, we have some private APIs, for example, as backends for Ember single-page apps or for typical inter-service communication.

And, of course, we use so many APIs as consumers, both REST APIs and GraphQL APIs, so in general, HTTP APIs are abundant in our ecosystem.

Asynchronous Events

To a limited extent, we've covered this already as contrasting integration pattern to synchronous request-response communication style.

When thinking about async messages or events, [RabbitMQ][https://www.rabbitmq.com] or Kafka might be the things that come into your mind as typical examples. We will get into these in a moment, but let's start with some not-so-obvious pattern - webhooks.

Webhooks

Yes, webhooks are also asynchronous messages, and they can be great as both additions to HTTP APIs or as a standalone pattern, that lets you benefit from the push flow instead of pulling the data from the API. That way, you can receive messages easily, even from third-party applications, so it's possible to have async events without involving any extra broker.

To benefit from the webhooks, you need to expose an HTTP endpoint (so it involves some extent HTTP API) to which a message will be delivered, typically in JSON, XML, or form-encoded format, often secured by an extra signature, so that we can tell if whoever is sending the given webhook is a legit sender.

A simple type of webhook could look something like this:

POST https://my-awesome-app.com/api/webhooks?event=booking_created&id=1&signature=123

And that's how we can get notified that a booking (notice the event param with booking_created value) with an ID of 1 has just been created. And there is also a signature for security purposes that hopefully would look a bit more secure in a real-world case.

In Smily, webhooks are an integral part of our primary public API and are highly recommended for building a possibly robust integration. You can find documentation about them here.

Message brokers

Now, let's focus on the more classic case where a middleman called a "message broker" connects producers of the messages with their consumers, allowing the implementation of the publish/subscribe model (or point-to-point messaging where the message is delivered to a single specific consumer instead of multiple subscribers). Thanks to that, you can publish a single event, and the message broker will ensure it's consumable by all the appropriate subscribers based on the defined config and routing, which depend on the specific message broker.

The differences between message brokers can be pretty significant, and perhaps the most meaningful one is what kind of model they implement, as we have two different types of models:

Smart broker/dumb consumer - the message broker is responsible directly for delivering the message to the consumer so that the consumer just waits to process events. Notable example: RabbitMQ
Dumb broker/smart consumer - the messages are available in the broker, but it is up to consumers to deal with these messages. Notable example: Kafka

It might already sound complex when choosing the message broker when you are sure you need async messaging. There is bad news, though: it only gets more complex from this point.

The sneaky issue is that most of the problems, at least defined in a generic way, can be solved using any of these combinations. Sometimes it might require a bit more effort for some use cases or introducing some extra third-party tool, but in general, you should be able to achieve the result by picking any of these.

The topic is so broad and complex that we will publish a couple of follow-up articles to cover the differences, among many other things, but to at least have a simple overview now, let's cover two brokers: RabbitMQ and Kafka.

RabbitMQ

RabbitMQ is essentially about publishing messages to queues from which consumers can read and process them. It might sound a bit like using Redis and Sidekiq, where jobs are pushed to the queues, and Sidekiq workers take them and handle the processing. Still, there is one essential difference - when using RabbitMQ, producers don't directly push to the queues, they push messages to the exchanges, ultimately responsible for delivering messages to the proper queues that are bound based on the routing keys. The consumers within a single consumer group can subscribe to the same queue competing for the messages (for parallelization), and once the message gets processed, it's gone from the broker.

This design has a profound impact on what RabbitMQ is capable of. Exchanges make it possible to implement a publish/subscribe model with multiple consumer groups and the killer feature of RabbitMQ - smart routing based on the routing key (which is an extra attribute of the message), including wildcard matches.

For example, you can publish messages with the following routing keys:

"rentals.eu.france.paris"
"rentals.eu.france.nevache"

Suppose you want the consumer to process messages concerning only Nevache (so the ones with "rentals.eu.france.nevache"). In that case, it's pretty straightforward - use "rentals.eu.france.nevache" as a binding key. However, what if you want to process all messages regarding rentals? Or all rentals from France? You can use wildcard matching! In this case, "rentals." and "rentals.eu.france." would be the appropriate binding keys to make it work, and the exchange will be smart enough to deliver the desired messages to the queues (which works only for a topic exchange, but don't worry about it at this point - we are going to cover all types of exchanges in the upcoming article).

Also, what is interesting about RabbitMQ is that you can implement RPC, thanks to the callback queues.

On top of that, we have priority queues and dead-letter exchanges.

Kafka

Kafka is a distributed streaming platform that takes a different approach from RabbitMQ. Kafka essentially stores all the messages in an append-only log. Producers publish messages to the topics that can be split into multiple partitions, and each partition represents a separate append-only log in which events are ordered. And every message in the partition has its own index (offset) based on which we can identify its position in the log.

The consumers read data from partitions periodically, and once they are done with a batch of events, they persist the current offset and move on to another batch. What is important is that within one consumer group, a single partition can have only a single consumer (as this is the only way to guarantee strict ordering).

This design is what makes Kafka so powerful. What is more, you can even replay events - regardless of whether it's the consumer that has already processed the message (just update the current offset to the earlier one) or a new one from a new consumer group that can start processing things from the beginning so there is even no need to republish the events to make them available for processing.

And as far as the retention goes, there is a lot of flexibility. You can define it based on the storage size or time. For example, you can configure it to store messages for 3 days, and everything beyond that will be automatically removed. Or you can configure it to retain messages forever (well, approximately, there is infinite retention but you can keep messages for hundreds of years).

Performance and ability to scale is another strong point of Kafka. If it's good enough for activity tracking at LinkedIn, it's not something you might need to worry for quite a while, at least if the consumers are not bottlenecks and the number of partitions is optimal.

Activity tracking, log/events aggregation, anomaly detection, and (nearly) real-time data processing are quite typical use cases for Kafka as well, thanks to the ease of integrating it with so many other tools like Apache Flink.

RabbitMQ vs. Kafka

Choosing between RabbitMQ and Kafka is not a simple decision. Nevertheless, let's summarize it with some general hints and guidelines.

Use RabbitMQ when you:

need complex routing
don't need to retain messages or replay them
need priority queues
need to support RPC calls
need a "simple" broker to get the job done

Use Kafka when you:

need to handle an extreme throughput of messages
need strict ordering of the messages
need to retain messages for an extended time or replay them
do the actual stream processing

Although in reality it's a bit more complex than that. A lot depends on the overall ecosystem, throughput, available tools or even... your monitoring practices.

To give you an example, in my experience, the integration via RabbitMQ generally works very smoothly and requires very little attention. With Kafka, it's a bit different story - if you don't have good monitoring practices, I wouldn't even consider it as a possible option. For example, suppose a message cannot be processed due to some error. In that case, processing from a given partition will be blocked until the issue is addressed, so you'd better have proper monitoring to tell you about this if you cannot avoid it in the first place (or use a third-party tool that implements a dead-letter queue). When it takes too long to process messages, you might also expect odd things to happen, like constantly reprocessing the same batch and never finishing. So again, monitoring is critical here.

On the other hand, it's still possible to have the strict ordering of the messages in RabbitMQ - at least if you don't have multiple consumers competing for the messages from a single queue. But that will have an impact on the scalability and performance.

Ultimately, the final choice requires carefully evaluating the trade-offs and a deep understanding of the ecosystem where the broker will be used.

And the final question: which one do we use in Smily? We use both, for different use cases. And for both of them, we've developed custom gems that massively simplify using both Kafka and RabbitMQ.

For RabbitMQ, we use hermes-rb which has been available for quite a while, and for Kafka, we use something that is not yet publicly available, but it will be very soon. And both will be covered in upcoming articles, including more details on how and why we use them.

Conclusions

In this article, we've covered four primary integration patterns for the distributed architecture: shared database, file transfer, synchronous request-response, and asynchronous events. We've also discussed the differences between Kafka and RabbitMQ and briefly mentioned how we apply these patterns in Smily.

Stay tuned for the upcoming articles as they will go much deeper, especially about asynchronous events.