DEV Community

Rajesh Gunasekaran
Rajesh Gunasekaran

Posted on • Edited on

Deploying Amazon MSK at Scale: A Platform Engineer's Journey at Wehkamp

Introduction

At Wehkamp, we embarked on a journey to modernize our messaging infrastructure by migrating from self-managed Kafka on EC2 to Amazon MSK (Managed Streaming for Apache Kafka). This post shares real-world experiences from the platform team perspective - the challenges we faced, lessons learned, and how we built a scalable, multi-account MSK platform using Infrastructure as Code.

Background: Why MSK?

Wehkamp's legacy Kafka infrastructure (established around 2014-2015) ran on EC2 instances. While it served us well initially, we faced several
challenges:

  • Scaling complexity: Manual broker management across multiple business units
  • Operational overhead: Patching, ZooKeeper management, monitoring setup
  • Stability concerns: Resource contention and performance issues
  • Multi-account sprawl: Difficult to maintain consistency across environments

Our Migration Goals

We set out to achieve:

  • Reduce operational burden on the platform team
  • Improve reliability with AWS-managed infrastructure
  • Standardize Kafka deployments across all AWS accounts
  • Enable faster scaling and easier maintenance
  • Free up engineering time for higher-value work

Infrastructure as Code: Multi-Account MSK with Terraform

At Wehkamp, we managed MSK clusters across multiple AWS accounts representing different business units and environments (dev, staging, production).

Here's how we structured our Terraform setup:

Repository Structure

terraform/
├── modules/
│ └── msk-cluster/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── broker-config.tf
├── accounts/
│ ├── bu1-prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── msk.tf
│ ├── bu1-dev/
│ │ └── ...
│ ├── bu2-prod/
│ │ └── ...
│ └── bu2-dev/
│ └── ...
└── aws-config/
└── credentials

The MSK Module Pattern

We created a reusable MSK module that handled all the complexity:

  # modules/msk-cluster/main.tf
  resource "aws_msk_cluster" "main" {
    cluster_name           = var.cluster_name
    kafka_version          = var.kafka_version
    number_of_broker_nodes = var.broker_count

    broker_node_group_info {
      instance_type   = var.instance_type
      client_subnets  = var.subnet_ids
      security_groups = [aws_security_group.msk.id]

      storage_info {
        ebs_storage_info {
          volume_size = var.storage_size_gb
        }
      }
    }

    configuration_info {
      arn      = aws_msk_configuration.main.arn
      revision = aws_msk_configuration.main.latest_revision
    }

    encryption_info {
      encryption_in_transit {
        client_broker = "TLS"
        in_cluster    = true
      }
    }

    tags = merge(var.common_tags, {
      Environment = var.environment
      ManagedBy   = "Terraform"
    })
  }
Enter fullscreen mode Exit fullscreen mode

Calling the Module Per Account

Each AWS account directory called the module with environment-specific values:

  # accounts/bu1-prod/msk.tf
  module "msk_cluster" {
    source = "../../modules/msk-cluster"

    cluster_name    = "bu1-prod-msk"
    kafka_version   = "2.8.1"
    broker_count    = 3
    instance_type   = "kafka.m5.large"
    storage_size_gb = 1000
    subnet_ids      = data.aws_subnets.private.ids

    environment = "production"

    common_tags = {
      BusinessUnit = "BU1"
      CostCenter   = "engineering"
    }
  }
Enter fullscreen mode Exit fullscreen mode

Multi-Account Authentication

Each AWS account directory contained its own variable files and configuration:

terraform/
├── accounts/
│ ├── bu1-prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Variables for this account
│ │ └── msk.tf
│ ├── bu1-dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Variables for this account
│ │ └── msk.tf

We used AWS credential profiles with Terraform automatically picking up the tfvars file in each directory:

  # Deploying to specific account
  cd terraform/accounts/bu1-prod
  terraform init
  terraform plan
  terraform apply
Enter fullscreen mode Exit fullscreen mode

Terraform automatically used:

  • The terraform.tfvars file in the current directory
  • The appropriate AWS credential profile configured in aws-config
  • Module references pointing to ../../modules/msk-cluster

The Initial Setup

When provisioning our MSK clusters, we followed https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html. We calculated our
requirements based on:

  • Expected throughput (MB/s)
  • Number of partitions
  • Retention policies
  • Client connections

Based on these calculations, we initially deployed with kafka.m5.large instances.

Problem: Capacity Exceeded

Shortly after production deployment, we hit a critical issue: our topic count exceeded what the instance type could accommodate. MSK has specific limits per broker instance type:

| Instance Type | Max Partitions per Broker |
|------------------|---------------------------|
| kafka.m5.large | ~1000 partitions |
| kafka.m5.xlarge | ~2000 partitions |
| kafka.m5.2xlarge | ~4000 partitions |

With multiple business units creating topics organically, we quickly exceeded capacity, causing:

  • Broker CPU spikes
  • Increased replication lag
  • Connection timeouts

Solution: Upgrade to kafka.m5.2xlarge

We upgraded the instance type via Terraform:

  module "msk_cluster" {
    source = "../../modules/msk-cluster"

    # Changed from kafka.m5.large
    instance_type = "kafka.m5.2xlarge"

    # ... other config
  }


Enter fullscreen mode Exit fullscreen mode

terraform apply
# MSK performs rolling upgrade, no downtime

The upgrade worked smoothly - MSK performed a rolling update with zero downtime.

New Problem: Over-Provisioning & Cost

A few weeks later, our AWS cost reports flagged a significant increase. We had over-provisioned:

  • Actual usage: ~1,500 partitions
  • Provisioned capacity: kafka.m5.2xlarge (4,000 partitions)
  • Cost impact: 2x more expensive than needed

We wanted to downgrade to kafka.m5.xlarge (the Goldilocks size), but discovered a critical MSK limitation:

⚠️ MSK does NOT support downgrading instance types through the console or API.

You can only upgrade, never downgrade.

The Workaround: AWS Support Case

We raised an AWS Support case requesting manual downgrade assistance. Here's what we learned:

AWS Support Options:

  1. Recommended approach: Create new cluster with correct size, migrate topics
    • Pros: Clean slate, proper sizing
    • Cons: Complex migration, client reconfiguration
  2. Manual intervention (what we did): AWS engineers performed backend downgrade
    • Pros: Faster, no client changes
    • Cons: Requires support case, not guaranteed for all scenarios

The AWS support team successfully downgraded our cluster, but the process took 48-72 hours and required careful planning during a maintenance window.

Lessons Learned

  1. Start conservatively, but not too conservatively
    • Use monitoring data to right-size within first 30 days
    • Build in 30-40% headroom for growth
  2. Monitor partition count actively
    • Set CloudWatch alarms for partition count thresholds
    • Implement topic creation governance (approval process)
  3. MSK instance type changes are one-way
    • You can upgrade easily, but downgrades require AWS support
    • Plan sizing carefully to avoid cost optimization headaches
  4. Consider kafka.m5.xlarge as default starting point
    • Good balance of capacity and cost
    • Enough headroom for most workloads
  5. Implement topic quotas
    • Prevent unbounded topic/partition growth
    • Use Kafka quotas or approval workflows

Cost Optimization

After this experience, we implemented partition count monitoring:

# CloudWatch alarm for partition count
resource "aws_cloudwatch_metric_alarm" "partition_count" {
alarm_name = "msk-partition-count-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "PartitionCount"
namespace = "AWS/Kafka"
period = 300
statistic = "Average"
threshold = 1500 # 75% of kafka.m5.xlarge capacity
alarm_description = "MSK partition count approaching instance type limit"

dimensions = {
  "Cluster Name" = aws_msk_cluster.main.cluster_name
}
Enter fullscreen mode Exit fullscreen mode

}

Estimated cost savings: ~$500/month per cluster by right-sizing.


Operational Challenge #2: Production Incident - MSK Disk Space Crisis

Our Monitoring Setup

At Wehkamp, we implemented a tiered alerting strategy:

  • Informational alerts → Slack channel #aws-platform-alerts
  • Critical alerts → Immediate action required

The Incident Timeline

Top comments (0)