Introduction
At Wehkamp, we embarked on a journey to modernize our messaging infrastructure by migrating from self-managed Kafka on EC2 to Amazon MSK (Managed Streaming for Apache Kafka). This post shares real-world experiences from the platform team perspective - the challenges we faced, lessons learned, and how we built a scalable, multi-account MSK platform using Infrastructure as Code.
Background: Why MSK?
Wehkamp's legacy Kafka infrastructure (established around 2014-2015) ran on EC2 instances. While it served us well initially, we faced several
challenges:
- Scaling complexity: Manual broker management across multiple business units
- Operational overhead: Patching, ZooKeeper management, monitoring setup
- Stability concerns: Resource contention and performance issues
- Multi-account sprawl: Difficult to maintain consistency across environments
Our Migration Goals
We set out to achieve:
- Reduce operational burden on the platform team
- Improve reliability with AWS-managed infrastructure
- Standardize Kafka deployments across all AWS accounts
- Enable faster scaling and easier maintenance
- Free up engineering time for higher-value work
Infrastructure as Code: Multi-Account MSK with Terraform
At Wehkamp, we managed MSK clusters across multiple AWS accounts representing different business units and environments (dev, staging, production).
Here's how we structured our Terraform setup:
Repository Structure
terraform/
├── modules/
│ └── msk-cluster/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── broker-config.tf
├── accounts/
│ ├── bu1-prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── msk.tf
│ ├── bu1-dev/
│ │ └── ...
│ ├── bu2-prod/
│ │ └── ...
│ └── bu2-dev/
│ └── ...
└── aws-config/
└── credentials
The MSK Module Pattern
We created a reusable MSK module that handled all the complexity:
# modules/msk-cluster/main.tf
resource "aws_msk_cluster" "main" {
cluster_name = var.cluster_name
kafka_version = var.kafka_version
number_of_broker_nodes = var.broker_count
broker_node_group_info {
instance_type = var.instance_type
client_subnets = var.subnet_ids
security_groups = [aws_security_group.msk.id]
storage_info {
ebs_storage_info {
volume_size = var.storage_size_gb
}
}
}
configuration_info {
arn = aws_msk_configuration.main.arn
revision = aws_msk_configuration.main.latest_revision
}
encryption_info {
encryption_in_transit {
client_broker = "TLS"
in_cluster = true
}
}
tags = merge(var.common_tags, {
Environment = var.environment
ManagedBy = "Terraform"
})
}
Calling the Module Per Account
Each AWS account directory called the module with environment-specific values:
# accounts/bu1-prod/msk.tf
module "msk_cluster" {
source = "../../modules/msk-cluster"
cluster_name = "bu1-prod-msk"
kafka_version = "2.8.1"
broker_count = 3
instance_type = "kafka.m5.large"
storage_size_gb = 1000
subnet_ids = data.aws_subnets.private.ids
environment = "production"
common_tags = {
BusinessUnit = "BU1"
CostCenter = "engineering"
}
}
Multi-Account Authentication
Each AWS account directory contained its own variable files and configuration:
terraform/
├── accounts/
│ ├── bu1-prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Variables for this account
│ │ └── msk.tf
│ ├── bu1-dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Variables for this account
│ │ └── msk.tf
We used AWS credential profiles with Terraform automatically picking up the tfvars file in each directory:
# Deploying to specific account
cd terraform/accounts/bu1-prod
terraform init
terraform plan
terraform apply
Terraform automatically used:
- The terraform.tfvars file in the current directory
- The appropriate AWS credential profile configured in aws-config
- Module references pointing to ../../modules/msk-cluster
The Initial Setup
When provisioning our MSK clusters, we followed https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html. We calculated our
requirements based on:
- Expected throughput (MB/s)
- Number of partitions
- Retention policies
- Client connections
Based on these calculations, we initially deployed with kafka.m5.large instances.
Problem: Capacity Exceeded
Shortly after production deployment, we hit a critical issue: our topic count exceeded what the instance type could accommodate. MSK has specific limits per broker instance type:
| Instance Type | Max Partitions per Broker |
|------------------|---------------------------|
| kafka.m5.large | ~1000 partitions |
| kafka.m5.xlarge | ~2000 partitions |
| kafka.m5.2xlarge | ~4000 partitions |
With multiple business units creating topics organically, we quickly exceeded capacity, causing:
- Broker CPU spikes
- Increased replication lag
- Connection timeouts
Solution: Upgrade to kafka.m5.2xlarge
We upgraded the instance type via Terraform:
module "msk_cluster" {
source = "../../modules/msk-cluster"
# Changed from kafka.m5.large
instance_type = "kafka.m5.2xlarge"
# ... other config
}
terraform apply
# MSK performs rolling upgrade, no downtime
The upgrade worked smoothly - MSK performed a rolling update with zero downtime.
New Problem: Over-Provisioning & Cost
A few weeks later, our AWS cost reports flagged a significant increase. We had over-provisioned:
- Actual usage: ~1,500 partitions
- Provisioned capacity: kafka.m5.2xlarge (4,000 partitions)
- Cost impact: 2x more expensive than needed
We wanted to downgrade to kafka.m5.xlarge (the Goldilocks size), but discovered a critical MSK limitation:
⚠️ MSK does NOT support downgrading instance types through the console or API.
You can only upgrade, never downgrade.
The Workaround: AWS Support Case
We raised an AWS Support case requesting manual downgrade assistance. Here's what we learned:
AWS Support Options:
- Recommended approach: Create new cluster with correct size, migrate topics
- Pros: Clean slate, proper sizing
- Cons: Complex migration, client reconfiguration
- Manual intervention (what we did): AWS engineers performed backend downgrade
- Pros: Faster, no client changes
- Cons: Requires support case, not guaranteed for all scenarios
The AWS support team successfully downgraded our cluster, but the process took 48-72 hours and required careful planning during a maintenance window.
Lessons Learned
- Start conservatively, but not too conservatively
- Use monitoring data to right-size within first 30 days
- Build in 30-40% headroom for growth
- Monitor partition count actively
- Set CloudWatch alarms for partition count thresholds
- Implement topic creation governance (approval process)
- MSK instance type changes are one-way
- You can upgrade easily, but downgrades require AWS support
- Plan sizing carefully to avoid cost optimization headaches
- Consider kafka.m5.xlarge as default starting point
- Good balance of capacity and cost
- Enough headroom for most workloads
- Implement topic quotas
- Prevent unbounded topic/partition growth
- Use Kafka quotas or approval workflows
Cost Optimization
After this experience, we implemented partition count monitoring:
# CloudWatch alarm for partition count
resource "aws_cloudwatch_metric_alarm" "partition_count" {
alarm_name = "msk-partition-count-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "PartitionCount"
namespace = "AWS/Kafka"
period = 300
statistic = "Average"
threshold = 1500 # 75% of kafka.m5.xlarge capacity
alarm_description = "MSK partition count approaching instance type limit"
dimensions = {
"Cluster Name" = aws_msk_cluster.main.cluster_name
}
}
Estimated cost savings: ~$500/month per cluster by right-sizing.
Operational Challenge #2: Production Incident - MSK Disk Space Crisis
Our Monitoring Setup
At Wehkamp, we implemented a tiered alerting strategy:
- Informational alerts → Slack channel #aws-platform-alerts
- Critical alerts → Immediate action required
The Incident Timeline

Top comments (0)