Two years ago our team inherited Atlas — a B2B SaaS monolith with ~680,000 lines of Rails code, a decade of accumulated complexity, and a deployment pipeline that caused anxiety every Friday afternoon.
The symptoms were familiar:
- Single PostgreSQL instance handling transactional and reporting load
- In-process custom thread pool for background jobs
- Singleton service objects maintaining application-level state
- Session data stored in application memory behind a sticky-session load balancer
- Deployments caused ~4 minutes of session loss for active users
Our goal wasn't a big-bang rewrite. It was a disciplined, iterative extraction — shipping features throughout the process while incrementally achieving statelessness.
Phase 1 — Audit: Finding State Where It Shouldn't Be
Before touching code, we needed a full inventory of where state lived. This sounds simple. It isn't.
Identifying Stateful Singletons
The most dangerous pattern we found: service objects instantiated once at boot and shared across all requests.
# config/initializers/rate_limiter.rb
RateLimiter = RateLimiterService.new(
store: {}, # mutable hash in process memory
window: 60,
limit: 100
)
RateLimiterService held a mutable store hash accumulating per-user request counts. Under a single process: fine. Under autoscaling with multiple pods: each pod maintained independent counters, making rate limiting effectively broken.
We audited the entire codebase and found 23 such singletons across four categories:
| Category | Count | Problem |
|---|---|---|
| Rate limiters / throttles | 6 | Per-process counters — ineffective at scale |
| Feature flag caches | 4 | Stale state — flags wouldn't propagate across pods |
| Third-party client pools | 8 | Connection state not shared, over-connecting |
| Tenant config aggregators | 5 | Mutations in one request leaked into others |
The fix for every category followed the same principle: move state out of process memory into a shared external store.
# Before — process-local hash
RateLimiter = RateLimiterService.new(store: {})
# After — Redis-backed, shared across all pods
RateLimiter = RateLimiterService.new(
store: RedisStore.new(
client: Redis.current,
namespace: "rate_limiter"
)
)
The Sticky Session Problem
Atlas used Rails' default :cookie_store — but with a twist. A developer years earlier needed to store non-serializable ActiveRecord objects in the session. Rather than fix serialization, they mirrored part of the session server-side in a process-level hash, then configured the load balancer with IP-based sticky sessions to make sure users always hit the same pod.
The consequences:
- Horizontal scaling didn't distribute load evenly
- Pod restarts caused immediate session loss
- Rolling deployments silently logged out active users
Fix part 1 — Migrate session storage to Redis:
# Gemfile
gem 'redis-session-store'
# config/initializers/session_store.rb
Rails.application.config.session_store :redis_session_store,
key: '_atlas_session',
redis: {
expire_after: 2.hours,
key_prefix: 'atlas:session:',
client: Redis.current
}
Fix part 2 — Eliminate non-serializable objects from session entirely:
# Before — storing the full ActiveRecord object
session[:current_user] = current_user
# After — store only the primitive ID, hydrate on each request
session[:current_user_id] = current_user.id
# In ApplicationController
def current_user
@current_user ||= User.find_by(id: session[:current_user_id])
end
After this change we removed sticky sessions from the load balancer entirely. Session data became a first-class shared resource, not a pod-local artifact. We saw an immediate 18% reduction in p99 login latency by eliminating database joins that had been avoided via session caching with stale data.
Phase 2 — Extraction: Breaking State Ownership
With state inventoried and moved to external stores, we began service extraction using a strangler fig pattern — new services intercepted traffic slice by slice while the monolith continued running.
Choosing What to Extract First
We scored candidates across three dimensions:
Score = (Domain Clarity × Async Viability) / Shared State Coupling
High score = good first candidate. Notifications won.
Notifications had:
- Clear bounded domain
- No need for synchronous response
- High enough volume to justify the infrastructure
- A clean, definable event contract
The original monolith code was fully synchronous and blocking:
def complete_order(order)
order.finalize!
NotificationMailer.order_confirmation(order).deliver_now # blocks thread
WebhookService.deliver(:order_completed, order) # blocks thread
ActivityFeed.append(order.user, :order_completed, order) # blocks thread
end
If the email provider was slow, order completion was slow. If a webhook endpoint timed out, the user waited. Classic incidental coupling through synchrony.
Phase 3 — SQS for Async State Updates
We replaced the synchronous calls with event publishing to SQS. The monolith's job changed: finalize the order, emit a fact, return.
The Event Bus Abstraction
We built a thin EventBus wrapper around the AWS SDK to keep the calling code clean:
# lib/event_bus.rb
class EventBus
QUEUE_URLS = {
"order.completed" => ENV["SQS_ORDER_EVENTS_URL"],
"user.registered" => ENV["SQS_USER_EVENTS_URL"],
"subscription.changed" => ENV["SQS_BILLING_EVENTS_URL"]
}.freeze
def self.publish(event_type, payload)
url = QUEUE_URLS.fetch(event_type) do
raise ArgumentError, "Unknown event type: #{event_type}"
end
sqs_client.send_message(
queue_url: url,
message_body: JSON.generate({
event: event_type,
payload: payload,
timestamp: Time.now.utc.iso8601,
version: "1.0"
}),
message_attributes: {
"EventType" => {
string_value: event_type,
data_type: "String"
}
}
)
end
def self.sqs_client
@sqs_client ||= Aws::SQS::Client.new(region: ENV["AWS_REGION"])
end
end
The refactored order completion:
def complete_order(order)
order.finalize!
EventBus.publish("order.completed", {
order_id: order.id,
user_id: order.user_id,
user_email: order.user.email,
total: order.total.to_s,
currency: order.currency,
line_items: order.line_items.map { |li| { sku: li.sku, qty: li.qty } }
})
end
The controller returns immediately. No email. No webhook. No feed update. All of those happen asynchronously, decoupled from the user's request thread.
The Notification Service Consumer
The extracted notification service is a standalone Python service (we used Python for new services — the team had strong Python ML tooling we wanted to leverage long-term). It polls SQS and processes events:
# notification_service/consumer.py
import boto3, json, logging
from handlers import order_handlers, user_handlers
logger = logging.getLogger(__name__)
HANDLER_MAP = {
"order.completed": order_handlers.handle_order_completed,
"user.registered": user_handlers.handle_user_registered,
"subscription.changed": user_handlers.handle_subscription_changed,
}
def poll(queue_url: str, sqs_client):
while True:
response = sqs_client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # long polling — reduces empty receives
MessageAttributeNames=["EventType"]
)
for message in response.get("Messages", []):
process_message(message, queue_url, sqs_client)
def process_message(message: dict, queue_url: str, sqs_client):
try:
body = json.loads(message["Body"])
event_type = body["event"]
handler = HANDLER_MAP.get(event_type)
if not handler:
logger.warning(f"No handler registered for event: {event_type}")
else:
handler(body["payload"])
# Only delete the message if processing succeeded
sqs_client.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message["ReceiptHandle"]
)
except Exception as exc:
# Do NOT delete — message returns to queue after visibility timeout
# SQS dead-letter queue (DLQ) handles repeated failures
logger.error(f"Failed to process message {message['MessageId']}: {exc}")
# notification_service/handlers/order_handlers.py
from clients.email_client import send_transactional_email
from clients.webhook_client import deliver_webhook
from clients.feed_client import append_feed_event
def handle_order_completed(payload: dict):
send_transactional_email(
template="order_confirmation",
to=payload["user_email"],
context={
"order_id": payload["order_id"],
"total": payload["total"],
"currency": payload["currency"],
"line_items": payload["line_items"]
}
)
deliver_webhook(
event="order.completed",
payload=payload
)
append_feed_event(
user_id=payload["user_id"],
event_type="order_completed",
metadata={"order_id": payload["order_id"]}
)
SQS Configuration: Dead-Letter Queue
Idempotency and failure handling are non-negotiable in async systems. We configured a DLQ for every event queue:
# terraform/sqs.tf
resource "aws_sqs_queue" "order_events" {
name = "atlas-order-events"
visibility_timeout_seconds = 30
message_retention_seconds = 86400 # 24 hours
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.order_events_dlq.arn
maxReceiveCount = 3 # 3 failed attempts → DLQ
})
}
resource "aws_sqs_queue" "order_events_dlq" {
name = "atlas-order-events-dlq"
message_retention_seconds = 1209600 # 14 days — time to investigate
}
With maxReceiveCount = 3, a message that fails processing three times moves to the DLQ automatically. We set up CloudWatch alarms on DLQ depth so failures surface as alerts rather than silent data loss.
Phase 4 — Event-Driven Patterns Across Services
Once SQS was in place and the team had built muscle with it, we extended the pattern across the rest of the extraction effort. Three patterns proved especially valuable.
Pattern 1: Event Sourcing for Audit Trails
The billing domain previously had an updated_at column and a loosely maintained versions table. After extraction, every billing event became a first-class fact published to SQS and persisted in a dedicated event store:
# billing_service/event_store.py
import boto3, json
from datetime import datetime, timezone
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("atlas-billing-events")
def record_event(aggregate_id: str, event_type: str, payload: dict):
timestamp = datetime.now(timezone.utc).isoformat()
table.put_item(Item={
"pk": f"BILLING#{aggregate_id}",
"sk": f"EVENT#{timestamp}",
"event_type": event_type,
"payload": payload,
"version": get_next_version(aggregate_id)
})
def get_billing_history(aggregate_id: str) -> list:
response = table.query(
KeyConditionExpression="pk = :pk AND begins_with(sk, :prefix)",
ExpressionAttributeValues={
":pk": f"BILLING#{aggregate_id}",
":prefix": "EVENT#"
},
ScanIndexForward=True
)
return response["Items"]
This replaced what used to be a SELECT * FROM billing_audits WHERE account_id = ? query on the monolith's main database — which had become a serious bottleneck at reporting time.
Pattern 2: Saga for Distributed Transactions
Distributed systems don't have ACID transactions across service boundaries. When we extracted the subscription service, we had to handle the scenario: charge succeeds → subscription activation fails. We implemented a choreography-based saga using SQS + SNS:
charge.succeeded ──► [Subscription Service] ──► subscription.activated
│
(on failure)
│
▼
subscription.activation_failed
│
▼
[Billing Service] ──► charge.reversed
# subscription_service/saga.py
def handle_charge_succeeded(payload: dict):
account_id = payload["account_id"]
plan_id = payload["plan_id"]
try:
activate_subscription(account_id, plan_id)
publish_event("subscription.activated", {
"account_id": account_id,
"plan_id": plan_id,
"charge_id": payload["charge_id"]
})
except ActivationError as exc:
# Compensating transaction — tell billing to reverse the charge
publish_event("subscription.activation_failed", {
"account_id": account_id,
"charge_id": payload["charge_id"],
"reason": str(exc)
})
# billing_service/saga_handlers.py
def handle_activation_failed(payload: dict):
charge_id = payload["charge_id"]
reverse_charge(charge_id)
publish_event("charge.reversed", {
"charge_id": charge_id,
"account_id": payload["account_id"],
"reason": payload["reason"]
})
The key insight: each service owns its own compensating logic. There's no central orchestrator that becomes a bottleneck or single point of failure.
Pattern 3: Read Model Projections (CQRS)
The reporting module was one of the last pieces consuming the monolith's main Postgres instance. We extracted it by building read-model projections — denormalized views rebuilt from events.
Each time an event arrived, a projection worker updated a read-optimized store:
# reporting_service/projections/revenue_projection.py
def on_order_completed(payload: dict):
"""
Update the revenue projection whenever an order completes.
This is eventually consistent — typically < 500ms behind real-time.
"""
date_key = payload["timestamp"][:10] # "2024-11-15"
dynamodb.update_item(
TableName="atlas-revenue-daily",
Key={
"date": {"S": date_key},
"currency": {"S": payload["currency"]}
},
UpdateExpression="ADD total_cents :amount, order_count :one",
ExpressionAttributeValues={
":amount": {"N": str(payload["total_cents"])},
":one": {"N": "1"}
}
)
The reporting API now reads from DynamoDB, never from Postgres. The main database load dropped by ~31% after this migration.
Results After 14 Months
| Metric | Before | After |
|---|---|---|
| Deployment time | 18 min (full downtime) | 3 min (rolling, zero downtime) |
| Horizontal scaling | Not possible | Auto-scales to 12 pods |
| p99 API latency | 1,840ms | 290ms |
| Session-related incidents | ~3/month | 0 in last 6 months |
| Primary DB CPU (peak) | 94% | 41% |
| MTTR for notification failures | ~2 hours | ~8 minutes (DLQ alert → fix) |
Lessons Worth Remembering
1. State audit before extraction.
Do not extract a service and discover later it secretly depends on shared mutable state. Build the inventory first. Every singleton, every in-memory store, every sticky-session dependency.
2. Make events self-contained.
An event payload should carry enough data that consumers never need to call back into the emitting service to process it. Avoid events like { "order_id": 123 } that force consumers to make a synchronous call to the monolith. Include the data the consumers need.
3. Idempotency is mandatory, not optional.
SQS delivers messages at least once. Consumers will see duplicates. Every handler must be safe to invoke multiple times with the same payload. Use a dedupe key:
def handle_order_completed(payload: dict):
dedupe_key = f"notification:order_completed:{payload['order_id']}"
if cache.exists(dedupe_key):
return # Already processed — safe to skip
# ... process ...
cache.set(dedupe_key, "1", ex=86400) # 24-hour TTL
4. DLQs are your observability layer.
A growing DLQ is a canary. Alert on it aggressively. Every message in the DLQ represents a business event that did not complete — treat it with the same urgency as a failed database write.
5. Don't extract the hardest domain first.
Start with high-async-viability, low-shared-state domains (notifications, emails, audit logs). Build team confidence and tooling patterns before tackling the hard stuff (billing, auth, core domain logic).
The Code That Made the Biggest Difference
If you take one pattern from this article, take this one: the thin event bus wrapper with a strict schema contract.
# The monolith's entire async surface area lives here.
# Every service interaction is an event.
# Every event has a version field.
# Schema changes are additive, never breaking.
EventBus.publish("order.completed", {
schema_version: "1.2",
order_id: order.id,
# ... fields
})
Versioning events from day one saved us multiple breaking-change incidents. When the notification service needed a new field, we added it to the payload (non-breaking). When a field needed to change shape, we bumped the version and ran both handlers in parallel until consumers migrated.
The migration is ongoing — we still have three bounded contexts inside the monolith. But the hardest part wasn't the code. It was building shared mental models across the team about what "stateless" actually means in a distributed system, and why the discipline of event contracts pays for itself a hundred times over.
Top comments (0)