Eslam Genedy

Posted on Apr 21

From Monolith State to Stateless Microservices: A Real-World Refactor

#architecture #microservices #aws #distributedsystems

Two years ago our team inherited Atlas — a B2B SaaS monolith with ~680,000 lines of Rails code, a decade of accumulated complexity, and a deployment pipeline that caused anxiety every Friday afternoon.

The symptoms were familiar:

Single PostgreSQL instance handling transactional and reporting load
In-process custom thread pool for background jobs
Singleton service objects maintaining application-level state
Session data stored in application memory behind a sticky-session load balancer
Deployments caused ~4 minutes of session loss for active users

Our goal wasn't a big-bang rewrite. It was a disciplined, iterative extraction — shipping features throughout the process while incrementally achieving statelessness.

Phase 1 — Audit: Finding State Where It Shouldn't Be

Before touching code, we needed a full inventory of where state lived. This sounds simple. It isn't.

Identifying Stateful Singletons

The most dangerous pattern we found: service objects instantiated once at boot and shared across all requests.

# config/initializers/rate_limiter.rb
RateLimiter = RateLimiterService.new(
  store: {},     # mutable hash in process memory
  window: 60,
  limit: 100
)

RateLimiterService held a mutable store hash accumulating per-user request counts. Under a single process: fine. Under autoscaling with multiple pods: each pod maintained independent counters, making rate limiting effectively broken.

We audited the entire codebase and found 23 such singletons across four categories:

Category	Count	Problem
Rate limiters / throttles	6	Per-process counters — ineffective at scale
Feature flag caches	4	Stale state — flags wouldn't propagate across pods
Third-party client pools	8	Connection state not shared, over-connecting
Tenant config aggregators	5	Mutations in one request leaked into others

The fix for every category followed the same principle: move state out of process memory into a shared external store.

# Before — process-local hash
RateLimiter = RateLimiterService.new(store: {})

# After — Redis-backed, shared across all pods
RateLimiter = RateLimiterService.new(
  store: RedisStore.new(
    client: Redis.current,
    namespace: "rate_limiter"
  )
)

The Sticky Session Problem

Atlas used Rails' default :cookie_store — but with a twist. A developer years earlier needed to store non-serializable ActiveRecord objects in the session. Rather than fix serialization, they mirrored part of the session server-side in a process-level hash, then configured the load balancer with IP-based sticky sessions to make sure users always hit the same pod.

The consequences:

Horizontal scaling didn't distribute load evenly
Pod restarts caused immediate session loss
Rolling deployments silently logged out active users

Fix part 1 — Migrate session storage to Redis:

# Gemfile
gem 'redis-session-store'

# config/initializers/session_store.rb
Rails.application.config.session_store :redis_session_store,
  key: '_atlas_session',
  redis: {
    expire_after: 2.hours,
    key_prefix: 'atlas:session:',
    client: Redis.current
  }

Fix part 2 — Eliminate non-serializable objects from session entirely:

# Before — storing the full ActiveRecord object
session[:current_user] = current_user

# After — store only the primitive ID, hydrate on each request
session[:current_user_id] = current_user.id

# In ApplicationController
def current_user
  @current_user ||= User.find_by(id: session[:current_user_id])
end

After this change we removed sticky sessions from the load balancer entirely. Session data became a first-class shared resource, not a pod-local artifact. We saw an immediate 18% reduction in p99 login latency by eliminating database joins that had been avoided via session caching with stale data.

Phase 2 — Extraction: Breaking State Ownership

With state inventoried and moved to external stores, we began service extraction using a strangler fig pattern — new services intercepted traffic slice by slice while the monolith continued running.

Choosing What to Extract First

We scored candidates across three dimensions:

Score = (Domain Clarity × Async Viability) / Shared State Coupling

High score = good first candidate. Notifications won.

Notifications had:

Clear bounded domain
No need for synchronous response
High enough volume to justify the infrastructure
A clean, definable event contract

The original monolith code was fully synchronous and blocking:

def complete_order(order)
  order.finalize!
  NotificationMailer.order_confirmation(order).deliver_now  # blocks thread
  WebhookService.deliver(:order_completed, order)           # blocks thread
  ActivityFeed.append(order.user, :order_completed, order)  # blocks thread
end

If the email provider was slow, order completion was slow. If a webhook endpoint timed out, the user waited. Classic incidental coupling through synchrony.

Phase 3 — SQS for Async State Updates

We replaced the synchronous calls with event publishing to SQS. The monolith's job changed: finalize the order, emit a fact, return.

The Event Bus Abstraction

We built a thin EventBus wrapper around the AWS SDK to keep the calling code clean:

# lib/event_bus.rb
class EventBus
  QUEUE_URLS = {
    "order.completed"      => ENV["SQS_ORDER_EVENTS_URL"],
    "user.registered"      => ENV["SQS_USER_EVENTS_URL"],
    "subscription.changed" => ENV["SQS_BILLING_EVENTS_URL"]
  }.freeze

  def self.publish(event_type, payload)
    url = QUEUE_URLS.fetch(event_type) do
      raise ArgumentError, "Unknown event type: #{event_type}"
    end

    sqs_client.send_message(
      queue_url:    url,
      message_body: JSON.generate({
        event:     event_type,
        payload:   payload,
        timestamp: Time.now.utc.iso8601,
        version:   "1.0"
      }),
      message_attributes: {
        "EventType" => {
          string_value: event_type,
          data_type:    "String"
        }
      }
    )
  end

  def self.sqs_client
    @sqs_client ||= Aws::SQS::Client.new(region: ENV["AWS_REGION"])
  end
end

The refactored order completion:

def complete_order(order)
  order.finalize!

  EventBus.publish("order.completed", {
    order_id:   order.id,
    user_id:    order.user_id,
    user_email: order.user.email,
    total:      order.total.to_s,
    currency:   order.currency,
    line_items: order.line_items.map { |li| { sku: li.sku, qty: li.qty } }
  })
end

The controller returns immediately. No email. No webhook. No feed update. All of those happen asynchronously, decoupled from the user's request thread.

The Notification Service Consumer

The extracted notification service is a standalone Python service (we used Python for new services — the team had strong Python ML tooling we wanted to leverage long-term). It polls SQS and processes events:

# notification_service/consumer.py
import boto3, json, logging
from handlers import order_handlers, user_handlers

logger = logging.getLogger(__name__)

HANDLER_MAP = {
    "order.completed":      order_handlers.handle_order_completed,
    "user.registered":      user_handlers.handle_user_registered,
    "subscription.changed": user_handlers.handle_subscription_changed,
}

def poll(queue_url: str, sqs_client):
    while True:
        response = sqs_client.receive_message(
            QueueUrl=queue_url,
            MaxNumberOfMessages=10,
            WaitTimeSeconds=20,          # long polling — reduces empty receives
            MessageAttributeNames=["EventType"]
        )

        for message in response.get("Messages", []):
            process_message(message, queue_url, sqs_client)

def process_message(message: dict, queue_url: str, sqs_client):
    try:
        body       = json.loads(message["Body"])
        event_type = body["event"]
        handler    = HANDLER_MAP.get(event_type)

        if not handler:
            logger.warning(f"No handler registered for event: {event_type}")
        else:
            handler(body["payload"])

        # Only delete the message if processing succeeded
        sqs_client.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message["ReceiptHandle"]
        )

    except Exception as exc:
        # Do NOT delete — message returns to queue after visibility timeout
        # SQS dead-letter queue (DLQ) handles repeated failures
        logger.error(f"Failed to process message {message['MessageId']}: {exc}")

# notification_service/handlers/order_handlers.py
from clients.email_client import send_transactional_email
from clients.webhook_client import deliver_webhook
from clients.feed_client import append_feed_event

def handle_order_completed(payload: dict):
    send_transactional_email(
        template="order_confirmation",
        to=payload["user_email"],
        context={
            "order_id":   payload["order_id"],
            "total":      payload["total"],
            "currency":   payload["currency"],
            "line_items": payload["line_items"]
        }
    )

    deliver_webhook(
        event="order.completed",
        payload=payload
    )

    append_feed_event(
        user_id=payload["user_id"],
        event_type="order_completed",
        metadata={"order_id": payload["order_id"]}
    )

SQS Configuration: Dead-Letter Queue

Idempotency and failure handling are non-negotiable in async systems. We configured a DLQ for every event queue:

# terraform/sqs.tf

resource "aws_sqs_queue" "order_events" {
  name                       = "atlas-order-events"
  visibility_timeout_seconds = 30
  message_retention_seconds  = 86400   # 24 hours

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_events_dlq.arn
    maxReceiveCount     = 3            # 3 failed attempts → DLQ
  })
}

resource "aws_sqs_queue" "order_events_dlq" {
  name                      = "atlas-order-events-dlq"
  message_retention_seconds = 1209600  # 14 days — time to investigate
}

With maxReceiveCount = 3, a message that fails processing three times moves to the DLQ automatically. We set up CloudWatch alarms on DLQ depth so failures surface as alerts rather than silent data loss.

Phase 4 — Event-Driven Patterns Across Services

Once SQS was in place and the team had built muscle with it, we extended the pattern across the rest of the extraction effort. Three patterns proved especially valuable.

Pattern 1: Event Sourcing for Audit Trails

The billing domain previously had an updated_at column and a loosely maintained versions table. After extraction, every billing event became a first-class fact published to SQS and persisted in a dedicated event store:

# billing_service/event_store.py
import boto3, json
from datetime import datetime, timezone

dynamodb = boto3.resource("dynamodb")
table    = dynamodb.Table("atlas-billing-events")

def record_event(aggregate_id: str, event_type: str, payload: dict):
    timestamp = datetime.now(timezone.utc).isoformat()

    table.put_item(Item={
        "pk":         f"BILLING#{aggregate_id}",
        "sk":         f"EVENT#{timestamp}",
        "event_type": event_type,
        "payload":    payload,
        "version":    get_next_version(aggregate_id)
    })

def get_billing_history(aggregate_id: str) -> list:
    response = table.query(
        KeyConditionExpression="pk = :pk AND begins_with(sk, :prefix)",
        ExpressionAttributeValues={
            ":pk":     f"BILLING#{aggregate_id}",
            ":prefix": "EVENT#"
        },
        ScanIndexForward=True
    )
    return response["Items"]

This replaced what used to be a SELECT * FROM billing_audits WHERE account_id = ? query on the monolith's main database — which had become a serious bottleneck at reporting time.

Pattern 2: Saga for Distributed Transactions

Distributed systems don't have ACID transactions across service boundaries. When we extracted the subscription service, we had to handle the scenario: charge succeeds → subscription activation fails. We implemented a choreography-based saga using SQS + SNS:

charge.succeeded  ──► [Subscription Service] ──► subscription.activated
                                │
                           (on failure)
                                │
                                ▼
                       subscription.activation_failed
                                │
                                ▼
                       [Billing Service] ──► charge.reversed

# subscription_service/saga.py

def handle_charge_succeeded(payload: dict):
    account_id = payload["account_id"]
    plan_id    = payload["plan_id"]

    try:
        activate_subscription(account_id, plan_id)

        publish_event("subscription.activated", {
            "account_id": account_id,
            "plan_id":    plan_id,
            "charge_id":  payload["charge_id"]
        })

    except ActivationError as exc:
        # Compensating transaction — tell billing to reverse the charge
        publish_event("subscription.activation_failed", {
            "account_id": account_id,
            "charge_id":  payload["charge_id"],
            "reason":     str(exc)
        })

# billing_service/saga_handlers.py

def handle_activation_failed(payload: dict):
    charge_id = payload["charge_id"]
    reverse_charge(charge_id)

    publish_event("charge.reversed", {
        "charge_id":  charge_id,
        "account_id": payload["account_id"],
        "reason":     payload["reason"]
    })

The key insight: each service owns its own compensating logic. There's no central orchestrator that becomes a bottleneck or single point of failure.

Pattern 3: Read Model Projections (CQRS)

The reporting module was one of the last pieces consuming the monolith's main Postgres instance. We extracted it by building read-model projections — denormalized views rebuilt from events.

Each time an event arrived, a projection worker updated a read-optimized store:

# reporting_service/projections/revenue_projection.py

def on_order_completed(payload: dict):
    """
    Update the revenue projection whenever an order completes.
    This is eventually consistent — typically < 500ms behind real-time.
    """
    date_key = payload["timestamp"][:10]   # "2024-11-15"

    dynamodb.update_item(
        TableName="atlas-revenue-daily",
        Key={
            "date":     {"S": date_key},
            "currency": {"S": payload["currency"]}
        },
        UpdateExpression="ADD total_cents :amount, order_count :one",
        ExpressionAttributeValues={
            ":amount": {"N": str(payload["total_cents"])},
            ":one":    {"N": "1"}
        }
    )

The reporting API now reads from DynamoDB, never from Postgres. The main database load dropped by ~31% after this migration.

Results After 14 Months

Metric	Before	After
Deployment time	18 min (full downtime)	3 min (rolling, zero downtime)
Horizontal scaling	Not possible	Auto-scales to 12 pods
p99 API latency	1,840ms	290ms
Session-related incidents	~3/month	0 in last 6 months
Primary DB CPU (peak)	94%	41%
MTTR for notification failures	~2 hours	~8 minutes (DLQ alert → fix)

Lessons Worth Remembering

1. State audit before extraction.
Do not extract a service and discover later it secretly depends on shared mutable state. Build the inventory first. Every singleton, every in-memory store, every sticky-session dependency.

2. Make events self-contained.
An event payload should carry enough data that consumers never need to call back into the emitting service to process it. Avoid events like { "order_id": 123 } that force consumers to make a synchronous call to the monolith. Include the data the consumers need.

3. Idempotency is mandatory, not optional.
SQS delivers messages at least once. Consumers will see duplicates. Every handler must be safe to invoke multiple times with the same payload. Use a dedupe key:

def handle_order_completed(payload: dict):
    dedupe_key = f"notification:order_completed:{payload['order_id']}"

    if cache.exists(dedupe_key):
        return   # Already processed — safe to skip

    # ... process ...

    cache.set(dedupe_key, "1", ex=86400)  # 24-hour TTL

4. DLQs are your observability layer.
A growing DLQ is a canary. Alert on it aggressively. Every message in the DLQ represents a business event that did not complete — treat it with the same urgency as a failed database write.

5. Don't extract the hardest domain first.
Start with high-async-viability, low-shared-state domains (notifications, emails, audit logs). Build team confidence and tooling patterns before tackling the hard stuff (billing, auth, core domain logic).

The Code That Made the Biggest Difference

If you take one pattern from this article, take this one: the thin event bus wrapper with a strict schema contract.

# The monolith's entire async surface area lives here.
# Every service interaction is an event.
# Every event has a version field.
# Schema changes are additive, never breaking.

EventBus.publish("order.completed", {
  schema_version: "1.2",
  order_id:       order.id,
  # ... fields
})

Versioning events from day one saved us multiple breaking-change incidents. When the notification service needed a new field, we added it to the payload (non-breaking). When a field needed to change shape, we bumped the version and ran both handlers in parallel until consumers migrated.

The migration is ongoing — we still have three bounded contexts inside the monolith. But the hardest part wasn't the code. It was building shared mental models across the team about what "stateless" actually means in a distributed system, and why the discipline of event contracts pays for itself a hundred times over.

DEV Community