DEV Community

Cover image for From Monolith State to Stateless Microservices: A Real-World Refactor
Eslam Genedy
Eslam Genedy

Posted on

From Monolith State to Stateless Microservices: A Real-World Refactor

Two years ago our team inherited Atlas — a B2B SaaS monolith with ~680,000 lines of Rails code, a decade of accumulated complexity, and a deployment pipeline that caused anxiety every Friday afternoon.

The symptoms were familiar:

  • Single PostgreSQL instance handling transactional and reporting load
  • In-process custom thread pool for background jobs
  • Singleton service objects maintaining application-level state
  • Session data stored in application memory behind a sticky-session load balancer
  • Deployments caused ~4 minutes of session loss for active users

Our goal wasn't a big-bang rewrite. It was a disciplined, iterative extraction — shipping features throughout the process while incrementally achieving statelessness.


Phase 1 — Audit: Finding State Where It Shouldn't Be

Before touching code, we needed a full inventory of where state lived. This sounds simple. It isn't.

Identifying Stateful Singletons

The most dangerous pattern we found: service objects instantiated once at boot and shared across all requests.

# config/initializers/rate_limiter.rb
RateLimiter = RateLimiterService.new(
  store: {},     # mutable hash in process memory
  window: 60,
  limit: 100
)
Enter fullscreen mode Exit fullscreen mode

RateLimiterService held a mutable store hash accumulating per-user request counts. Under a single process: fine. Under autoscaling with multiple pods: each pod maintained independent counters, making rate limiting effectively broken.

We audited the entire codebase and found 23 such singletons across four categories:

Category Count Problem
Rate limiters / throttles 6 Per-process counters — ineffective at scale
Feature flag caches 4 Stale state — flags wouldn't propagate across pods
Third-party client pools 8 Connection state not shared, over-connecting
Tenant config aggregators 5 Mutations in one request leaked into others

The fix for every category followed the same principle: move state out of process memory into a shared external store.

# Before — process-local hash
RateLimiter = RateLimiterService.new(store: {})

# After — Redis-backed, shared across all pods
RateLimiter = RateLimiterService.new(
  store: RedisStore.new(
    client: Redis.current,
    namespace: "rate_limiter"
  )
)
Enter fullscreen mode Exit fullscreen mode

The Sticky Session Problem

Atlas used Rails' default :cookie_store — but with a twist. A developer years earlier needed to store non-serializable ActiveRecord objects in the session. Rather than fix serialization, they mirrored part of the session server-side in a process-level hash, then configured the load balancer with IP-based sticky sessions to make sure users always hit the same pod.

The consequences:

  • Horizontal scaling didn't distribute load evenly
  • Pod restarts caused immediate session loss
  • Rolling deployments silently logged out active users

Fix part 1 — Migrate session storage to Redis:

# Gemfile
gem 'redis-session-store'

# config/initializers/session_store.rb
Rails.application.config.session_store :redis_session_store,
  key: '_atlas_session',
  redis: {
    expire_after: 2.hours,
    key_prefix: 'atlas:session:',
    client: Redis.current
  }
Enter fullscreen mode Exit fullscreen mode

Fix part 2 — Eliminate non-serializable objects from session entirely:

# Before — storing the full ActiveRecord object
session[:current_user] = current_user

# After — store only the primitive ID, hydrate on each request
session[:current_user_id] = current_user.id

# In ApplicationController
def current_user
  @current_user ||= User.find_by(id: session[:current_user_id])
end
Enter fullscreen mode Exit fullscreen mode

After this change we removed sticky sessions from the load balancer entirely. Session data became a first-class shared resource, not a pod-local artifact. We saw an immediate 18% reduction in p99 login latency by eliminating database joins that had been avoided via session caching with stale data.


Phase 2 — Extraction: Breaking State Ownership

With state inventoried and moved to external stores, we began service extraction using a strangler fig pattern — new services intercepted traffic slice by slice while the monolith continued running.

Choosing What to Extract First

We scored candidates across three dimensions:

Score = (Domain Clarity × Async Viability) / Shared State Coupling
Enter fullscreen mode Exit fullscreen mode

High score = good first candidate. Notifications won.

Notifications had:

  • Clear bounded domain
  • No need for synchronous response
  • High enough volume to justify the infrastructure
  • A clean, definable event contract

The original monolith code was fully synchronous and blocking:

def complete_order(order)
  order.finalize!
  NotificationMailer.order_confirmation(order).deliver_now  # blocks thread
  WebhookService.deliver(:order_completed, order)           # blocks thread
  ActivityFeed.append(order.user, :order_completed, order)  # blocks thread
end
Enter fullscreen mode Exit fullscreen mode

If the email provider was slow, order completion was slow. If a webhook endpoint timed out, the user waited. Classic incidental coupling through synchrony.


Phase 3 — SQS for Async State Updates

We replaced the synchronous calls with event publishing to SQS. The monolith's job changed: finalize the order, emit a fact, return.

The Event Bus Abstraction

We built a thin EventBus wrapper around the AWS SDK to keep the calling code clean:

# lib/event_bus.rb
class EventBus
  QUEUE_URLS = {
    "order.completed"      => ENV["SQS_ORDER_EVENTS_URL"],
    "user.registered"      => ENV["SQS_USER_EVENTS_URL"],
    "subscription.changed" => ENV["SQS_BILLING_EVENTS_URL"]
  }.freeze

  def self.publish(event_type, payload)
    url = QUEUE_URLS.fetch(event_type) do
      raise ArgumentError, "Unknown event type: #{event_type}"
    end

    sqs_client.send_message(
      queue_url:    url,
      message_body: JSON.generate({
        event:     event_type,
        payload:   payload,
        timestamp: Time.now.utc.iso8601,
        version:   "1.0"
      }),
      message_attributes: {
        "EventType" => {
          string_value: event_type,
          data_type:    "String"
        }
      }
    )
  end

  def self.sqs_client
    @sqs_client ||= Aws::SQS::Client.new(region: ENV["AWS_REGION"])
  end
end
Enter fullscreen mode Exit fullscreen mode

The refactored order completion:

def complete_order(order)
  order.finalize!

  EventBus.publish("order.completed", {
    order_id:   order.id,
    user_id:    order.user_id,
    user_email: order.user.email,
    total:      order.total.to_s,
    currency:   order.currency,
    line_items: order.line_items.map { |li| { sku: li.sku, qty: li.qty } }
  })
end
Enter fullscreen mode Exit fullscreen mode

The controller returns immediately. No email. No webhook. No feed update. All of those happen asynchronously, decoupled from the user's request thread.


The Notification Service Consumer

The extracted notification service is a standalone Python service (we used Python for new services — the team had strong Python ML tooling we wanted to leverage long-term). It polls SQS and processes events:

# notification_service/consumer.py
import boto3, json, logging
from handlers import order_handlers, user_handlers

logger = logging.getLogger(__name__)

HANDLER_MAP = {
    "order.completed":      order_handlers.handle_order_completed,
    "user.registered":      user_handlers.handle_user_registered,
    "subscription.changed": user_handlers.handle_subscription_changed,
}

def poll(queue_url: str, sqs_client):
    while True:
        response = sqs_client.receive_message(
            QueueUrl=queue_url,
            MaxNumberOfMessages=10,
            WaitTimeSeconds=20,          # long polling — reduces empty receives
            MessageAttributeNames=["EventType"]
        )

        for message in response.get("Messages", []):
            process_message(message, queue_url, sqs_client)

def process_message(message: dict, queue_url: str, sqs_client):
    try:
        body       = json.loads(message["Body"])
        event_type = body["event"]
        handler    = HANDLER_MAP.get(event_type)

        if not handler:
            logger.warning(f"No handler registered for event: {event_type}")
        else:
            handler(body["payload"])

        # Only delete the message if processing succeeded
        sqs_client.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message["ReceiptHandle"]
        )

    except Exception as exc:
        # Do NOT delete — message returns to queue after visibility timeout
        # SQS dead-letter queue (DLQ) handles repeated failures
        logger.error(f"Failed to process message {message['MessageId']}: {exc}")
Enter fullscreen mode Exit fullscreen mode
# notification_service/handlers/order_handlers.py
from clients.email_client import send_transactional_email
from clients.webhook_client import deliver_webhook
from clients.feed_client import append_feed_event

def handle_order_completed(payload: dict):
    send_transactional_email(
        template="order_confirmation",
        to=payload["user_email"],
        context={
            "order_id":   payload["order_id"],
            "total":      payload["total"],
            "currency":   payload["currency"],
            "line_items": payload["line_items"]
        }
    )

    deliver_webhook(
        event="order.completed",
        payload=payload
    )

    append_feed_event(
        user_id=payload["user_id"],
        event_type="order_completed",
        metadata={"order_id": payload["order_id"]}
    )
Enter fullscreen mode Exit fullscreen mode

SQS Configuration: Dead-Letter Queue

Idempotency and failure handling are non-negotiable in async systems. We configured a DLQ for every event queue:

# terraform/sqs.tf

resource "aws_sqs_queue" "order_events" {
  name                       = "atlas-order-events"
  visibility_timeout_seconds = 30
  message_retention_seconds  = 86400   # 24 hours

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_events_dlq.arn
    maxReceiveCount     = 3            # 3 failed attempts → DLQ
  })
}

resource "aws_sqs_queue" "order_events_dlq" {
  name                      = "atlas-order-events-dlq"
  message_retention_seconds = 1209600  # 14 days — time to investigate
}
Enter fullscreen mode Exit fullscreen mode

With maxReceiveCount = 3, a message that fails processing three times moves to the DLQ automatically. We set up CloudWatch alarms on DLQ depth so failures surface as alerts rather than silent data loss.


Phase 4 — Event-Driven Patterns Across Services

Once SQS was in place and the team had built muscle with it, we extended the pattern across the rest of the extraction effort. Three patterns proved especially valuable.

Pattern 1: Event Sourcing for Audit Trails

The billing domain previously had an updated_at column and a loosely maintained versions table. After extraction, every billing event became a first-class fact published to SQS and persisted in a dedicated event store:

# billing_service/event_store.py
import boto3, json
from datetime import datetime, timezone

dynamodb = boto3.resource("dynamodb")
table    = dynamodb.Table("atlas-billing-events")

def record_event(aggregate_id: str, event_type: str, payload: dict):
    timestamp = datetime.now(timezone.utc).isoformat()

    table.put_item(Item={
        "pk":         f"BILLING#{aggregate_id}",
        "sk":         f"EVENT#{timestamp}",
        "event_type": event_type,
        "payload":    payload,
        "version":    get_next_version(aggregate_id)
    })

def get_billing_history(aggregate_id: str) -> list:
    response = table.query(
        KeyConditionExpression="pk = :pk AND begins_with(sk, :prefix)",
        ExpressionAttributeValues={
            ":pk":     f"BILLING#{aggregate_id}",
            ":prefix": "EVENT#"
        },
        ScanIndexForward=True
    )
    return response["Items"]
Enter fullscreen mode Exit fullscreen mode

This replaced what used to be a SELECT * FROM billing_audits WHERE account_id = ? query on the monolith's main database — which had become a serious bottleneck at reporting time.


Pattern 2: Saga for Distributed Transactions

Distributed systems don't have ACID transactions across service boundaries. When we extracted the subscription service, we had to handle the scenario: charge succeeds → subscription activation fails. We implemented a choreography-based saga using SQS + SNS:

charge.succeeded  ──► [Subscription Service] ──► subscription.activated
                                │
                           (on failure)
                                │
                                ▼
                       subscription.activation_failed
                                │
                                ▼
                       [Billing Service] ──► charge.reversed
Enter fullscreen mode Exit fullscreen mode
# subscription_service/saga.py

def handle_charge_succeeded(payload: dict):
    account_id = payload["account_id"]
    plan_id    = payload["plan_id"]

    try:
        activate_subscription(account_id, plan_id)

        publish_event("subscription.activated", {
            "account_id": account_id,
            "plan_id":    plan_id,
            "charge_id":  payload["charge_id"]
        })

    except ActivationError as exc:
        # Compensating transaction — tell billing to reverse the charge
        publish_event("subscription.activation_failed", {
            "account_id": account_id,
            "charge_id":  payload["charge_id"],
            "reason":     str(exc)
        })
Enter fullscreen mode Exit fullscreen mode
# billing_service/saga_handlers.py

def handle_activation_failed(payload: dict):
    charge_id = payload["charge_id"]
    reverse_charge(charge_id)

    publish_event("charge.reversed", {
        "charge_id":  charge_id,
        "account_id": payload["account_id"],
        "reason":     payload["reason"]
    })
Enter fullscreen mode Exit fullscreen mode

The key insight: each service owns its own compensating logic. There's no central orchestrator that becomes a bottleneck or single point of failure.


Pattern 3: Read Model Projections (CQRS)

The reporting module was one of the last pieces consuming the monolith's main Postgres instance. We extracted it by building read-model projections — denormalized views rebuilt from events.

Each time an event arrived, a projection worker updated a read-optimized store:

# reporting_service/projections/revenue_projection.py

def on_order_completed(payload: dict):
    """
    Update the revenue projection whenever an order completes.
    This is eventually consistent — typically < 500ms behind real-time.
    """
    date_key = payload["timestamp"][:10]   # "2024-11-15"

    dynamodb.update_item(
        TableName="atlas-revenue-daily",
        Key={
            "date":     {"S": date_key},
            "currency": {"S": payload["currency"]}
        },
        UpdateExpression="ADD total_cents :amount, order_count :one",
        ExpressionAttributeValues={
            ":amount": {"N": str(payload["total_cents"])},
            ":one":    {"N": "1"}
        }
    )
Enter fullscreen mode Exit fullscreen mode

The reporting API now reads from DynamoDB, never from Postgres. The main database load dropped by ~31% after this migration.


Results After 14 Months

Metric Before After
Deployment time 18 min (full downtime) 3 min (rolling, zero downtime)
Horizontal scaling Not possible Auto-scales to 12 pods
p99 API latency 1,840ms 290ms
Session-related incidents ~3/month 0 in last 6 months
Primary DB CPU (peak) 94% 41%
MTTR for notification failures ~2 hours ~8 minutes (DLQ alert → fix)

Lessons Worth Remembering

1. State audit before extraction.
Do not extract a service and discover later it secretly depends on shared mutable state. Build the inventory first. Every singleton, every in-memory store, every sticky-session dependency.

2. Make events self-contained.
An event payload should carry enough data that consumers never need to call back into the emitting service to process it. Avoid events like { "order_id": 123 } that force consumers to make a synchronous call to the monolith. Include the data the consumers need.

3. Idempotency is mandatory, not optional.
SQS delivers messages at least once. Consumers will see duplicates. Every handler must be safe to invoke multiple times with the same payload. Use a dedupe key:

def handle_order_completed(payload: dict):
    dedupe_key = f"notification:order_completed:{payload['order_id']}"

    if cache.exists(dedupe_key):
        return   # Already processed — safe to skip

    # ... process ...

    cache.set(dedupe_key, "1", ex=86400)  # 24-hour TTL
Enter fullscreen mode Exit fullscreen mode

4. DLQs are your observability layer.
A growing DLQ is a canary. Alert on it aggressively. Every message in the DLQ represents a business event that did not complete — treat it with the same urgency as a failed database write.

5. Don't extract the hardest domain first.
Start with high-async-viability, low-shared-state domains (notifications, emails, audit logs). Build team confidence and tooling patterns before tackling the hard stuff (billing, auth, core domain logic).


The Code That Made the Biggest Difference

If you take one pattern from this article, take this one: the thin event bus wrapper with a strict schema contract.

# The monolith's entire async surface area lives here.
# Every service interaction is an event.
# Every event has a version field.
# Schema changes are additive, never breaking.

EventBus.publish("order.completed", {
  schema_version: "1.2",
  order_id:       order.id,
  # ... fields
})
Enter fullscreen mode Exit fullscreen mode

Versioning events from day one saved us multiple breaking-change incidents. When the notification service needed a new field, we added it to the payload (non-breaking). When a field needed to change shape, we bumped the version and ran both handlers in parallel until consumers migrated.


The migration is ongoing — we still have three bounded contexts inside the monolith. But the hardest part wasn't the code. It was building shared mental models across the team about what "stateless" actually means in a distributed system, and why the discipline of event contracts pays for itself a hundred times over.

Top comments (0)