Fu'ad Husnan

Posted on May 30

Building Our Backend House of Cards

#softwareengineering #backend #api #programming

Every backend system starts with the best intentions—clean models, sensible routes, a database schema that made perfect sense on the whiteboard. Then the product grows, the team doubles, the deadlines compress — and slowly, without anyone making a single catastrophic decision, you find yourself maintaining a backend house of cards: one wrong pull and the whole thing trembles.

Building a backend that doesn't collapse under its own weight is one of the most underappreciated disciplines in software engineering. It's not glamorous. Nobody tweets about the service they refactored to be more resilient. But the engineers who get it — who understand how structural debt accumulates and how to fight it without stopping product delivery — are the ones teams depend on when things get hard.

This article is about that. How backend systems become fragile, what the warning signs look like in real code, and what you can actually do about it before (or after) the cards start falling.

How a Backend Becomes Fragile

Fragility rarely happens all at once. It's a slow accumulation of shortcuts taken under pressure, abstractions that were never quite right, and coupling between services that seemed harmless at the time. The backend becomes a house of cards, not because anyone was careless, but because every individual decision was locally reasonable.

The most common culprit is tight coupling between components. When your user service directly calls your billing service, which calls your notification service, you've created a chain of dependencies where a latency spike in one place propagates instantly to everything downstream. It feels efficient — no queues, no indirection — right up until your billing provider has a slow night and your entire authentication flow starts timing out.

Another structural weakness is shared mutable state. A database table that three different services write to without coordination becomes a source of race conditions and data corruption that's almost impossible to reproduce locally. The bugs appear in production, under load, in edge cases that your test suite never hits. By the time you trace it back to the root cause, you've already lost user trust.

The Code That Tells You You're in Trouble

One of the most reliable signals that your backend is becoming fragile is the emergence of what engineers sometimes call "God objects" — classes or modules that know too much and do too much. When you open a file, and it imports from fifteen other modules, coordinates three external API calls, manages its own retry logic, and also handles serialization, that's a load-bearing card. Touch it carefully.

Consider this kind of function, which is more common than any team wants to admit:

def process_order(order_id: int, user_id: int):
    user = db.query(User).filter(User.id == user_id).first()
    order = db.query(Order).filter(Order.id == order_id).first()

    inventory = requests.get(f"{INVENTORY_SERVICE}/check/{order.item_id}")
    if inventory.json()["available"] < order.quantity:
        send_email(user.email, "out_of_stock_template", order)
        return {"status": "failed", "reason": "out_of_stock"}

    charge = stripe.charge(user.stripe_token, order.total_price)
    order.status = "paid"
    order.stripe_charge_id = charge.id
    db.commit()

    requests.post(f"{WAREHOUSE_SERVICE}/fulfill", json={"order_id": order_id})
    send_email(user.email, "order_confirmed_template", order)

    return {"status": "success"}

This function is doing five distinct jobs: reading user and order state, checking inventory, charging a payment method, updating the database, and triggering fulfillment. If the warehouse service call fails after the payment succeeds, the order is paid but never fulfilled, and nothing retries it. Every step is a potential failure point with no recovery path.

Decoupling as a Survival Strategy

The antidote to tight coupling isn't a full microservices rewrite (that's a different set of problems). It's introducing the right amount of indirection at the right boundaries. The most practical tool for this is asynchronous messaging — moving from direct synchronous calls to event-driven communication wherever the business logic doesn't require an immediate response.

Instead of process_order calling the warehouse synchronously, it should emit an event and let the warehouse service pick it up independently:

import json
import boto3

def process_order(order_id: int, user_id: int):
    # ... inventory check and payment logic ...

    sqs = boto3.client("sqs")
    sqs.send_message(
        QueueUrl=FULFILLMENT_QUEUE_URL,
        MessageBody=json.dumps({
            "event": "order_paid",
            "order_id": order_id,
            "timestamp": datetime.utcnow().isoformat()
        })
    )

    return {"status": "success"}

Now the payment service and the warehouse service are temporarily decoupled. If the warehouse service is down, the message waits in the queue and gets processed when it recovers. The payment service doesn't care — its job is done. This single change eliminates an entire class of failure modes.

The downside is real: debugging asynchronous flows is harder, observability requirements go up, and your team needs to reason about eventual consistency rather than immediate consistency. These are costs worth paying as a system grows, but you don't need to pay them everywhere. Apply async messaging at the boundaries between distinct business domains, and keep synchronous calls within a single bounded context.

Idempotency: The Safety Net You're Not Using

One of the most important and least discussed properties of a resilient backend is idempotency — the guarantee that calling an operation multiple times produces the same result as calling it once. It sounds simple. In practice, most teams only think about it after their retry logic causes duplicate charges.

Any operation that writes state — creating a record, sending an email, triggering a payment — should be idempotent. The simplest way to achieve this is with client-generated idempotency keys:

from uuid import uuid4

def create_charge(user_id: int, amount: int, idempotency_key: str = None):
    if idempotency_key is None:
        idempotency_key = str(uuid4())

    existing = db.query(Charge).filter(
        Charge.idempotency_key == idempotency_key
    ).first()

    if existing:
        return existing  # Return the original result, don't charge again

    charge = Charge(
        user_id=user_id,
        amount=amount,
        idempotency_key=idempotency_key,
        status="pending"
    )
    db.add(charge)
    db.commit()

    # ... proceed with actual charge ...
    return charge

This pattern means your retry logic can safely re-attempt failed requests without fear of side effects. It also makes your system significantly easier to reason about under network failures, because "try again" becomes a safe operation rather than a dangerous one.

Stripe, AWS, and most well-designed APIs expose idempotency keys for exactly this reason. If your own internal APIs don't, that's worth fixing before you add retry logic — otherwise you're building a retry mechanism that makes things worse.

Observability Is Not Optional

A fragile backend and an unobservable backend are two sides of the same problem. You can't fix what you can't see. Many teams invest heavily in writing good code but ship it into a production environment where, when something goes wrong, they're flying blind — refreshing dashboards and grepping through log files trying to reconstruct what happened.

Structured logging is the baseline. Every log line that enters production should be machine-parseable JSON with consistent fields: a timestamp, a severity level, a request ID that traces through your entire call stack, and the relevant business context. Free-text log messages like "Something went wrong in payment" are almost useless when you're trying to understand an incident at 2 am.

Beyond logging, distributed tracing — using something like OpenTelemetry — gives you the ability to see the full lifecycle of a request as it moves through your system. When a request is slow, you can see exactly which service, which database query, or which external call is the bottleneck. This visibility is what separates teams that fix incidents in twenty minutes from teams that spend three hours guessing.

Refactoring Under Load: The Real Skill

The hardest part about fixing a backend house of cards isn't knowing what to do — it's doing it while the house is still standing and people are living in it. You can't stop shipping features to do a six-month architectural rewrite. The team that tries that usually ends up with a half-finished new architecture and a legacy system that still needs to be maintained.

The right approach is incremental strangling. You identify a bounded piece of the fragile system — a single table, a single service boundary, a single API endpoint — and you build the better version alongside the old one. You route a small percentage of traffic to the new path, verify it works, and gradually shift more traffic until the old path is unused and can be deleted.

This takes longer than a rewrite. It requires discipline in not adding new features to the old path while the migration is in progress. But it's the only approach that keeps the product moving and the system stable simultaneously. The teams that do it well treat it like any other engineering project: scoped, measured, tracked in the same backlog as feature work.

Conclusion

A backend house of cards isn't a moral failing — it's a natural consequence of building under real-world constraints. The goal isn't to avoid all architectural debt, which is impossible, but to stay aware of where it's accumulating and to address it deliberately before it becomes load-bearing.

Start by identifying the tightest coupling in your system — the synchronous call chain that scares you, the God object that everyone touches carefully — and introduce one layer of indirection. Add idempotency keys to your most critical write operations. Set up structured logging if you don't have it. Each of these changes is small in isolation, but together they shift your backend from something fragile to something you can actually debug, extend, and trust.

The house of cards doesn't have to stay a house of cards. Pick one card, brace it properly, and go from there.

Top comments (1)

Harjot Singh • May 31

"House of cards" is the honest name for what most early backends actually are - each feature glued onto the last, working until one assumption shifts and the whole thing wobbles. The interesting question is whether that's avoidable or just the natural early-stage state: at MVP speed, some house-of-cards is rational (you don't know what'll survive, so over-engineering is waste), and the skill is knowing which cards are load-bearing enough to harden vs which to leave flimsy because they might get deleted.

The AI-era twist: AI makes it even easier to add another card fast, so the pile grows quicker - which makes the "which parts deserve real structure" judgment more important, not less. The load-bearing stuff (auth, data integrity, money) should be solid from day one; the rest can stay cards. That's the split I bake in with Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) - the boring-but-critical 20% as solid defaults, the rest fast. Relatable post - what's the card that finally collapsed and forced the rewrite? That moment is the best teacher.