Igor Voronin

Posted on Oct 13

Monolith First, Services Later: A Phased Architecture Playbook

#ai #saas #microservices #devops

Monolith First, Services Later: A Phased Architecture Playbook

“Start simple” is easy to say and hard to do—especially when the future looks big. This playbook shows how to begin with a monolith, scale it calmly, and split it only when the signals are undeniable. The goal isn’t ideology; it’s speed to value, reliability, and a system your team can actually carry.

This guide covers:

Why a monolith is often the right first architecture
How to structure it so future splits are cheap
Clear signals that say “it’s time to extract”
A low-drama migration plan you can run inside a sprint cadence
Metrics that catch complexity creep before it bites

Why a monolith wins the early game

Shortest path to learning. One deployable unit, one place to debug, one mental model.
Cheapest to carry. Fewer repos, infra pieces, and failure modes while you’re still finding product-market fit.
Better iteration speed. Cross-cutting changes (schema + API + UI) land together without waiting on service contracts.

The point isn’t to avoid services forever—it’s to earn the right to introduce them.

Design your “service-ready” monolith

Think of your monolith as a set of modules in one process, with strict boundaries and clean seams.

1) Define business modules

Organize code by cohesive business capability, not by tech layer:

accounts/ (users, auth, billing profiles)
catalog/ (products, categories, pricing)
orders/ (cart, checkout, fulfillment)
reporting/ (analytics, exports)

Inside each module, keep controllers/handlers, domain models, and data access close together. That locality is what you’ll later lift into a service if needed.

2) Stabilize module interfaces

Expose module APIs inside the monolith as if they were network calls:

Require DTOs for requests/responses (no “reach into my tables” shortcuts).
For async flows, publish domain events internally (in-process bus).
Avoid cross-module imports of private types; go through the interface.

3) Keep a single database with bounded schemas

Use one physical database but separate schemas/tables per module. No other module is allowed to touch your tables directly—ever.

4) Capture domain events (even in-process)

Emit events like OrderPlaced, PaymentCaptured, InventoryReserved. At first, handlers live in the same process. You’re training your system to be event-aware without paying the distributed-systems tax yet.

5) Instrument from day one

Request metrics by module (/orders/*, /catalog/*).
Tail latency (p95/p99), error rate, and resource use per module.
A correlation ID through the stack so you can trace “one user action”.

When to split: objective signals (not vibes)

Only start extracting services when two or more of these persist across sprints:

1) Team throughput hits a coordination wall.

Two teams keep stepping on each other because their modules change independently and often.

2) Hot path saturates resources independently.

One module is CPU/IO heavy and drives vertical scaling, starving others.

3) Availability needs diverge.

E.g., checkout needs 99.95% and can’t be blocked by reporting or catalog rebuilds.

4) Change cadence diverges.

A module deploys 10× more frequently and needs faster approval windows.

5) Compliance or data isolation.

Clear legal/runtime boundary (e.g., PII, tenant isolation) that justifies a separate blast radius.

6) Operational boundaries are obvious.

You naturally have dedicated ownership/on-call around a module.

If the motivation is “microservices are cool” or “we might need to scale someday,” don’t split. The real cost is not lines of code—it’s orchestration, observability, failure modes, and human carrying capacity.

The low-drama extraction plan

You’ve decided to split a module (say, orders) from the monolith. Here’s a path that keeps risk small.

Phase 0 — Prep (in the monolith)

Harden the module interface. Freeze it for a sprint; fix leaky calls.
Event inventory. List the domain events this module emits/consumes.
Data fence. Ensure only the module accesses its tables.

Phase 1 — Strangle with an internal boundary

Create an internal adapter: orders.api (function or HTTP client wrapper).
All callers use the adapter; no one touches orders internals anymore.
Add contract tests against the adapter to lock behavior.

Phase 2 — Extract the codebase

Copy orders/ into a new repo (or package) with its own CI/CD.
Start a small HTTP or gRPC service—same API as the adapter.
Wire a feature flag: monolith path vs. network path.

Phase 3 — Dual run (shadow)

In staging (and optionally prod), call both paths. Compare responses.
Log diffs; fix mismatches until they converge.
Keep writes going to the monolith DB; the service reads only.

Phase 4 — Data move

Provision an orders database (or schema) under the service’s ownership.
Incremental migration: backfill historical orders, then switch writes, then reads.
Keep a change-data-capture (CDC) or sync job temporarily for safety.

Phase 5 — Cutover

Flip the feature flag for a small % of traffic.
Watch p95/p99, error rate, and business KPIs (conversion, order success).
Roll forward when stable; roll back in one click if not.

Phase 6 — Retire the old path

Remove monolith internals and the adapter’s monolith branch.
Keep the adapter as the only client entry point (now networked).
Update runbooks, dashboards, and on-call rotations.

You didn’t “replatform.” You replaced a vein one module at a time with minimal blood loss.

Data patterns that avoid pain

Own your write model. Each service owns its tables; no shared writes.
Read copies for other services. If another service needs your data, offer:
- a read API, or
- event streams + read models on their side.
Idempotent events. Include event IDs and versioning; handle duplicates.
Backfills as jobs, not scripts. Logged, retryable, and reversible.

Observability (so you can sleep)

Golden signals per service: p95 latency, error rate, saturation, traffic.
End-to-end traces: user action → monolith → service → back.
SLOs with burn alerts: alert on SLO burn, not every spike.
Dead letter queues for events with dashboards and runbooks.

People & process (the real decoupling)

Ownership maps to services. One team, one on-call, one backlog.
Release cadence per service. Stop waiting on the slowest component.
Contract first changes. Propose API changes as PRs to shared contracts.
Platform guardrails. Templates for CI/CD, auth, logging, and metrics so every service starts with the basics.

Metrics that tell you it’s working

Lead time to production (for the split module) ↓
Change failure rate ↓ and MTTR ↓
Team throughput (completed stories) ↑ with fewer cross-team collisions
Infra cost per request stable or improving for the hot path
Business KPIs (checkout success, etc.) unchanged or better

If these don’t move, stop splitting. You might be adding ceremony without outcome.

Anti-patterns to avoid

“Nano-services.” Dozens of trivial services nobody can reason about.
Shared database across services. Creates tight coupling and blame storms.
Premature event explosions. Start with a few meaningful domain events.
Hidden “glue teams.” One platform team drowning in bespoke requests.
One-off infra. Every service should look boringly similar to operate.

A 30-day starter plan

Week 1: Identify the noisiest module (metrics + team complaints). Freeze its interface.

Week 2: Build the internal adapter + contract tests.

Week 3: Extract service repo; shadow requests in staging; fix diffs.

Week 4: Migrate writes, then reads; controlled cutover with a feature flag. Hold a post-cutover review and document runbooks.

The takeaway

A monolith isn’t the enemy; undisciplined complexity is. Start with a monolith that respects boundaries, watch the signals, and split only when the evidence is undeniable. Move one vein at a time, instrument the journey, and let outcomes—not architecture fashion—decide what comes next.

Discussion prompts

Which signal first told you a module needed to split?
What’s your favorite pattern for safe data migration during a service extraction?

Visit Igor’s Official Site

Top comments (1)

dirk lüsebrink • Oct 13

Can't put enough thumps-up on this. I'm telling these strugglings teams since at least a decade. “Undisciplined complexity is” the culprit, it kills you. Learned a long time ago, “you can write bad Fortran in ANY language”, and that's still true today.