đ The Moment of Choice
Youâve read the series so far:
- Article 1 â Every Software Architecture Is a Lie. Hereâs Why Thatâs OK.
- Article 2 â How AWS Secretly Breaks the Laws of Software Physics (And You Can Too)
- Article 3 â Microservices Destroyed Our Startup. Yours Could Be Next.
- Article 4 â The $15 Million Mistake That Killed a Bank (And What It Teaches You)
- Article 5 â Your âPerfectâ Decision Today Is a Nightmare Waiting to Happen.
Now comes the hard part: How do you actually make decisions in the face of these paradoxes?
This article is about practical tools and mindsets â not silver bullets, but battleâtested techniques to make tradeâoffs visible, reversible, and survivable.
âThe goal is not to avoid mistakes. The goal is to make mistakes that you can recover from.â
đ§° The Architectâs Toolkit for Living With Paradox
Weâll cover six core techniques, each with realâworld examples:
| Technique | What It Solves | Article Reference |
|---|---|---|
| 1. Architecture Decision Records (ADRs) | Hidden assumptions & forgotten rationale | Articles 1â5 |
| 2. Fitness Functions | Preventing architectural drift | Article 3 (microservices sprawl) |
| 3. Bulkheads | Containing failure blast radius | Article 2 (AWS cells) & Article 4 (ESB) |
| 4. TwoâWay Door Decisions | Keeping reversibility alive | Article 5 (Stripe versioning) |
| 5. Delayed DecisionâMaking | Avoiding premature lockâin | Article 3 (modular monolith first) |
| 6. Chaos Engineering | Testing your tradeâoffs to destruction | Article 4 (bank ESB would have survived) |
1ď¸âŁ Architecture Decision Records (ADRs) â Making the Invisible Visible
The Problem
Teams make architectural decisions every day. Six months later, no one remembers why. A new engineer asks, âWhy do we use Kafka instead of SQS?â The answer: âI donât know â itâs always been that way.â
Hidden assumptions fossilise. The bankâs ESB team never wrote down: âWe assume failover will preserve inâflight state. We have not tested splitâbrain scenarios.â That assumption killed them.
The Solution: ADRs
An Architecture Decision Record is a short text file (Markdown) that captures a single decision, its context, and its tradeâoffs.
Minimal ADR template:
# ADR-012: Use PostgreSQL for the transaction log
## Status
Accepted (2024-01-15)
## Context
We need durable storage for financial transactions. Requirements: ACID, high write throughput, familiar to the team.
## Decision
We will use PostgreSQL with logical replication to a read replica for reporting.
## Consequences (Tradeâoffs)
- â
Strong consistency, ACID transactions.
- â
Team already knows PostgreSQL.
- â Horizontal scaling is limited â weâll need to shard manually if we exceed 10TB.
- â Crossâshard queries will be impossible.
## Reversibility
We can migrate to CockroachDB or a distributed SQL database if we outgrow PostgreSQL. Estimated effort: 3 months.
## Assumptions (Explicit)
- Transaction volume will stay under 50,000 TPS for the next 2 years.
- We do not need crossâregion activeâactive writes.
Why ADRs Tame the Paradox
- Forces explicit tradeâoffs â you cannot write an ADR without listing what you lose.
- Documents assumptions â future you will know what you bet on.
- Makes reversibility a firstâclass concern â the âReversibilityâ section is mandatory.
- Creates a decision log â new team members can read history, not reverseâengineer it.
RealâTime Example: Fintech âLedgerHubâ
LedgerHub adopted ADRs after a nearâdisaster (similar to FastPay in Article 3). Their first ADR was:
âWe will keep the transaction processing logic in a modular monolith until we reach 100 engineers OR need to scale processing separately. This decision will be reviewed every 6 months.â
Two years later, they still havenât split into microservices â but the ADR reminds them why and when they should reconsider.
2ď¸âŁ Fitness Functions â Automating Architectural Governance
The Problem
You designed a beautiful modular monolith with strict boundaries. Then, under deadline pressure, a developer imports payment module directly into notification module â bypassing the API. Architectural drift begins.
Manual code reviews miss these violations. The architecture decays.
The Solution: Fitness Functions
A fitness function is an automated test that validates an architectural characteristic. Think of it as unit tests for architecture.
Examples:
| Architectural Requirement | Fitness Function |
|---|---|
| No direct database access from the web module | Static analysis rule (e.g., ArchUnit) that fails the build |
| All services must have a circuit breaker | Integration test that simulates a downstream failure |
| API version header is mandatory | HTTP middleware test that rejects requests without version |
| P95 latency < 100ms | Performance test that runs on every PR |
RealâTime Example: Uberâs âDependency Rulesâ
Uber (after their own microservices chaos) introduced fitness functions that enforce:
- No cycles between service packages.
- No direct database access from API layers.
- All RPC calls must go through the service mesh (no âshortâcircuitingâ).
When a developer violates a rule, the CI pipeline fails with a message: âYou are breaking architectural rule #42 â see ADR-042 for rationale.â
Why Fitness Functions Tame the Paradox
- Prevents silent debt accumulation â violations are caught immediately.
- Makes tradeâoffs enforceable â if you decided âno shared databaseâ, you can enforce it.
- Reduces review burden â machines check rules; humans review intent.
3ď¸âŁ Bulkheads â Containing the Explosion
The Problem
In Article 4, the bankâs ESB failed globally because there were no bulkheads â every channel shared the same critical path. A failure in one area consumed all resources and took down everything.
The Solution: Bulkheads (Physical or Logical)
In ship design, a bulkhead is a watertight compartment. If the hull is breached, only one compartment floods â the ship stays afloat.
Software bulkheads:
- Separate thread pools â so a slow dependency doesnât starve other requests.
- Separate deployment units â so a crash in one service doesnât crash others.
- Separate databases â so a lock storm in one table doesnât freeze everything.
- Separate clusters / cells â as AWS does (Article 2).
RealâTime Example: Netflixâs âHystrixâ (Now Resilience4j)
Netflix built Hystrix (later succeeded by Resilience4j) to implement bulkheading at the thread pool level. Each downstream dependency gets its own thread pool. If the recommendations service slows down, it fills its own thread pool â but billing and playback continue unaffected.
Code example (pseudo):
// Without bulkheads â one pool for everything
ExecutorService sharedPool = Executors.newFixedThreadPool(100);
// With bulkheads
ExecutorService billingPool = Executors.newFixedThreadPool(20);
ExecutorService recsPool = Executors.newFixedThreadPool(10);
ExecutorService playbackPool = Executors.newFixedThreadPool(70);
Why Bulkheads Tame the Paradox
- Limits blast radius â failure stays in its compartment.
- Preserves partial availability â 90% of the system can work even if 10% fails.
- Makes tradeâoffs visible â you must decide how many threads to allocate to each bulkhead.
4ď¸âŁ TwoâWay Door Decisions â Keeping Reversibility Alive
The Problem
Many architectural decisions feel permanent. But Jeff Bezos (Amazon) famously distinguishes between twoâway doors (reversible) and oneâway doors (irreversible). Most decisions are twoâway doors â but we treat them as oneâway because of fear.
The Solution: Design for Reversibility
Before making a decision, ask: âIf weâre wrong, how hard is it to change?â
If the answer is âvery hardâ, invest in making it less hard before committing.
Examples of reversible design:
| Decision | Irreversible Approach | Reversible Approach |
|---|---|---|
| Database choice | Write core logic directly to PostgreSQL API | Write a repository abstraction â swapping databases requires changing only the adapter |
| Cloud provider | Use AWS DynamoDB SDK everywhere | Use a thin wrapper (e.g., KeyValueStore interface) â DynamoDB is one implementation |
| Authentication | Hardcode session cookies | Use a pluggable auth middleware â swap sessions for OAuth with config change |
| API versioning | No versioning (clients break on changes) | Version header from day one (Stripe model) |
RealâTime Example: Airbnbâs âRepository Patternâ
Airbnb started with a monolithic Rails app using PostgreSQL. They knew they might need to shard or move to a different database. Instead of waiting, they built a repository layer early â every database query went through a UserRepository, BookingRepository, etc.
When they eventually needed to move some tables to Cassandra, the change was localised â they rewrote only the repository implementations. The rest of the code never knew.
Why TwoâWay Doors Tame the Paradox
- Reduces fear of making decisions â you know you can reverse.
- Preserves optionality â you donât get locked into a dead end.
- Encourages experimentation â try a pattern; if it fails, revert.
5ď¸âŁ Delayed DecisionâMaking â The Art of Not Deciding Yet
The Problem
Architects often feel pressure to âdecide everything upfrontâ. But many decisions are better made later, when you have more data.
The Solution: Delay Until the Last Responsible Moment
Ask: âDoes this decision need to be made now, or can we wait?â
If waiting costs little and gives you more information, wait.
Decisions to delay:
- Exact instance sizes (use autoâscaling with conservative guesses first)
- Specific NoSQL database (start with PostgreSQL, measure, then migrate if needed)
- Microservice boundaries (start modular monolith, split only when pain is real)
Decisions NOT to delay:
- Authentication scheme (hard to add later)
- API versioning strategy (impossible to add after clients exist)
- Data partitioning key (changing later means migrating all data)
RealâTime Example: Etsyâs âMonolith First, Ask Questions Laterâ
Etsy ran on a monolith for years, even as they grew to millions of users and hundreds of engineers. They delayed splitting into services until the pain of the monolith (deployment conflicts, slow tests) exceeded the pain of distributed systems. When they finally split, they had clear data on which boundaries made sense.
Why Delayed Decisions Tame the Paradox
- Avoids premature optimisation â solving problems you donât yet have.
- Reduces architectural debt â decisions made with more data are less likely to be wrong.
- Preserves energy for real problems â donât boil the ocean.
6ď¸âŁ Chaos Engineering â Testing Your TradeâOffs to Destruction
The Problem
You think your architecture is resilient. You think your bulkheads work. You think failover preserves state. But youâve never actually tested it under real failure conditions.
The bankâs ESB team thought their failover worked. They were wrong.
The Solution: Chaos Engineering
Chaos engineering is the practice of running experiments that inject failures into a productionâlike system to verify its resilience.
Principles:
- Define a steady state (e.g., â95% of requests succeed within 200msâ).
- Inject a realâworld failure (kill a node, corrupt a cache, slow a network).
- Observe if the steady state holds.
- If it doesnât, you have a gap â fix it.
RealâTime Example: Netflixâs âSimian Armyâ
Netflix runs Chaos Monkey â a service that randomly terminates production instances during business hours. This forces every team to build systems that survive instance death. They also have:
- Latency Monkey â injects artificial delays.
- Conformity Monkey â finds instances that donât follow best practices.
- Doctor Monkey â detects unhealthy instances (e.g., high CPU, disk full).
Practical Chaos for the Rest of Us
You donât need Netflix scale. Start small:
| Failure Injection | How to Test |
|---|---|
| Kill a database replica | In staging, stop the replica â does read traffic still work? |
| Slow a downstream service | Add a 5âsecond delay to a thirdâparty API call â does your circuit breaker trip? |
| Crash a service instance | In Kubernetes, kubectl delete pod â does the service recover? |
| Corrupt a cache | Manually delete a Redis key â does the system fall back to the database? |
| Exhaust a connection pool | Simulate many concurrent requests â does the pool correctly reject or queue? |
Why Chaos Engineering Tames the Paradox
- Reveals hidden assumptions â the ones that kill you in production.
- Builds confidence in tradeâoffs â you know your bulkheads work because youâve seen them work.
- Makes failure boring â when failures happen regularly in testing, theyâre less scary in production.
đ Putting It All Together: A DecisionâMaking Framework
When facing an architectural decision, run this checklist:
| Step | Action | Tool |
|---|---|---|
| 1 | Is this a twoâway door? | If yes, decide quickly. If no, proceed. |
| 2 | Can we delay this decision? | If yes, set a calendar reminder for review. If no, proceed. |
| 3 | Document the decision | Write an ADR with tradeâoffs and reversibility plan. |
| 4 | Enforce the decision | Write a fitness function to prevent drift. |
| 5 | Add bulkheads | Limit blast radius if the decision turns out wrong. |
| 6 | Test the decision | Write a chaos experiment that verifies the decisionâs assumptions. |
đ§ RealâTime Example: Applying the Framework to a Real Choice
Scenario: Your team must choose a message queue for a new order processing system.
| Step | Action Taken |
|---|---|
| Twoâway door? | Yes â you can change queues later if you use an abstraction. |
| Delay? | No â you need it now for the MVP. |
| ADR | Written: âUse RabbitMQ because the team knows it, but weâll wrap it with a MessageQueue interface.â |
| Fitness function | Test that no code directly imports the RabbitMQ client â only the wrapper. |
| Bulkheads | Separate queues per order type (standard vs. express) so one doesnât starve the other. |
| Chaos | In staging, kill RabbitMQ nodes â does the system degrade gracefully? Does it replay unacked messages? |
The decision is made confidently because the framework forces you to think about failure modes and reversibility â not just happy paths.
đ Article 6 Summary
âThe paradox doesnât go away. But with ADRs, fitness functions, bulkheads, twoâway doors, delayed decisions, and chaos engineering, you can live with it â and even thrive.â
The six tools are not a silver bullet. They wonât eliminate tradeâoffs. But they will:
- Make tradeâoffs visible (ADRs)
- Prevent silent decay (fitness functions)
- Limit damage when youâre wrong (bulkheads)
- Keep options open (twoâway doors, delayed decisions)
- Reveal hidden assumptions before they kill you (chaos engineering)
The best architects are not the ones who are never wrong. They are the ones who fail safely, learn quickly, and adapt gracefully.
đ Next in the Series⌠(The Grand Finale)
Youâve seen the paradox, the disasters, the tools. Now comes the hardest part: changing your mindset.
Article 7 (Coming Tuesday â Series Finale): âStop Trying to Build the Perfect System. Do This Instead.â
Spoiler: The 7 mindset shifts that separate great architects from burntâout ones â and why âgood enoughâ is the only sustainable goal.
This is the Zen of Architectural Pragmatism. Donât miss it. âŻď¸
Found this useful? Share it with a team thatâs about to make an irreversible decision without a reversibility plan.
Have a tool we missed? The paradox loves new weapons â reply.







Top comments (0)