DEV Community

Cover image for ๐Ÿ›๏ธ The Solution Architect Playbook ๐Ÿ“š: From Best Designer to Best Bridge ๐ŸŒ‰
Truong Phung
Truong Phung

Posted on

๐Ÿ›๏ธ The Solution Architect Playbook ๐Ÿ“š: From Best Designer to Best Bridge ๐ŸŒ‰

A deep, opinionated, practical guide for the engineer-architect who designs end-to-end solutions across systems, teams, and business units. The mental models, decision frameworks, discovery tactics, design methods, communication patterns, and anti-patterns that separate the SA whose solutions actually ship and run for years from the one whose 80-page Visio decks gather dust on Confluence. Grounded in current reality โ€” multi-cloud by default, AI woven into every solution, smaller delivery teams per dollar of revenue, regulated by frameworks that didn't exist five years ago, and customers who can read a SOC 2 report.

If you read only one section first, read ยง2 Mindset, ยง6 Discovery, ยง9 NFRs, and ยง13 Build vs Buy. Everything else is the implementation of those four.

Companion to ๐Ÿง‘โ€๐Ÿ’ป The Tech Lead Playbook: From Best IC to Multiplier ๐Ÿš€ (the team-level role), ๐Ÿ‘จโ€๐Ÿ’ป The CTO Playbook ๐Ÿ“˜: From Best Builder to Best Bet โ™Ÿ๏ธ (the org-level role), ๐Ÿ›๏ธ The System Design Playbook ๐Ÿ“– (the design vocabulary), ๐Ÿ› ๏ธ The Senior Software Engineer Playbook ๐Ÿ“–: From Good Coder to High-Impact Engineer ๐Ÿš€ (deep IC craft), ๐Ÿค– The AI SaaS Playbook (Practical Edition)๐Ÿ“˜ (AI overlay), and ๐Ÿš€ The SaaS Template Playbook ๐Ÿ“– (delivery foundations). This one is for the technical professional who is accountable for a solution end-to-end across systems, teams, and stakeholders โ€” whether at a consulting firm, cloud vendor, ISV, or in-house enterprise team.


๐Ÿ“‹ Table of Contents

  1. โšก Read This First
  2. ๐Ÿง  The Solution Architect Mindset
  3. ๐ŸŽญ The SA Landscape: Five Archetypes
  4. ๐Ÿชœ SA vs TL vs Software Architect vs EA vs CTO
  5. ๐Ÿšช The First 90 Days
  6. ๐Ÿ” Discovery: The Real Job Begins Here
  7. ๐Ÿ“ Solution Design Methodology
  8. ๐Ÿ—‚๏ธ Documenting a Solution: C4, ADRs, arc42
  9. ๐ŸŽฏ Non-Functional Requirements: The Real Job
  10. โ˜๏ธ Cloud Architecture (AWS, Azure, GCP, Multi)
  11. ๐Ÿ”Œ Integration Architecture
  12. ๐Ÿ—„๏ธ Data & AI Architecture
  13. โš–๏ธ Build vs Buy vs Customize
  14. ๐Ÿ›’ Vendor Evaluation & Selection
  15. ๐Ÿ’ฐ Cost & TCO Modeling
  16. ๐Ÿ›ก๏ธ Security, Compliance & Risk
  17. ๐Ÿšš Migration Architecture: 6Rs and Beyond
  18. ๐Ÿ’ฌ Communication: Diagrams, Documents, Presentations
  19. ๐Ÿค Stakeholder Management
  20. ๐Ÿคต Pre-Sales SA: The Consultative Sale
  21. ๐Ÿ› ๏ธ Post-Sales SA: Delivery Architecture
  22. ๐Ÿš€ Working with Delivery Teams
  23. โฑ๏ธ The Operating Cadence
  24. ๐Ÿค– AI in the SA Role
  25. ๐Ÿงฐ Tools of the Trade
  26. โš ๏ธ The SA Anti-Pattern Catalog
  27. ๐Ÿ—บ๏ธ The Phased Roadmap (Day 1 โ†’ Year 5)
  28. ๐Ÿ“‹ Cheat Sheet & Resources

1. โšก Read This First

Seven truths that will save you the first 18 months of mistakes every new solution architect makes:

  1. You are paid for the solution, not the technology. Technology is the cheapest input to a solution. The expensive inputs are: the problem you chose to solve, the constraints you accepted, the integrations you didn't anticipate, the stakeholders you forgot to align, and the operational cost the customer didn't budget. A great SA renders a business problem into a runnable, affordable, supportable system. A mediocre SA renders a Visio diagram. Recognize which one you are this quarter.
  2. Your authority is borrowed. You usually don't manage the people who will build the thing. You don't sign the cheque. You don't run the production system. Your influence comes from technical credibility (people trust your judgment), clarity (people know what to do and why), and being the only person who has read the whole problem (you are the connective tissue). If you try to lead with "because the architect said so," you have already lost.
  3. NFRs are the job; functional requirements are table stakes. Every junior can list "the system should let users log in." A senior SA writes: "login p99 โ‰ค 400ms at 5,000 RPS, 99.95% available, MFA required for admin actions, SOC 2 evidence captured per session, and per-tenant audit retention of 7 years." The first sentence is the menu. The second is the contract. The contract is where projects succeed or fail. Most SA failures aren't bad designs โ€” they're missing or sloppy non-functional requirements.
  4. The boring decisions compound. Naming conventions, ADR templates, environment promotion rules, IAM patterns, secrets handling, observability standards, vendor onboarding workflow. A solution where these are boring and consistent ships in 4 months. A solution where every team improvises ships in 14 months and never gets to "production-grade." Predictable, written, unsexy patterns beat clever bespoke designs every time.
  5. You will spend more time in conversations than in diagrams. Discovery interviews. Vendor calls. Risk reviews. Stakeholder alignment. Steering committee briefings. PMO standups. Devops handoffs. Most new SAs over-index on diagram-quality and under-index on conversation-quality. The single highest-leverage skill is: walk into a 60-minute meeting with five people who disagree and walk out with a written, signed decision. Practice it explicitly.
  6. Reversibility is your most valuable axis. Bezos's two-way / one-way door framing matters more for an SA than for almost any other role. Your job is to isolate the irreversible decisions (cloud provider, primary identity store, core data model, the integration contract two business units depend on) and surface them with appropriate care, while deliberately defaulting all reversible decisions to fast and cheap. SAs who treat every decision as one-way burn quarters; SAs who treat every decision as two-way leak risk.
  7. Writing is the operating system of your job. Architecture briefs, ADRs, RFP responses, runbooks, risk registers, decision memos, vendor scorecards, post-mortems. If your writing is mediocre, every other lever is dampened. The SAs who scale fastest are the ones whose writing is so clear that the team can act without needing a meeting. Ship that skill before you ship anything else.

The rest is implementation of these seven.

Who this is for

  • You were just made (or about to be made) Solution Architect, Principal Architect, or Senior Cloud Architect at a consulting firm, ISV, cloud vendor, SI, or in-house team.
  • You're a senior/staff engineer being pulled into pre-sales, vendor selection, or end-to-end design and want to learn the discipline rather than wing it.
  • You're a tech lead whose scope just expanded across teams or business units and you no longer have a single team's people leverage.
  • You're an enterprise architect or program lead who wants the next layer down โ€” how solutions actually get designed and delivered.

Who this is not for

A note on context

The default voice assumes a mid-to-senior solution architect on a multi-team, multi-system engagement, ~3 to 12 months of design+delivery duration, current reality (multi-cloud by default, AI woven through every solution, GenAI in copilots, FinOps mandatory, a regulatory surface that grew teeth). Pre-sales SAs in vendor/SI roles should read everything but lean hardest into ยง6, ยง14, ยง18, ยง20. In-house enterprise SAs should focus on ยง9, ยง16, ยง22, ยง23. Boutique and freelance SAs need every section, doubly so ยง1, ยง13, ยง15.


2. ๐Ÿง  The Solution Architect Mindset

The mindset shift from senior engineer or tech lead to SA is harder than the skill shift. Most failed SAs were technically capable; they failed at the positional layer โ€” they kept thinking like a builder when their job was to think like a connector.

2.1 Identity reframe: from "best designer" to "best bridge"

You used to be measured by the system you designed. Now you are measured by whether the right system gets designed, gets bought (literally or organizationally), and gets shipped, given the constraints and stakeholders in play. Your output is a solution that closes a business problem, and that includes everything from "the integration is feasible" to "the CFO signed off on the cost" to "the security team accepted the risk register" to "the delivery team can actually build it." This breaks five engineering instincts you must consciously rewire:

Old engineering instinct New SA instinct
"I'll design the cleanest system" "Which 3 constraints determine 80% of this design? Optimize there, accept the rest."
"Let me research the best technology" "What does the customer already have, what can they operate, and what can they afford?"
"I'll just code a prototype" "What's the smallest demo, document, or whiteboard that decides this?"
"We need consensus on the design" "Who owns this decision? When and how do they decide? Who do they need to hear from?"
"Production is the next team's problem" "Operability is part of my design. If it can't be run, I haven't designed it."

Practical: write a one-line role description and pin it to your monitor. "I am the Solution Architect for [Project / Account / Domain]. My job is to deliver a runnable, affordable, supportable solution that closes the business problem within the agreed constraints, working through teams I do not manage and stakeholders I do not control." If you can't articulate this, your stakeholders can't either, and they will silently form their own (often conflicting) definitions of your job.

2.2 The five hats โ€” and how they fight

You wear five hats simultaneously, and they actively interfere:

Hat Mode Time horizon Output
Discoverer Curious, slow, listening Daysโ€“weeks Interview notes, context map, problem statement
Designer Deep, abstract, system-level Weeks Architecture brief, C4 diagrams, ADRs
Negotiator Diplomatic, fast, decisive Hoursโ€“days Decisions logged, alignment achieved, scope clarified
Salesperson Confident, narrative, value-led Hours Pitch decks, RFP responses, executive briefings
Operator Pragmatic, hands-dirty Daysโ€“weeks Runbooks, governance gates, delivery escalations

Each demands a different brain state. A 2-hour design session with engineers and a 2-hour vendor pitch to a CIO cannot share the same morning. Batch by hat, not by topic. The most common failure mode: defaulting to Designer mode whenever uncomfortable. Discovery is messy, negotiation is stressful, sales feels icky, operations is tedious. Designer mode produces gorgeous diagrams that no one will pay for, no one will sign off on, and no one will run. Calendar discipline beats willpower. See ยง23 for the cadence.

2.3 The four voices

Every SA has four internal voices. They lie in different ways. Notice them.

  1. The Architect Astronaut Voice โ€” "This deserves a layered abstraction with a domain-driven hexagonal core." Lies upward โ€” turns simple problems into 18-month platform plays. Common in SAs who came from heavy frameworks or who haven't shipped recently.
  2. The Vendor-Whisperer Voice โ€” "AWS launched X last week, this is a perfect use case." Lies sideways โ€” fits the customer to the technology rather than the technology to the customer. Especially common in vendor-employed SAs and the newly certified.
  3. The Imposter Voice โ€” "They hired me by mistake, the *real architects know more about [obscure pattern]."* Lies downward โ€” talks you out of necessary calls and produces a consensus-only SA who never makes a decision and is invisible at the steering committee.
  4. The Steward Voice โ€” "What does this customer need to be capable of in 18 months given their team, budget, and regulatory reality? What's the smallest system that gets there?" Lies the least. Cultivate this one.

When the Astronaut, Vendor-Whisperer, or Imposter voice is driving a decision, write the decision down and revisit in 24 hours. Most regretted SA decisions happen in the 24 hours after a glossy vendor briefing, a hostile steering committee, or a public dressing-down. Sleep first.

2.4 The leverage hierarchy

Rank your time by leverage. Always work top-down:

  1. Problem framing. What is actually being solved, for whom, with what constraints. 1 hour here = 100 hours saved later.
  2. NFR negotiation. Latency, availability, cost ceiling, RPO/RTO, data residency, compliance class. The contract.
  3. Stakeholder alignment. Who owns each decision, who signs which doc, who attends which gate. The political wiring of the project.
  4. Build vs buy vs reuse. The biggest cost lever. Wrong here = wasted years.
  5. Reference architecture & ADRs. The shape of the solution, the irreversible choices, the rationale.
  6. Cost / TCO model. Without this you cannot defend the design.
  7. Integration design. Where systems meet is where projects fail. Spend disproportionate time here.
  8. Risk register & mitigation plan. The brutal honest list of what could kill this.
  9. Delivery handoff. The team needs to own this solution, not implement it under your dictation.
  10. Reviewing. Other people's diagrams, PRs, vendor decks. Useful in moderation. Stop being on the critical path.
  11. Building. Your own code. Lowest-leverage of all. Do only what literally only you can do โ€” usually a thin spike to prove a tradeoff, never production code.

When you feel busy but useless, you've inverted the stack. Reset by asking: "In the last 5 working hours, how much did I spend on items 1โ€“4?" If the answer is "<2," that's the problem.

2.5 Reversible vs irreversible decisions

The single most clarifying frame in your toolkit. Examples calibrated to the SA seat:

  • Two-way doors (reversible): which CI provider, which monitoring vendor, the exact format of an ADR, sprint cadence, the choice between two equivalent serializers, naming a microservice. Decide fast, reverse if wrong, do not run a six-week working group on these.
  • One-way doors (hard or expensive to reverse): primary cloud provider for production data, identity provider, core data model, public API shape, primary database for OLTP, the customer-facing event schema, a long-term integration contract with a partner, the multi-tenant boundary, the country of data residency. Slow down. Write it up. Get input. Get expert review. Sleep on it. Document why.

A good SA visibly labels each decision in the running ADR log: Reversibility: Two-way / One-way / One-and-a-half-way (reversible only with notable cost). This single column changes how stakeholders engage. It also gives you political air cover: "This is one-way. We need a written decision from the data owner. Until then, we're building the two-way pieces around it."

2.6 The "Design for the second-best engineer" rule

You will not be the one operating this thing in production. The team that operates it will not be the most senior team in the company. Design for the engineer who is the second-best on the team that will inherit it, on a Tuesday afternoon, three months after you've moved on. That engineer is intelligent but tired, has not read your 40-page design, has half a Slack thread of context, and just got paged.

If your design requires the brilliant engineer to keep it running, your design is wrong. Examples of the rule applied:

  • Prefer obvious over clever. If you must choose between a standard managed service and a custom event-driven mesh, the managed service wins unless the data forces otherwise.
  • Keep the operating model boring: standard SLOs, standard runbooks, standard observability stack, standard secrets store.
  • Eliminate "context-only-the-architect-knows" from the critical path. Every load-bearing decision must be a written ADR.

2.7 Three habits that separate principal from staff

  1. Quantify before you draw. Every box on the diagram has an estimated load (RPS, GB/day, concurrent users), a latency budget, a failure mode, and a cost. If you cannot fill those four columns, you have not designed it; you have drawn it.
  2. Name the failure modes. For every component: "What happens when this is slow / down / wrong / saturated / breached?" Then "Who finds out, how fast, and what do they do?" If you cannot answer, the design is incomplete.
  3. Defer the exotic. Reach for the boring tool until measurements force the exotic one. The career graveyard is full of solution architects who chose Cassandra-on-Day-One because the marketing said "scales," and now the customer has a six-node ops nightmare for 3,000 RPS.

3. ๐ŸŽญ The SA Landscape: Five Archetypes

"Solution Architect" is not one job; it is at least five. Be honest about which one you are this quarter โ€” the playbook chapters land differently depending on the answer.

Archetype Sits in Time horizon Primary deliverable Compensation model Key risk
Pre-sales SA Vendor, SI, cloud provider Daysโ€“weeks Demo, RFP response, statement of work Tied to bookings/quota Selling solutions you can't deliver
Delivery / Engagement SA SI, consulting, internal program Months Reference architecture, ADRs, governance, handoff Project / utilization Diagrams that don't survive contact with reality
In-house Enterprise SA Big-co IT, regulated industry Quartersโ€“years Domain reference architecture, integration contracts, vendor list Salary, sometimes bonus Becoming a process bottleneck
Cloud / Platform SA Cloud or platform vendor Continuous Reference architectures, customer reviews, partner enablement Salary + variable "Vendor goggles" โ€” every problem solved with your stack
Independent / Fractional SA Boutique or freelance Daysโ€“months Strategy memo, vendor selection, Phase-0 design Day rate Scope creep, no installed credibility, payment risk

A few non-obvious points:

  • The same person can wear all five hats over a career; the operating model differs sharply. A pre-sales SA who promises a feature wins the deal; a delivery SA who promises that same feature loses the project. Watch your incentives.
  • Cloud-vendor SAs are sometimes called "Solutions Architect" formally but spend ~70% of their time on enablement and reference architectures, not on a single customer's solution end-to-end. Title alike, job different.
  • Enterprise SAs in regulated industries (banking, insurance, health, telco) are often part of a governance function with veto power on certain designs. The skill is wielding that veto sparingly.

Cross-archetype constants (every SA does these): write ADRs, run NFR negotiations, design for operability, manage stakeholders, model cost. Everything else varies.


4. ๐Ÿชœ SA vs TL vs Software Architect vs EA vs CTO

The single most common confusion in the role. Five real adjacent positions:

Role Owns Time horizon People management Code authorship Where they fail
Tech Lead One team's delivery and quality Sprintsโ€“quarters Often dotted-line High (15โ€“40% of time) Stays IC, never grows the team
Software / Application Architect One product or system's internal design Monthsโ€“year None Medium (5โ€“20%) Becomes "the only one who knows it"
Solution Architect One solution across systems & teams 3โ€“18 months None (lateral influence) Low (<5%, mostly spikes) Diagrams that don't ship
Enterprise Architect (EA) Enterprise IT landscape, governance, capabilities 1โ€“5 years Sometimes Almost zero Frameworks > outcomes; "the strategy team that ships nothing"
CTO / VP Eng The whole engineering organization 6โ€“24 months and beyond Yes, 5โ€“500 reports Zero in steady state Goes too IC or too political

A useful mental geometry:

  • TL is vertical-narrow (one team, deep on its delivery).
  • Software Architect is vertical-deep (one product, deep on its internal structure).
  • Solution Architect is horizontal โ€” across systems, vendors, teams โ€” for a finite engagement.
  • EA is horizontal-and-permanent โ€” across all of IT, with multi-year governance horizons.
  • CTO is the line manager of the system that produces all of the above.

A few specific clarifications you'll need to make to a stakeholder, probably weekly:

  • "I am a Solution Architect, not a Software Architect โ€” I will not pick the unit-test framework. I will pick the integration contract between system A and B, the data residency boundary, and the build-vs-buy on the search component." โ€” sets scope cleanly.
  • "I am a Solution Architect, not an Enterprise Architect โ€” I am accountable for this solution. I will align with the EA's principles where they exist; I will not author them." โ€” keeps scope from ballooning.
  • "I am not the Tech Lead โ€” I do not own velocity. I own the design and the decision log. The TL owns the burn-down." โ€” keeps you out of standups you shouldn't be in.

The role names vary by company. Validate by responsibilities, not by title. A "Senior Cloud Architect" at one shop is a Pre-sales SA; at another, an in-house Enterprise SA; at a third, a Software Architect with a vendor focus.


5. ๐Ÿšช The First 90 Days

You are new to the engagement, the team, the customer, or all three. The first 90 days are almost entirely about earning the right to design. Skip this and you will make a beautiful design that nobody implements.

5.1 The 30-day plan: listen, map, baseline

Goals: Understand the business, the people, the existing landscape, the constraints, and the political wiring. Resist every urge to draw a diagram in week one.

Do:

  • Run 15โ€“25 discovery interviews (see ยง6). Across business, product, engineering, ops, security, finance, vendors, customers if possible.
  • Build a stakeholder map: who decides, who advises, who is informed, who blocks. Include their concerns and what they consider success.
  • Build a system context map: every system touching this solution, every owner, every integration. This is not a target architecture โ€” it's archaeology.
  • Read the last 6 months of relevant documents: design docs, post-mortems, board updates, audit reports, RFP responses, vendor contracts, incident reports. Most of your design constraints are in those documents already.
  • Identify the 3 burning constraints: cost ceiling, regulatory deadline, key-person dependency, integration that's already on fire, etc. These will dominate the design.
  • Listen for the 3 zombie projects: prior attempts to solve this problem that died. Why? You inherit those carcasses.

Do not:

  • Propose a target architecture. You don't have permission yet.
  • Promise scope. You don't know what's deliverable.
  • Bash an existing system, even if it's bad. The person who built it is in the room.
  • Default to "your" stack. The customer has a stack, a team that runs it, and a budget for it.

Output by day 30: a written Discovery Findings memo (4โ€“8 pages): business problem, current state context map, top 5 NFRs (draft), top 5 risks, top 3 zombie projects, list of unanswered questions, proposed next-30-day plan.

5.2 The 60-day plan: frame the problem, propose the shape

Goals: Get alignment on the problem, the NFRs, and the shape of the solution. Still no detailed design. The question to answer is not "what should we build?" but "what are we trying to be true at the end of this?"

Do:

  • Run an NFR workshop with the right stakeholders (see ยง9). Output: a signed-off NFR register with quantified targets and acceptance criteria.
  • Produce a Solution Vision doc (3โ€“5 pages): the future state in plain English, the 3โ€“5 architectural principles you propose to follow, the major shape (monolith vs distributed, sync vs async, on-prem vs cloud), and the top 3 strategic options at a high level (e.g., Option A: Build in-house on AWS, Option B: Buy SaaS X, Option C: Hybrid).
  • Run a risk workshop to surface the top 10 risks and their owners. Compliance, legal, vendor, key-person, technical, schedule.
  • Validate the cost ceiling with finance/CFO/Procurement: not "how much will it cost," but "what's the budget you've actually approved."

Output by day 60: a Solution Vision doc and a signed NFR register. Stakeholders should be able to repeat the problem and the principles in their own words. If they can't, you haven't done the work yet.

5.3 The 90-day plan: design, gate, and start delivery

Goals: Produce the reference architecture, the major ADRs, the cost model, the migration plan (if applicable), and hand off to delivery. Run the first design-review gate.

Do:

  • Produce the Reference Architecture: C4 Levels 1โ€“3 (see ยง8), the major data flows, the integration contracts, the deployment topology. With NFR mapping (which component delivers which NFR target).
  • Produce the first 5โ€“10 ADRs: cloud provider, identity, primary data store, integration backbone, compute model, observability stack, secrets, multi-tenancy boundary. (Trim to what your solution actually needs.)
  • Produce the TCO model (see ยง15): year 1, year 3, sensitivities. Cross-check against the budget.
  • Run the architecture review with the steering committee, security, compliance, and the EA. Capture decisions and dissent.
  • Hand off to the delivery TLs and PMs with a written delivery plan and the first sprint scope.

Output by day 90: the Solution Design Pack โ€” Vision, NFRs, Ref Arch, ADR set, Risk Register, TCO. This is what you'll be measured against for the next 6โ€“18 months.

A common mistake: trying to "complete" the design at day 90. You won't. The design will keep evolving as delivery exposes assumptions. The day-90 design is the design that's good enough to start. Plan for at least three major design review gates ahead.

5.4 The 90-day mistakes to avoid

  • Premature toolchain commitment. "We'll use Kafka." Until you know the data velocity, the team's Kafka skill, the cost, the integration mode, and whether managed Kafka exists in this region, that's a guess. Defer.
  • Saying yes to every interview. You'll burn 90 days in meetings. Prioritize the 25 highest-signal interviews; the rest go in a survey.
  • Skipping the EA. If there's an Enterprise Architect, brief them in week 1, before you produce anything. Their good will saves quarters.
  • Skipping security. Same. Bring them in early; they'll be your first reviewer or your last blocker. Choose.
  • Skipping finance. The cheapest way to discover the budget is to ask. The most expensive way is to design first.

6. ๐Ÿ” Discovery: The Real Job Begins Here

Discovery is not a phase you finish; it's the foundation that quietly determines whether the design is right. Most failed solutions are failures of discovery, not of design. You designed a great solution to the wrong problem.

6.1 The five layers of discovery

You have to surface all five. Skipping any will haunt you.

Layer What you're trying to learn Asked of
Business Why this solution, what outcomes, what dollar value, what deadline Sponsor, business owner, CFO
User / Customer Who uses this, how, when, what's painful, what does success feel like Product, end users, support
Functional The capabilities the solution must provide Product, BAs, domain experts
Non-functional The quality attributes (perf, availability, cost ceiling, security, compliance) Ops, security, compliance, finance
Constraint What the customer already has, can run, will allow, can pay All of the above + procurement, legal, vendor management

A solution that ships is one where the constraint layer was discovered first. Most SAs discover it last โ€” usually the day before architecture review, when procurement says "we don't have a contract with that vendor and won't get one in your timeline."

6.2 The Five Whys, applied to solution design

When a stakeholder hands you a "requirement," it is almost always a solution they already chose, not the actual requirement. Apply the Five Whys.

Stakeholder: "We need a real-time dashboard."
SA: "Why?"
"So executives can see the funnel."
SA: "Why does that need real-time?"
"Well, end-of-day is fine, but the current system is two days behind."
SA: "If we made it next-day reliable, would that solve the problem?"
"Yes, that's actually fine."

You just saved $200k of streaming infra and 4 months. Do this on every requirement. Real-time, high-availability, multi-region, full-mesh, blockchain โ€” these are almost always pre-baked solutions. Find the underlying need.

6.3 The discovery interview: a script

Each interview is 45โ€“60 minutes. Always one note-taker (you, or a co-architect) so eye contact is preserved.

  1. Their context (5 min): role, team, what they own, how long they've been in the seat.
  2. Their world today (15 min): "Walk me through a typical week. What's working, what's broken, what wakes you up?" Listen for the language they use โ€” that's the language to use back.
  3. Their wishlist (10 min): "If I could give you three things tomorrow, what would they be?" Distinguish wish from need.
  4. Their constraints (15 min): "What can't change? What's off-limits? What would your boss kill?" โ€” these are the irreversible boundaries.
  5. Their concerns (10 min): "What's the most likely way this project goes wrong?" โ€” the most undervalued question. Their answer is your risk register, free.
  6. Wrap (5 min): summarize back, ask "did I get that right?", ask "who else should I talk to?", thank, schedule follow-up if needed.

Anti-patterns:

  • Leading with technology. "Are you on AWS or Azure?" โ€” you're hiring, not researching. Save for the constraint interview.
  • Selling. You're not pitching yet. Asking and listening is the entire job for now.
  • Note-light. Memory degrades by 50% in 24 hours. Type or transcribe; review same-day.

6.4 The context map โ€” your most reused artifact

A context map is a one-page diagram of every system, every team, every integration, every data flow that touches this solution today, with arrows labeled. Not a target architecture; not beautiful; exhaustive.

This single artifact will be the most-photographed page of every meeting you run for the next 6 months. Conventions:

  • Every box has an owner (team or person).
  • Every arrow has a protocol (REST, gRPC, file drop, JDBC, message queue) and a frequency.
  • Every system has a "stability" tag: green (stable), yellow (planned change), red (deprecating, on fire, or unowned).
  • Every external system has a vendor name and contract status.

If you can produce a high-quality context map and the stakeholders argue with it, you've already done your job โ€” you've surfaced their misalignment about what they have today. Half of "design problems" are actually "we don't agree on the current state."

6.5 The unspoken constraints

The constraints stakeholders don't say are usually the ones that kill the project.

  • Vendor relationships. "We can't use AWS โ€” the CIO had a fight with their AE in 2024." (True story.)
  • Data residency. "Our German customers' data cannot leave the EU." Often only spoken when the contract review starts.
  • Internal politics. "The data team will block any solution that has its own database." Unstated until day 60.
  • Off-the-record commitments. "We promised the regulator we'd be on-prem until 2027." Lives in someone's email, not the wiki.
  • Headcount realities. "We will lose half the platform team in Q3 to the new product." Spoken only at the leaving drinks.

You discover these by asking specifically: "What are the things the org has decided that aren't written down?" "What does the CFO/CIO/CISO refuse to do?" "Who is leaving in the next year?" Ask once per interview, in the constraints block. Some you'll only learn by being around for 60+ days.

6.6 The discovery output

A 4โ€“8 page memo with these sections, every time:

  1. Problem statement (1 paragraph). The business outcome, not the technology.
  2. Stakeholders (table). Who decides, advises, blocks, is informed.
  3. Current state (1 page + context map). What's running today.
  4. Top 5 NFR drafts (table with quantified targets). Subject to ยง9.
  5. Top 10 risks (table). With owners.
  6. Open questions (list). With dates by which they must be answered.
  7. Recommended next steps (numbered list).

Send it. Get reactions. Iterate. Do not design the solution before this memo is signed off. If you do, you'll design the wrong solution.


7. ๐Ÿ“ Solution Design Methodology

You have the discovery in hand. Now you design. The disciplined SA does not start in Visio; they start in a structured methodology that compresses what we know into what we're choosing.

7.1 RAPID-S, adapted for solutions

The system-design interview framework adapts well to real solutions. Six phases, in order:

  1. R โ€” Requirements: functional + non-functional + constraints. Already done in discovery; reformulate as a one-pager.
  2. A โ€” API / Interface contracts: what does this solution expose, to whom, with what guarantees. Public APIs, integration contracts, event schemas.
  3. P โ€” Persistence model: data ownership, schema sketch, retention, residency. Not the table schema โ€” the boundaries of data.
  4. I โ€” Infrastructure: compute model, deployment topology, network, identity, observability stack.
  5. D โ€” Decisions: ADRs for the irreversible 5โ€“10 choices. The lasting artifact.
  6. S โ€” Scaling, security, sustainability: the NFR enforcement plan. How the solution holds at 10ร— load, an attempted breach, and 3 years from now.

Walk it in this order. RA-first, not I-first. The most common mistake is jumping to I (the cloud diagram) before R is signed off โ€” you end up architecting the wrong NFR class.

7.2 The two designs โ€” current vs target โ€” and the gap

Every design is really three documents in one:

  • Current state architecture (CSA): what's running today.
  • Target state architecture (TSA): where we want to be.
  • Transition architecture(s): the intermediate states that are themselves runnable.

A common mistake: drawing only the TSA. The TSA is hypothetical until the transition is designed. Most projects fail in the transition, not in the target. The transition has to be runnable: every milestone is a live, supported, monitored state.

For migration-heavy work, draw at least 3 transition architectures, not 1. (See ยง17.)

7.3 The principles set: the design constitution

Before drawing a single box, write 5โ€“7 principles the solution will follow. These are explicit value choices the team can cite during inevitable arguments. Examples:

  • "Buy before build, unless build is a clear strategic differentiator."
  • "Every service is owned by exactly one team."
  • "All data classified as PII is encrypted at rest with a customer-managed key."
  • "Synchronous calls only between services in the same trust boundary; cross-boundary is async."
  • "Single primary cloud (AWS); secondary cloud only for DR or specific regulated workloads."
  • "Every public API is versioned and documented in OpenAPI before code is written."
  • "Observability stack is shared; teams do not roll their own."

Principles are most useful when they cost something. "Be secure" is not a principle, it's a wish. "Customer-managed keys for all PII" is a principle โ€” it costs latency, complexity, and budget. That's why it's load-bearing.

7.4 The strategic options analysis (SOA)

Before committing to an architecture, write 2โ€“4 strategic options and analyze each. Don't compare 8 โ€” analysis paralysis. Don't compare 1 โ€” that's a recommendation, not analysis. Three is usually right.

Option Description Pros Cons Cost (Y1 / Y3) Risk Recommendation
A Build in-house on AWS Full control, integrates with rest of stack 9-month build, hire 4 engineers $1.2M / $2.4M Hiring market Default
B Buy SaaS (Vendor X) 6 weeks to live, vendor handles ops Lock-in, integration cost, $400k/yr forever $0.5M / $1.5M Vendor risk Recommended
C Hybrid โ€” buy core, build edges Best of both Two teams to manage, integration complexity $0.9M / $2.1M Coordination Acceptable backup

This is a steering-committee artifact. It compresses 200 pages of analysis into one defensible recommendation. Commit to one option in the SOA, with rationale. Wishy-washy "any could work" outputs get re-debated for months.

7.5 The "shape before the boxes" principle

A design has a shape before it has components. Decide the shape first:

  • Topology: monolith, modular monolith, microservices, mesh, micro-frontends, event-driven, batch.
  • Data flow: request/response, fan-out, pipeline, lake.
  • State: stateless services + data tier, stateful services with replication, ephemeral compute.
  • Multi-tenancy: shared everything, shared infra-isolated data, per-tenant deployment.
  • Failure model: graceful degradation, circuit breaker, retry, fallback to cache, fail fast.

Decide these before the cloud diagram. The cloud diagram is the implementation of the shape; many cloud diagrams can render the same shape; many shapes can be incompatible with the same NFRs. Get the shape right โ€” the rest is wiring.


8. ๐Ÿ—‚๏ธ Documenting a Solution: C4, ADRs, arc42

Three documentation tools cover 90% of SA work. Use them. Stop using "shapes in PowerPoint."

8.1 The C4 Model (Simon Brown)

A hierarchy of architecture diagrams that scales from "show this to a CFO" to "show this to a developer." Four levels:

Level Audience What it shows Example
L1 โ€” System Context Non-technical stakeholders, exec, customer The system as one box, with users and external systems around it "Order System receives orders from Web/Mobile, queries Inventory and CRM, sends to Fulfillment"
L2 โ€” Container Architects, leads, sec, ops Internal containers (apps, databases, queues) inside the system box "API service, worker, Postgres, Redis, S3"
L3 โ€” Component Engineers, designers Components inside one container "OrderController โ†’ OrderService โ†’ OrderRepository"
L4 โ€” Code Engineers (rarely) Class diagrams (mostly auto-generated) Skip in 99% of cases

For a typical solution: produce L1 always, L2 always, L3 for the 2โ€“3 most novel containers, L4 never. Tooling: Structurizr, draw.io, Excalidraw, Mermaid (in-line in Markdown โ€” composes with ADRs beautifully).

A common SA failure: starting at L2 with a 40-box diagram and never producing L1. Without L1 the CFO has no idea what they're funding. Always L1 first.

8.2 Architecture Decision Records (ADRs)

The single most important document genre in solution architecture. An ADR captures one decision, the alternatives, the rationale, and the consequences. Format (Michael Nygard variant, lightly extended for SA use):

# ADR-0007: Use AWS Aurora PostgreSQL for the OLTP store

Date: 2026-05-06
Status: Accepted
Reversibility: One-way (data migration is expensive)
Context owners: SA, Data Lead, Platform Lead

## Context
We need a primary OLTP store for order, inventory, and customer data, sized for 5,000 RPS peak, sub-50ms p99 reads, RPO โ‰ค 5min, RTO โ‰ค 1hr, single region with read replicas, encryption at rest with CMK, regional residency in eu-west-1.

## Decision
Use Amazon Aurora PostgreSQL 16, multi-AZ, with two read replicas, snapshot every 6 hours.

## Alternatives considered
- Self-managed PostgreSQL on EC2: rejected โ€” operational cost, no team capacity for tuning.
- Amazon RDS PostgreSQL: viable, but Aurora's storage model gives better failover characteristics for our RTO target.
- DynamoDB: rejected โ€” relational schema, ad-hoc joins required for the order workflow, would force redesign.
- CockroachDB: rejected โ€” multi-region not yet a requirement, adds operational burden.

## Consequences
+ Managed, in-region, meets RPO/RTO.
+ Familiar SQL surface for the team.
+ Encryption with CMK supported natively.
- Vendor lock-in to AWS (mitigated by standard PostgreSQL surface).
- Cost: ~$8k/month at the targeted size (see TCO doc ยง3).

## Compliance and security notes
- CMK in KMS, rotated annually.
- IAM authentication enabled; no static passwords.
- Audit logging to S3 โ†’ CloudWatch โ†’ SIEM, retained 7 years per policy P-23.

## Open follow-ups
- Validate read-replica lag under failover (load test before go-live).
- Decide PITR window with Compliance team.
Enter fullscreen mode Exit fullscreen mode

Rules of ADR hygiene that compound over years:

  • Numbered, never deleted. ADR-0007-aurora.md. If a decision is reversed, write ADR-0023: Reverse ADR-0007 โ€” switch to RDS for cost reasons. Append history. Never rewrite.
  • One decision per ADR. Two decisions = two ADRs. Otherwise the rationale becomes mush.
  • Reversibility tag. Forces honesty.
  • Alternatives section is mandatory. A decision without alternatives is a preference. Always list โ‰ฅ2.
  • Consequences are signed. A consequence labeled "we accept higher latency for cross-region reads" is a contract โ€” surface it during review.
  • Stored with the code. docs/adr/0001-cloud-provider.md in the repo, not buried in Confluence. Engineers read code; they only sometimes read Confluence.

A solution with 25โ€“60 well-maintained ADRs is unkillable โ€” its decisions can be defended, audited, and evolved. A solution with 200 PowerPoint slides and zero ADRs is unmaintainable โ€” when anyone leaves, the rationale is lost and the design starts decaying.

8.3 arc42

A 12-section architecture documentation template. Use it as the table of contents for your Solution Design Pack (ยง5.3). Sections (lightly summarized):

  1. Introduction & Goals
  2. Constraints
  3. Context & Scope (= C4 L1)
  4. Solution Strategy (= the principles, the SOA recommendation)
  5. Building Block View (= C4 L2/L3)
  6. Runtime View (sequence diagrams for key flows)
  7. Deployment View (the actual cloud topology)
  8. Cross-cutting Concepts (security, observability, resilience patterns)
  9. Architecture Decisions (link to ADRs)
  10. Quality Requirements (= NFRs, see ยง9)
  11. Risks and Technical Debt (= risk register)
  12. Glossary

You don't need every section every time, but having a consistent ToC across solutions removes a class of "where do I look?" overhead for everyone downstream. Pair arc42 with C4 for diagrams and ADRs for decisions, and you have a complete kit.

8.4 Documentation that ages

The hardest discipline in SA documentation is keeping it alive. Three rules that make the difference:

  1. Source-of-truth in the repo. Markdown, diagrams in Mermaid/Structurizr, ADRs as files. PR reviews catch drift; Confluence hides it.
  2. Reviewed at gates. Every steering committee, every release, every quarter โ€” pop the relevant doc, ask the team "is this still true?" If not, fix it now.
  3. Owned by name. Each doc lists an owner. When the owner leaves the project, ownership transfers in writing. Otherwise the doc dies the day they leave.

9. ๐ŸŽฏ Non-Functional Requirements: The Real Job

If you take one section away, take this one. Most SA failures aren't bad designs โ€” they're sloppy or missing NFRs. The contract between business and technology lives in this section.

9.1 The eight NFR classes

Every solution has targets in eight classes. Make them explicit, quantified, and acceptance-tested.

Class What to specify Example
Performance Latency p50/p95/p99, throughput, cold-start "p99 โ‰ค 400ms at 5,000 RPS, p99 cold-start โ‰ค 2s"
Availability Uptime SLO, error budget, planned downtime "99.95% per calendar month, โ‰ค4hr planned/yr"
Reliability / Resilience RPO, RTO, max tolerated dependency outage "RPO โ‰ค 5min, RTO โ‰ค 1hr, survive single AZ loss"
Scalability Peak load, growth runway, scale type "10ร— burst, 3-year runway, horizontal-only"
Security Threat model, controls, IAM model, encryption "STRIDE-reviewed, CMK at rest, MFA admin"
Compliance Frameworks, audit obligations, data classes "SOC 2 Type II, GDPR, HIPAA-eligible, PCI-out-of-scope"
Cost Y1/Y3 ceiling, $/transaction, cost-per-tenant "โ‰ค$80k/mo Y1, $0.04/order, scale linearly to $200k/mo at 10ร—"
Operability Monitoring, on-call expectations, runbook coverage "Every critical path observed; oncall rotation; โ‰ค30min p99 MTTD"

Add as needed: usability, accessibility (WCAG 2.2 AA), localization, internationalization, sustainability (kgCO2e/req), data quality.

9.2 The NFR negotiation

Every NFR target costs something. The number on the left has a direct line to the number on the bottom. The negotiation is not "what do we need," it's "what are we willing to pay for."

Examples of the cost curve:

  • 99.9% โ†’ 99.95% availability: roughly 2ร— infra cost (multi-AZ active-active, replicated state, faster failover). Plus oncall maturity.
  • p99 โ‰ค 200ms โ†’ p99 โ‰ค 50ms: usually a fundamental architecture change (cache layer, edge compute, denormalization). Sometimes 5ร—.
  • RPO 5min โ†’ RPO 0: synchronous replication, multi-region writes, conflict resolution, latency hit. Often the hardest NFR.
  • Multi-region active-active: 2โ€“3ร— infra cost, 5โ€“10ร— design complexity. Don't accept it without explicit business case.

Run an NFR workshop during the 30โ€“60 day window. Whiteboard. Each line: target / cost / acceptance test. Force the business owner to commit to the target with the cost on the table. Sign the page. Photograph it. That's the contract.

9.3 NFR acceptance tests

An NFR target without an acceptance test is a wish. For every quantified target, write how you will verify it.

NFR Target Acceptance test
Latency p99 โ‰ค 400ms at 5,000 RPS k6 load test, soak 1hr, p99 from server-side metrics
Availability 99.95%/month SLO measured by SLI = (success/total) over 30d trailing
RPO โ‰ค 5min DR drill quarterly; restore from backup within RPO measured
Cost โ‰ค $80k/mo FinOps weekly tag-based report; alert at 80% threshold
Security STRIDE-passed Threat model reviewed by security pre go-live; pen-test pre-prod
Compliance SOC 2 Type II External auditor, annual; controls evidenced in GRC tool

If you can't write an acceptance test, you don't have a real NFR. Promote vague NFRs ("highly available", "fast", "secure") to refusal status until they're quantified.

9.4 NFR mapping to components

For each NFR, identify which components in the architecture deliver it. This map should be in the Reference Architecture doc.

Availability 99.95% โ€” delivered by:
  - Multi-AZ Aurora (primary + replicas)
  - ALB across 2 AZs
  - ECS Fargate with min 2 tasks per AZ
  - DNS failover (Route 53 health checks)
  - Runbook RB-007 (db failover) drilled quarterly
Enter fullscreen mode Exit fullscreen mode

When a stakeholder questions "are we sure we hit 99.95%?", you point to the map. When the on-call engineer asks "why is everything in multi-AZ?", you point to the map. When the CFO asks "why are we spending 2ร— on infra?", you point to the map.

9.5 The NFR-to-architecture pressure test

Before the architecture review, take each NFR and stress-test:

  • "What if we 10ร—'d the latency target?" โ€” is that just a knob, or a redesign?
  • "What if compliance moved from SOC 2 to FedRAMP Moderate?" โ€” fundamental redesign or incremental?
  • "What if cost dropped 50%?" โ€” what would we cut?
  • "What if availability moved from 99.95% to 99.5%?" โ€” what could we simplify?

If a small NFR change forces a fundamental redesign, you've got an architecture that's brittle to its NFRs. Flag this as a risk and consider a more flexible shape.


10. โ˜๏ธ Cloud Architecture (AWS, Azure, GCP, Multi)

The default substrate for solution architecture today is the cloud. You will design for at least one and increasingly for more than one. Six things to get right.

10.1 The cloud-provider choice (one-way door)

The single most consequential ADR you'll write on most solutions. Drivers, in roughly this order:

  1. What the customer already runs. Skill, contracts, operating model. A 5-year AWS shop is rarely best served switching.
  2. Regulatory residency. Some regions are only on some clouds. Some governments only certify some clouds.
  3. Native services that matter. BigQuery is in GCP. Active Directory and Microsoft 365 integration favor Azure. SageMaker, EKS-with-Fargate, deep AI/ML breadth favor AWS.
  4. Pricing posture. Reserved instance / commitment discounts you've already negotiated.
  5. Specific service maturity. Vector DB, identity-aware proxy, managed Kubernetes, edge compute, etc.

Multi-cloud as default = mistake. Cost doubles, ops complexity quadruples, the team gets shallow on both. Multi-cloud for specific reasons (DR for a single critical workload, regulatory mandate, cost arbitrage on egress, vendor avoidance) โ€” fine. Decide deliberately.

10.2 The Well-Architected lens

Each major cloud publishes a Well-Architected Framework (AWS WAF, Azure WAF, GCP Architecture Framework). They're surprisingly good. Six pillars (with cross-cloud equivalents):

  1. Operational Excellence โ€” runbooks, IaC, observability, change management.
  2. Security โ€” IAM, encryption, network segmentation, secrets, audit.
  3. Reliability โ€” failure modes, recovery, multi-AZ/region, capacity headroom.
  4. Performance Efficiency โ€” sizing, latency, scaling, hot-spots.
  5. Cost Optimization โ€” sizing, reservations, lifecycle, FinOps.
  6. Sustainability โ€” efficiency, region selection, lifecycle.

Run a Well-Architected review at design milestone, mid-delivery, and pre-go-live. Most cloud vendors will run one for free if you're a meaningful spender โ€” take them up on it.

10.3 Landing zone and shared platform

A landing zone is the foundation: account/subscription structure, network, identity, logging, billing, baseline security. Don't reinvent it; use the vendor's reference (AWS Control Tower, Azure Landing Zones, GCP Cloud Foundation). For solution architects, two things matter:

  • Don't be the one designing the landing zone for a single solution. It's a multi-solution foundation. Coordinate with the platform team / EA. If there is no landing zone, raise it as a project-level risk.
  • Inherit, don't fight. If the landing zone forces a tagging schema, IAM boundary, network topology โ€” work within it. Solutions that fight the landing zone get veto'd.

10.4 Compute model

The default decision tree, in order of preference:

  1. Managed serverless (Lambda/Functions/Cloud Run) โ€” cheap, simple, scales to zero. Default for low-medium load, event-driven, async workloads. Limits: cold starts, runtime, vendor lock surface.
  2. Managed containers (ECS Fargate, AKS, GKE Autopilot, Cloud Run) โ€” solid middle ground. Reasonable lock-in if you stick to Kubernetes.
  3. Self-managed Kubernetes (EKS, AKS, GKE classic) โ€” only if you have the team. Yes, "we'll learn it" is a lie when the team is 6 people.
  4. VMs โ€” only when there's a specific reason (license, kernel module, vendor support).

Anti-pattern: defaulting to Kubernetes. Kubernetes is a power tool. It's correct when you have โ‰ฅ10 services, a platform team, and stable deployment patterns. It's wrong on day 1 of a 4-service product with no platform team โ€” Cloud Run / Fargate / Container Apps win there.

10.5 Network and identity

Two areas SAs underestimate, and that auditors and incidents both punish.

  • Network: VPC layout, subnetting, peering, transit gateway / hub-spoke, private endpoints, egress control. Egress is the blind spot โ€” most data exfiltration paths are egress-shaped, and egress is also a major cost line.
  • Identity: workload identity (instance profiles, managed identities, workload identity federation) > static keys, every time. Human identity through SSO/IdP only โ€” no shared admin accounts. Service-to-service: short-lived tokens, mTLS, or workload identity. Never use long-lived credentials in production.

A solution that gets identity right almost always gets the security review on the first pass. A solution that gets identity wrong almost always gets blocked in week 2.

10.6 Multi-cloud, hybrid, and edge

  • Multi-cloud for a single workload: rarely correct, almost never worth the operational cost. Exception: regulated workloads or strategic vendor avoidance.
  • Multi-cloud at the portfolio level: common in enterprises (CRM in one, data lake in another). Solution architect for one solution still picks one cloud; the EA owns the portfolio.
  • Hybrid (cloud + on-prem): legitimate for legacy + regulated systems. Design the boundary carefully โ€” direct connect, identity federation, data sync.
  • Edge / point-of-sale / IoT: a different design โ€” intermittent connectivity, local data, conflict resolution, OTA updates. Bring an edge specialist; this is its own discipline.

11. ๐Ÿ”Œ Integration Architecture

Where systems meet is where projects fail. Integration is the most underestimated portion of a solution by a factor of 2โ€“3ร—. Spend disproportionate time here.

11.1 Integration styles, picked deliberately

Style Best for Avoid when
Synchronous REST / gRPC Request/response, low latency, strong contract High-fanout, long-running, brittle dependencies
Asynchronous events (pub/sub, Kafka, EventBridge, Service Bus) Decoupling, fan-out, audit trail, replay Strict ordering across topics, instant consistency required
Message queues (SQS, RabbitMQ) Worker pools, retries, backpressure Pub/sub patterns (use topic)
Batch / file drop Legacy, bulk, regulatory data exchange Real-time needs
Database integration (shared DB) Almost never Almost always โ€” coupling at the data layer is the worst kind
API gateway aggregation BFF for mobile/web Backend-to-backend (just call directly)
Webhooks Outbound notifications to partners Internal โ€” too brittle for retries/auth
CDC (change data capture) Replicating data without writing client code Real-time business logic โ€” events are better

Default rule: synchronous within a service boundary, asynchronous across service boundaries. Async-everywhere is over-engineering; sync-everywhere is brittle.

11.2 Contracts: the integration's NFRs

Every integration is a contract. Document it explicitly:

  • Schema: OpenAPI / AsyncAPI / Protobuf. Versioned. Stored in a shared registry.
  • Compatibility policy: backward-compatible always; breaking changes go through a deprecation window.
  • SLA: latency, availability, error rate. Both sides sign.
  • Auth: OAuth/OIDC scope, mTLS cert, service account. Documented.
  • Idempotency: are repeated calls safe? With what key?
  • Retry policy: exponential backoff, max attempts, jitter, dead-letter destination.
  • Rate limits: documented; both sides aware.
  • Failure semantics: what do consumers see when this is down? Cached? Errored? Skipped?

A common failure: each team having their own opinion of the contract. The SA's job is to make the contract canonical, schema-checked, and version-controlled. Everything else flows from that.

11.3 Patterns for unreliable upstreams

You will integrate with a system that breaks more often than yours can tolerate. Apply patterns:

  • Circuit breaker: stop calling a degraded service after a threshold; back off.
  • Bulkhead: isolate threadpools/connections per upstream so one slow upstream doesn't drag the rest.
  • Retry with backoff + jitter: idempotent calls only.
  • Timeout, always: no unbounded calls, ever. Set p99-budget-aware timeouts.
  • Cache with TTL (or stale-while-revalidate): tolerate brief upstream outages with served-stale.
  • Dead-letter queue + alarm: failed messages go somewhere you can replay them.
  • Compensating transaction (Saga): for distributed flows that can't be a single transaction.

Each pattern has a cost (latency, complexity, eventual consistency). Apply them where the upstream merits, not by default.

11.4 The data contract

Increasingly the most under-defined part of integrations. Data contract = schema + semantics + freshness + ownership + retention + classification.

Examples:

  • "The customers.id field is a UUID v4 owned by the CRM team. Never mutated. Mapped to legacy cust_no only at the boundary."
  • "The orders topic is at-least-once with idempotency key order_id. Schema in registry. Compatibility: backward-compatible. Retention: 7 days for replay."
  • "The pii fields in the events stream are tokenized at source; raw values only available via the Identity Service with audit-logged lookup."

Without explicit data contracts, integrations rot. Every addition has to ask "is this safe?" and the answer is folklore. With them, the answer is in the registry.

11.5 Integration platforms (iPaaS) and ESBs

Be honest:

  • iPaaS (Workato, Mulesoft, Boomi, Azure Logic Apps, AWS AppFlow, Tray) shines for citizen-developer style integrations, SaaS-to-SaaS, low-volume, low-business-criticality. Bad for high-volume, transactional, latency-sensitive, programmable workflows.
  • ESB is largely a legacy term. If your customer has one, you'll work with it; if they don't, don't introduce one.

Default to direct event/REST integration with a registry. Reach for iPaaS for SaaS-stitching, not for the core path.


12. ๐Ÿ—„๏ธ Data & AI Architecture

Data is half of every solution; AI is increasingly half of every data solution. Three sub-architectures matter: operational data, analytical data, and AI/ML.

12.1 The operational data plane

The OLTP store(s) for the solution. Decisions:

  • Polyglot persistence vs single store. Default to a single primary store unless the access pattern demands otherwise. PostgreSQL handles 80% of cases (relational, JSONB, full-text, geo, vector with pgvector). DynamoDB handles single-digit-ms key-value at scale. Specialized stores (Redis for cache, Elastic/OpenSearch for search, time-series DB for metrics) bolted on as needed.
  • Schema ownership. One team owns the schema. No two teams write to the same table. Cross-team reads via API or replicated views.
  • Migrations. Online, backward-compatible, two-step (add โ†’ backfill โ†’ switch read โ†’ switch write โ†’ remove). Documented in ADRs.

12.2 The analytical data plane

Where reporting, dashboards, ML training, and ad-hoc analysis live. The current default stack:

  • Lakehouse (S3/ADLS/GCS + Delta Lake / Iceberg / Hudi) as the storage substrate.
  • Warehouse (Snowflake / BigQuery / Redshift / Databricks SQL) on top, or as the primary for many use cases.
  • Streaming (Kafka / Kinesis / Pub-Sub) for real-time pipelines.
  • dbt as the SQL transformation backbone.
  • Reverse-ETL (Hightouch / Census) to push warehouse data back to operational SaaS tools.

The SA's job is not to design the entire data platform โ€” that's a Data Architect's job. Your job is to:

  1. Decide what data the operational solution emits (events, CDC, snapshots) and at what cadence.
  2. Decide what data the operational solution consumes from the warehouse and how (reverse-ETL, scheduled fetch).
  3. Negotiate data contracts at the boundary (see ยง11.4).
  4. Ensure PII / regulated data is handled per policy on both sides of the boundary.

12.3 AI / ML in the solution

Today, almost every solution has an AI component. Three patterns dominate:

Pattern When to use Build cost Operational cost
LLM API call (OpenAI, Anthropic, Google) Most NL / generation tasks Low Per-token, predictable
RAG (Retrieval-Augmented Generation) Q&A over private content, customer support Medium Per-token + vector DB
Fine-tuned / hosted small model Domain-specific NLP at scale, latency-sensitive, data-sovereign High Compute reservation
Custom ML pipeline Predictive (churn, fraud, recommendation) Highest Training + inference + monitoring

Most "AI in the solution" requirements should default to LLM API + RAG, unless data sovereignty, latency, or volume forces otherwise. See ๐Ÿค– The AI SaaS Playbook (Practical Edition)๐Ÿ“˜ for the depth.

Key design points the SA owns:

  • Data flow to/from the model: what leaves your boundary? Logged where? Retained how long?
  • Prompt strategy: stored where, versioned how, evaluated how?
  • Evaluation harness: how do we know it's still working? Golden sets, online evals, human review.
  • Cost guardrails: per-tenant token budget, prompt size caps, model fallback to cheaper tier.
  • Failure mode: when the model is slow/down/wrong, what does the user see? (Increasingly: the most critical question.)

12.4 Vector stores and embeddings

For RAG and semantic search, you'll pick a vector store. Three tiers:

  • Embedded (pgvector on Postgres, sqlite-vec): default for โ‰ค10M vectors and where you already have the DB.
  • Managed (Pinecone, Weaviate Cloud, Qdrant Cloud, Vertex Vector Search, Atlas Search): default for โ‰ฅ10M vectors or when latency targets demand it.
  • Self-hosted at scale (Milvus, Vespa): only when you have a platform team and a reason.

Don't reach for a dedicated vector store on day 1. pgvector serves until you have data showing you've outgrown it.

12.5 Data residency and sovereignty

Increasingly mandatory and increasingly hard. Three rules:

  1. Map data classes early. What's PII? Health data? Financial? Regulated by which jurisdiction?
  2. Default to single-region for regulated data. Multi-region adds replication paths the regulator will scrutinize.
  3. Keep AI in the loop. Many AI providers run inference in specific regions. "Calls to LLM cross the EU boundary" is a finding waiting to happen. Use region-pinned endpoints; many providers offer them now.

13. โš–๏ธ Build vs Buy vs Customize

The single biggest cost lever in any solution. Wrong here = wasted years. Right here = hire fewer engineers, ship faster, focus on the differentiator.

13.1 The framework

Apply this in order, for every meaningful capability in the solution:

  1. Is it a strategic differentiator? If yes (the thing customers buy us for), build. If no, default to buy/reuse.
  2. Is there a mature off-the-shelf option? If yes, score it (see ยง14). If no, build.
  3. Is there a viable open-source option we can self-host? Score: TCO of self-hosting vs SaaS pricing.
  4. Is the cost of switching low (two-way door)? If yes, buy. If no, slow down โ€” vendor lock-in is expensive.
  5. Does our team have the skill to operate the build option? If no, default to buy unless we're prepared to hire.
  6. What's the time-to-value difference? If "buy = 8 weeks, build = 9 months," that's usually decisive.

Note the order: the question "is this a differentiator?" comes first. Most teams build the wrong thing first โ€” they build the auth system, the CMS, the ticketing system โ€” none of which differentiate them โ€” and starve the differentiator of time.

13.2 The classic "always buy" list

Capabilities that are almost always wrong to build today:

  • Authentication / SSO / IdP (Auth0, Cognito, Entra, Okta, WorkOS)
  • Email / transactional messaging (Postmark, SendGrid, Resend, SES)
  • Payments (Stripe, Adyen, Braintree)
  • Logging / observability platform (Datadog, New Relic, Grafana Cloud, Honeycomb)
  • Error tracking (Sentry, Rollbar)
  • Analytics (Amplitude, Mixpanel, PostHog)
  • Search infrastructure (Algolia, OpenSearch managed)
  • File storage (S3 / equivalent)
  • Customer support (Zendesk, Intercom, HelpScout)
  • Status pages (Statuspage.io)
  • DAM, CDN, WAF, DDoS โ€” all categories where infrastructure providers excel

Building any of these requires a written justification. The default is buy. The bias is strongly toward buy.

13.3 The classic "consider build" list

Capabilities where build is more often correct:

  • The core product surface (your differentiator)
  • Domain-specific data models that no SaaS product expresses
  • Workflow / orchestration of your business processes
  • Customer-facing UX (you're the brand)
  • Pricing engine, recommendation engine, ranking model โ€” where your data is the moat
  • Multi-tenant isolation, residency, audit โ€” when SaaS options can't meet your specific compliance posture

13.4 The "customize" trap

A vendor offers a platform you can heavily customize (Salesforce, ServiceNow, Pega, Microsoft Dynamics, low-code platforms). The trap: you start with "10% customization" and end with a 100-FTE practice maintaining a snowflake. Customization budget compounds.

Rules:

  • Be ruthless about what you customize. Workflows: yes. UI: maybe. Data model: only if forced. Core engine: never.
  • Time-box customization investment. Set an explicit budget (FTE-years and dollars) and revisit annually.
  • Plan an exit strategy. Even if you never use it, know how you'd leave. The vendor's roadmap is not yours.

13.5 The TCO comparison

Always quantify, always over 3 years. Don't compare list price; compare full TCO.

Cost component Build Buy SaaS Self-host OSS
Build / setup 8โ€“12 FTE-months 1โ€“2 FTE-months 2โ€“4 FTE-months
Annual licenses 0 $X/seat ร— N 0
Annual ops 1โ€“2 FTE 0.1 FTE 0.5โ€“1 FTE
Cloud infra $A/yr usually included $B/yr
Y3 cost rapid growth scales with usage sub-linear
Risk schedule, attrition, scope vendor, lock-in, price community, security, ops

A common trap: comparing "build cost" (engineers building) vs "SaaS cost" (license fee), forgetting the build option carries lifetime ops + maintenance + team-context cost too. Three-year TCO almost always favors buy for non-differentiator capabilities.


14. ๐Ÿ›’ Vendor Evaluation & Selection

You will pick vendors. Often. Do it as a process, not a vibes-based fight in a meeting.

14.1 The funnel

  1. Long list (โ‰ฅ5 vendors): gather from analyst reports (Gartner, Forrester, G2 grids), peer recommendations, your network. The point of a long list is to avoid the file-drawer effect of "the two we already heard about."
  2. Short list (3 vendors): cut on table-stakes โ€” region availability, compliance certifications, integration availability, price band, scale.
  3. RFP / questionnaire: standardized, scored, with same questions to all 3. (See ยง14.2.)
  4. Proof of concept (PoC): same scenario for all 3, same evaluation rubric, time-boxed.
  5. Reference calls: โ‰ฅ2 references each, asking the uncomfortable questions (see ยง14.4).
  6. Commercial negotiation: only after technical decision is made.
  7. Decision: written ADR with the scoring artifact attached.

14.2 The questionnaire (RFP)

A single questionnaire, applied to all 3 vendors. Categories and weights that work in practice:

Category Weight Sample questions
Functional fit 25% Does it cover capabilities X, Y, Z? Demo the workflow A.
Non-functional 20% SLA, availability, RPO, scale, observability surface
Integration 15% API quality, OpenAPI, events, SDK languages, rate limits, idempotency
Security / compliance 15% SOC 2 Type II, ISO 27001, GDPR posture, sub-processors, data residency, MFA, SSO, audit log retention
Operability 10% Status page, incident transparency, support tier responses, observability into our tenant
Roadmap & viability 5% Funding stage, customer count, growth, top customers, leadership stability
Commercial 10% Pricing model, predictability at scale, exit terms, data export, MSA flexibility

Vendors will resist standardized questionnaires. Insist. "We are evaluating three vendors with the same questionnaire to give you a fair comparison." They comply.

14.3 The PoC

A 2โ€“4 week structured trial, with the same scenario across all 3 vendors, scored on a published rubric. Hard rules:

  • The customer's engineers run the PoC, with vendor support. Not vendor-led.
  • Time-boxed; the same time box for each vendor.
  • Acceptance criteria written before the PoC starts. Otherwise you'll move the goalposts.
  • Document failures, not just successes โ€” "vendor 2 needed a workaround for our SSO" is a finding.

14.4 The reference call: ask the uncomfortable

Vendors' references are pre-selected; assume they're friendly. Get value anyway by asking:

  • "What's the worst incident you've had with this vendor in the last 18 months? How was it handled?"
  • "What did you wish you'd known before signing?"
  • "What's the next vendor capability that's blocking you?"
  • "How predictable is your bill quarter to quarter?"
  • "If you were starting today, would you choose them again?"
  • "Who else did you evaluate, and why did they lose?"

Ask for one reference not on the vendor's list โ€” usually possible through your network.

14.5 The vendor scorecard (running)

After selection, don't stop scoring. Maintain a running scorecard for any meaningful vendor:

  • SLA met (each month).
  • Incident count and severity.
  • Roadmap items shipped vs promised.
  • Cost trajectory vs forecast.
  • Support responsiveness.

When the scorecard goes red over two quarters, it's time to revisit. Most vendor problems are gradual decline, not sudden death โ€” the scorecard catches them early.

14.6 Lock-in: the four flavors

Not all lock-in is equal. Distinguish:

  • Data lock-in: getting your data out is hard or expensive. The most dangerous. Always negotiate data export terms upfront.
  • Operational lock-in: your team has skilled up and integrated workflows. Costly but survivable.
  • API lock-in: your code calls vendor APIs. Use abstraction at the boundary if the cost of switching matters.
  • Commercial lock-in: pricing escalators, multi-year commits, penalty clauses. Read the contract.

Data lock-in is the deal-breaker. Always have a written, tested, sub-week data export path.


15. ๐Ÿ’ฐ Cost & TCO Modeling

If you can't defend the cost, you can't defend the design. SAs who don't model cost don't get to architect โ€” they get overruled. Cost is a first-class design constraint, not a finance afterthought.

15.1 The three-year TCO

Always model three years. Year 1 hides the ramp; Year 3 reveals the steady-state. Categories:

Category Y1 Y2 Y3 Notes
Cloud infra (compute, storage, network, data transfer) Usage-based; model 3 scenarios
Managed services (DB, queue, cache, CDN) Mix base + usage
SaaS / vendor licenses Per-seat, per-event, per-tenant
AI / LLM API spend Per-token; sensitivity to volume
Build cost (FTEs ร— loaded cost ร— duration) Y1-heavy
Run cost (FTEs operating) Compounding
Compliance / audit Often overlooked
Support / training Often overlooked
Hidden โ€” data transfer, snapshot retention, log volume, dev/staging environments The biggest blind spots

Sum it. Show base case + optimistic + pessimistic (10ร— growth). Compare alternatives.

15.2 The cost-per-business-event metric

The most useful unit metric for a solution is cost per business event: per order, per request, per active user, per ML inference, per ticket. Calculate it; it's how you'll defend cost to the business.

Examples:

  • "$0.04 per order, of which $0.02 is database, $0.01 is compute, $0.005 is network, $0.005 is log volume."
  • "$0.18 per support conversation, of which $0.12 is LLM tokens (decreasing with caching), $0.04 is vector DB lookups."
  • "$2.10 per active user per month, dominated by storage and CDN."

When the number changes by 30%, you investigate. When the business asks "what does this cost?" โ€” you have the answer.

15.3 Cloud cost levers

  • Right-sizing: most workloads are 30โ€“60% over-provisioned by default. Saves 20โ€“40% almost always.
  • Reserved instances / savings plans: 30โ€“60% off list, for predictable workloads. Budget for the commitment.
  • Spot / preemptible: 60โ€“90% off, for fault-tolerant batch and stateless. Only with the right workload shape.
  • Storage class / lifecycle: hot โ†’ infrequent โ†’ cold โ†’ glacier. Saves 50โ€“95% on cold data.
  • Data transfer: the sneakiest cost. Cross-region, cross-AZ, NAT gateways. Architect to avoid.
  • Log volume: ingestion + storage + retention. Sample, drop, route by class. Often the biggest reduction lever after right-sizing.
  • Idle environments: dev/staging running 24/7 โ†’ switch off nights/weekends. Saves 50โ€“70% on those environments.

15.4 FinOps integration

Make the solution FinOps-aware from day 1, not retrofit later:

  • Tagging schema: every resource tagged with application, environment, cost-center, owner, data-class. Without tags, you have a cost line, not a cost story.
  • Budget alerts: at 50%, 80%, 100% of monthly budget, by tag. Alert the owner.
  • Showback / chargeback: monthly cost report by team / tenant / feature. Visibility changes behavior.
  • Anomaly detection: enable cloud-native (AWS Cost Anomaly Detection, equivalents). Catch the runaway batch job in 24h, not 28d.

15.5 Cost as a design driver

Surface cost in the architecture review. For each major component, attach: (load) ร— (unit cost) = (monthly cost). When a component is a 40% line item, defend it explicitly. Sometimes the design changes: a $40k/mo component you discovered late might be cheaper in a different topology.

A common SA upgrade: bring the FinOps person into the architecture review. They're often hungry to be invited; they'll find waste you missed; the design improves.


16. ๐Ÿ›ก๏ธ Security, Compliance & Risk

Security is not a section to bolt on at the end. It's a constraint that touches every box on the diagram. Compliance is the codification of security that somebody (regulator, auditor, customer) checks. Risk is the brutal honest list of what could kill the project.

16.1 Threat modeling โ€” early, with the security team

Run a threat model at the design stage, not at go-live. STRIDE is the workhorse:

  • Spoofing: identity assumption โ€” covered by auth/IAM
  • Tampering: data alteration โ€” covered by integrity, signing
  • Repudiation: deny actions โ€” covered by audit logs
  • Information disclosure: leak โ€” covered by encryption, access control
  • Denial of service: outage โ€” covered by rate limiting, autoscale, isolation
  • Elevation of privilege: getting more rights โ€” covered by least privilege, segmentation

For each component on the C4 L2 diagram, walk STRIDE. Document the controls. The output is a threat model artifact (typically 3โ€“10 pages) the security team signs.

16.2 The control catalogue (mapped to compliance)

Compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI DSS, FedRAMP, GDPR, NIS2) all reduce to roughly the same set of controls. Map your design against this canonical list:

Control What it means in design
Identity & access SSO, MFA, RBAC, least privilege, JIT access for admin
Encryption at rest CMK in KMS, rotated, with audited key access
Encryption in transit TLS 1.2+ everywhere, mTLS for service-to-service
Audit logging Every privileged action logged, immutable, retained per policy
Vulnerability management Image scanning, dependency scanning, periodic pen-test
Change management All changes via PR, reviewed, tested, rolled back-able
Backup & recovery RPO/RTO tested, DR drilled
Incident response Runbooks, on-call, post-mortem culture
Data classification Each data element tagged; PII handled distinctly
Vendor / sub-processor management Inventory, DPAs, security questionnaires
Physical / environmental Cloud provider's responsibility (in shared model)
Personnel Background checks, training, separation procedures (HR / IT)

The SA's job: ensure the design enables each control. Not necessarily implement them all directly โ€” but never design a solution that prevents a control.

16.3 The shared responsibility model

In cloud, security is shared. The cloud provider secures the substrate; you secure what you build on it. SAs frequently get the line wrong, either claiming AWS does too much or doing AWS's job for them.

A specific, clear table by service tier (illustrative):

  • IaaS (EC2, VMs): provider handles hypervisor, network fabric, physical. You handle OS patching, runtime, app, identity.
  • Managed services (RDS, ECS Fargate): provider handles OS, DB engine. You handle config, IAM, data, app.
  • Serverless (Lambda, Cloud Run): provider handles runtime. You handle code, IAM, secrets, data.
  • SaaS: provider handles almost everything. You handle identity (SSO), data classification, config.

State this explicitly in the security architecture document. Auditors love it. Engineers stop arguing about whose job patching is.

16.4 The risk register โ€” the brutal list

A risk register is the honest list of what could derail this solution. Format:

ID Risk Likelihood Impact Owner Mitigation Status
R-01 Vendor X bankrupt within 12 months M H SA Data export tested, secondary vendor researched Open
R-02 Key engineer departs before go-live M H EM Pair-programming, design docs, knowledge transfer plan Open
R-03 Data residency requirement changes mid-project L H Compliance Design abstracts region; abstraction tested Mitigated
R-04 LLM cost grows 5ร— at 10ร— usage M M SA Caching, prompt budget, model fallback In progress

Review the register at every steering committee. A risk register that doesn't change is a risk register that's not being maintained. Risks should appear, mitigate, close.

16.5 Privacy by design (GDPR and beyond)

If the solution touches personal data, design for privacy from day 1:

  • Data minimization: collect the least; design schemas around it.
  • Purpose limitation: each data element has a documented purpose; new use requires re-consent or DPIA.
  • Storage limitation: retention by data class, automated deletion.
  • Right to erasure: design for deletion. (This is harder than it sounds โ€” backups, logs, analytics.)
  • Data subject access requests (DSAR): design an API for "give me a user's data."
  • Cross-border transfers: SCCs, adequacy, residency design.

Privacy is non-trivial to retrofit. Asking these questions in week 4 is cheap; asking them in week 40 is expensive.

16.6 Compliance posture as a design output

By go-live, the solution should ship with:

  • A compliance posture document (1โ€“3 pages) โ€” which frameworks apply, which are out-of-scope, which controls are evidenced where.
  • A control mapping โ€” every control mapped to where it's implemented and how it's evidenced.
  • A DPIA (if EU/personal data) โ€” Data Protection Impact Assessment.
  • A records of processing (GDPR Article 30) โ€” for data flows.

These artifacts are increasingly commercial assets โ€” customers ask for them in security questionnaires, sales asks for them in deals, regulators ask for them in audits. Designing the solution to produce them naturally beats retrofitting them under audit pressure.


17. ๐Ÿšš Migration Architecture: 6Rs and Beyond

Many SA engagements are migrations more than greenfield. The "6Rs" framework (originally Gartner's 5Rs, extended) is the canonical taxonomy.

17.1 The 6Rs

For each system in scope, pick exactly one R:

R Action When Cost Risk
Retain Leave it where it is Stable, not strategic, low-risk-of-staying Lowest Lowest
Retire Decommission No longer needed, redundant, replaced Low (one-time) Low if scoped right
Rehost ("lift-and-shift") Move as-is to cloud Speed > optimization, simple stateless workloads Medium Medium โ€” works but expensive at run
Replatform Move with minimal changes (e.g., to managed DB) Easy wins via managed services Medium-high Medium
Refactor Re-architect Cloud-native is required, scale demands it High High
Repurchase Replace with SaaS Off-the-shelf option exists Medium-low (license + integration) Vendor risk

For each system: write the R, the rationale, the cost, the schedule, and the success criteria. A migration plan that can't articulate the R per system is not a plan.

17.2 The strangler fig pattern

For migrating large systems incrementally rather than big-bang. Conceptually: stand up the new system alongside the old, route a slice of traffic to new, validate, expand the slice, eventually retire the old.

Implementation patterns:

  • Reverse proxy / API gateway: route by path or feature flag.
  • Dual-write: write to old + new for a window; reconcile.
  • Read from new, fall back to old: for read paths.
  • CDC: replicate old โ†’ new while migrating.

Hard parts:

  • Data convergence: how do you ensure old + new agree during transition? Reconciliation jobs, comparison metrics.
  • Schema divergence: new schema may differ; transformation at the boundary.
  • Long tail: the last 10% of features takes 50% of the time. Plan for it.

17.3 The migration runway

Every migration has a runway. Plan it:

  • Phase 0: Foundations โ€” landing zone, identity, network, observability, IaC. Done before any workload moves.
  • Phase 1: Pilot โ€” one low-risk workload, end-to-end. Prove the pipeline.
  • Phase 2: Wave โ€” group similar workloads, migrate in 4โ€“8 week sprints.
  • Phase 3: Tail โ€” the hard cases. Strangler, replatform, or accept retain.
  • Phase 4: Retire โ€” decommission old infra. The most-skipped phase. Until you turn it off, you pay double.

A common failure: declaring victory at Phase 2. The legacy infra stays "for safety" for 18 months and you pay 1.7ร— run cost the whole time.

17.4 Migration cost shapes

Migrations have a characteristic "U-shape" cost: high during transition, theoretically lower after. Two traps:

  1. Underestimating transition cost. Dual-running, training, parallel teams. Often 1.5โ€“2ร— steady-state for 6โ€“18 months.
  2. Overestimating post-migration savings. Lift-and-shift to cloud is often more expensive than on-prem for the first 1โ€“2 years, until right-sizing and managed services pay off.

Be honest in the TCO model. The CFO will remember.


18. ๐Ÿ’ฌ Communication: Diagrams, Documents, Presentations

Most of your impact lands through communication. Bad communication kills good designs. Two principles dominate: audience-first and progressive disclosure.

18.1 The three-audience problem

Every artifact has at least three audiences:

Audience Wants Hates
Executive The headline, the cost, the risk, the recommendation Detail, jargon, indecision
Architect peer The decisions, the alternatives, the rationale Hand-waving, missing tradeoffs
Engineer The implementation truth, the contracts, the failure modes Vague abstractions, no examples

A single document cannot serve all three. Either produce three layered documents (recommended), or one document with clear sections labeled by audience.

The rough hierarchy:

  • Executive brief (1โ€“2 pages): problem, recommendation, cost, risk, decision needed. No diagrams more complex than C4 L1.
  • Architecture brief / RFC (8โ€“20 pages): full design, decisions, alternatives, NFRs, risks. Architects' bread and butter.
  • Technical spec / detailed design (per component): the engineer-facing detail.

18.2 Diagrams that earn their pixels

Rules:

  1. Title every diagram. "Figure 3: Order Flow โ€” happy path, sync, p99 budget 400ms." Untitled diagrams are riddles.
  2. Legend, always. Every shape and arrow color means something.
  3. One concept per diagram. A C4 L2 + sequence diagram + deployment view in one box is unreadable.
  4. Annotate the load and latency. Each box: estimated RPS, p99, cost contribution. Diagrams without numbers are decoration.
  5. Pretty is a feature. A clean diagram earns trust; a tangled one earns suspicion. Spend the extra hour.
  6. Mermaid > Visio for living architecture. Diagrams in code stay current; diagrams in Visio rot.

A well-known anti-pattern: the Buzzword Soup Diagram โ€” 60 boxes, 200 arrows, every cloud icon, no information. It says "I am working." It does not say what the system does. Replace with a 12-box C4 L2.

18.3 The architecture brief: a template

A reusable arc42-flavored skeleton:

  1. Summary (ยฝ page) โ€” problem, recommended solution, cost, risk, decisions needed now.
  2. Context (1โ€“2 pages) โ€” current state, business outcome, scope, out-of-scope.
  3. Constraints & NFRs (1 page) โ€” table.
  4. Strategic options (1 page) โ€” A/B/C with recommendation.
  5. Solution (3โ€“6 pages) โ€” C4 L1, L2, key flows, deployment.
  6. Decisions (link to ADRs).
  7. Cost & TCO (1 page) โ€” Y1/Y3, sensitivity.
  8. Risks (ยฝโ€“1 page) โ€” top 10 with mitigation.
  9. Migration / rollout (ยฝโ€“1 page) โ€” phases.
  10. Open questions & decisions needed (ยฝ page) โ€” explicit, named, dated.

Length cap: 20 pages. If you can't fit it, layer it: this brief + linked ADRs + linked detailed designs.

18.4 The executive presentation

Different beast. 5โ€“10 slides, 15-minute briefing, 30-minute decision meeting. Slide structure that works:

  1. The problem (1 slide, 1 sentence).
  2. What we recommend (1 slide, 3 bullets).
  3. Why this and not the alternatives (1 slide, 3 columns).
  4. What it costs and when it pays back (1 slide, 1 chart).
  5. What could go wrong, and our mitigation (1 slide, top 3 risks).
  6. What we need from you, and by when (1 slide, decisions list).
  7. Backup: full architecture, full TCO, full risk register. Don't open unless asked.

Anti-pattern: the 60-slide architecture deck where slide 23 has the recommendation. The exec is 60 seconds in by the time you reach slide 4. Lead with the answer.

18.5 The status update

Weekly or bi-weekly. Keep it boring. A template that works:

Project: <name>
Week of: <date>
RAG status: G/A/R (with reason if not G)

Highlights (3 max):
- ...

Decisions made this week:
- ...

Risks updated:
- ...

Decisions needed (with owner & date):
- ...

Next week:
- ...
Enter fullscreen mode Exit fullscreen mode

Boring is the strategy. Stakeholders need to know they don't have to read closely. The week you flip from green to amber, they read; that's the value.


19. ๐Ÿค Stakeholder Management

Eighty percent of the SA job is alignment with people you don't manage. The patterns:

19.1 The stakeholder map (RACI variant)

For each major decision, label four kinds of stakeholders:

  • Responsible (does the work)
  • Accountable (single owner of the decision)
  • Consulted (input; two-way)
  • Informed (one-way)

Rules:

  • Exactly one A. If you have two, you have zero.
  • The A is rarely the SA. The SA is often the R or C, sometimes the I.
  • Publish the map. Re-check at every gate. Decisions stall when A is unclear.

19.2 The decision log

Every decision gets an entry. Date, decision, alternatives, decider, rationale, reversibility. Stored alongside ADRs. Reviewed at gates.

A specific failure mode: "we kind of decided" decisions โ€” discussed in a meeting, never written. Six weeks later, the team rediscovers the question and re-decides differently. Cost: weeks. Solution: the SA writes it down within 24 hours, sends to the room, gets confirmation.

19.3 The "single throat to choke" pattern

For a complex solution, one person should be accountable for the solution end-to-end. Often that's you, the SA, or it's the Engagement Manager / Program Lead. Make it explicit. The customer should know whose phone to dial when something is going wrong. Distributed accountability = no accountability.

19.4 Difficult stakeholders

Patterns and counter-patterns:

Stakeholder type Pattern Counter
The dictator ("we're using X technology, end of story") Gives orders without rationale Ask "what problem are you solving with X?" โ€” re-route to the actual decision
The bikesheder (debates trivial things) Spends meetings on color of buttons Time-box the meeting; explicitly defer trivial choices to the team
The veto (security, legal, EA) Blocks late, never engages early Bring them in week 1; share artifacts early; get conditional approvals
The ghost (decision-maker who never shows) Books, cancels, no replies Escalate via their boss with written rationale; make absence costly
The polite blocker (says yes, does nothing) Agrees in meetings, no follow-through Ask for written commitment, dates; track in decision log
The technologist (a peer with strong tech opinions) Argues every choice as an aesthetic Push to write-up; force them to commit alternatives in ADR form

For each, the counter-pattern is make work visible and dated. Ambiguity is the enemy.

19.5 The quarterly steering committee

Every meaningful solution has a steering committee โ€” sponsor + key business + key tech leads + you. The cadence is monthly or quarterly. Run it as:

  1. RAG status (1 slide).
  2. Decisions needed today (3 slides max, one per decision).
  3. Risks updated (1 slide, focus on what changed).
  4. Roadmap (1 slide, gantt).
  5. AOB (10 min).

Goal: leave with written, signed decisions on every "decision needed today" item. If you don't, the next 2-4 weeks stall. The SA's job is to make the steering committee productive, not informational.

19.6 Bringing bad news

You will deliver bad news โ€” over budget, over schedule, the design is wrong, the vendor failed, the engineer left. Rules:

  1. Surface early. Bad news ages worse than fish. Tell the sponsor in 24h, not at the next steering.
  2. Bring options, not just problems. "We're 30 days behind. Three paths: cut scope X, add 2 contractors, accept slip. Recommendation: cut X."
  3. No blame. Talk about the system, not the people. People who fear blame hide problems.
  4. Take responsibility. As the SA, you're the connective tissue. If a thing didn't get caught, it's partly your job.
  5. Follow up in writing. Verbal news is half-news.

Sponsors who learn early that you bring honest, structured bad news with options trust you forever. Sponsors who learn late that you sat on it stop trusting you forever. Choose.


20. ๐Ÿคต Pre-Sales SA: The Consultative Sale

A pre-sales SA inside a vendor or SI has a different operating model. Not selling โ€” consulting โ€” but you do have a quota. The shape of the work:

20.1 The funnel and your role

Pre-sales SAs sit on the technical side of the sales funnel:

  1. Discovery โ€” sales-led, you co-attend. You listen for real problems; sales listens for budget and timing.
  2. Demo โ€” you lead. Tailored to the customer's actual problem, not the canned demo.
  3. PoC โ€” you scope, deliver or oversee, defend. Time-boxed, success-criteria-led.
  4. RFP / RFI response โ€” you write the technical sections. Often the deal is decided here.
  5. Statement of work / Pricing โ€” collaboration with sales / engagement managers.
  6. Close โ€” sales-led, you support objection handling.

20.2 The consultative sale

The pattern that wins, regardless of vendor:

  1. Understand the customer's business problem first. Not the technical requirement. Not the RFP question. The actual business outcome.
  2. Reflect it back. "You're trying to reduce time-to-resolution on tier-1 tickets from 8h to 1h, because customer churn correlates with first-touch latency. Did I get that right?" โ€” earns trust on the first call.
  3. Educate, don't pitch. Walk the customer through how similar customers solved similar problems โ€” yours and otherwise. They learn; trust compounds.
  4. Be the trusted advisor on the category, not the salesperson for the product. Mention competitors honestly. "If you have a heavy Salesforce footprint, our integration to product X may be less mature than competitor Y's; here's how customers handle it."
  5. Disqualify when needed. "Honestly, we're not the best fit for this use case. Vendor Z is stronger." โ€” this loses some deals and wins more, bigger, longer-term.

The sales reps who hit quota for years partner with SAs who do this. The ones who don't? They burn customers and the funnel goes dry.

20.3 The technical demo

A 30โ€“60 minute live walk-through. Rules:

  • Personalized: customer logo, customer data flavor, customer problem on screen. Generic demos lose.
  • Outcome-led: "By the end you'll see how this solves your tier-1 ticket time."
  • Failure-prepared: you've rehearsed, you've cached responses, you've got backup screenshots. The demo gods are cruel; the prepared SA is not surprised.
  • Q&A handled in real-time: if you don't know, say so, write it down, follow up within 48h. Honesty earns the deal.
  • No 60-slide intro. Start in the product. Slides for context, not for content.

20.4 The PoC: the scary one

PoCs are where deals are won or lost โ€” and where pre-sales SAs go off the rails. Rules:

  • Scoped explicitly: 2โ€“3 use cases, 2โ€“4 weeks, written success criteria. The customer signs the criteria.
  • Customer-led where possible: their engineers do the work, you support. They build muscle; they buy.
  • Failure modes documented: where the product doesn't fit, write it down. Surprises in production kill renewals.
  • Done = done. When the success criteria are met, celebrate and close. Don't drift into "while we're here, can you also..." That's free consulting and it tanks the deal close.

20.5 The RFP response

RFPs are a war of attrition. Practical patterns:

  • Reuse aggressively: maintain a question bank with last year's answers, scored by win/loss.
  • Answer the question asked, not the one you wish was asked. RFP scorers are unforgiving.
  • Use diagrams and tables in technical sections โ€” text walls don't score well.
  • Highlight unique strengths in 1โ€“2 places โ€” once at the top of the technical section, once in the executive summary.
  • Refuse low-quality RFPs: if the RFP looks copy-pasted from a competitor's marketing, you're column fodder. Decide whether to bid.

20.6 The handoff to delivery

The single most important moment in pre-sales SA work. Anti-pattern: pre-sales SA promises feature X to win the deal; delivery team didn't know; six months later the customer churns. Counter-patterns:

  • Internal SOW review: delivery sees the SOW before it's signed. They sign off in writing.
  • Documented promises: every commitment beyond the standard product is in a "delivery commitments" appendix. No verbal-only promises.
  • Joint kickoff: pre-sales SA + delivery SA + customer in the same room for handoff.
  • Pre-sales SA stays for first 30 days: as advisor, not driver. Continuity beats clean handoff.

21. ๐Ÿ› ๏ธ Post-Sales SA: Delivery Architecture

You won the deal, or you're an in-house SA on a greenfield. Now the work is delivery โ€” design that ships, runs, and renews.

21.1 Phase 0: foundations

Before any feature work:

  • Landing zone (cloud accounts, network, identity, observability, baseline IAM).
  • CI/CD pipeline (test, scan, deploy to dev/staging/prod).
  • Observability stack (logs, metrics, traces, dashboards, alerts).
  • Secrets management (Vault, KMS, AWS Secrets Manager).
  • Compliance baseline (audit logging, encryption defaults, change management).
  • Reference architecture & ADR baseline.

Phase 0 typically takes 4โ€“8 weeks. SAs new to delivery underestimate this and start feature work on shaky ground. Defer feature work; build foundations.

21.2 The delivery rhythm

Your operating cadence after Phase 0:

  • Daily: in standups occasionally (not every day โ€” that's the TL's job). Available on Slack for unblocks.
  • Weekly: design reviews on the week's hard topics. ADR updates. Cost dashboard review.
  • Bi-weekly: stakeholder update. Risk register review.
  • Monthly: steering committee. Deep architecture review.
  • Quarterly: WAR (Well-Architected Review) or equivalent technical health check.

Keep the engineering team's calendar light and your political-comm calendar heavy. They need flow; you need alignment.

21.3 Design reviews โ€” running them

Most teams' design reviews are bad โ€” too long, too vague, no decisions. A working format:

  1. Pre-read (10 min before). Author posts a 3-page brief with: problem, options, recommendation, NFR impact, open questions.
  2. Reviewer prep: each reviewer reads silently, leaves comments in the doc, comes with at most 3 "must-discuss" points.
  3. Meeting (45 min max): walk the must-discuss list, decide each. Decisions captured live.
  4. Output: an updated doc + decision-log entries, sent within 24h.

Patterns that ruin reviews:

  • "Cold" review where reviewers read the doc live. Wastes the room.
  • Architect monologue. Reviewers should be reacting, not listening.
  • No decisions captured. Six weeks later, no one remembers.

21.4 Architecture governance โ€” light, not heavy

Goal: enforce the important architectural principles (security, NFRs, integration contracts) without blocking velocity on minor decisions.

A working model:

  • Tier 1 โ€” automated: linters, IaC policy (OPA/Sentinel), dependency scanners. The team self-services.
  • Tier 2 โ€” peer review: PR with the right reviewer. No central architect needed.
  • Tier 3 โ€” ADR + design review: the SA or an architecture board reviews. For the load-bearing decisions only.
  • Tier 4 โ€” exception process: documented, time-boxed, expirable.

Anti-pattern: every change must go to the architecture board. Velocity collapses, the team goes around you, the architecture decays. Reserve the board for irreversible decisions.

21.5 The drift problem

Architectures drift. Teams adopt a new library, a new pattern, a new approach without updating the docs. Six months in, the running system doesn't match the design. Counter-measures:

  • Architecture validation in CI: probes that fail when the production topology diverges from the documented one.
  • Quarterly drift review: SA + leads walk the system vs the doc; close the gap.
  • ADRs are living: when a new decision invalidates an old one, write a new ADR; don't silently change.

21.6 The transition out

Eventually you leave the project. The transition is part of the design.

  • Documentation handoff: the next SA can read your docs cold and operate. Not a verbal walkthrough.
  • Decision log handoff: every irreversible decision documented with rationale and reversibility tag.
  • Risk register handoff: mitigations in flight, decisions still pending.
  • Stakeholder handoff: introduce the next SA in person to the top 5 stakeholders.

The mark of a good SA engagement: six months after you leave, the team is still operating well and the design is still coherent. If it falls apart in 6 weeks, you didn't transition โ€” you abandoned.


22. ๐Ÿš€ Working with Delivery Teams

You design; they build. The relationship determines whether the design lives.

22.1 Don't out-design the team

The most common SA failure: producing a design the team can't operate. Symptoms:

  • The design depends on tools the team doesn't know.
  • The design assumes 24/7 on-call when the team is 4 people EU-only.
  • The design has 11 environments, 23 services, and a service mesh; the team is 6 engineers.
  • The design optimizes for problems the team will not face for 3 years.

The fix: design with the team, not for them. Bring the TL into discovery. Bring engineers into ADRs. Walk the design with the team before the steering. They'll find issues you'd miss; they'll buy in earlier; they'll own it longer.

22.2 The SA's relationship with the TL

You and the team's tech lead are partners, not competitors. Roles:

  • TL: owns the team's velocity, code quality, day-to-day execution, sprint scope, code review.
  • SA: owns the cross-team integration, the major ADRs, the NFR negotiation, the stakeholder alignment, the long-arc design.

Lines blur in the middle. Resolve early:

  • "Who picks the unit test framework?" TL.
  • "Who decides the inter-service event schema?" SA, with TL input.
  • "Who chooses the database technology?" SA writes ADR; TL co-signs.
  • "Who runs the design review?" SA. "Who runs the sprint review?" TL.

Misalignment between SA and TL is poison โ€” the team gets contradictory direction, picks one, the other escalates, trust evaporates. Have the conversation explicitly in week 1.

22.3 Pairing in the design

The most underused tactic in solution architecture: pair with an engineer on the hard parts of the design. Walk a flow at the whiteboard. Sketch the schema together. Run a load-test plan together. Two effects:

  1. The engineer's local truth surfaces โ€” "actually, that join is 80ms in production, not the 8ms you think."
  2. The design becomes their design too. They defend it.

A common bad SA pattern: produce the design alone, deliver as fait accompli. The team disagrees, can't say so politely, builds something half-aligned, and resents it. Pair early.

22.4 The "spike" tool

When a design decision hinges on uncertainty (will this integration work? what's the actual latency? does this library do what its docs claim?), don't argue โ€” spike. A 1โ€“3 day prototype that answers exactly one question, then is thrown away. Rules:

  • Time-boxed: max 3 days. If you can't answer in 3 days, the question is too big โ€” break it down.
  • Single-question: "Can we get sub-200ms p99 with this integration?" โ€” yes/no.
  • Disposable: spike code is not production code. Throw it away. Do not let a spike become the foundation.

The SA either runs the spike themselves (rare) or writes the spike brief and hands it to a senior engineer.

22.5 The handoff document

When you're handing a design to delivery for build:

  • Reference architecture (C4 L1, L2, L3 of key bits).
  • All ADRs (decisions made + their rationale).
  • NFR register with acceptance tests.
  • Integration contracts (OpenAPI, AsyncAPI, schemas).
  • Runtime view (sequence diagrams of key flows).
  • Operational architecture (observability, on-call, runbook list).
  • Risk register with mitigations the team owns.
  • Open questions with named owners.

Anti-pattern: a 200-slide deck. Counter: a Markdown bundle in the repo, with diagrams in code, ADRs alongside.


23. โฑ๏ธ The Operating Cadence

Without a cadence, the SA defaults to firefighting and inbox-archaeology. With one, the role is leveraged. The default week:

23.1 The weekly template

Block Day(s) Duration Purpose
Deep design / writing Mon, Wed AM 3h ร— 2 ADRs, briefs, RFC review, longer thinking
Stakeholder 1:1s Tue, Thu 30 min ร— 4 Sponsor, delivery TLs, EA, security, finance
Design review Wed PM 2h The team's hard design topic of the week
Vendor / external Thu PM 2h Vendor calls, partner integrations
Discovery interviews (during phase) Various 1h ร— 3โ€“5 When in 30/60-day window
Steering committee prep Fri AM 2h Slides, decisions list
Steering committee (monthly) Last Fri 90 min The big meeting
Operating dashboard review Fri PM 30 min Cost, SLO, risk register, ADR backlog
Reading / learning Fri PM 1h Vendor releases, peer practice, conference talks

About 18โ€“22h of "scheduled" work. The rest is reactive: Slack, ad-hoc unblocks, escalations, urgent design questions, customer crises. Protect the deep blocks. They're where the actual design work happens. Without them, you're just a busy person who attends meetings.

23.2 The quarterly cadence

  • Quarter open: re-confirm NFRs, refresh roadmap, re-cost the TCO.
  • Mid-quarter: WAR (Well-Architected Review) on a specific workload. Drift check.
  • Quarter end: deep retro on the quarter's design decisions โ€” what's standing, what drifted, what should change. Update the principles set if needed.

23.3 The annual cadence

  • Strategic re-baseline: revisit the whole solution shape vs. the original vision. Is the customer's business still the same shape? Is the platform stack still the right one?
  • Cost re-baseline: full TCO recalculation with actuals; re-negotiate vendor commitments.
  • Talent / team check: who's leaving, who's growing, who needs cross-training. (Even though you don't manage them, their continuity is your design's continuity.)
  • Compliance / audit cycle: SOC 2, ISO, etc. Re-evidence controls.

23.4 Boundaries

Without protection, your calendar will fill with meetings other people benefit from.

  • No-meeting block at least one half-day a week. This is when ADRs get written.
  • Default to async. Most "let's get on a call" can be a doc comment.
  • One-screen rule: if the meeting can't be 30 minutes, it should be a doc instead.
  • The "decision-needed" filter: if the meeting has no decision needed, decline or downgrade to async update.

24. ๐Ÿค– AI in the SA Role

AI is now in every solution and every SA's workflow. Two flavors: AI in the solution you design, and AI augmenting your SA work.

24.1 AI in the solution: the patterns

Already covered in ยง12.3. The SA-level design points:

  • Default to LLM API + RAG for natural language workloads. Don't build a model unless data sovereignty, scale, or latency forces it.
  • Treat the LLM as an unreliable upstream โ€” apply circuit breakers, fallbacks, evals.
  • Cost guardrails are mandatory. Token budget per tenant, prompt caching, model fallback. AI cost is the new data-egress cost โ€” it sneaks up.
  • Evaluation harness in production. Golden sets, online evals, human review for sensitive paths.
  • Privacy review. Where do prompts go? Who can see them? How long are they retained? Most data-leak incidents in 2025 started with "we shipped an LLM call." Don't be the next one.

24.2 AI in the SA workflow

Things you can leverage AI for, today:

  • Discovery synthesis: paste interview notes, get a structured context map. Verify, don't trust blind.
  • First-draft ADRs: "Write an ADR comparing AWS Aurora vs. RDS PostgreSQL for the following NFRs." Then you edit, sign, own.
  • RFP response drafts: maintain a question bank; have the model produce first drafts; human-in-the-loop for accuracy.
  • Diagram generation: Mermaid / PlantUML / Structurizr produced from natural-language descriptions.
  • Cost modeling: spreadsheets and TCO comparisons sketched fast.
  • Threat modeling: a STRIDE walk on a C4 diagram, first-draft.
  • Documentation refresh: bring stale docs up to current state by pasting code + asking for diff.

Things to not delegate to AI:

  • The decision itself. Your name is on the ADR; you defend it; you sleep on it.
  • The stakeholder call. No model can read a CIO's mood or the silence after a security objection.
  • Final review. Models hallucinate constraints, invent compliance frameworks, and confidently misquote contracts. Always read the output as if a junior wrote it.

24.3 The hybrid workflow

A typical SA week looks like this:

  1. Spend 10 minutes describing the problem to your AI assistant. It produces a first-draft architecture brief, complete with C4 sketch, NFR draft, ADR stubs.
  2. Spend 90 minutes editing and rewriting โ€” fixing where it's wrong, deepening where it's shallow, removing where it's overconfident.
  3. Spend 30 minutes in a stakeholder call walking the resulting brief. Record. Feed the recording back to the model for a synthesized "decisions and follow-ups" memo.
  4. Spend 15 minutes reviewing and editing the memo. Send.

The 10-90-30-15 โ€” or thereabouts โ€” is roughly 3ร— faster than pure-human and 2ร— higher quality than pure-AI. The "centaur" pattern is the SA's modern toolkit.

24.4 The "AI-native solution" pattern

When the customer asks for an "AI-native" solution, what they often want is a human-in-the-loop system: the model does the heavy lifting; the human approves, edits, escalates. The architectural shape:

  • Inference layer (LLM + RAG + tools).
  • Action layer with explicit approval/escalation gates.
  • Observability layer that captures every prompt, response, decision.
  • Eval layer that scores model outputs continuously.
  • Cost layer that tracks per-tenant spend, caps it, alerts.
  • Compliance layer with audit logs of every model interaction.

This shape repeats across customer support, document review, code review, content moderation, claims processing. Recognize it; reuse it.


25. ๐Ÿงฐ Tools of the Trade

A lean toolkit beats a sprawling one. The SAs who deliver consistently rely on a small, mastered set.

25.1 The core kit

  • Diagramming: Excalidraw (whiteboard), Mermaid (in-doc), Structurizr or Lucidchart (formal C4). Stop using Visio for living architecture.
  • Documentation: Markdown in Git, with ADRs as files. Confluence as a publish target, not a source of truth.
  • Modeling: Spreadsheet (Google Sheets, Excel) for TCO, capacity, NFR matrix. Don't underestimate the spreadsheet.
  • Diagrams-as-code: Mermaid for flow/sequence, Structurizr DSL for C4, draw.io / Excalidraw for sketches. Diagrams in code stay current; diagrams in PowerPoint die.
  • Knowledge management: a personal Obsidian / Notion vault for vendor research, customer notes, design patterns, cheat sheets. Reuse aggressively.
  • AI assistant: Claude / ChatGPT / Cursor / Codeium. Become fluent.
  • Collaboration: Slack / Teams for ambient, doc comments for considered, calendar for protected.
  • Project tracking: Linear / Jira for the team, your own running decision log alongside. Don't run the SA's life inside the PM tool.

25.2 Cloud-specific tooling

  • AWS: Well-Architected Tool, Cost Explorer, Trusted Advisor, AWS Application Composer.
  • Azure: Azure Advisor, Cost Management, Architecture Center reference docs.
  • GCP: Active Assist, Cost Recommender, Architecture Framework docs.

For each cloud, there's a vendor-published reference architecture catalog. Read these. Most of your design has been done before by the vendor and is sitting on their site, free.

25.3 The frameworks that pay back

  • C4 model: covered in ยง8.
  • arc42: covered in ยง8.
  • TOGAF: enterprise architecture framework. Useful in regulated big-cos. Skim TOGAF 10's ADM cycle once; you'll recognize the pattern in EA conversations. Don't try to be TOGAF.
  • AWS Well-Architected Framework / Azure WAF / GCP Architecture Framework: the cloud-vendor lens. Run a review at gates.
  • DDD (Domain-Driven Design): useful for bounded contexts and cross-team boundaries. Read the Eric Evans book once; quote sparingly.
  • Risk-Based Architecture: surface the top 5 risks and design to mitigate them; bias time-spent toward risk-resolution.

25.4 Reading discipline

The SA who falls behind on the platform stack ages out fast. A working diet:

  • 1 hour a week minimum, blocked, on cloud release notes (one cloud, alternated).
  • 1 vendor briefing or webinar a month on a new category (vector DB, observability, security).
  • 1 architecture-related book a quarter โ€” Designing Data-Intensive Applications, Software Architecture: The Hard Parts, the Phoenix/Unicorn series, Accelerate, Domain-Driven Design, Building Microservices.
  • 1 conference a year, if possible. KubeCon, AWS re:Invent, Azure Build, QCon, GOTO, DDD Europe โ€” pick by what you're designing.

26. โš ๏ธ The SA Anti-Pattern Catalog

The recurring mistakes. Recognize, name, avoid.

26.1 The Architecture Astronaut

Symptom: layers of abstraction, every system a kafka-event-driven hexagonal-domain mesh, no actual feature ships in 6 months.

Cause: SA is more interested in being clever than in being useful.

Counter: every design has a "what would the simplest thing be?" sentence. If your design is 10ร— more complex than the simple thing, defend the 10ร— explicitly. Often it can be cut.

26.2 The Vendor-Captured SA

Symptom: every problem is a use-case for the SA's favorite vendor (AWS Step Functions, ServiceNow, Snowflake โ€” pick your poison).

Cause: certifications, comfort, sales relationship, or being employed by said vendor.

Counter: ask "what would I recommend if this customer was on a different stack?" The answer reveals captivity.

26.3 The Diagram-Heavy, Decision-Light SA

Symptom: 80-page design pack, zero ADRs, "design is still being finalized" for 6 months.

Cause: avoiding the discomfort of irreversible decisions.

Counter: target 1 ADR per week. If a week passed without one, you're stalling.

26.4 The Whiteboard Designer Who Never Ships

Symptom: brilliant in the room, vague on paper, the team builds something different from what was discussed.

Cause: the design lives in the SA's head; the team builds what they understood, which is different.

Counter: write before you whiteboard. Or whiteboard, then immediately photograph and write up. The artifact is the design; the meeting is the discussion about it.

26.5 The "Forever in Discovery" SA

Symptom: month 4, still no design. Just more interviews. The customer is paying.

Cause: fear of committing, masquerading as thoroughness.

Counter: time-box discovery (30 days for most engagements, 60 for big enterprise). After that, ship a design even if rough. Iterate.

26.6 The Over-Architect of Trivial Things

Symptom: a 12-page ADR on the choice between two equivalent libraries. A formal design review for a config flag.

Cause: applying one-way-door rigor to two-way-door decisions.

Counter: explicitly tag every decision as one-way or two-way. Defaults: two-way โ†’ fast/cheap. One-way โ†’ slow/careful.

26.7 The Solo Architect

Symptom: design is "done," delivery team has questions you can't answer because the design didn't survive contact with the team.

Cause: producing the design alone, without the team.

Counter: design pairing (ยง22.3). The first draft is yours; the second draft is the team's; the third draft is jointly owned.

26.8 The "Build to Resume" SA

Symptom: every solution involves the technology the SA wants experience with โ€” Kubernetes, Kafka, Cassandra โ€” regardless of fit.

Cause: SA's career incentives โ‰  customer's outcome.

Counter: declare your preferences explicitly to a peer; have them challenge you. Or use the "would I recommend this in 5 years to a friend" test.

26.9 The Compliance-Avoider

Symptom: design ignores compliance until week 18, then a compliance review forces a 3-month redesign.

Cause: compliance is boring; engineers postpone.

Counter: bring compliance into discovery. Make compliance constraints explicit in NFRs. Treat them as design inputs, not gates.

26.10 The Cost-Blind SA

Symptom: design works perfectly; bill is 4ร— what the customer expected; CFO kills the project.

Cause: cost was finance's problem.

Counter: TCO is part of the design (ยง15). Cost is an NFR. Defend it like latency.

26.11 The Handoff Cliff

Symptom: SA designs, leaves; six months later the team has rewritten half of it.

Cause: design didn't fit the team's reality; team wasn't on board.

Counter: pair-design with the team (ยง22.3); transition in (ยง21.6) rather than out.

26.12 The Status-Update Theater

Symptom: weekly 12-slide deck, beautiful charts, but the steering can't tell what's blocked or decide anything.

Cause: confusing visibility with clarity.

Counter: use the boring template (ยง18.5). Lead with RAG, lead with decisions needed, lead with risks updated.

26.13 The Promised Feature

Symptom (pre-sales): SA promises capability X in the demo to win the deal; delivery team didn't know; deal churns.

Cause: incentive misalignment, no internal review of commitments.

Counter: every promise is a written delivery commitment, reviewed by delivery before the SOW signs.

26.14 The "Single Source of Truth" That Isn't

Symptom: three Confluence pages, two Notion docs, one diagram in Lucidchart, and a Slack thread โ€” all describing the same thing, all slightly different.

Cause: no documentation discipline.

Counter: ONE source-of-truth, declared and linked. Everything else is a mirror or summary, with link-back. Old artifacts archived, not deleted.

26.15 The Architecture Board That Slows Everything

Symptom: every change must go through a weekly board, the queue is 4 weeks long, teams route around it.

Cause: governance over-applied.

Counter: tier governance (ยง21.4). Most changes are auto + peer; only the load-bearing ones go to the board.


27. ๐Ÿ—บ๏ธ The Phased Roadmap (Day 1 โ†’ Year 5)

Where you are in your SA career changes which sections matter most.

27.1 Year 0โ€“1: The new SA

You are: a senior engineer or tech lead newly given an SA title, or a first-job SA at a vendor.

Focus:

  • ยง2 Mindset (it's the hardest shift)
  • ยง6 Discovery (where most failures originate)
  • ยง8 ADRs (the deepest skill compound)
  • ยง9 NFRs (the contract โ€” overlearn it)
  • ยง18 Communication (writing first, then diagrams)

Avoid:

  • Pretending you have authority you don't.
  • Diagrams without numbers.
  • Designing alone.

Win: ship one solution end-to-end, with documented ADRs, that runs in production and gets renewed.

27.2 Year 2โ€“3: The competent SA

You are: shipping multiple solutions, recognized as the technical lead in a room of stakeholders.

Focus:

  • ยง13 Build vs Buy (becomes your highest-leverage skill)
  • ยง14 Vendor evaluation (RFP responses, PoCs)
  • ยง15 Cost (the language of business)
  • ยง19 Stakeholder management (the underrated skill)
  • ยง22 Working with delivery teams (your designs need to ship through people)

Avoid:

  • Becoming captive to a single vendor or stack.
  • Letting your IC craft atrophy completely (the role still needs technical credibility).
  • Thinking the role is done at the SOW signature.

Win: a solution you designed at year 2 is still running well at year 4, run by a team you trust.

27.3 Year 4โ€“6: The principal SA

You are: trusted with the largest, most ambiguous engagements. Mentoring junior SAs.

Focus:

  • ยง3 Archetypes (consciously choosing your seat)
  • ยง7 Methodology (yours, opinionated, repeatable)
  • ยง10โ€“11 Cloud + integration patterns at depth
  • ยง16 Compliance (becomes a competitive advantage)
  • ยง24 AI in the role (centaur workflow)

Avoid:

  • Becoming the bottleneck for every decision (delegate downward; mentor up).
  • Drifting into pure pre-sales or pure delivery โ€” keep both muscles.
  • Thinking the playbook is done; the platform stack changes every 2 years.

Win: your patterns (templates, ADR catalog, NFR register, vendor scorecards) are reused across engagements. You are the one teaching the next SA.

27.4 Year 7+: The strategic SA / Chief Architect / EA

Your fork:

  • Path A: Principal SA โ€” bigger, more strategic engagements, fewer of them, deeper. The "we hire you for the hard ones" path.
  • Path B: Chief Architect / Director โ€” own the SA practice; mentor a team of architects; set standards. People-leverage.
  • Path C: Enterprise Architect โ€” multi-year horizon, capability heatmaps, governance board. Less project, more program.
  • Path D: CTO / VPE โ€” you take on the org. Read ๐Ÿ‘จโ€๐Ÿ’ป The CTO Playbook ๐Ÿ“˜: From Best Builder to Best Bet โ™Ÿ๏ธ.

The skills overlap, but the daily life diverges sharply. Choose deliberately. Many great SAs miscast themselves into a chief-architect role and find they hate management; many great chief architects miscast themselves into a CTO role and find they hate the board. Try the role for 6 months in some way (interim, secondment, shadowing) before committing.


28. ๐Ÿ“‹ Cheat Sheet & Resources

28.1 The 30-second SA pitch

"I'm the Solution Architect for [project]. My job is to deliver a runnable, affordable, supportable solution that closes the business problem within the agreed constraints, working through teams I do not manage and stakeholders I do not control. I will spend the first 30 days listening, the next 30 framing, the next 30 designing and gating, and the rest delivering โ€” through ADRs, an NFR register, a TCO model, and a risk register that I'll keep alive and visible."

28.2 The questions a good SA asks every week

  • "What's the most likely way this project goes wrong this quarter?"
  • "What decision is stuck because nobody owns it?"
  • "What's the cost trajectory vs. what we modeled?"
  • "What's drifting from the design?"
  • "Who hasn't I talked to in two weeks who matters?"

28.3 The pre-meeting checklist

Before any architecture-related meeting:

  • Pre-read sent? (โ‰ฅ24h ahead)
  • Decision needed today, named explicitly?
  • Decider in the room?
  • Alternatives on a slide / in the doc?
  • NFR impact stated?
  • Cost impact stated?
  • Reversibility tagged?
  • Note-taker assigned?

If five of eight are no, the meeting will fail. Reschedule.

28.4 The "ship it or not" gate

Before declaring a solution shippable:

  • All P1 NFRs have passing acceptance tests
  • Threat model signed by security
  • Compliance posture documented
  • TCO Y1 within budget; Y3 within tolerance
  • DR drilled at least once
  • On-call rotation staffed and trained
  • Runbooks for the top 5 incidents
  • Observability covering the critical paths
  • ADRs current and reviewed
  • Risk register reviewed and at acceptable residual

If any are no, ship a limited go-live (single tenant, soft-launch, beta) โ€” not a full GA.

28.5 Reusable artifact templates

Maintain a personal vault with reusable templates:

  • ADR template (Markdown)
  • Architecture brief template (arc42)
  • NFR register (spreadsheet)
  • TCO model (spreadsheet, parameterized)
  • Risk register (spreadsheet)
  • Vendor scorecard (spreadsheet)
  • Discovery interview script
  • Steering committee deck skeleton (โ‰ค10 slides)
  • Status update template
  • Threat model template (STRIDE)

Each saves hours per engagement and improves quality. Sharpen them every quarter.

28.6 The reading list (focused)

If you only read 5 books in your SA career:

  1. Designing Data-Intensive Applications โ€” Kleppmann. The vocabulary of data architecture.
  2. Software Architecture: The Hard Parts โ€” Ford, Richards. Tradeoffs, distributed systems, decision frameworks.
  3. Fundamentals of Software Architecture โ€” Ford, Richards. The companion volume.
  4. Building Microservices โ€” Newman. Even if you don't do microservices, the boundary thinking is essential.
  5. The Phoenix Project + The Unicorn Project โ€” Kim. Operational thinking. Less "architecture," more "why architecture fails in practice."

Plus periodically:

  • Domain-Driven Design โ€” Evans (skim, but you must know the vocabulary)
  • Accelerate โ€” Forsgren et al. (the metrics that matter)
  • Site Reliability Engineering โ€” Beyer et al. (the operational mindset)
  • Thinking in Systems โ€” Meadows (the meta-skill)

28.7 Online resources

  • Cloud reference architectures: AWS Architecture Center, Azure Architecture Center, GCP Architecture Framework. Free, vendor-published, current.
  • Martin Fowler's site: martinfowler.com. Patterns and articles aging extraordinarily well.
  • Simon Brown's C4 model: c4model.com. Read this once.
  • arc42: arc42.org. Templates and examples.
  • High Scalability: highscalability.com. Real-world architectures.
  • InfoQ Architecture queue: infoq.com.
  • CNCF Landscape: landscape.cncf.io. The platform-tooling map.

28.8 The companion playbooks in this repo

28.9 The closing reminder

The Solution Architect role is one of the most leveraged in tech: a single good solution shipped for the right reasons can save a customer years and millions, and a single misframed one can burn the same. You sit at a unique intersection: technical enough to design, business-fluent enough to negotiate, organized enough to deliver, and patient enough to listen. Few roles touch all four โ€” most engineers are stronger on the design axis but weaker on the others. The SAs who scale are the ones who deliberately level all four, year over year.

The work compounds. Every engagement teaches you a constraint you hadn't seen, a vendor who let you down, a stakeholder who taught you a new question, a design that survived contact with reality and another that didn't. Keep your vault. Update your patterns. Mentor the next SA. The discipline is younger than software engineering itself; the next decade of practice is being written by the people who are practicing it now, deliberately. Be one of them.


If you found this helpful, let me know by leaving a ๐Ÿ‘ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! ๐Ÿ˜ƒ

Top comments (0)