Design & Trade-Off Thinking
Why did I choose this design over at least two alternatives?
What am I optimizing for: latency, cost, scalability, simplicity, or speed to market?
What assumptions am I making that could later prove false?
Which part of this design is the most fragile?
If requirements double, which component breaks first?
If requirements change, which component is hardest to modify?
What would I change if I had half the budget?
What would I change if traffic increased 10× overnight?
Scale & Performance
Which component becomes the bottleneck at scale?
How does this behave under uneven traffic or hot keys?
What happens during a traffic spike?
How do we protect downstream systems?
How do we degrade gracefully instead of failing hard?
Which data access paths are on the critical path?
How do we cache without breaking correctness?
How do we scale reads vs writes independently?
Failure & Resilience
What fails first in this system?
What happens when a dependency is slow or down?
How does the system recover from partial failures?
Is the failure visible or silent?
How do we prevent cascading failures?
Do retries make things worse?
What happens during deployment failures?
Can we roll back safely?
Cost & Efficiency
What is the monthly cost of this design?
Which components drive the most cost?
How does cost scale with traffic?
What happens to cost at 10× usage?
Where can we trade cost for latency?
Where can we trade cost for reliability?
Are we paying for unused capacity?
Is serverless cheaper or more expensive here?
Security & Risk
What data is sensitive?
Where is data exposed in transit or at rest?
How do we limit blast radius if credentials leak?
What happens if this API is abused?
How do we enforce least privilege?
How do we audit access?
How do we detect suspicious behavior?
How do we comply with regulations (HIPAA, SOC2, GDPR)?
Operability & Supportability
How do we know the system is healthy?
What metrics matter most?
How fast can we detect and debug issues?
Can on-call engineers understand this system at 3 AM?
What logs are critical?
What dashboards must exist?
What alerts are actionable vs noisy?
Data & Consistency
What consistency model do we need?
Where is eventual consistency acceptable?
What happens if data is duplicated?
How do we handle partial updates?
How do we reconcile failures?
What is the source of truth?
How do schema changes affect the system?
How do we migrate data safely?
API & Integration Design
- Who are the consumers of this API?
- How do we version APIs without breaking clients?
- How do we handle backward compatibility?
- What happens if clients misuse the API?
- How do we enforce rate limits?
- How do we communicate breaking changes?
- Is synchronous or asynchronous better here?
AI / GenAI / Agentic Systems
- Why use GenAI here instead of rules?
- What happens when the model hallucinates?
- How do we validate AI responses?
- How do we control cost per request?
- What data should never go to the model?
- What tools does the agent have access to?
- What if the agent makes a wrong decision?
- Where is human approval required?
Business & Long-Term Thinking
- How does this architecture support business goals?
- What business risk does this reduce?
- How does this enable faster feature delivery?
- How do I explain this to a non-technical leader?
- How will this system evolve in 2–3 years?
- Which decisions are hard to reverse?
- What tech debt is acceptable vs dangerous?
Top comments (0)