When a company brings me in as a fractional CTO or for a technical review, the first question is always some version of: "Is our system healthy?"
That question is too broad to answer directly. So I break it down into a structured audit across 12 dimensions — each scored 0–10, each with a risk level, each driving a prioritized action list. Over the past few years this framework has become my go-to tool for evaluating systems I've never seen before, in industries ranging from logistics to legal tech to government SaaS.
This post explains the framework in detail. Not as a checklist to print and forget, but as a mental model for thinking about systems holistically.
Why Frameworks Matter (and Where They Break Down)
Ad-hoc reviews produce ad-hoc findings. You notice what you know, miss what you don't, and end up with a list of opinions rather than an assessment.
A scoring framework forces coverage. It also creates an artifact that a non-technical stakeholder can read. An executive doesn't care that you're using Optional instead of T | None in Python. They do care that your Security score is 4/10 and that three of the findings are OWASP Top 10 violations.
The limitation: scores are a starting point, not a verdict. A system with a 7/10 Architecture score and a 3/10 Security score is more dangerous than one with consistent 5s across the board. Context matters. The framework surfaces what to look at; judgment determines what to do about it.
The 12 Dimensions
1. General Architecture
Before looking at any code, I ask: what is this system's intended architecture, and does the implementation match the intent?
The key distinctions:
- Monolith vs. microservices: Is the choice appropriate for the team size and traffic? A five-person team running twelve microservices is paying complexity costs they can't afford.
- Layered vs. Clean Architecture: Are business rules isolated from infrastructure? Can you swap the database without rewriting business logic?
- DDD alignment: Do the bounded contexts in the code map to the actual problem domain, or did someone read a book and add "Domain" to every class name?
What I'm actually looking for: coupling. How much does changing one thing break other things? High coupling is the primary driver of slow development and high defect rates.
Red flags: circular dependencies between modules, business logic in controller/route handlers, God classes with 2000+ lines, no clear seam between "what the system does" and "how it does it."
2. Backend
Framework choice is a proxy war. The real questions are about how the framework is being used.
I evaluate:
- SOLID principles: Not academically — are responsibilities separated in a way that makes the code testable and changeable?
- Dependency Injection: Is dependency injection used? Or is everything instantiated inline, making testing impossible without real infrastructure?
- Async architecture: For I/O-bound systems, is async used throughout, or are there blocking calls hiding in the hot path?
- Exception management: Are errors caught at the right level? Is the distinction between "expected failure" (user error) and "unexpected failure" (bug) clear in the codebase?
- Logging: Can you debug a production incident from the logs alone? Are log levels used meaningfully, or is everything INFO?
What I'm actually looking for: testability and debuggability. These are proxies for maintainability. A system that's hard to test is a system that's hard to change safely.
3. Frontend
Frontend reviews often get less rigor than backend reviews. That's a mistake — frontend is where most user-facing bugs live, and where security issues like XSS originate.
Key areas:
- Component architecture: Is there a clear separation between presentational and container components? Are components appropriately sized?
- State management: Is global state minimized? Are you using a state manager because you need to, or because it was in the boilerplate?
- Performance: Time to interactive, bundle size, lazy loading. Are you shipping 4MB of JavaScript for a mostly-static page?
- Accessibility: Can the application be used without a mouse? Are ARIA attributes present and meaningful?
The question that cuts through the noise: If a junior developer joins tomorrow, can they find where to make a UI change in under 10 minutes?
4. Database
Schema design is where technical debt crystallizes into permanent form. Bad schemas don't get fixed — they get worked around, forever.
I look at:
- Normalization: Are there repeating groups? Is data that should be derived being stored redundantly?
- Indexing: Are there foreign keys without indexes? Is the query planner using the indexes you have? (Explain plans, not assumptions.)
-
Query patterns: Are N+1 queries present in ORM usage? Are there unbounded
SELECT *queries in production code paths? -
Migration strategy: Is schema change managed through a migration tool (Alembic, Flyway, etc.) with version-controlled, reversible migrations? Or are developers running
ALTER TABLEdirectly on production? - Backup and recovery: When was the last backup tested? "We have backups" and "we can restore from backups" are different things.
The hardest question: What is your RTO/RPO, and does your backup strategy actually meet it?
5. API Design
A well-designed API is a contract. A poorly designed one is a trap — for your frontend developers, your integration partners, and your future self.
Evaluation criteria:
-
REST semantics: Are HTTP verbs used correctly? Is
GETidempotent? DoesDELETEreturn the right status code? - Versioning: Is there a versioning strategy before it's needed, or will you break clients when you need to change a response shape?
-
Error handling: Do error responses follow a consistent structure? Does
400vs422vs500mean something, or is it random? - Rate limiting: Is rate limiting implemented before launch, or after the first abuse incident?
-
Pagination: Are list endpoints paginated from day one?
SELECT * FROM eventsreturning 2 million rows is a real incident.
What I'm actually looking for: does the API communicate intent, or does it require the consumer to know implementation details?
6. Security
Security review is where I slow down. A missed finding here has a different consequence than a missed finding in "Code Quality."
I follow OWASP Top 10 as a baseline and add:
Authentication: Are credentials hashed with a modern algorithm (bcrypt, Argon2)? Is there account lockout? Is password reset implemented securely (time-limited tokens, single-use)?
Authorization: Is authorization checked at the service layer, not just the route layer? Is there a test for horizontal privilege escalation (user A accessing user B's data)?
Injection: SQL injection via ORM parameter binding — is raw string interpolation present anywhere? Is user input ever passed to shell commands?
Secrets management: Are secrets in environment variables (acceptable) or hardcoded/committed to source control (not acceptable)? Is .env in .gitignore?
Transport security: Is HTTPS enforced? Are security headers present (CSP, HSTS, X-Frame-Options)?
The audit pattern that finds the most real vulnerabilities: Follow the data. Pick a piece of user-provided data and trace it from HTTP request to database and back. Every transformation and validation point along the way is a potential vulnerability.
7. Performance
Performance problems have two categories: those that exist now, and those that will exist at 10x current load.
Current state:
- Is there caching at the right layer (in-memory, Redis, CDN)?
- Are images optimized? Is there a CDN in front of static assets?
- Is compression enabled (Gzip/Brotli) for API responses?
Future state:
- What is the bottleneck at 10x? Is it the database? The application server? A third-party API with rate limits?
- Are there synchronous operations in the hot path that could be moved to a queue?
- Is there a single replica database that becomes the bottleneck under read load?
The question I ask in every review: What is the slowest operation in the system, and is it in the hot path?
8. DevOps
DevOps is the operational envelope around the software. A perfect codebase deployed manually to a single server is fragile. An imperfect codebase with solid CI/CD, automated rollbacks, and structured logging is manageable.
Key areas:
- CI/CD: Is every merge to main automatically tested and deployed? What is the deploy time? Can you roll back in under 5 minutes?
- Container strategy: Are containers immutable? Is the image built once and promoted through environments, or rebuilt at each stage?
- Environment parity: Are staging and production running the same infrastructure? Environment-specific bugs are a signal that they're not.
- Observability: Can you answer "is the system working right now?" without checking logs manually? Are there dashboards and alerts?
- Incident response: Is there a runbook? Does the team practice incident response, or is every incident a first-time experience?
9. Cloud Infrastructure
Cloud-specific concerns that are separate from application-level DevOps:
- Scalability: Is compute auto-scaling configured? Will the system scale on demand, or require manual intervention?
- High Availability: Are there single points of failure? Is the database replicated? Is the application deployed across availability zones?
- Disaster Recovery: Is there a tested DR plan? What is the geographic footprint?
- Cost model: Are resources appropriately sized? Are there reserved instances for predictable load?
The question that exposes unexamined assumptions: What happens if the primary region goes down for 4 hours? Walk me through it.
10. Testing
Testing is the only mechanism that provides evidence (not just confidence) that the system works correctly.
I evaluate:
- Coverage: What is the unit test coverage? More importantly: are the right things tested (business logic, not getters/setters)?
- Integration tests: Are there tests that exercise the real database, not mocks? Mocks that diverge from reality cause incidents.
- E2E tests: Is there a smoke test suite that runs on every production deployment?
- Test design: Are tests isolated? Is there test data management? Do tests fail for one reason, or do they entangle multiple concerns?
The coverage trap: 85% coverage achieved by testing trivial code is less valuable than 40% coverage of the critical paths. Coverage is a proxy metric. What matters is: can you confidently refactor?
11. Code Quality
Code quality is the accumulation of thousands of small decisions. I evaluate it at the structural level:
- Readability: Can a new engineer understand what a function does without reading its dependencies?
- Naming: Do variable and function names express intent? Is there Hungarian notation, single-letter variables in non-trivial contexts, or names that contradict the actual behavior?
- Modularity: Are modules sized appropriately? Are there 5000-line files?
-
Technical debt: Is there a
TODOarchaeology layer where comments from three years ago reference tickets that no longer exist?
What I'm actually measuring: the cognitive load required to make a change. High cognitive load → slow velocity → more bugs → more technical debt. It compounds.
12. AI Readiness
In 2026, AI readiness is a first-class architectural concern. Not "are you using AI?" but "is the system designed to integrate AI as capabilities evolve?"
Key questions:
- API-first design: Is the system's business logic accessible via clean APIs, or is it buried in monolithic processes that AI agents can't interact with?
- Event-driven architecture: Is significant state change emitted as events that AI agents can react to?
- MCP compatibility: Are you thinking about Model Context Protocol? Can an AI agent inspect and act on your system's data through structured interfaces?
- RAG readiness: Is content structured in a way that supports retrieval-augmented generation? Are there semantic search capabilities or hooks for them?
- Data quality: AI is only as good as its training and retrieval data. Is your data structured, clean, and accessible?
The strategic question: In 18 months, a competitor will have AI agents that can automate 30% of what your users do manually. Does your architecture make that possible, or does it make it impossible?
The Scoring Model
Each dimension is scored 0–10:
| Score | Interpretation |
|---|---|
| 9–10 | Best practice. No meaningful improvement needed. |
| 7–8 | Solid. Minor improvements would have diminishing returns. |
| 5–6 | Adequate. Known gaps but no immediate risk. |
| 3–4 | Significant gaps. Improvement needed within 3-6 months. |
| 1–2 | Critical deficiencies. Immediate attention required. |
| 0 | Not implemented at all. |
From the 12 category scores I derive composite metrics:
- Overall Technical Score (0–100): weighted average, security and reliability weighted higher
- Production Readiness (%): a function of security, DevOps, testing, and observability scores
- Enterprise Readiness (%): adds compliance posture, audit logging, and multi-tenancy to production readiness
- AI Readiness Score: standalone, given its increasing strategic importance
The Output
The audit produces four deliverables:
1. The scorecard — 12 scores, risk levels (Low/Medium/High/Critical), and one-line findings.
2. Critical issues — The 10 findings that carry the most risk, regardless of category. These are the things that could cause an incident, a breach, or a failed fundraise due diligence.
3. Quick wins — Changes that take under a week and have disproportionate impact. These maintain momentum and demonstrate value early.
4. Roadmap — Medium-term (1–3 months) and long-term (6–12 months) architectural improvements, with rough effort estimates and business justification.
What Surprises People Most
When I share results with engineering teams, the finding that surprises them most is almost always in Testing or Database — not Security, which most teams are at least anxious about.
Specifically:
- Integration tests that mock the database: Teams confidently claim "we have 80% coverage" and then discover that none of those tests would catch a query that works on SQLite but fails on PostgreSQL. This is a real incident pattern.
- Unmigrated schema changes: Tables that were altered directly in production, with no corresponding migration file. The application works, but you can't reproduce the production environment from source control alone.
- Missing composite indexes: The queries work fine at current data volumes. They'll time out at 10x. Nobody notices because load testing isn't in the development process.
A Note on Judgment
This framework is a tool for structured thinking, not a replacement for it.
A system with a 6/10 Security score at an internal tool company is a different situation than a 6/10 at a payment processor. A 4/10 Testing score on a prototype is acceptable; on a system processing medical records, it's a liability.
The framework surfaces what to look at. The review conversation — with the engineering team, with the CTO, with the business stakeholders — determines what it means and what to do about it.
That's the part that can't be automated.
If you're about to make a significant technical investment — new hire, rewrite, acquisition — and want an independent architecture review, I'm available for fractional engagements.
Top comments (0)