We are going to build a design review agent that reads an engineering spec and flags missing context, security gaps, and operational concerns before implementation starts. It runs from a single Python file and returns a structured JSON report you can paste into a PR or ticket. Teams that review specs manually spend hours on consistency checks that a model can handle in seconds, and Oxlo.ai's flat per-request pricing (see https://oxlo.ai/pricing) means long documents do not inflate your bill.
What you'll need
- Python 3.10 or newer
- An Oxlo.ai API key from https://portal.oxlo.ai
- The OpenAI SDK:
pip install openai - A sample markdown spec to test against
Step 1: Scaffold the client and verify the connection
Start by instantiating the client against Oxlo.ai. I keep the API key in an environment variable. I use Llama 3.3 70B because it handles long context windows reliably, and since Oxlo.ai charges per request rather than per token, feeding it a 3,000-word spec costs the same as a one-liner.
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key=os.environ["OXLO_API_KEY"]
)
# Quick smoke test
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Reply with exactly: Connection OK"}],
)
assert "Connection OK" in response.choices[0].message.content
print("Oxlo.ai client ready")
Step 2: Define the review rubric
The agent needs a rigid rubric or it will hallucinate criteria. I store the prompt in a module-level constant so it can be versioned in git. The rubric checks for goals, non-goals, dependencies, security, observability, and rollback plans.
SYSTEM_PROMPT = """You are a staff engineer performing a design review.
Read the provided engineering spec and evaluate it against the rubric below.
Return ONLY a JSON object. Do not wrap it in markdown.
Rubric categories (score 0-10):
1. goals: Are the goals and success criteria explicit?
2. non_goals: Are boundaries and out-of-scope items listed?
3. dependencies: Are upstream and downstream dependencies identified?
4. security: Are threat model, authz, and data handling addressed?
5. observability: Are metrics, logs, and alerts defined?
6. rollback: Is there a rollback or kill-switch plan?
7. capacity: Are load estimates and scaling assumptions stated?
For each category scoring below 7, emit a finding with a specific recommendation.
Also provide an overall summary and priority-ordered next steps.
JSON schema:
{
"scores": {"goals": 8, "non_goals": 5, ...},
"findings": [
{"category": "non_goals", "score": 5, "issue": "...", "recommendation": "..."}
],
"summary": "...",
"next_steps": ["...", "..."]
}
"""
Step 3: Build the review function with JSON mode
Now I wrap the API call in a function that accepts a spec string and returns parsed JSON. Oxlo.ai supports JSON mode on all chat models, so I set response_format to enforce valid output.
import json
def review_spec(spec_text: str) -> dict:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Engineering spec:\n\n{spec_text}"},
],
response_format={"type": "json_object"},
)
raw = response.choices[0].message.content
return json.loads(raw)
Step 4: Add severity classification
A raw score is useful, but I want anything involving auth or data integrity escalated. I add a second pass that tags each finding with LOW, MEDIUM, HIGH, or CRITICAL. Because this is a separate API call, Oxlo.ai's flat per-request pricing keeps the cost predictable even when you chain multiple reasoning steps.
SEVERITY_PROMPT = """You are an SRE triaging engineering findings.
Given the JSON findings below, assign each a severity.
CRITICAL means auth, PII, data loss, or missing rollback without a guardrail.
Return valid JSON with the same list structure but add a 'severity' field to each item."""
def classify_severity(findings: list) -> list:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SEVERITY_PROMPT},
{"role": "user", "content": json.dumps(findings, indent=2)},
],
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)
return data.get("findings", findings)
Step 5: Wire up the CLI
Finally, I add a small CLI that reads a markdown file, runs the review, classifies severity, and prints a formatted report.
import sys
def main(file_path: str):
with open(file_path, "r", encoding="utf-8") as f:
spec = f.read()
result = review_spec(spec)
result["findings"] = classify_severity(result["findings"])
print("=" * 60)
print("DESIGN REVIEW REPORT")
print("=" * 60)
print(f"Overall summary: {result['summary']}\n")
for cat, score in result["scores"].items():
print(f"{cat:15s}: {score}/10")
print()
for f in result["findings"]:
print(f"[{f['severity']:8s}] {f['category']}: {f['issue']}")
print(f" Recommendation: {f['recommendation']}\n")
print("Next steps:")
for step in result["next_steps"]:
print(f" - {step}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python reviewer.py design.md")
sys.exit(1)
main(sys.argv[1])
Run it
Create a file named design.md with a thin spec, then run the script.
# Service X Migration
## Goals
Move user data from Postgres to DynamoDB.
## Non-Goals
None.
## Dependencies
Internal auth service.
## Rollback
TBD.
Run the agent:
$ export OXLO_API_KEY=your-api-key
$ python reviewer.py design.md
Example output:
============================================================
DESIGN REVIEW REPORT
============================================================
Overall summary: The spec identifies a migration goal but lacks non-goals, capacity planning, observability, and a concrete rollback strategy.
goals : 8/10
non_goals : 2/10
dependencies : 6/10
security : 4/10
observability : 1/10
rollback : 2/10
capacity : 0/10
[CRITICAL ] rollback: No rollback plan defined for a data migration.
Recommendation: Define a backward-compatible write path and a revert script with data validation before cutover.
[HIGH ] observability: No metrics, alarms, or log aggregation strategy listed.
Recommendation: Add row-count reconciliation metrics and p99 latency alarms for both stores during dual-write.
[MEDIUM ] non_goals: Section is empty, allowing scope creep.
Recommendation: Explicitly list out-of-scope features, such as historical data backfill and real-time sync.
Next steps:
- Draft a detailed rollback runbook with go/no-go criteria.
- Add observability requirements including dashboards and paging thresholds.
- Fill capacity estimates for peak write throughput.
Wrap-up
This agent replaces the first-pass consistency check I used to do manually. If you want to extend it, wire it into a GitHub Action so every PR containing a design doc gets an automatic review comment. You could also swap in DeepSeek V3.2 or Qwen 3 32B on Oxlo.ai for heavier reasoning workloads without changing any code, because the client is fully OpenAI-compatible.
Top comments (0)