The Problem with "AI-Assisted" Development
Most AI coding tools today are autocomplete on steroids. They make you faster at typing, but the fundamental loop hasn't changed: you still decompose requirements, design architecture, write code, write tests, and review — one step at a time, context-switching between roles.
What if you could delegate the whole loop?
That's the question behind AI SDLC — a multi-agent pipeline where a chain of specialised AI agents handles every phase of the software development life cycle. You write a plain-English task description. One command later, you have:
- A structured software specification
- A full technical design with an implementation checklist
- Working Python source code
- pytest unit tests (edge cases included)
- A code review with severity-coded issues
No scaffolding. No boilerplate. No switching tabs.
The Landscape: Agentic AI Frameworks in 2026
Before diving into the implementation, it's worth understanding where this fits in the current ecosystem.
Approaches to multi-agent orchestration
| Framework | Model | Best for |
|---|---|---|
| LangGraph | Graph/state machine | Sequential pipelines, conditional routing, checkpointing |
| AutoGen | Conversation-based | Back-and-forth agent dialogues, human-in-the-loop |
| CrewAI | Role-based crew | Parallel task execution, hierarchical delegation |
| OpenAI Swarm | Handoff-based | Lightweight, low-boilerplate agent handoffs |
| Semantic Kernel | Plugin/planner | Enterprise .NET/Python integrations |
Each has its niche. LangGraph is the right choice here because the SDLC is fundamentally a directed acyclic pipeline with conditional error exits. State flows forward, agents don't loop back, and failures need to short-circuit gracefully. That's exactly what LangGraph's StateGraph was built for.
Why not just one big prompt?
A single "write me an app from this description" prompt degrades quickly for non-trivial tasks:
- Context collapse — one prompt can't simultaneously be a BA, architect, developer, QA engineer, and reviewer without each role undermining the others
- No specialisation — a general prompt produces general output; specialised prompts with role-specific context produce expert output
- No accountability — you can't easily replay from the architect stage if only the code was wrong
- Token ceiling — a single-turn mega-prompt blows up for anything beyond toy examples
The pipeline approach solves all four.
Architecture Overview
Every node is a LangGraph node. Every edge is either unconditional (start → load_task, write_artifacts → END) or conditional (check state["status"], route to error_handler if "error").
The entire pipeline shares one typed state object (SDLCState), defined once and validated throughout:
class SDLCState(TypedDict):
task_md: str # Input
spec_md: str # BA output
tech_design_md: str # Architect output (updated by Dev)
generated_code: dict[str, str] # Dev output: filename → content
test_code: dict[str, str] # QA output: filename → content
code_review_md: str # Review output
project_name: str # Extracted from spec
status: str # "running" | "error" | "done"
current_agent: str
error: Optional[str]
messages: Annotated[list[BaseMessage], add_messages]
Agents return only the keys they modify. LangGraph merges partial updates into the full state automatically.
The Five Agents
1. BA Agent — gpt-4o-mini
Input: raw task description
Output: structured Markdown spec
The BA agent takes the free-form task and produces a proper specification document:
project_name: simple_cli_todo_list
## Overview
A command-line to-do application that runs in a loop...
## Goals
- Provide a simple, interactive interface for managing tasks
- Support add, show, and delete operations
## Functional Requirements
- FR-1: `add "item"` appends a new item to the list
- FR-2: `show` displays all items numbered 1-based
- FR-3: `delete N` removes the item at position N
- FR-4: `quit` exits the loop gracefully
## Non-Functional Requirements
- Pure Python, no external dependencies
- Single-file implementation preferred
The first line is always project_name: <snake_case_name> — this is parsed with a regex and used to name all output folders for the rest of the run.
Why gpt-4o-mini? Structured document generation from a template is a lightweight task. The mini model is fast, cheap, and plenty capable here.
2. Architect Agent — gpt-4o-mini
Input: spec from BA
Output: full technical design + implementation checklist
The Architect produces a complete design document covering components, data models, data flow, tech stack, and file structure. The critical part is the Implementation Plan section — a numbered checklist in - [ ] format:
## Implementation Plan
- [ ] 1. Define `TodoList` class with internal list storage
- [ ] 2. Implement `add_item(text)` method
- [ ] 3. Implement `show_items()` method
- [ ] 4. Implement `delete_item(n)` method with bounds checking
- [ ] 5. Write `main()` loop with command parsing
- [ ] 6. Handle invalid commands and out-of-range deletes
This checklist isn't just documentation — the Dev Agent updates it after code generation.
3. Dev Agent — gpt-4o (two LLM calls)
Input: technical design
Output: source files as {filename: content} dict + updated tech design
This is the most complex agent. It makes two sequential LLM calls:
Call 1 — Code generation:
Returns a JSON object mapping filenames to file content. The strict JSON output contract lets us reliably parse multi-file outputs regardless of LLM formatting variations:
{
"todo.py": "\"\"\"Simple CLI to-do list.\"\"\"\n\nclass TodoList:\n ...",
"main.py": "from todo import TodoList\n\ndef main():\n ..."
}
A _parse_json_output() helper strips markdown fences before parsing — LLMs are inconsistent about whether they wrap JSON in `json blocks.
Call 2 — Plan update:
Takes the tech design + generated filenames, rewrites the implementation plan with all steps marked [x] and annotated with the file that implements each step:
`markdown
- [x] 1. Define
TodoListclass → todo.py - [x] 2. Implement
add_item(text)method → todo.py - [x] 5. Write
main()loop with command parsing → main.py`
The updated tech_design_md (with checked-off plan) replaces the original in state and gets persisted to disk. When you open artifacts/<project>/tech_design.md after a run, you see exactly what was built and where.
Why gpt-4o for Dev? Code generation quality matters. The gap between gpt-4o and gpt-4o-mini on code is meaningful, especially for edge case handling, idiom correctness, and docstrings.
4. QA Agent — gpt-4o
Input: all generated source files
Output: pytest test files as {filename: content} dict
The QA Agent reads every source file and writes comprehensive pytest tests. The key insight in the prompt: test files are given the actual source code, not just the spec — this means the tests actually match the implementation's structure (real method names, real class names).
Generated tests cover:
- Happy paths (standard usage)
- Edge cases (empty list, boundary indices)
- Error conditions (invalid input, out-of-range delete)
-
unittest.mockfor any I/O or external calls
5. Review Agent — gpt-4o-mini
Input: all source files + all test files
Output: structured Markdown code review
The review doc follows a consistent schema:
`markdown
Summary
Clean implementation of the requirements. Single-file structure is appropriate.
Issues
| # | Severity | Location | Issue | Recommendation |
|---|---|---|---|---|
| 1 | 🟡 Minor | todo.py:14 | No type hints on public methods | Add -> None / -> str annotations |
| 2 | 🔵 Info | main.py:3 | No if __name__ == "__main__" guard |
Wrap main() call |
Test Coverage Assessment
Tests cover all three commands and error paths. Missing: concurrent access scenario (out of scope for CLI).
Verdict: ✅ Approved
Severity codes: 🔴 Critical, 🟠 High, 🟡 Minor, 🔵 Info.
State Management: The Secret Sauce
LangGraph's state model is what makes this architecture clean.
Context isolation between agents
Each agent resets "messages": [] in its return dict. Because messages uses LangGraph's add_messages reducer — which accumulates messages — returning an empty list clears the accumulated history:
python
def ba_agent(state: SDLCState) -> dict:
response = llm.invoke([SystemMessage(...), HumanMessage(...)])
return {
"spec_md": response.content,
"project_name": _extract_project_name(response.content),
"current_agent": "ba_agent",
"messages": [], # ← clears history for the next agent
}
Without this, each subsequent agent would see the entire conversation history from all previous agents — a context bleed that confuses specialised roles and wastes tokens.
Conditional error routing
Every edge (except the final ones) uses the same routing factory:
`python
def _route(next_node: str):
def route(state: SDLCState) -> str:
if state.get("status") == "error":
return "error_handler"
return next_node
return route
builder.add_conditional_edges("ba_agent", _route("architect_agent"))
builder.add_conditional_edges("architect_agent", _route("dev_agent"))
... etc
`
Any agent can fail gracefully by returning {"status": "error", "error": "message"}. The graph short-circuits to error_handler without affecting already-written artifacts. This is critical for real-world use where LLM calls occasionally fail or return malformed output.
Checkpointing and resumability
The graph compiles with MemorySaver():
python
graph = build_graph() # compiled with MemorySaver checkpointer
Every invocation gets a unique thread_id (UUID). This means state is checkpointed at every node boundary. You can resume a failed run or inspect intermediate state without replaying the whole pipeline.
The CLI exposes this with run-from:
`bash
Code generation was wrong? Re-run from Dev, reusing existing spec + tech design
python sdlc_cli.py run-from dev
`
This loads persisted artifacts back into state up to the requested restart point, saving both time and API cost.
The write_artifacts Node
One of the stronger design decisions: agents never touch the filesystem.
All agents are pure functions of state → state. The filesystem write is centralised in a single write_artifacts node that runs only after all agents succeed:
python
def write_artifacts(state: SDLCState) -> dict:
name = state["project_name"]
write_artifact(name, "spec.md", state["spec_md"])
write_artifact(name, "tech_design.md", state["tech_design_md"])
write_artifact(name, "code_review.md", state["code_review_md"])
all_code = {**state["generated_code"], **state["test_code"]}
write_code_files(name, all_code)
return {"status": "done"}
Benefits:
- Testable agents — unit tests mock the LLM, never the filesystem
- Atomic output — you don't get partially written artifacts from a failed run
- Single I/O boundary — one place to change output format, destination, or cloud upload
Output Structure
After python sdlc_cli.py run on a "simple CLI to-do list" task:
`plaintext
artifacts/simple_cli_todo_list/
spec.md ← BA spec with functional requirements
tech_design.md ← Architect design with ✓ checked implementation plan
code_review.md ← Severity-coded review with verdict
code/simple_cli_todo_list/
todo.py ← TodoList class implementation
main.py ← CLI loop and command parser
test_todo.py ← pytest tests for TodoList
test_main.py ← pytest tests for the CLI
`
Both directories are gitignored — they're runtime outputs.
AI Tool Integrations
The pipeline is designed to be AI-tool agnostic. Every popular coding assistant gets its own integration file that delegates to sdlc_cli.py:
| Tool | Integration | Invocation |
|---|---|---|
| Claude Code | .claude/commands/sdlc.md |
/sdlc run |
| Cursor | .cursor/commands/sdlc.md |
@sdlc run |
| GitHub Copilot | .github/prompts/sdlc.md |
Prompt panel |
| Continue.dev | .continue/prompts/sdlc.md |
/sdlc run |
| Windsurf | .windsurf/rules/sdlc.md |
Rules panel |
AGENTS.md at the repo root is a universal context file — any tool can read it to understand the project architecture and available commands without tool-specific configuration.
This pattern is increasingly important: your automation shouldn't be locked to a single AI assistant.
CLI Reference
`bash
Run full pipeline on input/task.md
python sdlc_cli.py run
One-liner — write task inline and run
python sdlc_cli.py new "Build a CLI password generator"
Re-run from a specific agent (reuses prior artifacts)
python sdlc_cli.py run-from dev # valid: ba, architect, dev, qa, review
Inspect outputs
python sdlc_cli.py show spec
python sdlc_cli.py show tech_design
python sdlc_cli.py show code
python sdlc_cli.py show code_review
Check what's been built
python sdlc_cli.py status
Run framework unit tests (all mocked, no API keys needed)
python sdlc_cli.py test
Run QA-generated tests for the last project
python sdlc_cli.py test-generated
`
Testing the Pipeline Itself
The framework ships with its own unit tests in tests/. These test each agent in isolation — no real API calls, no API keys required:
`python
tests/test_dev_agent.py
@patch("sdlc.agents.dev_agent.llm")
def test_dev_agent_makes_two_llm_calls(mock_llm):
mock_llm.invoke.side_effect = [
AIMessage(content='{"main.py": "print(\'hello\')"}'),
AIMessage(content="- [x] 1. Create main.py → main.py"),
]
result = dev_agent(base_state())
assert mock_llm.invoke.call_count == 2
assert "main.py" in result["generated_code"]
`
The Dev Agent test specifically asserts exactly two LLM calls — if the implementation changes to make one or three calls, the test catches it. This kind of behavioural assertion is more valuable than output-content assertions for LLM-calling code.
Design Decisions Worth Noting
Why separate write_artifacts from agents?
Agents stay pure and testable. A failed run doesn't leave half-written files. One node controls all I/O.
Why JSON for multi-file output?
Markdown code fences are ambiguous when embedding multiple files. JSON gives a reliable, parseable structure: {"filename.py": "content..."}. The _parse_json_output() helper handles LLM fence inconsistencies.
Why reset messages between agents?
Each agent is a standalone expert. Prior conversation context from other agents would confuse the role and waste tokens. Clean slate per agent.
Why gpt-4o for Dev/QA but gpt-4o-mini for the rest?
Code generation and test generation have the highest quality ceiling — stronger model pays off. Structured document generation (spec, design, review) works well with the faster mini model.
Why a CLI instead of direct Python calls?
sdlc_cli.py is a single, tool-agnostic interface. Every AI coding assistant can invoke the same commands. No tool-specific knowledge required.
Getting Started
`bash
git clone https://git.epam.com/alexander_uspensky/ai-sdlc.git
cd ai-sdlc
python -m venv .venv && .venv\Scripts\activate # Windows
source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
cp .env.example .env # add your OPENAI_API_KEY
`
Write your task:
`bash
Edit input/task.md with your task description, then:
python sdlc_cli.py run
Or inline:
python sdlc_cli.py new "Build a REST API for a bookmark manager with FastAPI"
`
Watch the pipeline run:
`console
Multi-Agent SDLC Pipeline
[Pipeline] Loading task...
[BA] Analysing requirements...
[Architect] Designing system...
[Dev] Generating code (call 1/2)...
[Dev] Updating implementation plan (call 2/2)...
[QA] Writing tests...
[Review] Reviewing code...
[Pipeline] Writing artifacts to disk...
✅ Pipeline complete!
Project : simple_cli_todo_list
Artifacts: artifacts/simple_cli_todo_list/
Code : code/simple_cli_todo_list/
`
What's Next
The current pipeline is linear — each agent hands off sequentially. Obvious extensions:
- Parallel QA + Review — once Dev finishes, QA and Review could run concurrently (LangGraph supports fan-out/fan-in natively)
- Feedback loops — if Review flags Critical issues, route back to Dev for a fix pass
-
LangSmith tracing — set
LANGCHAIN_TRACING_V2=trueand every LLM call is logged with inputs, outputs, latency, and token usage - Model pluggability — swap agents to Claude Sonnet 4.6, Gemini 2.0 Flash, or local Llama models without changing graph structure
-
Web UI — LangGraph's
LangGraph Platformcan serve the graph as an API with a streaming interface
Conclusion
Multi-agent SDLC isn't about replacing developers — it's about automating the mechanical parts of the cycle so you can focus on the creative parts: system design decisions, edge case identification, architectural trade-offs.
The LangGraph approach specifically gives you:
- Explicit, auditable data flow — state is typed and visible at every step
- Reliable error handling — any agent can fail gracefully without corrupting output
- Composable architecture — add, remove, or swap agents without touching the graph structure
- Resumability — run from any checkpoint, save API costs on partial failures
The full project is on GitLab with working code:
👉 https://github.com/alexander-uspenskiy/ai_sdlc
Built with LangGraph 1.1.0, LangChain, OpenAI GPT-4o, Python 3.13.
All agent tests run without API keys — pytest tests/ works out of the box.

Top comments (0)