The Problem with "AI-Assisted" Development
Most AI coding tools today are autocomplete on steroids. They make you faster at typing, but the fundamental loop hasn't changed: you still decompose requirements, design architecture, write code, write tests, and review β one step at a time, context-switching between roles.
What if you could delegate the whole loop?
That's the question behind AI SDLC β a multi-agent pipeline where a chain of specialised AI agents handles every phase of the software development life cycle. You write a plain-English task description. One command later, you have:
- A structured software specification
- A full technical design with an implementation checklist
- Working Python source code
- pytest unit tests (edge cases included)
- A code review with severity-coded issues
No scaffolding. No boilerplate. No switching tabs.
The full project is on GitLab with working code: π https://github.com/alexander-uspenskiy/ai_sdlc
Why GPT 4.x models?
Only for PoC purposes, for any production env it is highly recommended to use GPT 5.3 and higher or Opus/Sonet 4.5 and higher (as of article published).
The Landscape: Agentic AI Frameworks in 2026
Before diving into the implementation, it's worth understanding where this fits in the current ecosystem.
Approaches to multi-agent orchestration
| Framework | Model | Best for |
|---|---|---|
| LangGraph | Graph/state machine | Sequential pipelines, conditional routing, checkpointing |
| AutoGen | Conversation-based | Back-and-forth agent dialogues, human-in-the-loop |
| CrewAI | Role-based crew | Parallel task execution, hierarchical delegation |
| OpenAI Swarm | Handoff-based | Lightweight, low-boilerplate agent handoffs |
| Semantic Kernel | Plugin/planner | Enterprise .NET/Python integrations |
Each has its niche. LangGraph is the right choice here because the SDLC is fundamentally a directed acyclic pipeline with conditional error exits. State flows forward, agents don't loop back, and failures need to short-circuit gracefully. That's exactly what LangGraph's StateGraph was built for.
Why not just one big prompt?
A single "write me an app from this description" prompt degrades quickly for non-trivial tasks:
- Context collapse β one prompt can't simultaneously be a BA, architect, developer, QA engineer, and reviewer without each role undermining the others
- No specialisation β a general prompt produces general output; specialised prompts with role-specific context produce expert output
- No accountability β you can't easily replay from the architect stage if only the code was wrong
- Token ceiling β a single-turn mega-prompt blows up for anything beyond toy examples
The pipeline approach solves all four.
Architecture Overview
Every node is a LangGraph node. Every edge is either unconditional (start β load_task, write_artifacts β END) or conditional (check state["status"], route to error_handler if "error").
The entire pipeline shares one typed state object (SDLCState), defined once and validated throughout:
class SDLCState(TypedDict):
task_md: str # Input
spec_md: str # BA output
tech_design_md: str # Architect output (updated by Dev)
generated_code: dict[str, str] # Dev output: filename β content
test_code: dict[str, str] # QA output: filename β content
code_review_md: str # Review output
project_name: str # Extracted from spec
status: str # "running" | "error" | "done"
current_agent: str
error: Optional[str]
messages: Annotated[list[BaseMessage], add_messages]
Agents return only the keys they modify. LangGraph merges partial updates into the full state automatically.
The Five Agents
1. BA Agent
Input: raw task description
Output: structured Markdown spec
The BA agent takes the free-form task and produces a proper specification document:
project_name: simple_cli_todo_list
## Overview
A command-line to-do application that runs in a loop...
## Goals
- Provide a simple, interactive interface for managing tasks
- Support add, show, and delete operations
## Functional Requirements
- FR-1: `add "item"` appends a new item to the list
- FR-2: `show` displays all items numbered 1-based
- FR-3: `delete N` removes the item at position N
- FR-4: `quit` exits the loop gracefully
## Non-Functional Requirements
- Pure Python, no external dependencies
- Single-file implementation preferred
The first line is always project_name: <snake_case_name> β this is parsed with a regex and used to name all output folders for the rest of the run.
Why gpt-4o-mini? Structured document generation from a template is a lightweight task. The mini model is fast, cheap, and plenty capable here.
2. Architect Agent
Input: spec from BA
Output: full technical design + implementation checklist
The Architect produces a complete design document covering components, data models, data flow, tech stack, and file structure. The critical part is the Implementation Plan section β a numbered checklist in - [ ] format:
## Implementation Plan
- [ ] 1. Define `TodoList` class with internal list storage
- [ ] 2. Implement `add_item(text)` method
- [ ] 3. Implement `show_items()` method
- [ ] 4. Implement `delete_item(n)` method with bounds checking
- [ ] 5. Write `main()` loop with command parsing
- [ ] 6. Handle invalid commands and out-of-range deletes
This checklist isn't just documentation β the Dev Agent updates it after code generation.
3. Dev Agent (two LLM calls)
Input: technical design
Output: source files as {filename: content} dict + updated tech design
This is the most complex agent. It makes two sequential LLM calls:
Call 1 β Code generation:
Returns a JSON object mapping filenames to file content. The strict JSON output contract lets us reliably parse multi-file outputs regardless of LLM formatting variations:
{
"todo.py": "\"\"\"Simple CLI to-do list.\"\"\"\n\nclass TodoList:\n ...",
"main.py": "from todo import TodoList\n\ndef main():\n ..."
}
A _parse_json_output() helper strips markdown fences before parsing β LLMs are inconsistent about whether they wrap JSON in `json blocks.
Call 2 β Plan update:
Takes the tech design + generated filenames, rewrites the implementation plan with all steps marked [x] and annotated with the file that implements each step:
- [x] 1. Define `TodoList` class β todo.py
- [x] 2. Implement `add_item(text)` method β todo.py
- [x] 5. Write `main()` loop with command parsing β main.py
The updated tech_design_md (with checked-off plan) replaces the original in state and gets persisted to disk. When you open artifacts/<project>/tech_design.md after a run, you see exactly what was built and where.
Why gpt-4o for Dev? Code generation quality matters. The gap between gpt-4o and gpt-4o-mini on code is meaningful, especially for edge case handling, idiom correctness, and docstrings.
4. QA Agent
Input: all generated source files
Output: pytest test files as {filename: content} dict
The QA Agent reads every source file and writes comprehensive pytest tests. The key insight in the prompt: test files are given the actual source code, not just the spec β this means the tests actually match the implementation's structure (real method names, real class names).
Generated tests cover:
- Happy paths (standard usage)
- Edge cases (empty list, boundary indices)
- Error conditions (invalid input, out-of-range delete)
-
unittest.mockfor any I/O or external calls
5. Review Agent
Input: all source files + all test files
Output: structured Markdown code review
The review doc follows a consistent schema:
## Summary
Clean implementation of the requirements. Single-file structure is appropriate.
## Issues
| # | Severity | Location | Issue | Recommendation |
|---|----------|----------|-------|----------------|
| 1 | π‘ Minor | todo.py:14 | No type hints on public methods | Add `-> None` / `-> str` annotations |
| 2 | π΅ Info | main.py:3 | No `if __name__ == "__main__"` guard | Wrap main() call |
## Test Coverage Assessment
Tests cover all three commands and error paths. Missing: concurrent access scenario (out of scope for CLI).
## Verdict: β
Approved
Severity codes: π΄ Critical, π High, π‘ Minor, π΅ Info.
State Management: The Secret Sauce
LangGraph's state model is what makes this architecture clean.
Context isolation between agents
Each agent resets "messages": [] in its return dict. Because messages uses LangGraph's add_messages reducer β which accumulates messages β returning an empty list clears the accumulated history:
def ba_agent(state: SDLCState) -> dict:
response = llm.invoke([SystemMessage(...), HumanMessage(...)])
return {
"spec_md": response.content,
"project_name": _extract_project_name(response.content),
"current_agent": "ba_agent",
"messages": [], # β clears history for the next agent
}
Without this, each subsequent agent would see the entire conversation history from all previous agents β a context bleed that confuses specialised roles and wastes tokens.
Conditional error routing
Every edge (except the final ones) uses the same routing factory:
def _route(next_node: str):
def route(state: SDLCState) -> str:
if state.get("status") == "error":
return "error_handler"
return next_node
return route
builder.add_conditional_edges("ba_agent", _route("architect_agent"))
builder.add_conditional_edges("architect_agent", _route("dev_agent"))
# ... etc
Any agent can fail gracefully by returning {"status": "error", "error": "message"}. The graph short-circuits to error_handler without affecting already-written artifacts. This is critical for real-world use where LLM calls occasionally fail or return malformed output.
Checkpointing and resumability
The graph compiles with MemorySaver():
graph = build_graph() # compiled with MemorySaver checkpointer
Every invocation gets a unique thread_id (UUID). This means state is checkpointed at every node boundary. You can resume a failed run or inspect intermediate state without replaying the whole pipeline.
The CLI exposes this with run-from:
# Code generation was wrong? Re-run from Dev, reusing existing spec + tech design
python sdlc_cli.py run-from dev
This loads persisted artifacts back into state up to the requested restart point, saving both time and API cost.
The write_artifacts Node
One of the stronger design decisions: agents never touch the filesystem.
All agents are pure functions of state β state. The filesystem write is centralised in a single write_artifacts node that runs only after all agents succeed:
def write_artifacts(state: SDLCState) -> dict:
name = state["project_name"]
write_artifact(name, "spec.md", state["spec_md"])
write_artifact(name, "tech_design.md", state["tech_design_md""])
write_artifact(name, "code_review.md", state["code_review_md"])
all_code = {**state["generated_code"], **state["test_code"]}
write_code_files(name, all_code)
return {"status": "done"}
Benefits:
- Testable agents β unit tests mock the LLM, never the filesystem
- Atomic output β you don't get partially written artifacts from a failed run
- Single I/O boundary β one place to change output format, destination, or cloud upload
Output Structure
After python sdlc_cli.py run on a "simple CLI to-do list" task:
artifacts/simple_cli_todo_list/
spec.md β BA spec with functional requirements
tech_design.md β Architect design with β checked implementation plan
code_review.md β Severity-coded review with verdict
code/simple_cli_todo_list/
todo.py β TodoList class implementation
main.py β CLI loop and command parser
test_todo.py β pytest tests for TodoList
test_main.py β pytest tests for the CLI
Both directories are gitignored β they're runtime outputs.
AI Tool Integrations
The pipeline is designed to be AI-tool agnostic. Every popular coding assistant gets its own integration file that delegates to sdlc_cli.py:
| Tool | Integration | Invocation |
|---|---|---|
| Claude Code | .claude/commands/sdlc.md |
/sdlc run |
| Cursor | .cursor/commands/sdlc.md |
@sdlc run |
| GitHub Copilot | .github/prompts/sdlc.md |
Prompt panel |
| Continue.dev | .continue/prompts/sdlc.md |
/sdlc run |
| Windsurf | .windsurf/rules/sdlc.md |
Rules panel |
AGENTS.md at the repo root is a universal context file β any tool can read it to understand the project architecture and available commands without tool-specific configuration.
This pattern is increasingly important: your automation shouldn't be locked to a single AI assistant.
Advanced Monitoring (LLM, Cost)
Advanced monitoring and debugging using LangSmith dashboard(s). Inquiry, responses, timing, cost and more.
CLI Reference
# Run full pipeline on input/task.md
python sdlc_cli.py run
# One-liner β write task inline and run
python sdlc_cli.py new "Build a CLI password generator"
# Re-run from a specific agent (reuses prior artifacts)
python sdlc_cli.py run-from dev # valid: ba, architect, dev, qa, review
# Inspect outputs
python sdlc_cli.py show spec
python sdlc_cli.py show tech_design
python sdlc_cli.py show code
python sdlc_cli.py show code_review
# Check what's been built
python sdlc_cli.py status
# Run framework unit tests (all mocked, no API keys needed)
python sdlc_cli.py test
# Run QA-generated tests for the last project
python sdlc_cli.py test-generated
Testing the Pipeline Itself
The framework ships with its own unit tests in tests/. These test each agent in isolation β no real API calls, no API keys required:
# tests/test_dev_agent.py
@patch("sdlc.agents.dev_agent.llm")
def test_dev_agent_makes_two_llm_calls(mock_llm):
mock_llm.invoke.side_effect = [
AIMessage(content='{"main.py": "print(\'hello\')"'),
AIMessage(content="- [x] 1. Create main.py β main.py"),
]
result = dev_agent(base_state())
assert mock_llm.invoke.call_count == 2
assert "main.py" in result["generated_code"]
The Dev Agent test specifically asserts exactly two LLM calls β if the implementation changes to make one or three calls, the test catches it. This kind of behavioural assertion is more valuable than output-content assertions for LLM-calling code.
Design Decisions Worth Noting
Why separate write_artifacts from agents?
Agents stay pure and testable. A failed run doesn't leave half-written files. One node controls all I/O.
Why JSON for multi-file output?
Markdown code fences are ambiguous when embedding multiple files. JSON gives a reliable, parseable structure: {"filename.py": "content..."}. The _parse_json_output() helper handles LLM fence inconsistencies.
Why reset messages between agents?
Each agent is a standalone expert. Prior conversation context from other agents would confuse the role and waste tokens. Clean slate per agent.
Why gpt-4o for Dev/QA but gpt-4o-mini for the rest?
Code generation and test generation have the highest quality ceiling β stronger model pays off. Structured document generation (spec, design, review) works well with the faster mini model.
Why a CLI instead of direct Python calls?
sdlc_cli.py is a single, tool-agnostic interface. Every AI coding assistant can invoke the same commands. No tool-specific knowledge required.
Getting Started
git clone https://git.epam.com/alexander_uspensky/ai-sdlc.git
cd ai-sdlc
python -m venv .venv && .venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS/Linux
pip install -r requirements.txt
cp .env.example .env # add your OPENAI_API_KEY
Write your task:
# Edit input/task.md with your task description, then:
python sdlc_cli.py run
# Or inline:
python sdlc_cli.py new "Build a REST API for a bookmark manager with FastAPI"
Watch the pipeline run:
============================================================
Multi-Agent SDLC Pipeline
============================================================
[Pipeline] Loading task...
[BA] Analysing requirements...
[Architect] Designing system...
[Dev] Generating code (call 1/2)...
[Dev] Updating implementation plan (call 2/2)...
[QA] Writing tests...
[Review] Reviewing code...
[Pipeline] Writing artifacts to disk...
============================================================
β
Pipeline complete!
Project : simple_cli_todo_list
Artifacts: artifacts/simple_cli_todo_list/
Code : code/simple_cli_todo_list/
============================================================
What's Next
The current pipeline is linear β each agent hands off sequentially. Obvious extensions:
- Parallel QA + Review β once Dev finishes, QA and Review could run concurrently (LangGraph supports fan-out/fan-in natively)
- Feedback loops β if Review flags Critical issues, route back to Dev for a fix pass
-
LangSmith tracing β set
LANGCHAIN_TRACING_V2=trueand every LLM call is logged with inputs, outputs, latency, and token usage - Model pluggability β swap agents to Claude Sonnet 4.6, Gemini 2.0 Flash, or local Llama models without changing graph structure
-
Web UI β LangGraph's
LangGraph Platformcan serve the graph as an API with a streaming interface
Conclusion
Multi-agent SDLC isn't about replacing developers β it's about automating the mechanical parts of the cycle so you can focus on the creative parts: system design decisions, edge case identification, architectural trade-offs.
The LangGraph approach specifically gives you:
- Explicit, auditable data flow β state is typed and visible at every step
- Reliable error handling β any agent can fail gracefully without corrupting output
- Composable architecture β add, remove, or swap agents without touching the graph structure
- Resumability β run from any checkpoint, save API costs on partial failures
The full project is on GitLab with working code: π https://github.com/alexander-uspenskiy/ai_sdlc
Built with LangGraph 1.1.0, LangChain, OpenAI GPT-4o, Python 3.13.
All agent tests run without API keys β pytest tests/ works out of the box.

Top comments (0)