DEV Community

Cover image for AI SDLC Pipeline: 5 Agents/Fully Autonomus
Alexander Uspenskiy
Alexander Uspenskiy

Posted on

AI SDLC Pipeline: 5 Agents/Fully Autonomus

The Problem with "AI-Assisted" Development

Most AI coding tools today are autocomplete on steroids. They make you faster at typing, but the fundamental loop hasn't changed: you still decompose requirements, design architecture, write code, write tests, and review — one step at a time, context-switching between roles.

What if you could delegate the whole loop?

That's the question behind AI SDLC — a multi-agent pipeline where a chain of specialised AI agents handles every phase of the software development life cycle. You write a plain-English task description. One command later, you have:

  • A structured software specification
  • A full technical design with an implementation checklist
  • Working Python source code
  • pytest unit tests (edge cases included)
  • A code review with severity-coded issues

No scaffolding. No boilerplate. No switching tabs.


The Landscape: Agentic AI Frameworks in 2026

Before diving into the implementation, it's worth understanding where this fits in the current ecosystem.

Approaches to multi-agent orchestration

Framework Model Best for
LangGraph Graph/state machine Sequential pipelines, conditional routing, checkpointing
AutoGen Conversation-based Back-and-forth agent dialogues, human-in-the-loop
CrewAI Role-based crew Parallel task execution, hierarchical delegation
OpenAI Swarm Handoff-based Lightweight, low-boilerplate agent handoffs
Semantic Kernel Plugin/planner Enterprise .NET/Python integrations

Each has its niche. LangGraph is the right choice here because the SDLC is fundamentally a directed acyclic pipeline with conditional error exits. State flows forward, agents don't loop back, and failures need to short-circuit gracefully. That's exactly what LangGraph's StateGraph was built for.

Why not just one big prompt?

A single "write me an app from this description" prompt degrades quickly for non-trivial tasks:

  1. Context collapse — one prompt can't simultaneously be a BA, architect, developer, QA engineer, and reviewer without each role undermining the others
  2. No specialisation — a general prompt produces general output; specialised prompts with role-specific context produce expert output
  3. No accountability — you can't easily replay from the architect stage if only the code was wrong
  4. Token ceiling — a single-turn mega-prompt blows up for anything beyond toy examples

The pipeline approach solves all four.


Architecture Overview

Every node is a LangGraph node. Every edge is either unconditional (start → load_task, write_artifacts → END) or conditional (check state["status"], route to error_handler if "error").

The entire pipeline shares one typed state object (SDLCState), defined once and validated throughout:

class SDLCState(TypedDict):
    task_md: str                              # Input
    spec_md: str                              # BA output
    tech_design_md: str                       # Architect output (updated by Dev)
    generated_code: dict[str, str]            # Dev output: filename → content
    test_code: dict[str, str]                 # QA output: filename → content
    code_review_md: str                       # Review output
    project_name: str                         # Extracted from spec
    status: str                               # "running" | "error" | "done"
    current_agent: str
    error: Optional[str]
    messages: Annotated[list[BaseMessage], add_messages]
Enter fullscreen mode Exit fullscreen mode

Agents return only the keys they modify. LangGraph merges partial updates into the full state automatically.


The Five Agents

1. BA Agent — gpt-4o-mini

Input: raw task description
Output: structured Markdown spec

The BA agent takes the free-form task and produces a proper specification document:

project_name: simple_cli_todo_list

## Overview
A command-line to-do application that runs in a loop...

## Goals
- Provide a simple, interactive interface for managing tasks
- Support add, show, and delete operations

## Functional Requirements
- FR-1: `add "item"` appends a new item to the list
- FR-2: `show` displays all items numbered 1-based
- FR-3: `delete N` removes the item at position N
- FR-4: `quit` exits the loop gracefully

## Non-Functional Requirements
- Pure Python, no external dependencies
- Single-file implementation preferred
Enter fullscreen mode Exit fullscreen mode

The first line is always project_name: <snake_case_name> — this is parsed with a regex and used to name all output folders for the rest of the run.

Why gpt-4o-mini? Structured document generation from a template is a lightweight task. The mini model is fast, cheap, and plenty capable here.

2. Architect Agent — gpt-4o-mini

Input: spec from BA
Output: full technical design + implementation checklist

The Architect produces a complete design document covering components, data models, data flow, tech stack, and file structure. The critical part is the Implementation Plan section — a numbered checklist in - [ ] format:

## Implementation Plan

- [ ] 1. Define `TodoList` class with internal list storage
- [ ] 2. Implement `add_item(text)` method
- [ ] 3. Implement `show_items()` method
- [ ] 4. Implement `delete_item(n)` method with bounds checking
- [ ] 5. Write `main()` loop with command parsing
- [ ] 6. Handle invalid commands and out-of-range deletes
Enter fullscreen mode Exit fullscreen mode

This checklist isn't just documentation — the Dev Agent updates it after code generation.

3. Dev Agent — gpt-4o (two LLM calls)

Input: technical design
Output: source files as {filename: content} dict + updated tech design

This is the most complex agent. It makes two sequential LLM calls:

Call 1 — Code generation:
Returns a JSON object mapping filenames to file content. The strict JSON output contract lets us reliably parse multi-file outputs regardless of LLM formatting variations:

{
  "todo.py": "\"\"\"Simple CLI to-do list.\"\"\"\n\nclass TodoList:\n    ...",
  "main.py": "from todo import TodoList\n\ndef main():\n    ..."
}
Enter fullscreen mode Exit fullscreen mode

A _parse_json_output() helper strips markdown fences before parsing — LLMs are inconsistent about whether they wrap JSON in `json blocks.

Call 2 — Plan update:
Takes the tech design + generated filenames, rewrites the implementation plan with all steps marked [x] and annotated with the file that implements each step:

`markdown

  • [x] 1. Define TodoList class → todo.py
  • [x] 2. Implement add_item(text) method → todo.py
  • [x] 5. Write main() loop with command parsing → main.py `

The updated tech_design_md (with checked-off plan) replaces the original in state and gets persisted to disk. When you open artifacts/<project>/tech_design.md after a run, you see exactly what was built and where.

Why gpt-4o for Dev? Code generation quality matters. The gap between gpt-4o and gpt-4o-mini on code is meaningful, especially for edge case handling, idiom correctness, and docstrings.

4. QA Agent — gpt-4o

Input: all generated source files
Output: pytest test files as {filename: content} dict

The QA Agent reads every source file and writes comprehensive pytest tests. The key insight in the prompt: test files are given the actual source code, not just the spec — this means the tests actually match the implementation's structure (real method names, real class names).

Generated tests cover:

  • Happy paths (standard usage)
  • Edge cases (empty list, boundary indices)
  • Error conditions (invalid input, out-of-range delete)
  • unittest.mock for any I/O or external calls

5. Review Agent — gpt-4o-mini

Input: all source files + all test files
Output: structured Markdown code review

The review doc follows a consistent schema:

`markdown

Summary

Clean implementation of the requirements. Single-file structure is appropriate.

Issues

# Severity Location Issue Recommendation
1 🟡 Minor todo.py:14 No type hints on public methods Add -> None / -> str annotations
2 🔵 Info main.py:3 No if __name__ == "__main__" guard Wrap main() call

Test Coverage Assessment

Tests cover all three commands and error paths. Missing: concurrent access scenario (out of scope for CLI).

Verdict: ✅ Approved

Severity codes: 🔴 Critical, 🟠 High, 🟡 Minor, 🔵 Info.


State Management: The Secret Sauce

LangGraph's state model is what makes this architecture clean.

Context isolation between agents

Each agent resets "messages": [] in its return dict. Because messages uses LangGraph's add_messages reducer — which accumulates messages — returning an empty list clears the accumulated history:

python
def ba_agent(state: SDLCState) -> dict:
response = llm.invoke([SystemMessage(...), HumanMessage(...)])
return {
"spec_md": response.content,
"project_name": _extract_project_name(response.content),
"current_agent": "ba_agent",
"messages": [], # ← clears history for the next agent
}

Without this, each subsequent agent would see the entire conversation history from all previous agents — a context bleed that confuses specialised roles and wastes tokens.

Conditional error routing

Every edge (except the final ones) uses the same routing factory:

`python
def _route(next_node: str):
def route(state: SDLCState) -> str:
if state.get("status") == "error":
return "error_handler"
return next_node
return route

builder.add_conditional_edges("ba_agent", _route("architect_agent"))
builder.add_conditional_edges("architect_agent", _route("dev_agent"))

... etc

`

Any agent can fail gracefully by returning {"status": "error", "error": "message"}. The graph short-circuits to error_handler without affecting already-written artifacts. This is critical for real-world use where LLM calls occasionally fail or return malformed output.

Checkpointing and resumability

The graph compiles with MemorySaver():

python
graph = build_graph() # compiled with MemorySaver checkpointer

Every invocation gets a unique thread_id (UUID). This means state is checkpointed at every node boundary. You can resume a failed run or inspect intermediate state without replaying the whole pipeline.

The CLI exposes this with run-from:

`bash

Code generation was wrong? Re-run from Dev, reusing existing spec + tech design

python sdlc_cli.py run-from dev
`

This loads persisted artifacts back into state up to the requested restart point, saving both time and API cost.


The write_artifacts Node

One of the stronger design decisions: agents never touch the filesystem.

All agents are pure functions of state → state. The filesystem write is centralised in a single write_artifacts node that runs only after all agents succeed:

python
def write_artifacts(state: SDLCState) -> dict:
name = state["project_name"]
write_artifact(name, "spec.md", state["spec_md"])
write_artifact(name, "tech_design.md", state["tech_design_md"])
write_artifact(name, "code_review.md", state["code_review_md"])
all_code = {**state["generated_code"], **state["test_code"]}
write_code_files(name, all_code)
return {"status": "done"}

Benefits:

  • Testable agents — unit tests mock the LLM, never the filesystem
  • Atomic output — you don't get partially written artifacts from a failed run
  • Single I/O boundary — one place to change output format, destination, or cloud upload

Output Structure

After python sdlc_cli.py run on a "simple CLI to-do list" task:

`plaintext
artifacts/simple_cli_todo_list/
spec.md ← BA spec with functional requirements
tech_design.md ← Architect design with ✓ checked implementation plan
code_review.md ← Severity-coded review with verdict

code/simple_cli_todo_list/
todo.py ← TodoList class implementation
main.py ← CLI loop and command parser
test_todo.py ← pytest tests for TodoList
test_main.py ← pytest tests for the CLI
`

Both directories are gitignored — they're runtime outputs.


AI Tool Integrations

The pipeline is designed to be AI-tool agnostic. Every popular coding assistant gets its own integration file that delegates to sdlc_cli.py:

Tool Integration Invocation
Claude Code .claude/commands/sdlc.md /sdlc run
Cursor .cursor/commands/sdlc.md @sdlc run
GitHub Copilot .github/prompts/sdlc.md Prompt panel
Continue.dev .continue/prompts/sdlc.md /sdlc run
Windsurf .windsurf/rules/sdlc.md Rules panel

AGENTS.md at the repo root is a universal context file — any tool can read it to understand the project architecture and available commands without tool-specific configuration.

This pattern is increasingly important: your automation shouldn't be locked to a single AI assistant.


CLI Reference

`bash

Run full pipeline on input/task.md

python sdlc_cli.py run

One-liner — write task inline and run

python sdlc_cli.py new "Build a CLI password generator"

Re-run from a specific agent (reuses prior artifacts)

python sdlc_cli.py run-from dev # valid: ba, architect, dev, qa, review

Inspect outputs

python sdlc_cli.py show spec
python sdlc_cli.py show tech_design
python sdlc_cli.py show code
python sdlc_cli.py show code_review

Check what's been built

python sdlc_cli.py status

Run framework unit tests (all mocked, no API keys needed)

python sdlc_cli.py test

Run QA-generated tests for the last project

python sdlc_cli.py test-generated
`


Testing the Pipeline Itself

The framework ships with its own unit tests in tests/. These test each agent in isolation — no real API calls, no API keys required:

`python

tests/test_dev_agent.py

@patch("sdlc.agents.dev_agent.llm")
def test_dev_agent_makes_two_llm_calls(mock_llm):
mock_llm.invoke.side_effect = [
AIMessage(content='{"main.py": "print(\'hello\')"}'),
AIMessage(content="- [x] 1. Create main.py → main.py"),
]
result = dev_agent(base_state())
assert mock_llm.invoke.call_count == 2
assert "main.py" in result["generated_code"]
`

The Dev Agent test specifically asserts exactly two LLM calls — if the implementation changes to make one or three calls, the test catches it. This kind of behavioural assertion is more valuable than output-content assertions for LLM-calling code.


Design Decisions Worth Noting

Why separate write_artifacts from agents?
Agents stay pure and testable. A failed run doesn't leave half-written files. One node controls all I/O.

Why JSON for multi-file output?
Markdown code fences are ambiguous when embedding multiple files. JSON gives a reliable, parseable structure: {"filename.py": "content..."}. The _parse_json_output() helper handles LLM fence inconsistencies.

Why reset messages between agents?
Each agent is a standalone expert. Prior conversation context from other agents would confuse the role and waste tokens. Clean slate per agent.

Why gpt-4o for Dev/QA but gpt-4o-mini for the rest?
Code generation and test generation have the highest quality ceiling — stronger model pays off. Structured document generation (spec, design, review) works well with the faster mini model.

Why a CLI instead of direct Python calls?
sdlc_cli.py is a single, tool-agnostic interface. Every AI coding assistant can invoke the same commands. No tool-specific knowledge required.


Getting Started

`bash
git clone https://git.epam.com/alexander_uspensky/ai-sdlc.git
cd ai-sdlc
python -m venv .venv && .venv\Scripts\activate # Windows

source .venv/bin/activate # macOS/Linux

pip install -r requirements.txt
cp .env.example .env # add your OPENAI_API_KEY
`

Write your task:

`bash

Edit input/task.md with your task description, then:

python sdlc_cli.py run

Or inline:

python sdlc_cli.py new "Build a REST API for a bookmark manager with FastAPI"
`

Watch the pipeline run:

`console

Multi-Agent SDLC Pipeline

[Pipeline] Loading task...
[BA] Analysing requirements...
[Architect] Designing system...
[Dev] Generating code (call 1/2)...
[Dev] Updating implementation plan (call 2/2)...
[QA] Writing tests...
[Review] Reviewing code...

[Pipeline] Writing artifacts to disk...

✅ Pipeline complete!
Project : simple_cli_todo_list
Artifacts: artifacts/simple_cli_todo_list/

Code : code/simple_cli_todo_list/

`


What's Next

The current pipeline is linear — each agent hands off sequentially. Obvious extensions:

  • Parallel QA + Review — once Dev finishes, QA and Review could run concurrently (LangGraph supports fan-out/fan-in natively)
  • Feedback loops — if Review flags Critical issues, route back to Dev for a fix pass
  • LangSmith tracing — set LANGCHAIN_TRACING_V2=true and every LLM call is logged with inputs, outputs, latency, and token usage
  • Model pluggability — swap agents to Claude Sonnet 4.6, Gemini 2.0 Flash, or local Llama models without changing graph structure
  • Web UI — LangGraph's LangGraph Platform can serve the graph as an API with a streaming interface

Conclusion

Multi-agent SDLC isn't about replacing developers — it's about automating the mechanical parts of the cycle so you can focus on the creative parts: system design decisions, edge case identification, architectural trade-offs.

The LangGraph approach specifically gives you:

  • Explicit, auditable data flow — state is typed and visible at every step
  • Reliable error handling — any agent can fail gracefully without corrupting output
  • Composable architecture — add, remove, or swap agents without touching the graph structure
  • Resumability — run from any checkpoint, save API costs on partial failures

The full project is on GitLab with working code:
👉 https://github.com/alexander-uspenskiy/ai_sdlc


Built with LangGraph 1.1.0, LangChain, OpenAI GPT-4o, Python 3.13.
All agent tests run without API keys — pytest tests/ works out of the box.

Top comments (0)