Alexander Uspenskiy

Posted on Mar 13 • Edited on Mar 15

How to build AI SDLC Pipeline in 15 minutes using LangGraph: Fully Autonomous Development Team with 5 Agents

#ai #agents #sdlc #llm

The Problem with "AI-Assisted" Development

Most AI coding tools today are autocomplete on steroids. They make you faster at typing, but the fundamental loop hasn't changed: you still decompose requirements, design architecture, write code, write tests, and review — one step at a time, context-switching between roles.

What if you could delegate the whole loop?

That's the question behind AI SDLC — a multi-agent pipeline where a chain of specialised AI agents handles every phase of the software development life cycle. You write a plain-English task description. One command later, you have:

A structured software specification
A full technical design with an implementation checklist
Working Python source code
pytest unit tests (edge cases included)
A code review with severity-coded issues

No scaffolding. No boilerplate. No switching tabs.

The full project is on GitLab with working code: 👉 https://github.com/alexander-uspenskiy/ai_sdlc

Why GPT 4.x models?

Only for PoC purposes, for any production env it is highly recommended to use GPT 5.3 and higher or Opus/Sonet 4.5 and higher (as of article published).

The Landscape: Agentic AI Frameworks in 2026

Before diving into the implementation, it's worth understanding where this fits in the current ecosystem.

Approaches to multi-agent orchestration

Framework	Model	Best for
LangGraph	Graph/state machine	Sequential pipelines, conditional routing, checkpointing
AutoGen	Conversation-based	Back-and-forth agent dialogues, human-in-the-loop
CrewAI	Role-based crew	Parallel task execution, hierarchical delegation
OpenAI Swarm	Handoff-based	Lightweight, low-boilerplate agent handoffs
Semantic Kernel	Plugin/planner	Enterprise .NET/Python integrations

Each has its niche. LangGraph is the right choice here because the SDLC is fundamentally a directed acyclic pipeline with conditional error exits. State flows forward, agents don't loop back, and failures need to short-circuit gracefully. That's exactly what LangGraph's StateGraph was built for.

Why not just one big prompt?

A single "write me an app from this description" prompt degrades quickly for non-trivial tasks:

Context collapse — one prompt can't simultaneously be a BA, architect, developer, QA engineer, and reviewer without each role undermining the others
No specialisation — a general prompt produces general output; specialised prompts with role-specific context produce expert output
No accountability — you can't easily replay from the architect stage if only the code was wrong
Token ceiling — a single-turn mega-prompt blows up for anything beyond toy examples

The pipeline approach solves all four.

Architecture Overview

Every node is a LangGraph node. Every edge is either unconditional (start → load_task, write_artifacts → END) or conditional (check state["status"], route to error_handler if "error").

The entire pipeline shares one typed state object (SDLCState), defined once and validated throughout:


class SDLCState(TypedDict):
    task_md: str                              # Input
    spec_md: str                              # BA output
    tech_design_md: str                       # Architect output (updated by Dev)
    generated_code: dict[str, str]            # Dev output: filename → content
    test_code: dict[str, str]                 # QA output: filename → content
    code_review_md: str                       # Review output
    project_name: str                         # Extracted from spec
    status: str                               # "running" | "error" | "done"
    current_agent: str
    error: Optional[str]
    messages: Annotated[list[BaseMessage], add_messages]

Agents return only the keys they modify. LangGraph merges partial updates into the full state automatically.

The Five Agents

1. BA Agent

Input: raw task description
Output: structured Markdown spec

The BA agent takes the free-form task and produces a proper specification document:


project_name: simple_cli_todo_list

## Overview
A command-line to-do application that runs in a loop...

## Goals
- Provide a simple, interactive interface for managing tasks
- Support add, show, and delete operations

## Functional Requirements
- FR-1: `add "item"` appends a new item to the list
- FR-2: `show` displays all items numbered 1-based
- FR-3: `delete N` removes the item at position N
- FR-4: `quit` exits the loop gracefully

## Non-Functional Requirements
- Pure Python, no external dependencies
- Single-file implementation preferred

The first line is always project_name: <snake_case_name> — this is parsed with a regex and used to name all output folders for the rest of the run.

Why gpt-4o-mini? Structured document generation from a template is a lightweight task. The mini model is fast, cheap, and plenty capable here.

2. Architect Agent

Input: spec from BA
Output: full technical design + implementation checklist

The Architect produces a complete design document covering components, data models, data flow, tech stack, and file structure. The critical part is the Implementation Plan section — a numbered checklist in - [ ] format:


## Implementation Plan

- [ ] 1. Define `TodoList` class with internal list storage
- [ ] 2. Implement `add_item(text)` method
- [ ] 3. Implement `show_items()` method
- [ ] 4. Implement `delete_item(n)` method with bounds checking
- [ ] 5. Write `main()` loop with command parsing
- [ ] 6. Handle invalid commands and out-of-range deletes

This checklist isn't just documentation — the Dev Agent updates it after code generation.

3. Dev Agent (two LLM calls)

Input: technical design
Output: source files as {filename: content} dict + updated tech design

This is the most complex agent. It makes two sequential LLM calls:

Call 1 — Code generation:
Returns a JSON object mapping filenames to file content. The strict JSON output contract lets us reliably parse multi-file outputs regardless of LLM formatting variations:


{
  "todo.py": "\"\"\"Simple CLI to-do list.\"\"\"\n\nclass TodoList:\n    ...",
  "main.py": "from todo import TodoList\n\ndef main():\n    ..."
}

A _parse_json_output() helper strips markdown fences before parsing — LLMs are inconsistent about whether they wrap JSON in `json blocks.

Call 2 — Plan update:
Takes the tech design + generated filenames, rewrites the implementation plan with all steps marked [x] and annotated with the file that implements each step:


- [x] 1. Define `TodoList` class → todo.py
- [x] 2. Implement `add_item(text)` method → todo.py
- [x] 5. Write `main()` loop with command parsing → main.py

The updated tech_design_md (with checked-off plan) replaces the original in state and gets persisted to disk. When you open artifacts/<project>/tech_design.md after a run, you see exactly what was built and where.

Why gpt-4o for Dev? Code generation quality matters. The gap between gpt-4o and gpt-4o-mini on code is meaningful, especially for edge case handling, idiom correctness, and docstrings.

4. QA Agent

Input: all generated source files
Output: pytest test files as {filename: content} dict

The QA Agent reads every source file and writes comprehensive pytest tests. The key insight in the prompt: test files are given the actual source code, not just the spec — this means the tests actually match the implementation's structure (real method names, real class names).

Generated tests cover:

Happy paths (standard usage)
Edge cases (empty list, boundary indices)
Error conditions (invalid input, out-of-range delete)
unittest.mock for any I/O or external calls

5. Review Agent

Input: all source files + all test files
Output: structured Markdown code review

The review doc follows a consistent schema:



## Summary
Clean implementation of the requirements. Single-file structure is appropriate.

## Issues

| # | Severity | Location | Issue | Recommendation |
|---|----------|----------|-------|----------------|
| 1 | 🟡 Minor | todo.py:14 | No type hints on public methods | Add `-> None` / `-> str` annotations |
| 2 | 🔵 Info  | main.py:3  | No `if __name__ == "__main__"` guard | Wrap main() call |

## Test Coverage Assessment
Tests cover all three commands and error paths. Missing: concurrent access scenario (out of scope for CLI).

## Verdict: ✅ Approved

Severity codes: 🔴 Critical, 🟠 High, 🟡 Minor, 🔵 Info.

State Management: The Secret Sauce

LangGraph's state model is what makes this architecture clean.

Context isolation between agents

Each agent resets "messages": [] in its return dict. Because messages uses LangGraph's add_messages reducer — which accumulates messages — returning an empty list clears the accumulated history:


def ba_agent(state: SDLCState) -> dict:
    response = llm.invoke([SystemMessage(...), HumanMessage(...)])
    return {
        "spec_md": response.content,
        "project_name": _extract_project_name(response.content),
        "current_agent": "ba_agent",
        "messages": [],  # ← clears history for the next agent
    }

Without this, each subsequent agent would see the entire conversation history from all previous agents — a context bleed that confuses specialised roles and wastes tokens.

Conditional error routing

Every edge (except the final ones) uses the same routing factory:


def _route(next_node: str):
    def route(state: SDLCState) -> str:
        if state.get("status") == "error":
            return "error_handler"
        return next_node
    return route

builder.add_conditional_edges("ba_agent", _route("architect_agent"))
builder.add_conditional_edges("architect_agent", _route("dev_agent"))
# ... etc

Any agent can fail gracefully by returning {"status": "error", "error": "message"}. The graph short-circuits to error_handler without affecting already-written artifacts. This is critical for real-world use where LLM calls occasionally fail or return malformed output.

Checkpointing and resumability

The graph compiles with MemorySaver():


graph = build_graph()  # compiled with MemorySaver checkpointer

Every invocation gets a unique thread_id (UUID). This means state is checkpointed at every node boundary. You can resume a failed run or inspect intermediate state without replaying the whole pipeline.

The CLI exposes this with run-from:


# Code generation was wrong? Re-run from Dev, reusing existing spec + tech design
python sdlc_cli.py run-from dev

This loads persisted artifacts back into state up to the requested restart point, saving both time and API cost.

The write_artifacts Node

One of the stronger design decisions: agents never touch the filesystem.

All agents are pure functions of state → state. The filesystem write is centralised in a single write_artifacts node that runs only after all agents succeed:


def write_artifacts(state: SDLCState) -> dict:
    name = state["project_name"]
    write_artifact(name, "spec.md", state["spec_md"])
    write_artifact(name, "tech_design.md", state["tech_design_md""])
    write_artifact(name, "code_review.md", state["code_review_md"])
    all_code = {**state["generated_code"], **state["test_code"]}
    write_code_files(name, all_code)
    return {"status": "done"}

Benefits:

Testable agents — unit tests mock the LLM, never the filesystem
Atomic output — you don't get partially written artifacts from a failed run
Single I/O boundary — one place to change output format, destination, or cloud upload

Output Structure

After python sdlc_cli.py run on a "simple CLI to-do list" task:


artifacts/simple_cli_todo_list/
    spec.md              ← BA spec with functional requirements
    tech_design.md       ← Architect design with ✓ checked implementation plan
    code_review.md       ← Severity-coded review with verdict

code/simple_cli_todo_list/
    todo.py              ← TodoList class implementation
    main.py              ← CLI loop and command parser
    test_todo.py         ← pytest tests for TodoList
    test_main.py         ← pytest tests for the CLI

Both directories are gitignored — they're runtime outputs.

AI Tool Integrations

The pipeline is designed to be AI-tool agnostic. Every popular coding assistant gets its own integration file that delegates to sdlc_cli.py:

Tool	Integration	Invocation
Claude Code	`.claude/commands/sdlc.md`	`/sdlc run`
Cursor	`.cursor/commands/sdlc.md`	`@sdlc run`
GitHub Copilot	`.github/prompts/sdlc.md`	Prompt panel
Continue.dev	`.continue/prompts/sdlc.md`	`/sdlc run`
Windsurf	`.windsurf/rules/sdlc.md`	Rules panel

AGENTS.md at the repo root is a universal context file — any tool can read it to understand the project architecture and available commands without tool-specific configuration.

This pattern is increasingly important: your automation shouldn't be locked to a single AI assistant.

Advanced Monitoring (LLM, Cost)

Advanced monitoring and debugging using LangSmith dashboard(s). Inquiry, responses, timing, cost and more.

CLI Reference


# Run full pipeline on input/task.md
python sdlc_cli.py run

# One-liner — write task inline and run
python sdlc_cli.py new "Build a CLI password generator"

# Re-run from a specific agent (reuses prior artifacts)
python sdlc_cli.py run-from dev   # valid: ba, architect, dev, qa, review

# Inspect outputs
python sdlc_cli.py show spec
python sdlc_cli.py show tech_design
python sdlc_cli.py show code
python sdlc_cli.py show code_review

# Check what's been built
python sdlc_cli.py status

# Run framework unit tests (all mocked, no API keys needed)
python sdlc_cli.py test

# Run QA-generated tests for the last project
python sdlc_cli.py test-generated

Testing the Pipeline Itself

The framework ships with its own unit tests in tests/. These test each agent in isolation — no real API calls, no API keys required:


# tests/test_dev_agent.py
@patch("sdlc.agents.dev_agent.llm")
def test_dev_agent_makes_two_llm_calls(mock_llm):
    mock_llm.invoke.side_effect = [
        AIMessage(content='{"main.py": "print(\'hello\')"'),
        AIMessage(content="- [x] 1. Create main.py → main.py"),
    ]
    result = dev_agent(base_state())
    assert mock_llm.invoke.call_count == 2
    assert "main.py" in result["generated_code"]

The Dev Agent test specifically asserts exactly two LLM calls — if the implementation changes to make one or three calls, the test catches it. This kind of behavioural assertion is more valuable than output-content assertions for LLM-calling code.

Design Decisions Worth Noting

Why separate write_artifacts from agents?
Agents stay pure and testable. A failed run doesn't leave half-written files. One node controls all I/O.

Why JSON for multi-file output?
Markdown code fences are ambiguous when embedding multiple files. JSON gives a reliable, parseable structure: {"filename.py": "content..."}. The _parse_json_output() helper handles LLM fence inconsistencies.

Why reset messages between agents?
Each agent is a standalone expert. Prior conversation context from other agents would confuse the role and waste tokens. Clean slate per agent.

Why gpt-4o for Dev/QA but gpt-4o-mini for the rest?
Code generation and test generation have the highest quality ceiling — stronger model pays off. Structured document generation (spec, design, review) works well with the faster mini model.

Why a CLI instead of direct Python calls?
sdlc_cli.py is a single, tool-agnostic interface. Every AI coding assistant can invoke the same commands. No tool-specific knowledge required.

Getting Started


git clone https://git.epam.com/alexander_uspensky/ai-sdlc.git
cd ai-sdlc
python -m venv .venv && .venv\Scripts\activate   # Windows
# source .venv/bin/activate                      # macOS/Linux
pip install -r requirements.txt
cp .env.example .env  # add your OPENAI_API_KEY

Write your task:


# Edit input/task.md with your task description, then:
python sdlc_cli.py run

# Or inline:
python sdlc_cli.py new "Build a REST API for a bookmark manager with FastAPI"

Watch the pipeline run:


============================================================
  Multi-Agent SDLC Pipeline
============================================================
[Pipeline] Loading task...
[BA] Analysing requirements...
[Architect] Designing system...
[Dev] Generating code (call 1/2)...
[Dev] Updating implementation plan (call 2/2)...
[QA] Writing tests...
[Review] Reviewing code...
[Pipeline] Writing artifacts to disk...
============================================================
  ✅ Pipeline complete!
  Project  : simple_cli_todo_list
  Artifacts: artifacts/simple_cli_todo_list/
  Code     : code/simple_cli_todo_list/
============================================================

What's Next

The current pipeline is linear — each agent hands off sequentially. Obvious extensions:

Parallel QA + Review — once Dev finishes, QA and Review could run concurrently (LangGraph supports fan-out/fan-in natively)
Feedback loops — if Review flags Critical issues, route back to Dev for a fix pass
LangSmith tracing — set LANGCHAIN_TRACING_V2=true and every LLM call is logged with inputs, outputs, latency, and token usage
Model pluggability — swap agents to Claude Sonnet 4.6, Gemini 2.0 Flash, or local Llama models without changing graph structure
Web UI — LangGraph's LangGraph Platform can serve the graph as an API with a streaming interface

Conclusion

Multi-agent SDLC isn't about replacing developers — it's about automating the mechanical parts of the cycle so you can focus on the creative parts: system design decisions, edge case identification, architectural trade-offs.

The LangGraph approach specifically gives you:

Explicit, auditable data flow — state is typed and visible at every step
Reliable error handling — any agent can fail gracefully without corrupting output
Composable architecture — add, remove, or swap agents without touching the graph structure
Resumability — run from any checkpoint, save API costs on partial failures