DEV Community

Cover image for How to build AI SDLC Pipeline in 15 minutes using LangGraph: Fully Autonomous Development Team with 5 Agents
Alexander Uspenskiy
Alexander Uspenskiy

Posted on • Edited on

How to build AI SDLC Pipeline in 15 minutes using LangGraph: Fully Autonomous Development Team with 5 Agents

The Problem with "AI-Assisted" Development

Most AI coding tools today are autocomplete on steroids. They make you faster at typing, but the fundamental loop hasn't changed: you still decompose requirements, design architecture, write code, write tests, and review β€” one step at a time, context-switching between roles.

What if you could delegate the whole loop?

That's the question behind AI SDLC β€” a multi-agent pipeline where a chain of specialised AI agents handles every phase of the software development life cycle. You write a plain-English task description. One command later, you have:

  • A structured software specification
  • A full technical design with an implementation checklist
  • Working Python source code
  • pytest unit tests (edge cases included)
  • A code review with severity-coded issues

No scaffolding. No boilerplate. No switching tabs.

The full project is on GitLab with working code: πŸ‘‰ https://github.com/alexander-uspenskiy/ai_sdlc


Why GPT 4.x models?

Only for PoC purposes, for any production env it is highly recommended to use GPT 5.3 and higher or Opus/Sonet 4.5 and higher (as of article published).


The Landscape: Agentic AI Frameworks in 2026

Before diving into the implementation, it's worth understanding where this fits in the current ecosystem.

Approaches to multi-agent orchestration

Framework Model Best for
LangGraph Graph/state machine Sequential pipelines, conditional routing, checkpointing
AutoGen Conversation-based Back-and-forth agent dialogues, human-in-the-loop
CrewAI Role-based crew Parallel task execution, hierarchical delegation
OpenAI Swarm Handoff-based Lightweight, low-boilerplate agent handoffs
Semantic Kernel Plugin/planner Enterprise .NET/Python integrations

Each has its niche. LangGraph is the right choice here because the SDLC is fundamentally a directed acyclic pipeline with conditional error exits. State flows forward, agents don't loop back, and failures need to short-circuit gracefully. That's exactly what LangGraph's StateGraph was built for.

Why not just one big prompt?

A single "write me an app from this description" prompt degrades quickly for non-trivial tasks:

  1. Context collapse β€” one prompt can't simultaneously be a BA, architect, developer, QA engineer, and reviewer without each role undermining the others
  2. No specialisation β€” a general prompt produces general output; specialised prompts with role-specific context produce expert output
  3. No accountability β€” you can't easily replay from the architect stage if only the code was wrong
  4. Token ceiling β€” a single-turn mega-prompt blows up for anything beyond toy examples

The pipeline approach solves all four.


Architecture Overview

Every node is a LangGraph node. Every edge is either unconditional (start β†’ load_task, write_artifacts β†’ END) or conditional (check state["status"], route to error_handler if "error").

The entire pipeline shares one typed state object (SDLCState), defined once and validated throughout:


class SDLCState(TypedDict):
    task_md: str                              # Input
    spec_md: str                              # BA output
    tech_design_md: str                       # Architect output (updated by Dev)
    generated_code: dict[str, str]            # Dev output: filename β†’ content
    test_code: dict[str, str]                 # QA output: filename β†’ content
    code_review_md: str                       # Review output
    project_name: str                         # Extracted from spec
    status: str                               # "running" | "error" | "done"
    current_agent: str
    error: Optional[str]
    messages: Annotated[list[BaseMessage], add_messages]


Agents return only the keys they modify. LangGraph merges partial updates into the full state automatically.


The Five Agents

1. BA Agent

Input: raw task description
Output: structured Markdown spec

The BA agent takes the free-form task and produces a proper specification document:


project_name: simple_cli_todo_list

## Overview
A command-line to-do application that runs in a loop...

## Goals
- Provide a simple, interactive interface for managing tasks
- Support add, show, and delete operations

## Functional Requirements
- FR-1: `add "item"` appends a new item to the list
- FR-2: `show` displays all items numbered 1-based
- FR-3: `delete N` removes the item at position N
- FR-4: `quit` exits the loop gracefully

## Non-Functional Requirements
- Pure Python, no external dependencies
- Single-file implementation preferred


The first line is always project_name: <snake_case_name> β€” this is parsed with a regex and used to name all output folders for the rest of the run.

Why gpt-4o-mini? Structured document generation from a template is a lightweight task. The mini model is fast, cheap, and plenty capable here.

2. Architect Agent

Input: spec from BA
Output: full technical design + implementation checklist

The Architect produces a complete design document covering components, data models, data flow, tech stack, and file structure. The critical part is the Implementation Plan section β€” a numbered checklist in - [ ] format:


## Implementation Plan

- [ ] 1. Define `TodoList` class with internal list storage
- [ ] 2. Implement `add_item(text)` method
- [ ] 3. Implement `show_items()` method
- [ ] 4. Implement `delete_item(n)` method with bounds checking
- [ ] 5. Write `main()` loop with command parsing
- [ ] 6. Handle invalid commands and out-of-range deletes


This checklist isn't just documentation β€” the Dev Agent updates it after code generation.

3. Dev Agent (two LLM calls)

Input: technical design
Output: source files as {filename: content} dict + updated tech design

This is the most complex agent. It makes two sequential LLM calls:

Call 1 β€” Code generation:
Returns a JSON object mapping filenames to file content. The strict JSON output contract lets us reliably parse multi-file outputs regardless of LLM formatting variations:


{
  "todo.py": "\"\"\"Simple CLI to-do list.\"\"\"\n\nclass TodoList:\n    ...",
  "main.py": "from todo import TodoList\n\ndef main():\n    ..."
}


A _parse_json_output() helper strips markdown fences before parsing β€” LLMs are inconsistent about whether they wrap JSON in `json blocks.

Call 2 β€” Plan update:
Takes the tech design + generated filenames, rewrites the implementation plan with all steps marked [x] and annotated with the file that implements each step:


- [x] 1. Define `TodoList` class β†’ todo.py
- [x] 2. Implement `add_item(text)` method β†’ todo.py
- [x] 5. Write `main()` loop with command parsing β†’ main.py


The updated tech_design_md (with checked-off plan) replaces the original in state and gets persisted to disk. When you open artifacts/<project>/tech_design.md after a run, you see exactly what was built and where.

Why gpt-4o for Dev? Code generation quality matters. The gap between gpt-4o and gpt-4o-mini on code is meaningful, especially for edge case handling, idiom correctness, and docstrings.

4. QA Agent

Input: all generated source files
Output: pytest test files as {filename: content} dict

The QA Agent reads every source file and writes comprehensive pytest tests. The key insight in the prompt: test files are given the actual source code, not just the spec β€” this means the tests actually match the implementation's structure (real method names, real class names).

Generated tests cover:

  • Happy paths (standard usage)
  • Edge cases (empty list, boundary indices)
  • Error conditions (invalid input, out-of-range delete)
  • unittest.mock for any I/O or external calls

5. Review Agent

Input: all source files + all test files
Output: structured Markdown code review

The review doc follows a consistent schema:



## Summary
Clean implementation of the requirements. Single-file structure is appropriate.

## Issues

| # | Severity | Location | Issue | Recommendation |
|---|----------|----------|-------|----------------|
| 1 | 🟑 Minor | todo.py:14 | No type hints on public methods | Add `-> None` / `-> str` annotations |
| 2 | πŸ”΅ Info  | main.py:3  | No `if __name__ == "__main__"` guard | Wrap main() call |

## Test Coverage Assessment
Tests cover all three commands and error paths. Missing: concurrent access scenario (out of scope for CLI).

## Verdict: βœ… Approved


Severity codes: πŸ”΄ Critical, 🟠 High, 🟑 Minor, πŸ”΅ Info.


State Management: The Secret Sauce

LangGraph's state model is what makes this architecture clean.

Context isolation between agents

Each agent resets "messages": [] in its return dict. Because messages uses LangGraph's add_messages reducer β€” which accumulates messages β€” returning an empty list clears the accumulated history:


def ba_agent(state: SDLCState) -> dict:
    response = llm.invoke([SystemMessage(...), HumanMessage(...)])
    return {
        "spec_md": response.content,
        "project_name": _extract_project_name(response.content),
        "current_agent": "ba_agent",
        "messages": [],  # ← clears history for the next agent
    }


Without this, each subsequent agent would see the entire conversation history from all previous agents β€” a context bleed that confuses specialised roles and wastes tokens.

Conditional error routing

Every edge (except the final ones) uses the same routing factory:


def _route(next_node: str):
    def route(state: SDLCState) -> str:
        if state.get("status") == "error":
            return "error_handler"
        return next_node
    return route

builder.add_conditional_edges("ba_agent", _route("architect_agent"))
builder.add_conditional_edges("architect_agent", _route("dev_agent"))
# ... etc


Any agent can fail gracefully by returning {"status": "error", "error": "message"}. The graph short-circuits to error_handler without affecting already-written artifacts. This is critical for real-world use where LLM calls occasionally fail or return malformed output.

Checkpointing and resumability

The graph compiles with MemorySaver():


graph = build_graph()  # compiled with MemorySaver checkpointer


Every invocation gets a unique thread_id (UUID). This means state is checkpointed at every node boundary. You can resume a failed run or inspect intermediate state without replaying the whole pipeline.

The CLI exposes this with run-from:


# Code generation was wrong? Re-run from Dev, reusing existing spec + tech design
python sdlc_cli.py run-from dev


This loads persisted artifacts back into state up to the requested restart point, saving both time and API cost.


The write_artifacts Node

One of the stronger design decisions: agents never touch the filesystem.

All agents are pure functions of state β†’ state. The filesystem write is centralised in a single write_artifacts node that runs only after all agents succeed:


def write_artifacts(state: SDLCState) -> dict:
    name = state["project_name"]
    write_artifact(name, "spec.md", state["spec_md"])
    write_artifact(name, "tech_design.md", state["tech_design_md""])
    write_artifact(name, "code_review.md", state["code_review_md"])
    all_code = {**state["generated_code"], **state["test_code"]}
    write_code_files(name, all_code)
    return {"status": "done"}


Benefits:

  • Testable agents β€” unit tests mock the LLM, never the filesystem
  • Atomic output β€” you don't get partially written artifacts from a failed run
  • Single I/O boundary β€” one place to change output format, destination, or cloud upload

Output Structure

After python sdlc_cli.py run on a "simple CLI to-do list" task:


artifacts/simple_cli_todo_list/
    spec.md              ← BA spec with functional requirements
    tech_design.md       ← Architect design with βœ“ checked implementation plan
    code_review.md       ← Severity-coded review with verdict

code/simple_cli_todo_list/
    todo.py              ← TodoList class implementation
    main.py              ← CLI loop and command parser
    test_todo.py         ← pytest tests for TodoList
    test_main.py         ← pytest tests for the CLI


Both directories are gitignored β€” they're runtime outputs.


AI Tool Integrations

The pipeline is designed to be AI-tool agnostic. Every popular coding assistant gets its own integration file that delegates to sdlc_cli.py:

Tool Integration Invocation
Claude Code .claude/commands/sdlc.md /sdlc run
Cursor .cursor/commands/sdlc.md @sdlc run
GitHub Copilot .github/prompts/sdlc.md Prompt panel
Continue.dev .continue/prompts/sdlc.md /sdlc run
Windsurf .windsurf/rules/sdlc.md Rules panel

AGENTS.md at the repo root is a universal context file β€” any tool can read it to understand the project architecture and available commands without tool-specific configuration.

This pattern is increasingly important: your automation shouldn't be locked to a single AI assistant.


Advanced Monitoring (LLM, Cost)

Advanced monitoring and debugging using LangSmith dashboard(s). Inquiry, responses, timing, cost and more.


CLI Reference


# Run full pipeline on input/task.md
python sdlc_cli.py run

# One-liner β€” write task inline and run
python sdlc_cli.py new "Build a CLI password generator"

# Re-run from a specific agent (reuses prior artifacts)
python sdlc_cli.py run-from dev   # valid: ba, architect, dev, qa, review

# Inspect outputs
python sdlc_cli.py show spec
python sdlc_cli.py show tech_design
python sdlc_cli.py show code
python sdlc_cli.py show code_review

# Check what's been built
python sdlc_cli.py status

# Run framework unit tests (all mocked, no API keys needed)
python sdlc_cli.py test

# Run QA-generated tests for the last project
python sdlc_cli.py test-generated



Testing the Pipeline Itself

The framework ships with its own unit tests in tests/. These test each agent in isolation β€” no real API calls, no API keys required:


# tests/test_dev_agent.py
@patch("sdlc.agents.dev_agent.llm")
def test_dev_agent_makes_two_llm_calls(mock_llm):
    mock_llm.invoke.side_effect = [
        AIMessage(content='{"main.py": "print(\'hello\')"'),
        AIMessage(content="- [x] 1. Create main.py β†’ main.py"),
    ]
    result = dev_agent(base_state())
    assert mock_llm.invoke.call_count == 2
    assert "main.py" in result["generated_code"]


The Dev Agent test specifically asserts exactly two LLM calls β€” if the implementation changes to make one or three calls, the test catches it. This kind of behavioural assertion is more valuable than output-content assertions for LLM-calling code.


Design Decisions Worth Noting

Why separate write_artifacts from agents?
Agents stay pure and testable. A failed run doesn't leave half-written files. One node controls all I/O.

Why JSON for multi-file output?
Markdown code fences are ambiguous when embedding multiple files. JSON gives a reliable, parseable structure: {"filename.py": "content..."}. The _parse_json_output() helper handles LLM fence inconsistencies.

Why reset messages between agents?
Each agent is a standalone expert. Prior conversation context from other agents would confuse the role and waste tokens. Clean slate per agent.

Why gpt-4o for Dev/QA but gpt-4o-mini for the rest?
Code generation and test generation have the highest quality ceiling β€” stronger model pays off. Structured document generation (spec, design, review) works well with the faster mini model.

Why a CLI instead of direct Python calls?
sdlc_cli.py is a single, tool-agnostic interface. Every AI coding assistant can invoke the same commands. No tool-specific knowledge required.


Getting Started


git clone https://git.epam.com/alexander_uspensky/ai-sdlc.git
cd ai-sdlc
python -m venv .venv && .venv\Scripts\activate   # Windows
# source .venv/bin/activate                      # macOS/Linux
pip install -r requirements.txt
cp .env.example .env  # add your OPENAI_API_KEY


Write your task:


# Edit input/task.md with your task description, then:
python sdlc_cli.py run

# Or inline:
python sdlc_cli.py new "Build a REST API for a bookmark manager with FastAPI"


Watch the pipeline run:


============================================================
  Multi-Agent SDLC Pipeline
============================================================
[Pipeline] Loading task...
[BA] Analysing requirements...
[Architect] Designing system...
[Dev] Generating code (call 1/2)...
[Dev] Updating implementation plan (call 2/2)...
[QA] Writing tests...
[Review] Reviewing code...
[Pipeline] Writing artifacts to disk...
============================================================
  βœ… Pipeline complete!
  Project  : simple_cli_todo_list
  Artifacts: artifacts/simple_cli_todo_list/
  Code     : code/simple_cli_todo_list/
============================================================



What's Next

The current pipeline is linear β€” each agent hands off sequentially. Obvious extensions:

  • Parallel QA + Review β€” once Dev finishes, QA and Review could run concurrently (LangGraph supports fan-out/fan-in natively)
  • Feedback loops β€” if Review flags Critical issues, route back to Dev for a fix pass
  • LangSmith tracing β€” set LANGCHAIN_TRACING_V2=true and every LLM call is logged with inputs, outputs, latency, and token usage
  • Model pluggability β€” swap agents to Claude Sonnet 4.6, Gemini 2.0 Flash, or local Llama models without changing graph structure
  • Web UI β€” LangGraph's LangGraph Platform can serve the graph as an API with a streaming interface

Conclusion

Multi-agent SDLC isn't about replacing developers β€” it's about automating the mechanical parts of the cycle so you can focus on the creative parts: system design decisions, edge case identification, architectural trade-offs.

The LangGraph approach specifically gives you:

  • Explicit, auditable data flow β€” state is typed and visible at every step
  • Reliable error handling β€” any agent can fail gracefully without corrupting output
  • Composable architecture β€” add, remove, or swap agents without touching the graph structure
  • Resumability β€” run from any checkpoint, save API costs on partial failures

The full project is on GitLab with working code: πŸ‘‰ https://github.com/alexander-uspenskiy/ai_sdlc


Built with LangGraph 1.1.0, LangChain, OpenAI GPT-4o, Python 3.13.
All agent tests run without API keys β€” pytest tests/ works out of the box.

Top comments (0)