DEV Community: Alexander Uspenskiy

How to build AI SDLC Pipeline in 15 minutes using LangGraph: Fully Autonomous Development Team with 5 Agents

Alexander Uspenskiy — Fri, 13 Mar 2026 23:40:19 +0000

The Problem with "AI-Assisted" Development

Most AI coding tools today are autocomplete on steroids. They make you faster at typing, but the fundamental loop hasn't changed: you still decompose requirements, design architecture, write code, write tests, and review — one step at a time, context-switching between roles.

What if you could delegate the whole loop?

That's the question behind AI SDLC — a multi-agent pipeline where a chain of specialised AI agents handles every phase of the software development life cycle. You write a plain-English task description. One command later, you have:

A structured software specification
A full technical design with an implementation checklist
Working Python source code
pytest unit tests (edge cases included)
A code review with severity-coded issues

No scaffolding. No boilerplate. No switching tabs.

The full project is on GitLab with working code: 👉 https://github.com/alexander-uspenskiy/ai_sdlc

Why GPT 4.x models?

Only for PoC purposes, for any production env it is highly recommended to use GPT 5.3 and higher or Opus/Sonet 4.5 and higher (as of article published).

The Landscape: Agentic AI Frameworks in 2026

Before diving into the implementation, it's worth understanding where this fits in the current ecosystem.

Approaches to multi-agent orchestration

Framework	Model	Best for
LangGraph	Graph/state machine	Sequential pipelines, conditional routing, checkpointing
AutoGen	Conversation-based	Back-and-forth agent dialogues, human-in-the-loop
CrewAI	Role-based crew	Parallel task execution, hierarchical delegation
OpenAI Swarm	Handoff-based	Lightweight, low-boilerplate agent handoffs
Semantic Kernel	Plugin/planner	Enterprise .NET/Python integrations

Each has its niche. LangGraph is the right choice here because the SDLC is fundamentally a directed acyclic pipeline with conditional error exits. State flows forward, agents don't loop back, and failures need to short-circuit gracefully. That's exactly what LangGraph's StateGraph was built for.

Why not just one big prompt?

A single "write me an app from this description" prompt degrades quickly for non-trivial tasks:

Context collapse — one prompt can't simultaneously be a BA, architect, developer, QA engineer, and reviewer without each role undermining the others
No specialisation — a general prompt produces general output; specialised prompts with role-specific context produce expert output
No accountability — you can't easily replay from the architect stage if only the code was wrong
Token ceiling — a single-turn mega-prompt blows up for anything beyond toy examples

The pipeline approach solves all four.

Architecture Overview

Every node is a LangGraph node. Every edge is either unconditional (start → load_task, write_artifacts → END) or conditional (check state["status"], route to error_handler if "error").

The entire pipeline shares one typed state object (SDLCState), defined once and validated throughout:


class SDLCState(TypedDict):
    task_md: str                              # Input
    spec_md: str                              # BA output
    tech_design_md: str                       # Architect output (updated by Dev)
    generated_code: dict[str, str]            # Dev output: filename → content
    test_code: dict[str, str]                 # QA output: filename → content
    code_review_md: str                       # Review output
    project_name: str                         # Extracted from spec
    status: str                               # "running" | "error" | "done"
    current_agent: str
    error: Optional[str]
    messages: Annotated[list[BaseMessage], add_messages]

Agents return only the keys they modify. LangGraph merges partial updates into the full state automatically.

The Five Agents

1. BA Agent

Input: raw task description
Output: structured Markdown spec

The BA agent takes the free-form task and produces a proper specification document:


project_name: simple_cli_todo_list

## Overview
A command-line to-do application that runs in a loop...

## Goals
- Provide a simple, interactive interface for managing tasks
- Support add, show, and delete operations

## Functional Requirements
- FR-1: `add "item"` appends a new item to the list
- FR-2: `show` displays all items numbered 1-based
- FR-3: `delete N` removes the item at position N
- FR-4: `quit` exits the loop gracefully

## Non-Functional Requirements
- Pure Python, no external dependencies
- Single-file implementation preferred

The first line is always project_name: <snake_case_name> — this is parsed with a regex and used to name all output folders for the rest of the run.

Why gpt-4o-mini? Structured document generation from a template is a lightweight task. The mini model is fast, cheap, and plenty capable here.

2. Architect Agent

Input: spec from BA
Output: full technical design + implementation checklist

The Architect produces a complete design document covering components, data models, data flow, tech stack, and file structure. The critical part is the Implementation Plan section — a numbered checklist in - [ ] format:


## Implementation Plan

- [ ] 1. Define `TodoList` class with internal list storage
- [ ] 2. Implement `add_item(text)` method
- [ ] 3. Implement `show_items()` method
- [ ] 4. Implement `delete_item(n)` method with bounds checking
- [ ] 5. Write `main()` loop with command parsing
- [ ] 6. Handle invalid commands and out-of-range deletes

This checklist isn't just documentation — the Dev Agent updates it after code generation.

3. Dev Agent (two LLM calls)

Input: technical design
Output: source files as {filename: content} dict + updated tech design

This is the most complex agent. It makes two sequential LLM calls:

Call 1 — Code generation:
Returns a JSON object mapping filenames to file content. The strict JSON output contract lets us reliably parse multi-file outputs regardless of LLM formatting variations:


{
  "todo.py": "\"\"\"Simple CLI to-do list.\"\"\"\n\nclass TodoList:\n    ...",
  "main.py": "from todo import TodoList\n\ndef main():\n    ..."
}

A _parse_json_output() helper strips markdown fences before parsing — LLMs are inconsistent about whether they wrap JSON in `json blocks.

Call 2 — Plan update:
Takes the tech design + generated filenames, rewrites the implementation plan with all steps marked [x] and annotated with the file that implements each step:


- [x] 1. Define `TodoList` class → todo.py
- [x] 2. Implement `add_item(text)` method → todo.py
- [x] 5. Write `main()` loop with command parsing → main.py

The updated tech_design_md (with checked-off plan) replaces the original in state and gets persisted to disk. When you open artifacts/<project>/tech_design.md after a run, you see exactly what was built and where.

Why gpt-4o for Dev? Code generation quality matters. The gap between gpt-4o and gpt-4o-mini on code is meaningful, especially for edge case handling, idiom correctness, and docstrings.

4. QA Agent

Input: all generated source files
Output: pytest test files as {filename: content} dict

The QA Agent reads every source file and writes comprehensive pytest tests. The key insight in the prompt: test files are given the actual source code, not just the spec — this means the tests actually match the implementation's structure (real method names, real class names).

Generated tests cover:

Happy paths (standard usage)
Edge cases (empty list, boundary indices)
Error conditions (invalid input, out-of-range delete)
unittest.mock for any I/O or external calls

5. Review Agent

Input: all source files + all test files
Output: structured Markdown code review

The review doc follows a consistent schema:



## Summary
Clean implementation of the requirements. Single-file structure is appropriate.

## Issues

| # | Severity | Location | Issue | Recommendation |
|---|----------|----------|-------|----------------|
| 1 | 🟡 Minor | todo.py:14 | No type hints on public methods | Add `-> None` / `-> str` annotations |
| 2 | 🔵 Info  | main.py:3  | No `if __name__ == "__main__"` guard | Wrap main() call |

## Test Coverage Assessment
Tests cover all three commands and error paths. Missing: concurrent access scenario (out of scope for CLI).

## Verdict: ✅ Approved

Severity codes: 🔴 Critical, 🟠 High, 🟡 Minor, 🔵 Info.

State Management: The Secret Sauce

LangGraph's state model is what makes this architecture clean.

Context isolation between agents

Each agent resets "messages": [] in its return dict. Because messages uses LangGraph's add_messages reducer — which accumulates messages — returning an empty list clears the accumulated history:


def ba_agent(state: SDLCState) -> dict:
    response = llm.invoke([SystemMessage(...), HumanMessage(...)])
    return {
        "spec_md": response.content,
        "project_name": _extract_project_name(response.content),
        "current_agent": "ba_agent",
        "messages": [],  # ← clears history for the next agent
    }

Without this, each subsequent agent would see the entire conversation history from all previous agents — a context bleed that confuses specialised roles and wastes tokens.

Conditional error routing

Every edge (except the final ones) uses the same routing factory:


def _route(next_node: str):
    def route(state: SDLCState) -> str:
        if state.get("status") == "error":
            return "error_handler"
        return next_node
    return route

builder.add_conditional_edges("ba_agent", _route("architect_agent"))
builder.add_conditional_edges("architect_agent", _route("dev_agent"))
# ... etc

Any agent can fail gracefully by returning {"status": "error", "error": "message"}. The graph short-circuits to error_handler without affecting already-written artifacts. This is critical for real-world use where LLM calls occasionally fail or return malformed output.

Checkpointing and resumability

The graph compiles with MemorySaver():


graph = build_graph()  # compiled with MemorySaver checkpointer

Every invocation gets a unique thread_id (UUID). This means state is checkpointed at every node boundary. You can resume a failed run or inspect intermediate state without replaying the whole pipeline.

The CLI exposes this with run-from:


# Code generation was wrong? Re-run from Dev, reusing existing spec + tech design
python sdlc_cli.py run-from dev

This loads persisted artifacts back into state up to the requested restart point, saving both time and API cost.

The write_artifacts Node

One of the stronger design decisions: agents never touch the filesystem.

All agents are pure functions of state → state. The filesystem write is centralised in a single write_artifacts node that runs only after all agents succeed:


def write_artifacts(state: SDLCState) -> dict:
    name = state["project_name"]
    write_artifact(name, "spec.md", state["spec_md"])
    write_artifact(name, "tech_design.md", state["tech_design_md""])
    write_artifact(name, "code_review.md", state["code_review_md"])
    all_code = {**state["generated_code"], **state["test_code"]}
    write_code_files(name, all_code)
    return {"status": "done"}

Benefits:

Testable agents — unit tests mock the LLM, never the filesystem
Atomic output — you don't get partially written artifacts from a failed run
Single I/O boundary — one place to change output format, destination, or cloud upload

Output Structure

After python sdlc_cli.py run on a "simple CLI to-do list" task:


artifacts/simple_cli_todo_list/
    spec.md              ← BA spec with functional requirements
    tech_design.md       ← Architect design with ✓ checked implementation plan
    code_review.md       ← Severity-coded review with verdict

code/simple_cli_todo_list/
    todo.py              ← TodoList class implementation
    main.py              ← CLI loop and command parser
    test_todo.py         ← pytest tests for TodoList
    test_main.py         ← pytest tests for the CLI

Both directories are gitignored — they're runtime outputs.

AI Tool Integrations

The pipeline is designed to be AI-tool agnostic. Every popular coding assistant gets its own integration file that delegates to sdlc_cli.py:

Tool	Integration	Invocation
Claude Code	`.claude/commands/sdlc.md`	`/sdlc run`
Cursor	`.cursor/commands/sdlc.md`	`@sdlc run`
GitHub Copilot	`.github/prompts/sdlc.md`	Prompt panel
Continue.dev	`.continue/prompts/sdlc.md`	`/sdlc run`
Windsurf	`.windsurf/rules/sdlc.md`	Rules panel

AGENTS.md at the repo root is a universal context file — any tool can read it to understand the project architecture and available commands without tool-specific configuration.

This pattern is increasingly important: your automation shouldn't be locked to a single AI assistant.

Advanced Monitoring (LLM, Cost)

Advanced monitoring and debugging using LangSmith dashboard(s). Inquiry, responses, timing, cost and more.

CLI Reference


# Run full pipeline on input/task.md
python sdlc_cli.py run

# One-liner — write task inline and run
python sdlc_cli.py new "Build a CLI password generator"

# Re-run from a specific agent (reuses prior artifacts)
python sdlc_cli.py run-from dev   # valid: ba, architect, dev, qa, review

# Inspect outputs
python sdlc_cli.py show spec
python sdlc_cli.py show tech_design
python sdlc_cli.py show code
python sdlc_cli.py show code_review

# Check what's been built
python sdlc_cli.py status

# Run framework unit tests (all mocked, no API keys needed)
python sdlc_cli.py test

# Run QA-generated tests for the last project
python sdlc_cli.py test-generated

Testing the Pipeline Itself

The framework ships with its own unit tests in tests/. These test each agent in isolation — no real API calls, no API keys required:


# tests/test_dev_agent.py
@patch("sdlc.agents.dev_agent.llm")
def test_dev_agent_makes_two_llm_calls(mock_llm):
    mock_llm.invoke.side_effect = [
        AIMessage(content='{"main.py": "print(\'hello\')"'),
        AIMessage(content="- [x] 1. Create main.py → main.py"),
    ]
    result = dev_agent(base_state())
    assert mock_llm.invoke.call_count == 2
    assert "main.py" in result["generated_code"]

The Dev Agent test specifically asserts exactly two LLM calls — if the implementation changes to make one or three calls, the test catches it. This kind of behavioural assertion is more valuable than output-content assertions for LLM-calling code.

Design Decisions Worth Noting

Why separate write_artifacts from agents?
Agents stay pure and testable. A failed run doesn't leave half-written files. One node controls all I/O.

Why JSON for multi-file output?
Markdown code fences are ambiguous when embedding multiple files. JSON gives a reliable, parseable structure: {"filename.py": "content..."}. The _parse_json_output() helper handles LLM fence inconsistencies.

Why reset messages between agents?
Each agent is a standalone expert. Prior conversation context from other agents would confuse the role and waste tokens. Clean slate per agent.

Why gpt-4o for Dev/QA but gpt-4o-mini for the rest?
Code generation and test generation have the highest quality ceiling — stronger model pays off. Structured document generation (spec, design, review) works well with the faster mini model.

Why a CLI instead of direct Python calls?
sdlc_cli.py is a single, tool-agnostic interface. Every AI coding assistant can invoke the same commands. No tool-specific knowledge required.

Getting Started


git clone https://git.epam.com/alexander_uspensky/ai-sdlc.git
cd ai-sdlc
python -m venv .venv && .venv\Scripts\activate   # Windows
# source .venv/bin/activate                      # macOS/Linux
pip install -r requirements.txt
cp .env.example .env  # add your OPENAI_API_KEY

Write your task:


# Edit input/task.md with your task description, then:
python sdlc_cli.py run

# Or inline:
python sdlc_cli.py new "Build a REST API for a bookmark manager with FastAPI"

Watch the pipeline run:


============================================================
  Multi-Agent SDLC Pipeline
============================================================
[Pipeline] Loading task...
[BA] Analysing requirements...
[Architect] Designing system...
[Dev] Generating code (call 1/2)...
[Dev] Updating implementation plan (call 2/2)...
[QA] Writing tests...
[Review] Reviewing code...
[Pipeline] Writing artifacts to disk...
============================================================
  ✅ Pipeline complete!
  Project  : simple_cli_todo_list
  Artifacts: artifacts/simple_cli_todo_list/
  Code     : code/simple_cli_todo_list/
============================================================

What's Next

The current pipeline is linear — each agent hands off sequentially. Obvious extensions:

Parallel QA + Review — once Dev finishes, QA and Review could run concurrently (LangGraph supports fan-out/fan-in natively)
Feedback loops — if Review flags Critical issues, route back to Dev for a fix pass
LangSmith tracing — set LANGCHAIN_TRACING_V2=true and every LLM call is logged with inputs, outputs, latency, and token usage
Model pluggability — swap agents to Claude Sonnet 4.6, Gemini 2.0 Flash, or local Llama models without changing graph structure
Web UI — LangGraph's LangGraph Platform can serve the graph as an API with a streaming interface

Conclusion

Multi-agent SDLC isn't about replacing developers — it's about automating the mechanical parts of the cycle so you can focus on the creative parts: system design decisions, edge case identification, architectural trade-offs.

The LangGraph approach specifically gives you:

Explicit, auditable data flow — state is typed and visible at every step
Reliable error handling — any agent can fail gracefully without corrupting output
Composable architecture — add, remove, or swap agents without touching the graph structure
Resumability — run from any checkpoint, save API costs on partial failures

The full project is on GitLab with working code: 👉 https://github.com/alexander-uspenskiy/ai_sdlc

Built with LangGraph 1.1.0, LangChain, OpenAI GPT-4o, Python 3.13.
All agent tests run without API keys — pytest tests/ works out of the box.

Semantic Similarity Score for AI RAG

Alexander Uspenskiy — Mon, 19 May 2025 17:46:13 +0000

What is Semantic Similarity Score?

A semantic similarity score measures how closely two pieces of text (like a question and an answer) relate in meaning—regardless of exact wording. In AI systems, it’s used to rank or retrieve the most relevant answers by comparing their vector embeddings. A higher score (closer to 1) means the texts are more alike in context and intent.

Think of it as how well your AI understood what you meant—beyond just matching keywords.

My next RAG POC will use similarity score to distinguish the better approach: the data in vector db or context from the web search agent.

If interested you can find my previous articles on RAG POCs:

Build the Smartest AI Bot You’ve Ever Seen — A 7B Model + Web Search, Right on Your Laptop

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

Build the Smartest AI Bot You’ve Ever Seen — A 7B Model + Web Search, Right on Your Laptop

Alexander Uspenskiy — Tue, 22 Apr 2025 21:54:01 +0000

Summary:

RAG Web is a Python-based application that combines web search and natural language processing to answer user queries. It uses DuckDuckGo for retrieving web search results and a Hugging Face Zephyr-7B-beta model for generating answers based on the retrieved context.

This is my second article related to RAG implementation, the first part related to vector in-memory RAG on your laptop you can find here: How to create your own RAG

Web Search RAG Architecture

In this PO user query is sent as an input for the external web search. For this implementation it is DuckDuckGo service to avoid API and security limitations for the more efficient search services like Google. Search Result (as Context) is sent with the original user query as an input for the language transfer model (HuggingFaceH4/zephyr-7b-beta) which summarise and extracts the answer and output to the user.

Deployment Instructions

1. Clone / Copy the Project

git clone https://github.com/alexander-uspenskiy/rag_web
cd rag_web

2. Create and Activate Virtual Environment

python3 -m venv venv
source venv/bin/activate

3. Install Requirements

pip install -r requirements.txt

4. Run the Script

python rag_web.py

How the Script Works

This is a lightweight Retrieval-Augmented Generation (RAG) implementation using:
• A 7B language model (Zephyr) from Hugging Face
• DuckDuckGo for real-time web search (no API key needed)

Code Breakdown

1. Imports and Setup

from transformers import pipeline
from duckduckgo_search import DDGS
import textwrap
import re

transformers: From Hugging Face, used to load and interact with the LLM.
DDGS: DuckDuckGo’s Python interface for search queries.
textwrap: Used for formatting the output neatly.
re: Regular expressions to clean the model’s output.

2. Web Search Function

def search_web(query, num_results=3):
    with DDGS() as ddgs:
        results = ddgs.text(query, max_results=num_results)
        return [r['body'] for r in results]

Purpose: Takes a user query and performs a web search.
How it works: Uses the DDGS().text(...) method to fetch search results.
Returns: A list of snippet texts (just the bodies, without links/titles).

3. Context Generation

def get_context(query):
    snippets = search_web(query)
    context = " ".join(snippets)
    return textwrap.fill(context, width=120)

Combines all snippet results into one big context paragraph.
Applies word wrapping to improve readability (optional for model input but nice for debugging/logging).

4. Model Initialization

qa_pipeline = pipeline(
    "text-generation", 
    model="HuggingFaceH4/zephyr-7b-beta", 
    tokenizer="HuggingFaceH4/zephyr-7b-beta", 
    device_map="auto"
)

Loads Zephyr-7B, a chat-tuned model from Hugging Face.
device_map="auto" lets Hugging Face offload model parts across available hardware (e.g., MPS or CUDA).

5. Question Answering Function

def answer_question(query):

a) Get Context

context = get_context(query)

Performs search and prepares the retrieved content.

b) Prepare Prompt

prompt = f"""[CONTEXT]
{context}

[QUESTION]
{query}

[ANSWER]
"""

This RAG-style prompt provides the model:

[CONTEXT] = retrieved text from the web
[QUESTION] = user’s query
[ANSWER] = expected model output

c) Generate Answer

response = qa_pipeline(prompt, max_new_tokens=128, do_sample=True)

The model generates text following the [ANSWER] tag.
do_sample=True allows some creativity/randomness.

d) Post-processing

answer_raw = response[0]['generated_text'].split('[ANSWER]')[-1].strip()
answer = re.sub(r"<[^>]+>", "", answer_raw)

Strips the prompt from the output.
Removes any stray XML/HTML-style tags (, , etc.) the model might emit.

6. User Interaction Loop

if __name__ == "__main__":

Opens a CLI loop.
Reads user input from the terminal.
Runs the full search + answer pipeline.
Displays the answer and continues unless the user types exit or quit.

Architecture Summary

[User Query]
     ↓
DuckDuckGo Search API
     ↓
[Web Snippets]
     ↓
[CONTEXT] + [QUESTION] Prompt
     ↓
Zephyr 7B (Hugging Face)
     ↓
[Generated Answer]
     ↓
Display in Terminal

Why Zephyr-7B?

Zephyr is a family of instruction-tuned, open-weight language models developed by Hugging Face. It's designed to be helpful, honest, and harmless — and small enough to run on consumer hardware.

Key Characteristics

Feature	Description
Model Size	7 Billion parameters
Architecture	Based on Mistral-7B (dense transformer, multi-query attention)
Tuning	Fine-tuned using DPO (Direct Preference Optimization)
Context Length	Supports up to 8,192 tokens
Hardware	Runs locally on M1/M2 Macs, GPUs, or even CPU with quantization
Use Case	Optimized for dialogue, instructions, and chat use

Why I Picked Zephyr for This Script

Open weights — no API keys, no rate limits
Runs on laptop — 7B is small enough for consumer devices
Instruction-tuned — great at handling prompts containing context and questions
Friendly outputs — fine-tuned to be helpful and safe
Easy integration — via Hugging Face transformers pipeline

Compared to Other Models

Model	Pros	Cons
Zephyr-7B	Open, chat-tuned, lightweight	Slightly less fluent than GPT-4
GPT-3.5/4	Top-tier reasoning	Closed, pay-per-use, no local use
Mistral-7B	High-speed base model	Needs fine-tuning for QA/chat
LLaMA2 7B	Open and popular	Less optimized for chat out-of-box

Final Thoughts on model

Zephyr-7B hits the sweet spot between performance, privacy, and portability. It gives you GPT-style interaction with full local control — and when combined with web search, it becomes a surprisingly capable assistant.

If you're building a local AI assistant or just want to experiment with RAG pipelines without burning through API tokens, Zephyr-7B is a strong starting point.

Usage example

You can see the RAG searches for the real-time data to add to the context and send to the model so model can generate an answer:

Performance Optimization

While the baseline implementation is functional and responsive, several optimizations can improve performance:

Model Quantization: Use 4-bit or 8-bit quantized versions of the model with bitsandbytes to reduce memory usage and inference time.
Streaming Inference: Implement token streaming for faster perceived response times.
Caching Search Results: Avoid redundant queries by caching recent DuckDuckGo results locally.
Async Execution: Use asyncio to parallelize web search and token generation.
Prompt Truncation: Dynamically trim context to fit within model’s token limits, prioritizing relevance.

Future Enhancements for Enterprise RAG

To scale this into an enterprise-grade RAG system, consider the following enhancements:

Vector Search Integration: Add to a web search a hybrid search system using vector embeddings (e.g., FAISS, Weaviate, Pinecone).
Knowledge Base Sync: Sync data from private sources like Confluence, Notion, SharePoint, or document stores.
Multi-turn Memory: Add a conversation memory layer using a session buffer or vector memory for context retention.
User Feedback Loop: Incorporate thumbs-up/down voting to improve results and fine-tune retrieval relevance.
Security & Auditability: Wrap API access and logging in enterprise security layers (SSO, encryption, RBAC).
Scalability: Run inference via model serving tools like vLLM, TGI, or TorchServe with GPU acceleration and autoscaling.

Summary

This article explores how to build a lightweight Retrieval-Augmented Generation (RAG) assistant using a 7B parameter open-source language model (Zephyr-7B) and real-time web search via DuckDuckGo.

The solution runs locally, requires no external APIs, and leverages Hugging Face's transformers library to deliver intelligent, contextual responses to user queries.

Zephyr-7B was chosen for its balance of performance and portability. It is instruction-tuned, easy to run on consumer hardware, and excels in structured question-answering tasks. When paired with live search results, it creates a powerful, self-contained research assistant.

This project is ideal for developers looking to experiment with local LLMs, build RAG prototypes, or create privacy-respecting AI tools without relying on paid cloud APIs.

The full implementation, code walkthrough, and architecture are detailed below.

Use a GitHub repository to get a POC code: GitHub Repository

Happy Coding!

RAG with free LLM Model

Alexander Uspenskiy — Thu, 13 Feb 2025 22:05:57 +0000

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

Alexander Uspenskiy ・ Dec 16 '24

#python #ai #rag #vectordatabase

DeepSeek on your laptop!

Alexander Uspenskiy — Thu, 13 Feb 2025 22:05:23 +0000

Unlock DeepSeek R1 7B on Your Laptop—Experience the Smartest AI Model I Ever Tested!

Alexander Uspenskiy ・ Jan 31 '25

#ai #deepseek #python #code

Any alternative to DeepSeek?

Alexander Uspenskiy — Thu, 13 Feb 2025 22:04:36 +0000

Mistral’s ‘Small’ 24B Parameter Model Blows Minds—No Data Sent to China, Just Pure AI Power!

Alexander Uspenskiy ・ Feb 4 '25

#ai #deepseek #code #python

Agentic Ai, is it our future?

Alexander Uspenskiy — Thu, 13 Feb 2025 22:03:41 +0000

Agentic AI: Revolutionizing Next-Generation Software Development Teams

Alexander Uspenskiy ・ Feb 13 '25

#webdev #ai #sdl #programming

Agentic AI: Revolutionizing Next-Generation Software Development Teams

Alexander Uspenskiy — Thu, 13 Feb 2025 19:49:50 +0000

As we move forward at light speed toward the implementation of Agentic AI and Artificial General Intelligence (AGI), I think it’s time to consider a next-generation Software Development Life Cycle in terms of team structure.

So, what is Agentic AI?

Agentic AI refers to AI systems that operate with a level of autonomy, decision-making, and adaptability similar to human agents. These AI models can independently plan, execute tasks, and adjust their behavior based on goals, feedback, and environmental changes.

Since there is no real Agentic AI on the market at the moment, we need to consider how to incorporate these systems into current or future IT teams.

I would assume that, in the near future, there will be several proposals for AI-based developers—either universal or language/technology-specific. As a leader, you will need to compose a hybrid team of both human IT specialists and Agentic AI virtual specialists.

Another assumption is that there will be more than one proposal on the market, meaning we’ll likely see different AI agents (on-premises as LLM/RAG, online, in private/public clouds, etc.). One of the most important questions here is security. I would classify the key parameters as follows:

Security
Cost
Performance
Redundancy
Special Requirements (content window, fine-tuning, integrations, etc.)

If you’re working in government or finance, you should start thinking about on-premises/private cloud resources in advance. For any other commercial use of agents, you can begin evaluating options for the level of security your business model can tolerate.

I believe that hybrid teams are the near future. For example, in a development team, you might have a set of Agentic AI developers and human dev/engineering leads with specialized AI-related skills. It’s likely that the role of a regular programmer or coder will diminish quickly.

One of the interesting new fields is Agentic AI synchronization and communication interfaces. Such interfaces should be secure, fast, and agnostic to different model providers.

I also believe that a “next-gen Jira” could be integrated with these interfaces, allowing a dev lead to assign tasks directly to AI models or let the models select tasks themselves, much like in a classic Agile environment.

Finally, I see that the positions of Scrum Masters, Performance Engineers, and QA specialists could also be handled by Agentic AI. My hope is that software quality will rise as new approaches to automated testing are implemented.

Summary:

I believe that we are (surprise but again) pretty close to the transformation point, and it is wise to start planning the organizational transformation in advance and understand how to incorporate Agentic AI (or AGI) into the existing (or new) IT teams as frictionlessly as possible.

Mistral’s ‘Small’ 24B Parameter Model Blows Minds—No Data Sent to China, Just Pure AI Power!

Alexander Uspenskiy — Tue, 04 Feb 2025 00:56:17 +0000

I've inspected the latest response from Mistral: Mistral-Small-24B-Instruct. It is bigger, slower than deepseek-ai/deepseek-r1-distill-qwen-7b but it also showing how it is thinking and doesn't send your sensitive data to China soil :)

So let's start.

This project provides an interactive chat interface for the mistralai/Mistral-Small-24B-Instruct-2501 model using PyTorch and the Hugging Face Transformers library.

Requirements

Python 3.8+
PyTorch
Transformers
An Apple Silicon device (optional, for MPS support)

Setup
Clone the repository:

git clone https://github.com/alexander-uspenskiy/mistral.git
cd mistral

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required packages:

pip install torch transformers

Set your Hugging Face Hub token:

export HUGGINGFACE_HUB_TOKEN=your_token_here

Usage

Run the chat interface:

python mistral.py

Features

Interactive chat interface with the Mistral-Small-24B-Base-2501 model.
Progress indicator while generating responses.
Supports Apple Silicon GPU (MPS) for faster inference.

Code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import time
import threading

# Check if MPS (Apple Silicon GPU) is available
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

# Load the Mistral-Small-24B-Base-2501 model
model_name = "mistralai/Mistral-Small-24B-Instruct-2501"
token = os.getenv("HUGGINGFACE_HUB_TOKEN")

tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)


model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map={"": device},
    torch_dtype=torch.float16,  # Optimized for M1 GPU
    token=token
)

def show_progress():
    while not stop_event.is_set():
        for char in "|/-\\":
            print(f"\rGenerating response {char}", end="", flush=True)
            time.sleep(0.1)

# Interactive terminal loop
print("mistralai/Mistral-Small-24B-Instruct-2501 Chat Interface (type 'exit' to quit)")
interaction_count = 0  # Initialize counter
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit"]:
        break

    inputs = tokenizer(user_input, return_tensors="pt").to(device)

    stop_event = threading.Event()
    progress_thread = threading.Thread(target=show_progress)
    progress_thread.start()

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=200,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id  # Added to avoid warning
        )

    stop_event.set()
    progress_thread.join()
    print("\r", end="")  # Clear the progress line

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Mistral: {response}\n")

Example:

You: The bat and the ball costs $1.10 in total, bat is expensive than a ball costs $1.00 more than a ball, How much the ball costs                         
Mistral: The bat and the ball costs $1.10 in total, bat is expensive than a ball costs $1.00 more than a ball, How much the ball costs?
Let's denote the cost of the ball as \( B \) and the cost of the bat as \( B + 1.00 \).

According to the problem, the total cost of the bat and the ball is $1.10. Therefore, we can write the equation:

\[ B + (B + 1.00) = 1.10 \]

Simplifying the equation:

\[ 2B + 1.00 = 1.10 \]

Subtract 1.00 from both sides:

\[ 2B = 0.10 \]

Divide both sides by 2:

\[ B = 0.05 \]

So, the ball costs $0.05.

Summary:

So as you see modern models are able to run locally and solve logical tasks with the excellent performance.

Happy coding!

Unlock DeepSeek R1 7B on Your Laptop—Experience the Smartest AI Model I Ever Tested!

Alexander Uspenskiy — Fri, 31 Jan 2025 22:52:40 +0000

To be honest, I didn’t have high expectations for the buzz surrounding DeepSeek R1. However, I decided to test the 7B DeepSeek model, deepseek-ai/deepseek-r1-distill-qwen-7b, and what I discovered truly amazed me.

So let's start.

This is a project that utilizes transformer models for generating responses to user queries. It leverages the transformers library from HuggingFace and torch for efficient model handling and inference.

Setup

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Installation

Clone the repository:

git clone https://github.com/alexander-uspenskiy/deepseek
cd deepseek

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate

Install the required packages:
```
pip install transformers torch
```

Usage

Run the main script:
```
python deepseek.py
```
Follow the prompts to enter your questions. Type 'quit' to exit the interaction.

Project Structure

deepseek.py: Main script containing the model setup and response generation logic.

Example

(venv) $ python deepseek.py

Enter your question (or 'quit' to exit): A bat and a ball costs 1 dollar and 10 cents in total. The bat costs 1 dollar more than a ball. How much does the ball costs?

Response: Question: A bat and a ball costs 1 dollar and 10 cents in total. The bat costs 1 dollar more than a ball. How much does the ball costs?
Answer: 5 cents.

But wait, that doesn't make sense. If the ball is 5 cents, then the bat is 1 dollar more, which would be $1.05, and together that's $1.10. So, the ball is 5 cents.

Wait, but that seems a bit tricky. Let me think again. Maybe I should set up some equations to solve this properly.

Let me denote the cost of the ball as \( x \) dollars. Then, according to the problem, the bat costs \( x + 1 \) dollars. The total cost of the bat and the ball is $1.10, so:

\( x + (x + 1) = 1.10 \)

Simplify the equation:

\( 2x + 1 = 1.10 \)

Subtract 1 from both sides:

\( 2x = 0.10 \)

Divide both sides by 2:

\( x = 0.05 \)

So, the ball costs 5 cents. That seems correct, but let me verify.

If the ball is 5 cents, the bat is $1.05. Adding them together: 5 + 105 = 110 cents, which is $1.10. Yes, that checks out.

Wait, but sometimes these problems can be tricky. Is there any other way to interpret the problem that could lead to a different answer? Maybe the phrasing could be ambiguous. Let me read it again:

"A bat and a ball costs 1 dollar and 10 cents in total. The bat costs 1 dollar more than a ball. How much does the ball costs?"

Hmm, no, the way it's phrased seems pretty straightforward. It says the total is $1.10, and the bat is $1 more than the ball. So, with the equations I set up, it leads to the ball being 5 cents.

As you see the response shows the whole reasoning process which is amazing for the model that can be executed on your laptop.

Source Code

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def setup_model():
    # Model ID from HuggingFace
    model_id = "deepseek-ai/deepseek-r1-distill-qwen-7b"

    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

    # Load model with lower precision for memory efficiency
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16,  # Use fp16 for efficiency
        device_map="auto",  # Automatically handle device placement
        trust_remote_code=True
    )

    return model, tokenizer

def generate_response(model, tokenizer, prompt, max_length=512):
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,  # Pass attention_mask
            max_length=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode and return response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

def main():
    try:
        # Setup model and tokenizer
        model, tokenizer = setup_model()

        # Example QA interaction
        while True:
            question = input("\nEnter your question (or 'quit' to exit): ")
            if question.lower() == 'quit':
                break

            prompt = f"Question: {question}\nAnswer:"
            response = generate_response(model, tokenizer, prompt)
            print(f"\nResponse: {response}")

    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    main()

Troubleshooting

If you encounter issues with the model download or execution, ensure that your internet connection is stable and try the following steps:

Ensure the virtual environment is activated:
```
source venv/bin/activate
```

Reinstall the required packages:

pip install --upgrade transformers torch

Check the Python interpreter being used:
```
which python
```

Unlock AI-Powered Image Processing on Your Laptop with Stable Diffusion v1.5 – It’s Easier Than You Think!

Alexander Uspenskiy — Wed, 29 Jan 2025 16:32:35 +0000

This script leverages Stable Diffusion v1.5 from Hugging Face's Diffusers library to generate image variations based on a given text prompt. By using torch and PIL, it processes an input image, applies AI-driven transformations, and saves the results.

You can clone this repo to get the code https://github.com/alexander-uspenskiy/image_variations

Source code:

import torch
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
import requests
from io import BytesIO

def load_image(image_path, target_size=(768, 768)):
    """
    Load and preprocess the input image
    """
    if image_path.startswith('http'):
        response = requests.get(image_path)
        image = Image.open(BytesIO(response.content))
    else:
        image = Image.open(image_path)

    # Resize and preserve aspect ratio
    image = image.convert("RGB")
    image.thumbnail(target_size, Image.Resampling.LANCZOS)

    # Create new image with padding to reach target size
    new_image = Image.new("RGB", target_size, (255, 255, 255))
    new_image.paste(image, ((target_size[0] - image.size[0]) // 2,
                           (target_size[1] - image.size[1]) // 2))

    return new_image

def generate_image_variation(
    input_image_path,
    prompt,
    model_id="stable-diffusion-v1-5/stable-diffusion-v1-5",
    num_images=1,
    strength=0.75,
    guidance_scale=7.5,
    seed=None
):
    """
    Generate variations of an input image using a specified prompt

    Parameters:
    - input_image_path: Path or URL to the input image
    - prompt: Text prompt to guide the image generation
    - model_id: Hugging Face model ID
    - num_images: Number of variations to generate
    - strength: How much to transform the input image (0-1)
    - guidance_scale: How closely to follow the prompt
    - seed: Random seed for reproducibility

    Returns:
    - List of generated images
    """
    # Set random seed if provided
    if seed is not None:
        torch.manual_seed(seed)

    # Load the model
    device = "cuda" if torch.cuda.is_available() else "cpu"
    pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32
    ).to(device)

    # Load and preprocess the input image
    init_image = load_image(input_image_path)

    # Generate images
    result = pipe(
        prompt=prompt,
        image=init_image,
        num_images_per_prompt=num_images,
        strength=strength,
        guidance_scale=guidance_scale
    )

    return result.images

def save_generated_images(images, output_prefix="generated"):
    """
    Save the generated images with sequential numbering
    """
    for i, image in enumerate(images):
        image.save(f"images-out/{output_prefix}_{i}.png")

# Example usage
if __name__ == "__main__":
    # Example parameters
    input_image = "images-in/Image_name.jpg"  # or URL
    prompt = "Draw the image in modern art style, photorealistic and detailed."

    # Generate variations
    generated_images = generate_image_variation(
        input_image,
        prompt,
        num_images=3,
        strength=0.75,
        seed=42  # Optional: for reproducibility
    )

    # Save the results
    save_generated_images(generated_images)

How It Works:

Load & Preprocess the Input Image

Accepts both local file paths and URLs.
Converts the image to RGB format and resizes it to 768×768, maintaining aspect ratio.
Adds padding to fit the target size.
Initialize Stable Diffusion v1.5
Loads the model on CUDA (if available) or falls back to CPU.
Uses StableDiffusionImg2ImgPipeline to process the input image.
Generate AI-Modified Image Variations
Takes in a text prompt to guide the transformation.
Parameters like strength (0-1) and guidance scale (higher = stricter prompt adherence) allow customization.
Supports multiple output images per prompt.
Save Results to image-out directory.
Outputs generated images with a sequential naming scheme (generated_0.png, generated_1.png, etc.).

Example Use Case

You can transform an image of a person into a medieval king using a prompt like:
prompt = "Draw this person as a powerful king, photorealistic and detailed, in a medieval setting."

Initial image:

Result:

Cons&Pros

Cons:

Can be slow on some hardware configurations.
Small size model limitations.

Pros:

Runs locally (no need for cloud services).
Customizable parameters for fine-tuning output.
Reproducibility with optional random seed.

Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model

Alexander Uspenskiy — Fri, 24 Jan 2025 02:36:19 +0000

The model SmolVLM-500M-Instruct is a state-of-the-art, compact model with 500 million parameters. Despite its relatively small size, its capabilities are remarkably impressive.

Let's jump to the code:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import warnings

warnings.filterwarnings("ignore", message="Some kwargs in processor config are unused")

def upload_and_describe_image(image_path):
    processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

    image = Image.open(image_path)

    prompt = "Describe the content of this <image> in detail, give only answers in a form of text"
    inputs = processor(text=[prompt], images=[image], return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7
        )

    description = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return description.strip()

if __name__ == "__main__":
    image_path = "images/bender.jpg"

    try:
        description = upload_and_describe_image(image_path)
        print("Image Description:", description)
    except Exception as e:
        print(f"An error occurred: {e}")

This Python script uses the Hugging Face Transformers library to generate a textual description of an image. It loads a pre-trained vision-to-sequence model and processor, processes an input image, and generates a descriptive text based on the image content. The script handles exceptions and prints the generated description.

You can download it here: https://github.com/alexander-uspenskiy/vlm

Based on this original non-stock image (put it to the image directory of the project):

Take a look at the description generated by the model (you can play with the prompt and parameters in the code to format the output better for any propose): The robot is sitting on a couch. It has eyes and mouth. He is reading something. He is holding a book with his hands. He is looking at the book. In the background, there are books in a shelf. Behind the books, there is a wall and a door. At the bottom of the image, there is a chair. The chair is white. The chair has a cushion on it. In the background, the wall is brown. The floor is grey. in the image, the robot is silver and cream color. The book is brown. The book is open. The robot is holding the book with both hands. The robot is looking at the book. The robot is sitting on the couch.

It looks excellent, and the model is both fast and resource-efficient compared to LLMs.

Happy coding!