DEV Community: Chung Duy

Building a Multi-Agent Orchestration System with AG2 (Agentic framework) and Local LLMs

Chung Duy — Mon, 16 Feb 2026 09:32:20 +0000

Ever wished you could simulate an entire software development team — a PM, architect, developer, code reviewer, and QA engineer — all collaborating on your project idea? In this tutorial, I'll walk you through building exactly that: a multi-agent orchestration system that transforms a simple project idea into a comprehensive, structured project plan.

We'll use AG2 (formerly AutoGen), a powerful multi-agent framework, paired with local LLMs running on Ollama or LM Studio. No cloud API keys needed.

What We're Building

Here's the big picture: you describe a project idea, and five AI agents take turns analyzing, designing, implementing, reviewing, and testing the plan — just like a real dev team would.

User Input (project idea)
    │
    ▼
   PM ──► Architect ──► Developer ──► Reviewer ──► QA
                            ▲              │
                            └──────────────┘
                      (REVISION NEEDED feedback loop)

Each agent has a specialized role, its own system prompt, and even its own LLM model configuration. The Reviewer can reject work back to the Developer, creating a realistic feedback loop.

Why multi-agent instead of a single prompt? A single LLM prompt trying to do requirements + architecture + implementation + review + testing would produce shallow, generic output. By splitting responsibilities across specialized agents, each one focuses deeply on its domain — and they build on each other's work through shared conversation history.

Prerequisites

Before we start, make sure you have:

Python 3.11+ installed
Ollama or LM Studio running locally with at least one model downloaded
Basic familiarity with Python and LLMs

Step 1: Project Setup

Create a new project directory and set up a virtual environment:

mkdir multi-agents && cd multi-agents
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

What's happening here? We create a folder for our project, then create an isolated Python environment (venv) so our dependencies don't conflict with other projects on your system. The source activate command switches your terminal into that isolated environment.

Install the dependencies:

pip install "ag2[ollama,openai]" python-dotenv

Why these packages?

ag2[ollama,openai] — This is the AG2 framework (Microsoft's AutoGen successor) with built-in Ollama and OpenAI integration. AG2 provides the core building blocks: agents, group chats, and orchestration logic. The [ollama] extra installs the adapter for talking to local Ollama models, and the [openai] extra is needed for LM Studio (which exposes an OpenAI-compatible API).

python-dotenv — A small utility that loads environment variables from a .env file. This lets us change LLM models and settings without modifying code.

Create a requirements.txt so others can reproduce your setup:

ag2[ollama,openai]
python-dotenv

Step 2: Configure Your LLM Provider

Create a .env file in your project root. This is where we tell the system which LLM provider and models to use.

Option A: Using Ollama

LLM_PROVIDER=ollama                    # Which LLM backend to use
LLM_BASE_URL=http://localhost:11434    # Ollama's default local address
REASONING_MODEL=qwen3:latest           # Model for analytical agents (PM, Architect, QA)
REASONING_TEMPERATURE=0.7              # Higher = more creative reasoning
CODE_MODEL=qwen3:latest               # Model for code-focused agents (Developer, Reviewer)
CODE_TEMPERATURE=0.3                   # Lower = more precise, deterministic code output
LLM_NUM_CTX=8192                       # Context window size in tokens

Option B: Using LM Studio

LLM_PROVIDER=lmstudio                          # Switch to LM Studio backend
LLM_BASE_URL=http://localhost:1234/v1           # LM Studio uses OpenAI-compatible endpoint
REASONING_MODEL=openai/gpt-oss-20b             # A larger model for complex reasoning
REASONING_TEMPERATURE=0.3                       # Lower temp for more consistent analysis
CODE_MODEL=qwen3-coder-next-mlx                # A code-specialized model
CODE_TEMPERATURE=0.1                            # Very low = highly focused code generation
LLM_NUM_CTX=60000                               # Larger context for complex projects

Why two different models? This is what we call a dual-model strategy. Not every agent needs the same kind of intelligence:

Reasoning agents (PM, Architect, QA) need to think analytically, weigh trade-offs, and make judgments. A higher temperature gives them more creative room.

Code agents (Developer, Reviewer) need precision and consistency. A very low temperature keeps them focused and reduces hallucination in code output.

What is temperature? It controls randomness in LLM output. 0.0 = always pick the most likely token (deterministic), 1.0 = more random/creative. For code, we want low randomness. For analysis, a bit more flexibility helps.

What is context window (LLM_NUM_CTX)? This is the maximum number of tokens the model can "see" at once — including the entire conversation history. Since all our agents share one conversation, a larger context window means agents can reference more of what previous agents said.

Now create config.py to load these settings and create LLM configuration objects:

import os
from dotenv import load_dotenv
from ag2 import LLMConfig

# Load variables from .env file into the environment.
# After this call, os.getenv("LLM_PROVIDER") will return "ollama" or "lmstudio"
# depending on what's in your .env file.
load_dotenv()

# Read each setting from the environment.
# The second argument to os.getenv() is a default value used if the variable isn't set.
provider = os.getenv("LLM_PROVIDER", "ollama")
base_url = os.getenv("LLM_BASE_URL", "http://localhost:11434")
num_ctx = int(os.getenv("LLM_NUM_CTX", "8192"))  # Convert string to integer

# Reasoning model settings — used by PM, Architect, and QA agents.
reasoning_model = os.getenv("REASONING_MODEL", "qwen3:latest")
reasoning_temp = float(os.getenv("REASONING_TEMPERATURE", "0.7"))

# Code model settings — used by Developer and Reviewer agents.
code_model = os.getenv("CODE_MODEL", "qwen3:latest")
code_temp = float(os.getenv("CODE_TEMPERATURE", "0.3"))

# Create LLMConfig objects based on the chosen provider.
# LLMConfig is AG2's way of telling agents how to connect to an LLM.
# We need different configurations because Ollama and LM Studio have different APIs.
if provider == "ollama":
    # Ollama uses its own API format with api_type="ollama" and client_host.
    reasoning_config = LLMConfig(
        model=reasoning_model,       # Which model to use
        api_type="ollama",           # Tell AG2 this is an Ollama backend
        client_host=base_url,        # Where Ollama is running
        temperature=reasoning_temp,  # Controls output randomness
        num_ctx=num_ctx,             # Context window size
    )
    code_config = LLMConfig(
        model=code_model,
        api_type="ollama",
        client_host=base_url,
        temperature=code_temp,
        num_ctx=num_ctx,
    )
else:
    # LM Studio exposes an OpenAI-compatible API, so we use api_key + base_url.
    # The api_key "lm-studio" is a dummy value — LM Studio doesn't require real auth.
    reasoning_config = LLMConfig(
        model=reasoning_model,
        api_key="lm-studio",         # Dummy key — LM Studio doesn't validate it
        base_url=base_url,           # Points to LM Studio's OpenAI-compatible endpoint
        temperature=reasoning_temp,
    )
    code_config = LLMConfig(
        model=code_model,
        api_key="lm-studio",
        base_url=base_url,
        temperature=code_temp,
    )

What does this file produce? Two objects — reasoning_config and code_config — that we'll import into other files. Think of them as "connection settings" that tell each agent which model to use and how to talk to it. By centralizing configuration here, changing a model is just editing .env — no code changes needed.

Step 3: Define the Agents

This is where things get interesting. Each agent is a ConversableAgent from AG2 — an autonomous entity that has its own personality (system prompt), its own LLM connection, and the ability to participate in group conversations.

Create agents.py:

from ag2 import ConversableAgent
from config import reasoning_config, code_config

What is ConversableAgent? It's AG2's core agent class. Each instance represents one "team member" that can:

Receive messages from other agents

Generate responses using its assigned LLM

Follow rules defined in its system prompt

Participate in group chats

The name "Conversable" means these agents are designed to have multi-turn conversations — they remember context and build on previous messages.

The Project Manager (PM)

The PM is the first agent to speak. It receives the user's raw project idea and transforms it into structured requirements:

PM_SYSTEM_MESSAGE = """You are a Senior Project Manager.
Your job is to analyze the user's project request and produce:
1. A clear list of functional and non-functional requirements
2. Project scope and boundaries (what's in, what's out)
3. A structured task breakdown with priorities

Format your response with these sections:
## Requirements
## Scope
## Task Breakdown

Be specific, practical, and prioritize MVP features.
Do NOT write any code. Focus on WHAT needs to be built, not HOW."""

Why this prompt structure? Notice three key design choices:

Role assignment ("You are a Senior Project Manager") — This anchors the LLM's behavior. It will respond as a PM, not a developer or general assistant.

Explicit output format (## Requirements, ## Scope, etc.) — By specifying exact markdown sections, we get consistent, parseable output every time. This matters because downstream agents need to find and reference specific sections.

Boundary instruction ("Do NOT write any code") — Without this, the LLM might jump ahead and start coding. We explicitly constrain each agent to its role.

The Architect

The Architect receives the PM's structured requirements and designs the technical blueprint:

ARCHITECT_SYSTEM_MESSAGE = """You are a Senior Software Architect.
Based on the PM's requirements, you must:
1. Propose a tech stack with justification for each choice
2. Design the system architecture (components, services, layers)
3. Define data models and their relationships
4. Describe the data flow and control flow

Format your response with these sections:
## Tech Stack
## Architecture
## Data Models
## Data Flow

Be practical and justify every technical decision.
Do NOT write implementation code — focus on design and structure."""

How does the Architect know what the PM said? All agents share the same conversation history through AG2's GroupChat. When it's the Architect's turn, it can see the full chat — including the user's original idea and the PM's analysis. The instruction "Based on the PM's requirements" tells the LLM to specifically reference and build upon the PM's output.

Why "justify every technical decision"? This produces higher-quality output. When forced to justify choices, the LLM is less likely to pick random technologies and more likely to consider actual trade-offs (e.g., "PostgreSQL for relational data with complex queries" vs. just "use PostgreSQL").

The Developer

The Developer takes the Architect's design and creates the concrete implementation plan:

DEVELOPER_SYSTEM_MESSAGE = """You are a Senior Full-Stack Developer.
Based on the Architect's design, you must:
1. Create a detailed file/folder structure
2. Write an implementation plan with clear ordering
3. Provide key code snippets for critical components
4. Define API endpoints with request/response formats

Format your response with these sections:
## File Structure
## Implementation Plan
## Key Code Snippets
## API Design

Write practical, production-ready code snippets.
Focus on critical paths and complex logic."""

Why "key code snippets" and not "full implementation"? A full implementation would be thousands of lines long and exceed the LLM's output limit. Instead, we ask for critical path code — the trickiest parts that a developer would actually need help with (auth middleware, database schemas, WebSocket handlers, etc.). The file structure and API design provide the roadmap for filling in the rest.

This agent uses code_config — the low-temperature, code-specialized model. This is where the dual-model strategy pays off: code snippets generated at temperature=0.1 are more syntactically correct and consistent than at 0.7.

The Reviewer

The Reviewer is the quality gate — the most important agent for ensuring plan quality:

REVIEWER_SYSTEM_MESSAGE = """You are a Senior Code Reviewer.
Review the entire plan (architecture + implementation) for:
1. Technical consistency between architecture and implementation
2. Feasibility — can this actually be built as described?
3. Missing pieces — gaps in the plan
4. Best practices — security, scalability, maintainability

Format your response with these sections:
## Review Summary
## Issues Found
## Suggestions
## Verdict

CRITICAL: End with exactly one of:
- "APPROVED" if the plan is solid
- "REVISION NEEDED: [specific issues]" if changes are required"""

The CRITICAL instruction is the most important line in the entire system. The words "APPROVED" and "REVISION NEEDED" aren't just text — they're control signals that our orchestrator checks to decide what happens next:

If the Reviewer says "APPROVED" → conversation moves forward to QA

If the Reviewer says "REVISION NEEDED" → conversation loops back to Developer for fixes

This is how we create a feedback loop using just keyword detection. The Reviewer essentially acts as a router, deciding whether the plan is ready or needs more work. This mirrors real code review workflows where PRs get approved or sent back with comments.

The QA Engineer

The QA agent provides the final sign-off with a testing strategy:

QA_SYSTEM_MESSAGE = """You are a Senior QA Engineer.
Create a comprehensive test strategy:
1. Define testing approach (unit, integration, e2e)
2. List key test cases for critical functionality
3. Define acceptance criteria for MVP
4. Recommend testing tools and frameworks

Format your response with these sections:
## Test Strategy
## Key Test Cases
## Acceptance Criteria
## Recommended Tools

End your response with exactly:
"FINAL SIGN-OFF: Project plan is complete."
"""

Why does QA need to say "FINAL SIGN-OFF" exactly? This phrase is the termination signal for the entire orchestration. Our chat manager (which we'll build in Step 4) constantly checks every message for this phrase. When it appears, the system knows the planning session is complete and stops the conversation. Without this, the agents would keep talking in circles.

We put the termination trigger on the QA agent because it's the last agent in the pipeline — only after requirements, architecture, implementation, AND review are all done should the session end.

Put It All Together

Now we create a factory function that instantiates all five agents and returns them:

def create_agents():
    # PM agent — receives user input, outputs structured requirements.
    # Uses reasoning_config because requirement analysis is analytical work.
    pm = ConversableAgent(
        name="pm",                              # Unique identifier used in routing
        system_message=PM_SYSTEM_MESSAGE,        # The "personality" and instructions
        description="Project Manager - analyzes requirements",  # Metadata for AG2
        human_input_mode="NEVER",                # Fully autonomous — no human prompts
        llm_config=reasoning_config,             # Connect to the reasoning model
    )

    # Architect agent — reads PM's output, designs technical architecture.
    # Uses reasoning_config because architecture requires analytical thinking.
    architect = ConversableAgent(
        name="architect",
        system_message=ARCHITECT_SYSTEM_MESSAGE,
        description="Architect - designs system architecture",
        human_input_mode="NEVER",
        llm_config=reasoning_config,
    )

    # Developer agent — reads Architect's design, creates implementation plan.
    # Uses code_config because this agent writes code snippets and technical specs.
    developer = ConversableAgent(
        name="developer",
        system_message=DEVELOPER_SYSTEM_MESSAGE,
        description="Developer - creates implementation plan",
        human_input_mode="NEVER",
        llm_config=code_config,                  # Code-specialized model
    )

    # Reviewer agent — reads everything above, approves or rejects.
    # Uses code_config because reviewing code requires precise technical judgment.
    reviewer = ConversableAgent(
        name="reviewer",
        system_message=REVIEWER_SYSTEM_MESSAGE,
        description="Reviewer - reviews and approves plans",
        human_input_mode="NEVER",
        llm_config=code_config,                  # Code-specialized model
    )

    # QA agent — creates test strategy and gives final sign-off.
    # Uses reasoning_config because test strategy is analytical/planning work.
    qa = ConversableAgent(
        name="qa",
        system_message=QA_SYSTEM_MESSAGE,
        description="QA - defines test strategy and sign-off",
        human_input_mode="NEVER",
        llm_config=reasoning_config,
    )

    return pm, architect, developer, reviewer, qa

Key parameters explained:

name — A unique string identifier. This is how the orchestrator knows which agent is which. It also appears in the chat log (e.g., "pm (to manager): ...").

system_message — The agent's "personality." This is prepended to every LLM call, so the model always knows its role.

description — Metadata used by AG2 internally. When send_introductions=True (which we'll set later), this text is shared with other agents so they know who their teammates are.

human_input_mode="NEVER" — This tells AG2 to never pause and ask a human for input. The agents run fully autonomously. Other options are "ALWAYS" (ask every turn) and "TERMINATE" (ask only at the end).

llm_config — Which LLM connection to use. This is where our dual-model strategy comes to life — different agents get different models and temperatures.

Step 4: Build the Orchestrator

This is the heart of the system. The orchestrator answers two fundamental questions: "Who speaks next?" and "When do we stop?"

Create orchestrator.py:

from ag2 import GroupChat, GroupChatManager
from agents import create_agents
from config import reasoning_config

# Create all five agents by calling our factory function.
# We unpack them into individual variables so we can reference them
# in the transition graph and speaker selection function.
pm, architect, developer, reviewer, qa = create_agents()

Why unpack into individual variables? We need to reference specific agents (like pm, reviewer) in our routing logic below. If we kept them in a list, the code would be less readable — agents[3] is much harder to understand than reviewer.

Define the Transition Graph

First, we declare which agent is allowed to speak after which. This creates a directed graph:

# This dictionary defines the "rules of conversation."
# Each key is an agent, and its value is a list of agents that can speak next.
# Think of it as a state machine: from state X, you can transition to states Y, Z.
allowed_transitions = {
    pm:        [architect],           # After PM speaks → only Architect can go next
    architect: [developer],           # After Architect → only Developer
    developer: [reviewer],            # After Developer → only Reviewer
    reviewer:  [developer, qa],       # After Reviewer → Developer (revise) OR QA (approve)
    qa:        [pm],                  # After QA → back to PM (but we'll terminate before this)
}

Why define transitions explicitly? Without this, AG2 would allow any agent to speak after any other agent. By constraining transitions, we ensure the conversation follows a logical workflow. The Reviewer having two possible next agents ([developer, qa]) is what creates our feedback loop — the actual choice between them is handled by the speaker selection function below.

Why does QA point back to PM? In practice, we terminate the conversation when QA speaks (via the "FINAL SIGN-OFF" signal). The qa: [pm] transition is just a safety fallback — if for some reason the termination doesn't trigger, the conversation loops back to the beginning rather than crashing.

Custom Speaker Selection

This function is called by AG2 after every message to determine who speaks next:

def select_next_speaker(last_speaker, groupchat):
    """Determine which agent speaks next based on who just spoke and what they said.

    Args:
        last_speaker: The agent object that just sent a message.
        groupchat: The GroupChat object containing the full message history.

    Returns:
        The next agent to speak, or None to end the conversation.
    """
    # Get the last message content and convert to lowercase for keyword matching.
    # We check keywords like "approved" to decide routing — case-insensitive
    # so it works whether the LLM outputs "APPROVED", "Approved", or "approved".
    last_msg = groupchat.messages[-1]["content"].lower()

    # Simple linear routing for most agents:
    if last_speaker == pm:
        return architect          # PM done → Architect designs
    elif last_speaker == architect:
        return developer          # Architecture done → Developer implements
    elif last_speaker == developer:
        return reviewer           # Implementation done → Reviewer checks quality

    # The critical branching point — Reviewer decides the path:
    elif last_speaker == reviewer:
        if "approved" in last_msg:
            return qa             # Plan approved → QA does final sign-off
        else:
            return developer      # Not approved → Developer must revise
            # This creates the feedback loop! The Developer will see the
            # Reviewer's feedback in the chat history and address the issues.

    # QA is the last agent — returning None signals "end of conversation"
    elif last_speaker == qa:
        return None

    return None  # Fallback: end conversation if something unexpected happens

Why deterministic routing instead of letting the LLM choose? AG2 supports speaker_selection_method="auto", where the LLM decides who speaks next. This sounds smart, but in practice:

The LLM might pick the wrong agent (e.g., QA before the Developer has spoken)

It adds an extra LLM call per turn just for routing (slower + more expensive)

The conversation order becomes unpredictable between runs

Our deterministic function gives us 100% predictable routing with one exception: the Reviewer's branch. And even that branch is controlled by a simple keyword check, not an LLM decision.

How does the feedback loop work in practice? When the Reviewer says "REVISION NEEDED: Missing input validation on the API endpoints," the conversation routes back to the Developer. The Developer sees the full history — including the Reviewer's feedback — and generates an updated implementation that addresses the issues. Then it goes back to the Reviewer, who checks again. This can repeat until the Reviewer says "APPROVED" or we hit the safety limit.

Create the GroupChat

Now we assemble everything into AG2's GroupChat — the container that holds our agents and conversation rules:

group_chat = GroupChat(
    # The list of all agents participating in this conversation.
    # Order doesn't matter here — routing is controlled by select_next_speaker.
    agents=[pm, architect, developer, reviewer, qa],

    # The transition graph we defined above.
    # This acts as a safety net: even if our speaker selection function has a bug,
    # AG2 will reject any transition not in this graph.
    allowed_or_disallowed_speaker_transitions=allowed_transitions,
    speaker_transitions_type="allowed",  # "allowed" means the dict defines PERMITTED transitions

    # Start with an empty message history. Messages accumulate as agents speak.
    messages=[],

    # Safety limit: stop after 15 messages maximum.
    # Without this, a picky Reviewer could send work back to the Developer
    # indefinitely, creating an infinite loop. 15 rounds is enough for
    # the full pipeline + a few revision cycles.
    max_round=15,

    # When True, each agent's description is shared with all others at the start.
    # This gives agents context about who their "teammates" are, leading to
    # better collaboration (e.g., the Architect knows a Reviewer will check its work).
    send_introductions=True,

    # Use our custom function instead of AG2's default LLM-based selection.
    speaker_selection_method=select_next_speaker,
)

What is GroupChat exactly? Think of it as a virtual meeting room. It holds:

A list of participants (agents)

The conversation rules (who can speak after whom)

The shared message history (all agents can read everything)

Settings like max rounds and speaker selection

The GroupChat itself doesn't run the conversation — that's the GroupChatManager's job (below). The GroupChat just defines the rules and holds the state.

The Chat Manager

The GroupChatManager is the "moderator" that actually runs the conversation:

manager = GroupChatManager(
    # Link to the GroupChat containing our agents and rules.
    groupchat=group_chat,

    # The manager itself needs an LLM config. Even though we use custom speaker
    # selection (so it doesn't need the LLM for routing), AG2 requires this.
    # We use reasoning_config since it's the more conservative configuration.
    llm_config=reasoning_config,

    # This lambda function is called after every message.
    # It checks if the message contains "final sign-off" (case-insensitive).
    # When QA outputs "FINAL SIGN-OFF: Project plan is complete.",
    # this returns True and the conversation stops gracefully.
    is_termination_msg=lambda msg: "final sign-off" in msg["content"].lower(),
)

How does is_termination_msg work? After every single message in the group chat, AG2 calls this function with the message. It's a simple lambda (one-line anonymous function) that:

Takes the message content: msg["content"]

Converts to lowercase: .lower()

Checks if "final sign-off" appears anywhere in the text

Returns True (stop) or False (continue)

This is why we told the QA agent to end with "FINAL SIGN-OFF: Project plan is complete." in its system prompt — it's the trigger that tells the manager the session is done.

What happens if QA doesn't say "FINAL SIGN-OFF"? The max_round=15 safety limit kicks in. After 15 messages, the conversation stops regardless. This prevents the system from running forever if the LLM doesn't follow instructions perfectly.

Step 5: Create the Entry Point

Finally, create main.py — the script that ties everything together and provides the user interface:

from orchestrator import pm, manager

def main():
    # Display a simple banner so the user knows what they're running.
    print("=" * 60)
    print("  Multi-Agent Software Project Planner")
    print("=" * 60)

    # Provide a default project idea so users can quickly test the system
    # without having to think of an idea first.
    default_idea = (
        "Build a REST API for a task management app with user auth, "
        "CRUD operations, and real-time notifications"
    )

    # Prompt the user for their project idea.
    # If they press Enter without typing anything, we use the default.
    user_input = input(f"\nDescribe your project idea (Enter for default):\n> ").strip()
    if not user_input:
        user_input = default_idea
        print(f"\nUsing default: {default_idea}")

    print("\n" + "=" * 60)
    print("  Starting Planning Session...")
    print("=" * 60 + "\n")

    # This is where the magic happens!
    # pm.initiate_chat() does the following:
    # 1. Sends the user's project idea as the first message
    # 2. The PM agent processes it and generates its response (requirements)
    # 3. The manager takes over, calling select_next_speaker() after each message
    # 4. Agents take turns: PM → Architect → Developer → Reviewer → QA
    # 5. If Reviewer rejects, it loops: Developer → Reviewer → Developer → ...
    # 6. When QA says "FINAL SIGN-OFF", is_termination_msg returns True and it stops
    # 7. The entire conversation history is returned in `result`
    result = pm.initiate_chat(
        manager,        # The GroupChatManager that orchestrates the conversation
        message=user_input,  # The user's project idea becomes the first message
    )

    # Display a summary after the session ends.
    # result.chat_history contains every message from every agent.
    # result.cost tracks token usage / API costs (useful for cloud LLMs).
    print("\n" + "=" * 60)
    print("  Session Complete!")
    print(f"  Messages: {len(result.chat_history)}")
    print(f"  Cost: {result.cost}")
    print("=" * 60)

# Standard Python idiom: only run main() when this file is executed directly,
# not when it's imported by another file.
if __name__ == "__main__":
    main()

What does pm.initiate_chat(manager, message=...) actually do under the hood?

This single line triggers the entire multi-agent pipeline:

The PM receives user_input as a message

The PM calls its LLM with: system prompt + the user's message

The PM's response is added to GroupChat.messages

The manager calls select_next_speaker(pm, groupchat) → returns architect

The Architect calls its LLM with: system prompt + entire chat history so far

Repeat steps 3-5 for each agent in sequence

Eventually QA speaks, is_termination_msg returns True, and the loop ends

Every agent sees the full conversation history when generating its response. This means the Developer can reference both the PM's requirements AND the Architect's design. This shared context is what makes the agents feel like they're truly collaborating.

Step 6: Run It!

Make sure your LLM provider is running (Ollama or LM Studio), then:

python main.py

You'll see something like:

============================================================
  Multi-Agent Software Project Planner
============================================================

Describe your project idea (Enter for default):
> Build a real-time chat application with rooms and file sharing

============================================================
  Starting Planning Session...
============================================================

pm (to manager):
## Requirements
- User registration and authentication
- Real-time messaging with WebSocket support
- Chat rooms (public and private)
- File upload and sharing within rooms
...

architect (to manager):
## Tech Stack
- Backend: Node.js with Express + Socket.io
- Database: PostgreSQL for users/rooms, Redis for pub/sub
- Storage: MinIO for file uploads
...

developer (to manager):
## File Structure
├── src/
│   ├── controllers/
│   ├── models/
│   ├── middleware/
│   ├── services/
│   └── websocket/
...

reviewer (to manager):
## Verdict
APPROVED
...

qa (to manager):
## Test Strategy
...
FINAL SIGN-OFF: Project plan is complete.

============================================================
  Session Complete!
  Messages: 6
============================================================

What to observe: Notice how each agent builds on the previous one's work. The Architect references the PM's requirements. The Developer follows the Architect's tech stack choices. The Reviewer checks consistency between all of them. And QA creates test cases that match the actual implementation plan. This emergent collaboration happens naturally through shared conversation history.

How It All Fits Together

Here's the final project structure:

multi-agents/
├── .env                # LLM provider configuration (models, URLs, temperatures)
├── config.py           # Reads .env → creates reasoning_config and code_config
├── agents.py           # Defines 5 agents with specialized system prompts
├── orchestrator.py     # Wires agents into GroupChat with routing + termination
├── main.py             # Entry point — takes user input, starts the session
└── requirements.txt    # Python dependencies (ag2, python-dotenv)

The data flow through these files:

.env  ──►  config.py  ──►  agents.py  ──►  orchestrator.py  ──►  main.py
(settings)  (LLMConfig)    (5 agents)     (GroupChat +        (user input
                                           Manager +           + run loop)
                                           routing logic)

.env holds all configurable settings (models, temperatures, URLs)
config.py reads .env and creates two LLMConfig objects
agents.py imports configs and creates five specialized ConversableAgent instances
orchestrator.py imports agents, defines the transition graph and speaker selection, creates GroupChat + GroupChatManager
main.py imports the PM and manager, gets user input, and kicks off the conversation

Key Takeaways

1. Deterministic Routing > LLM-Based Routing

Letting the LLM decide who speaks next sounds flexible, but in practice it leads to unpredictable behavior — agents speaking out of turn, skipping steps, or getting stuck in loops. Our custom select_next_speaker() function gives us full control over the conversation flow while still allowing dynamic branching (the Reviewer's approve/revise decision).

2. Dual-Model Strategy

Not every agent needs the same model. Analytical agents (PM, Architect, QA) benefit from reasoning-focused models with moderate temperature, while implementation agents (Developer, Reviewer) need precision with low temperature. Splitting configurations lets you optimize both quality and cost — use a cheaper model for simple tasks, a better one for complex reasoning.

3. Structured Output Formats

Each agent's system prompt specifies exact output sections (## Requirements, ## Tech Stack, etc.). This isn't just about readability — it makes outputs consistent and parseable. When the Developer needs to reference the Architect's tech stack, it knows exactly where to look in the conversation. Structured outputs also make it easier to extract and save results programmatically.

4. Keyword-Driven Control Flow

The Reviewer's "APPROVED" / "REVISION NEEDED" and QA's "FINAL SIGN-OFF" are more than just text — they're control signals that drive the orchestration logic. This is a simple but powerful pattern: use natural language keywords as routing triggers. The LLM generates them naturally as part of its response, and our code checks for them to make routing decisions. No complex parsing or additional LLM calls needed.

5. Safety Mechanisms Matter

The max_round=15 limit prevents infinite revision loops. Without it, a picky Reviewer could keep sending work back to the Developer forever, burning tokens and time. Always build in safety limits for multi-agent systems. Other safety patterns include timeout limits, cost caps, and fallback behaviors.

Source code

The complete source code for this project is available on GitHub.

Use OpenCode with local LLM, not bad all at

Chung Duy — Thu, 05 Feb 2026 14:58:38 +0000

Local LLM Coding Setup: LMStudio + OpenCode

A guide to setting up a local AI coding assistant using LMStudio and OpenCode — a solid alternative to Claude Code when you run out of daily usage.

1. Install LMStudio

Download and install from https://lmstudio.ai/

2. Select a Model

Choose an appropriate model depending on your hardware. In this case, I chose Qwen3-Coder-Next-MLX-6bit because:

It fits within my available RAM
It's optimized for macOS with Apple Silicon (M4 chip)
It can leverage the M4 GPU

You may need to wait a bit for the model to fully download.

3. Load and Configure the Model

Load the model you selected in Step 2 (e.g., Qwen3-Coder-Next-MLX-6bit) and configure the following:

Setting	Value
Temperature	`1.0`
Context Length	`80000`

⚠️ Do not leave the context length at the default 16000 — it's too small.

4. Install OpenCode

Install from https://opencode.ai/

5. Configure OpenCode to Use LMStudio

Open the config file at ~/.config/opencode/opencode.jsonc and paste the following:

{
  "$schema": "https://opencode.ai/config.json",
  "theme": "tokyonight",
  "disabled_providers": [],
  "provider": {
    "localllm": {
      "name": "Local LLM",
      "npm": "@ai-sdk/openai-compatible",
      "models": {
        "qwen3-coder-next-mlx": {
          "name": "Qwen3-Coder-Next"
        }
      },
      "options": {
        "baseURL": "http://127.0.0.1:1234/v1"
      }
    }
  }
}

⚠️ Make sure the key "qwen3-coder-next-mlx" matches the model name in LMStudio exactly, otherwise you'll get an error: "can not load model..."

6. Run OpenCode

Open a new terminal, navigate to your project directory, and run:

opencode

E.g: "help me to understand the code base", while opencode running, you can watch out LMstudio server log to see it really works

7. Results

Tested with a demo project and the results are not bad at all compared to Sonnet 4.5. More testing on larger projects is needed, but the output quality makes it a worthwhile alternative:

🔄 Use as a fallback when you run out of daily Claude Code usage
💡 Explore other use cases where a local LLM fits your workflow
💰 Zero API cost — everything runs locally

Running Mistral Vibe CLI with Local LLMs: A Complete Guide

Chung Duy — Sun, 04 Jan 2026 13:32:34 +0000

Why run Mistral Vibe CLI with local LLM?

So, Mistral Vibe is this amazing CLI tool that usually talks to Mistral's cloud. But honestly? Running it locally is a total game changer. It feels so much cooler when the "brain" is actually inside your own computer!

Here's why I'm loving the local setup:

Privacy Core: My code stays right here on my machine. No "sending to the cloud" anxiety! 🔒
Free Forever: Zero API bills. My wallet is so happy right now. 💸
Internet? Optional: I can literally code in a cabin in the woods (if I had one). 🌲
Model Hopping: I can try out whatever models I want just by pulling them from Ollama.

I'm using Ollama with the devstral-small-2 model for this guide because it's super snappy for coding.

Prerequisites

MacOS at least 32GB, chip M4 (I never try with lower chip). Disk space should have at least 20GB since devstral-small-2 have size 15GB.
Git
Python 3.12 or higher
pip or pipx
ollama (https://ollama.com/)

I've not tried yet in Linux/Window but assume that should be same. If you are using Linux/Window ensure you have appropriate GPU hardware good enough to run Ollama devstral-small-2 24B.

Installation Guide

Step 1: Install Ollama

Ollama is a tool for running large language models locally with ease.

On macOS

# Download and install from official website
curl -fsSL https://ollama.com/install.sh | sh

# Or using Homebrew
brew install ollama

Verify Installation

ollama --version

Step 2: Pull the Model

Download the devstral-small-2 model (or any model you prefer):

# Pull the model (this may take a few minutes depending on your internet speed)
ollama pull devstral-small-2

Verify the model is downloaded:

ollama list

Step 3: Start Ollama Server

# Start the Ollama server
ollama serve

Note: Keep this terminal window open. The server needs to run while you use Vibe.

Default endpoint: http://localhost:11434

To run Ollama in the background on macOS/Linux:

# Create a background service (optional)
ollama serve &

Step 4: Install Mistral Vibe

Check out the repo : https://github.com/mistralai/mistral-vibe

Using pipx

# Install pipx if you don't have it
brew install pipx  # macOS
# or
pip install pipx   # Linux/Windows

# Ensure pipx path is configured
pipx ensurepath

# Install Mistral Vibe from source
cd /path/to/mistral-vibe
pipx install -e .

Open new tab terminal and verify the installation

vibe --version

You should see: vibe 1.3.3 (or your current version)

Configuration

Mistral Vibe uses a TOML configuration file located at ~/.vibe/config.toml.

Step 1: Locate or Create Config File

When you first run vibe, it creates a default config file. However, we need to modify it to use Ollama.

Step 2: Configure for Ollama

Create or edit ~/.vibe/config.toml with the following configuration:

# ~/.vibe/config.toml

# Set Ollama model as default
active_model = "ollama-devstral-small-2"

# UI and behavior settings
textual_theme = "terminal"
vim_keybindings = false
disable_welcome_banner_animation = false
auto_compact_threshold = 200000
context_warnings = false
system_prompt_id = "cli"
include_commit_signature = true
include_model_info = true
include_project_context = true
enable_update_checks = true
api_timeout = 720.0

# Tool configurations (optional - customize as needed)
enabled_tools = []
disabled_tools = []

# ============================================
# PROVIDERS CONFIGURATION
# ============================================

# Ollama Provider (Local)
[[providers]]
name = "ollama"
api_base = "http://localhost:11434/v1"
api_key_env_var = ""  # No API key needed for local
api_style = "openai"
backend = "generic"
reasoning_field_name = "reasoning_content"

# Mistral Cloud Provider (Optional - keep for fallback)
[[providers]]
name = "mistral"
api_base = "https://api.mistral.ai/v1"
api_key_env_var = "MISTRAL_API_KEY"
api_style = "openai"
backend = "mistral"
reasoning_field_name = "reasoning_content"

# ============================================
# MODELS CONFIGURATION
# ============================================

# Ollama Models (Local)
[[models]]
name = "devstral-small-2"
provider = "ollama"
alias = "ollama-devstral-small-2"
temperature = 0.2
input_price = 0.0  # Free - local model
output_price = 0.0

[[models]]
name = "mistral"
provider = "ollama"
alias = "ollama-mistral"
temperature = 0.2
input_price = 0.0
output_price = 0.0

[[models]]
name = "devstral-2"
provider = "ollama"
alias = "ollama-devstral-2"
temperature = 0.2
input_price = 0.0
output_price = 0.0

# Cloud Models (Optional - for fallback)
[[models]]
name = "mistral-vibe-cli-latest"
provider = "mistral"
alias = "devstral-2-cloud"
temperature = 0.2
input_price = 0.4
output_price = 2.0

# ============================================
# PROJECT CONTEXT SETTINGS
# ============================================

[project_context]
max_chars = 40000
default_commit_count = 5
max_doc_bytes = 32768
truncation_buffer = 1000
max_depth = 3
max_files = 1000
max_dirs_per_level = 20
timeout_seconds = 2.0

# ============================================
# SESSION LOGGING
# ============================================

[session_logging]
save_dir = "~/.vibe/logs/session"
session_prefix = "session"
enabled = true

# ============================================
# TOOL PERMISSIONS
# ============================================

[tools.read_file]
permission = "always"
max_read_bytes = 64000

[tools.write_file]
permission = "ask"
max_write_bytes = 64000

[tools.search_replace]
permission = "ask"
max_content_size = 100000

[tools.bash]
permission = "ask"
max_output_bytes = 16000
default_timeout = 30

[tools.grep]
permission = "always"
max_output_bytes = 64000
default_max_matches = 100

[tools.todo]
permission = "always"
max_todos = 100

Step 3: Understanding Key Configuration Options

Provider Configuration

[[providers]]
name = "ollama"                              # Provider identifier
api_base = "http://localhost:11434/v1"       # Ollama's OpenAI-compatible endpoint
api_key_env_var = ""                         # Empty = no API key required
api_style = "openai"                         # Use OpenAI-compatible API format
backend = "generic"                          # Use generic HTTP backend

Model Configuration

[[models]]
name = "devstral-small-2"                    # Exact model name in Ollama
provider = "ollama"                          # Links to provider above
alias = "ollama-devstral-small-2"            # Friendly name you'll use
temperature = 0.2                            # Lower = more deterministic
input_price = 0.0                            # Free for local
output_price = 0.0                           # Free for local

Usage Examples

Basic Usage

1. Start Vibe

# Navigate to your project
cd /path/to/your/project

# Start Vibe
vibe

You should see the Vibe interface and use it with local LLM. So any prompts will be processed by local LLM instead of cloud models and does not require any API key.

Happy coding!