TechLatest

Posted on Jun 3 • Originally published at faun.pub on Jun 3

Deploy a Qwen 3.6 Agentic RAG — Step-by-Step Walkthrough

#opensource #qwen36 #localai #retrievalaugmentedge

Today we’ll build and deploy an Agentic RAG powered by Alibaba’s latest Qwen 3.6, running fully on your machine.

What you’ll build

A private API where two AI agents collaborate:

Researcher Agent — retrieves context from a vector database or the web
Writer Agent — turns that research into a polished answer

Tool stack

| Tool | Role |
|------|------|
| **Qwen 3.6** (via Ollama) | Local LLM — no cloud API needed |
| **CrewAI** | Multi-agent orchestration |
| **Firecrawl** | Web search when the vector DB doesn't have the answer |
| **Qdrant** | Local vector database for your knowledge base |
| **LitServe** | Production-style HTTP API deployment |

Architecture

Flow:

Client sends a query to LitServe
Researcher Agent picks the right tool (vector DB or Firecrawl)
Writer Agent synthesizes the final answer
LitServe returns JSON to the client

Prerequisites

1. Remove old models (optional cleanup)

If you had other Ollama models taking disk space:

ollama list
ollama rm gemma4:e2b # example — use your model name

2. Pull Qwen 3.6

On a 16GB Mac, use the 27B variant:

ollama pull qwen3.6:27b

Verify:

ollama run qwen3.6:27b "Say hello in one sentence."

3. Install Python dependencies

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

4. Environment variables

cp .env.example .env

Edit .env:

FIRECRAWL_API_KEY=fc-...
OLLAMA_MODEL=ollama/qwen3.6:27b
OLLAMA_BASE_URL=http://localhost:11434

Get a Firecrawl key at firecrawl.dev.

5. Start Qdrant

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

6. Build the knowledge base

python setup_vectordb.py

This embeds 20 ML FAQ chunks into Qdrant using nomic-embed-text-v1.5.

Step 1 — Set up the LLM

CrewAI integrates with Ollama through its LLM class. We point it at your local Qwen 3.6 model:

Why qwen3.6:27b? Qwen 3.6 adds stronger agentic reasoning and tool use. On 16GB RAM, the 27B quantized model (~17GB) is the practical choice.

Step 2 — Define the Research Agent and Task

The Researcher gets two tools:

ml_faq_retrieval_tool — searches your Qdrant vector DB
FirecrawlSearchTool — searches the web for fresh or out-of-scope topics

Vector DB tool (tools.py)

The custom tool wraps Qdrant retrieval:

The agent decides which tool to call — that’s what makes this “agentic” RAG instead of a fixed retrieve-then-generate pipeline.

Step 3 — Define the Writer Agent and Task

The Writer receives the Researcher’s output via context=[researcher_task]:

Step 4 — Set up the Crew

Orchestrate both agents inside LitServe’s setup() method (runs once at startup):

Step 5 — Decode request

Extract the user query from the incoming JSON body:

Example request:

{"query": "What is cross-validation and why is it important?"}

Step 6 — Predict

Pass the query to the Crew. The {query} placeholder in task descriptions is filled from inputs:

Behind the scenes:

Researcher runs and may call vector DB and/or Firecrawl
Writer reads those findings and drafts the answer
Qwen 3.6 powers both agents through Ollama

Step 7 — Encode response

Return the final answer as JSON:

Step 8 — Start the server

timeout=False is important — agent crews with tool calls can take several minutes on local hardware.

Client code

client.py sends a POST to /predict:

Run it:

# Terminal 1
python server.py

# Terminal 2
python client.py --query "How do I avoid overfitting?"
python client.py --query "What is the latest news about Qwen 3.6?"

The second query should trigger Firecrawl because it’s not in the ML FAQ knowledge base.

Full server code

For reference, here is the complete server.py:

Agentic RAG vs classic RAG

| Classic RAG | Agentic RAG (this tutorial) |
|-------------|----------------------------|
| Fixed: always retrieve → generate | Agent chooses tools dynamically |
| Single LLM call | Multi-agent pipeline |
| One data source | Vector DB + web fallback |
| Hard to extend | Add tools without rewriting the pipeline |

Troubleshooting

| Issue | Fix |
|-------|-----|
| `connection refused` on port 6333 | Start Qdrant with Docker |
| Ollama model not found | Run `ollama pull qwen3.6:27b` |
| Very slow responses | Normal on 16GB RAM; close other apps |
| Firecrawl errors | Check `FIRECRAWL_API_KEY` in `.env` |
| Empty vector results | Run `python setup_vectordb.py` first |