Today we’ll build and deploy an Agentic RAG powered by Alibaba’s latest Qwen 3.6, running fully on your machine.
What you’ll build
A private API where two AI agents collaborate:
- Researcher Agent — retrieves context from a vector database or the web
- Writer Agent — turns that research into a polished answer
Tool stack
| Tool | Role |
|------|------|
| **Qwen 3.6** (via Ollama) | Local LLM — no cloud API needed |
| **CrewAI** | Multi-agent orchestration |
| **Firecrawl** | Web search when the vector DB doesn't have the answer |
| **Qdrant** | Local vector database for your knowledge base |
| **LitServe** | Production-style HTTP API deployment |
Architecture
Flow:
- Client sends a query to LitServe
- Researcher Agent picks the right tool (vector DB or Firecrawl)
- Writer Agent synthesizes the final answer
- LitServe returns JSON to the client
Prerequisites
1. Remove old models (optional cleanup)
If you had other Ollama models taking disk space:
ollama list
ollama rm gemma4:e2b # example — use your model name
2. Pull Qwen 3.6
On a 16GB Mac, use the 27B variant:
ollama pull qwen3.6:27b
Verify:
ollama run qwen3.6:27b "Say hello in one sentence."
3. Install Python dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
4. Environment variables
cp .env.example .env
Edit .env:
FIRECRAWL_API_KEY=fc-...
OLLAMA_MODEL=ollama/qwen3.6:27b
OLLAMA_BASE_URL=http://localhost:11434
Get a Firecrawl key at firecrawl.dev.
5. Start Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
6. Build the knowledge base
python setup_vectordb.py
This embeds 20 ML FAQ chunks into Qdrant using nomic-embed-text-v1.5.
Step 1 — Set up the LLM
CrewAI integrates with Ollama through its LLM class. We point it at your local Qwen 3.6 model:
Why qwen3.6:27b? Qwen 3.6 adds stronger agentic reasoning and tool use. On 16GB RAM, the 27B quantized model (~17GB) is the practical choice.
Step 2 — Define the Research Agent and Task
The Researcher gets two tools:
- ml_faq_retrieval_tool — searches your Qdrant vector DB
- FirecrawlSearchTool — searches the web for fresh or out-of-scope topics
Vector DB tool (tools.py)
The custom tool wraps Qdrant retrieval:
The agent decides which tool to call — that’s what makes this “agentic” RAG instead of a fixed retrieve-then-generate pipeline.
Step 3 — Define the Writer Agent and Task
The Writer receives the Researcher’s output via context=[researcher_task]:
Step 4 — Set up the Crew
Orchestrate both agents inside LitServe’s setup() method (runs once at startup):
Step 5 — Decode request
Extract the user query from the incoming JSON body:
Example request:
{"query": "What is cross-validation and why is it important?"}
Step 6 — Predict
Pass the query to the Crew. The {query} placeholder in task descriptions is filled from inputs:
Behind the scenes:
- Researcher runs and may call vector DB and/or Firecrawl
- Writer reads those findings and drafts the answer
- Qwen 3.6 powers both agents through Ollama
Step 7 — Encode response
Return the final answer as JSON:
Step 8 — Start the server
timeout=False is important — agent crews with tool calls can take several minutes on local hardware.
Client code
client.py sends a POST to /predict:
Run it:
# Terminal 1
python server.py
# Terminal 2
python client.py --query "How do I avoid overfitting?"
python client.py --query "What is the latest news about Qwen 3.6?"
The second query should trigger Firecrawl because it’s not in the ML FAQ knowledge base.
Full server code
For reference, here is the complete server.py:
Agentic RAG vs classic RAG
| Classic RAG | Agentic RAG (this tutorial) |
|-------------|----------------------------|
| Fixed: always retrieve → generate | Agent chooses tools dynamically |
| Single LLM call | Multi-agent pipeline |
| One data source | Vector DB + web fallback |
| Hard to extend | Add tools without rewriting the pipeline |
Troubleshooting
| Issue | Fix |
|-------|-----|
| `connection refused` on port 6333 | Start Qdrant with Docker |
| Ollama model not found | Run `ollama pull qwen3.6:27b` |
| Very slow responses | Normal on 16GB RAM; close other apps |
| Firecrawl errors | Check `FIRECRAWL_API_KEY` in `.env` |
| Empty vector results | Run `python setup_vectordb.py` first |
What’s next
- Replace the sample FAQ with your own documents in
rag_code.py - Add a Gradio UI in front of the LitServe API
- Swap Firecrawl for another search provider
- Deploy LitServe behind Docker or Lightning AI Cloud
Summary
You deployed a fully private Qwen 3.6 Agentic RAG:
- Qwen 3.6 runs locally via Ollama
- CrewAI orchestrates Researcher + Writer agents
- Qdrant stores your knowledge base
- Firecrawl fills gaps with live web data
- LitServe exposes everything as a clean REST API
Done!
Thank you so much for reading
Like | Follow | Subscribe to the newsletter.
Catch us on
Website: https://www.techlatest.net/
Newsletter: https://substack.com/@techlatest
Twitter: https://twitter.com/TechlatestNet
LinkedIn: https://www.linkedin.com/in/techlatest-net/
YouTube:https://www.youtube.com/@techlatest_net/
Blogs: https://medium.com/@techlatest.net
Reddit Community: https://www.reddit.com/user/techlatest_net/














Top comments (0)