We Built a Production-Ready Auto-Reply Chatbot (FastAPI + OpenAI + Hybrid Retrieval)
Most "chatbot tutorials" stop at:
app.py- 50 lines of OpenAI calls
- No logging
- No retrieval
- No evaluation
- No production thinking
That's not how real systems work.
So We built a production-style auto-reply chatbot using:
- FastAPI
- OpenAI Chat Completions
- OpenAI Embeddings
- Hybrid retrieval (vector + keyword ready)
- Clean service architecture
- Separation of LLM / Retrieval / API layers
Full open-source repo: auto-reply-chatbot (FastAPI + OpenAI + Retrieval)
If you find it useful, consider starring the repo ⭐
What Problem This Solves
If you're building:
- Customer support auto-reply
- Ticket answering system
- Live chat AI
- Internal knowledge assistant
- RAG-based chatbot
You don't need another toy example.
You need:
- Structured backend
- Clear LLM gateway
- Retrieval service
- Embedding pipeline
- Production-ready folder layout
That's what this project demonstrates.
Architecture Overview
High-level flow:
API (FastAPI)
↓
AnswerService
↓
RetrievalService → Embeddings → Vector Search
↓
LLM Gateway → OpenAI Chat Completion
↓
Final Answer
This separation makes it:
- Testable
- Replaceable (swap LLM provider easily)
- Scalable
- Production-friendly
Project Structure
app/
├── api/
│ └── routes/
│ └── conversations.py
├── services/
│ ├── answer_service.py
│ ├── retrieval.py
│ ├── ingestion.py
│ └── llm_gateway.py
├── search/
│ └── embeddings.py
└── main.py
Why this matters?
Most examples mix everything in one file.
This project separates:
- API layer
- Business logic
- Retrieval logic
- LLM provider abstraction
- Embedding layer
That's how real systems are built.
LLM Layer (Gateway Pattern)
Instead of calling OpenAI directly everywhere:
openai.chat.completions.create(...)
We wrap it in:
llm_gateway.chat(...)
Why?
Because:
- You may change models
- You may change providers
- You may add logging
- You may add retry policies
- You may measure token cost
This pattern prevents vendor lock-in chaos.
Retrieval + Embeddings
The system uses:
text-embedding-3-small- Vector search flow
- Document ingestion pipeline
Two flows exist:
| Flow | Description |
|---|---|
| Ingestion | Document → Chunk → Embed → Store |
| Retrieval | User Query → Embed → Vector Search → Evidence → LLM |
This creates a clean RAG-ready foundation.
Even if you're not using a full vector DB yet, the structure is ready for:
- pgvector
- Weaviate
- Pinecone
- Milvus
Why This Repo Is Different
Most repos show:
| ❌ | ✅ |
|---|---|
| "Hello world" chatbot | Clear service boundaries |
| No architecture | Retrieval-first mindset |
| No layering | LLM abstraction |
| No production thinking | Ready for RAG |
| FastAPI production pattern |
🛠 Use Cases
You can extend this into:
- SaaS auto-reply platform
- AI support desk
- AI ticket triage
- Enterprise RAG assistant
- Multi-tenant AI backend
It's a backend-first design — you can plug any frontend later.
🧪 What You Can Experiment With
- Swap GPT-4o → GPT-4o-mini
- Add hybrid retrieval (BM25 + vector)
- Add eval loop
- Add grounding verification
- Add cost tracking
- Add retry logic and latency control
This repo gives you the skeleton.
You build the muscle.
🚀 Why I Open-Sourced This
Because most AI tutorials skip the hard parts:
- Architecture
- Reliability
- Separation of concerns
- Scaling thinking
If you're serious about building AI systems — not just demos — this repo will help.
⭐ GitHub Repository
👉 https://github.com/OptyxStack/rag-knowledge-base-chatbot
If this project helps you:
- ⭐ Star the repo
- 🍴 Fork it
- 🛠 Contribute improvements
- 🔁 Share it
💡 Future Improvements Planned
- Hybrid retrieval implementation
- Evaluation pipeline
- Cost monitoring
- Latency optimization
- Tool-calling support
- Multi-tenant design
Top comments (0)