DEV Community

Daniel R. Foster for OptyxStack

Posted on • Edited on

We Built a Production-Ready Auto-Reply Chatbot (FastAPI + OpenAI + Hybrid Retrieval)

We Built a Production-Ready Auto-Reply Chatbot (FastAPI + OpenAI + Hybrid Retrieval)

Most "chatbot tutorials" stop at:

  • app.py
  • 50 lines of OpenAI calls
  • No logging
  • No retrieval
  • No evaluation
  • No production thinking

That's not how real systems work.

So We built a production-style auto-reply chatbot using:

  • FastAPI
  • OpenAI Chat Completions
  • OpenAI Embeddings
  • Hybrid retrieval (vector + keyword ready)
  • Clean service architecture
  • Separation of LLM / Retrieval / API layers

Full open-source repo: auto-reply-chatbot (FastAPI + OpenAI + Retrieval)

If you find it useful, consider starring the repo ⭐


What Problem This Solves

If you're building:

  • Customer support auto-reply
  • Ticket answering system
  • Live chat AI
  • Internal knowledge assistant
  • RAG-based chatbot

You don't need another toy example.

You need:

  • Structured backend
  • Clear LLM gateway
  • Retrieval service
  • Embedding pipeline
  • Production-ready folder layout

That's what this project demonstrates.


Architecture Overview

High-level flow:

API (FastAPI)
   ↓
AnswerService
   ↓
RetrievalService → Embeddings → Vector Search
   ↓
LLM Gateway → OpenAI Chat Completion
   ↓
Final Answer
Enter fullscreen mode Exit fullscreen mode

This separation makes it:

  • Testable
  • Replaceable (swap LLM provider easily)
  • Scalable
  • Production-friendly

Project Structure

app/
├── api/
│   └── routes/
│       └── conversations.py
├── services/
│   ├── answer_service.py
│   ├── retrieval.py
│   ├── ingestion.py
│   └── llm_gateway.py
├── search/
│   └── embeddings.py
└── main.py
Enter fullscreen mode Exit fullscreen mode

Why this matters?

Most examples mix everything in one file.

This project separates:

  • API layer
  • Business logic
  • Retrieval logic
  • LLM provider abstraction
  • Embedding layer

That's how real systems are built.


LLM Layer (Gateway Pattern)

Instead of calling OpenAI directly everywhere:

openai.chat.completions.create(...)
Enter fullscreen mode Exit fullscreen mode

We wrap it in:

llm_gateway.chat(...)
Enter fullscreen mode Exit fullscreen mode

Why?

Because:

  • You may change models
  • You may change providers
  • You may add logging
  • You may add retry policies
  • You may measure token cost

This pattern prevents vendor lock-in chaos.


Retrieval + Embeddings

The system uses:

  • text-embedding-3-small
  • Vector search flow
  • Document ingestion pipeline

Two flows exist:

Flow Description
Ingestion Document → Chunk → Embed → Store
Retrieval User Query → Embed → Vector Search → Evidence → LLM

This creates a clean RAG-ready foundation.

Even if you're not using a full vector DB yet, the structure is ready for:

  • pgvector
  • Weaviate
  • Pinecone
  • Milvus

Why This Repo Is Different

Most repos show:

"Hello world" chatbot Clear service boundaries
No architecture Retrieval-first mindset
No layering LLM abstraction
No production thinking Ready for RAG
FastAPI production pattern

🛠 Use Cases

You can extend this into:

  • SaaS auto-reply platform
  • AI support desk
  • AI ticket triage
  • Enterprise RAG assistant
  • Multi-tenant AI backend

It's a backend-first design — you can plug any frontend later.


🧪 What You Can Experiment With

  • Swap GPT-4o → GPT-4o-mini
  • Add hybrid retrieval (BM25 + vector)
  • Add eval loop
  • Add grounding verification
  • Add cost tracking
  • Add retry logic and latency control

This repo gives you the skeleton.

You build the muscle.


🚀 Why I Open-Sourced This

Because most AI tutorials skip the hard parts:

  • Architecture
  • Reliability
  • Separation of concerns
  • Scaling thinking

If you're serious about building AI systems — not just demos — this repo will help.


⭐ GitHub Repository

👉 https://github.com/OptyxStack/rag-knowledge-base-chatbot

If this project helps you:

  • ⭐ Star the repo
  • 🍴 Fork it
  • 🛠 Contribute improvements
  • 🔁 Share it

💡 Future Improvements Planned

  • Hybrid retrieval implementation
  • Evaluation pipeline
  • Cost monitoring
  • Latency optimization
  • Tool-calling support
  • Multi-tenant design

Top comments (0)