DEV Community

SMITHA YENUGU
SMITHA YENUGU

Posted on

AI Chatbot with RAG: RGUKT ChatBot Journey

How I built a production AI chatbot that answers questions from university documents without hallucinating


The Problem

Imagine you're a student at RGUKT (my university), and you have a question about:

  • Eligibility criteria for B.Tech programs
  • Scholarship details
  • Admission deadlines
  • Campus facilities

Where do you go?

  • Google the RGUKT website (slow, outdated)
  • Ask in a WhatsApp group (inconsistent answers)
  • Email the office (wait 3 days for a reply)
  • Read 50-page PDF handbooks (pain)

There had to be a better way.

I decided to build an AI chatbot that could answer these questions instantly, accurately, and 24/7 — without making stuff up (hallucinating).

Enter: Retrieval-Augmented Generation (RAG).


What's RAG and Why Not Just Use ChatGPT?

If you just asked ChatGPT "What's the RGUKT B.Tech eligibility criteria?", here's what happens:

User: "What's RGUKT's B.Tech eligibility?"
ChatGPT: "Typically, B.Tech programs require 10+2 with PCM, 
           and a score of at least 75%..."
Enter fullscreen mode Exit fullscreen mode

The problem: This is generic knowledge. ChatGPT doesn't actually know RGUKT's specific criteria because:

  1. It's trained on data from 2023 (RGUKT might have updated eligibility last month)
  2. It doesn't have access to RGUKT's internal documents
  3. When it doesn't know, it makes something up (hallucination)

RAG solves this by:

  1. Retrieve relevant documents from a knowledge base
  2. Augment the LLM prompt with those documents
  3. Generate an answer grounded in real facts
User: "What's RGUKT's B.Tech eligibility?"
     ↓
Vector Search (find relevant docs) → returns RGUKT's official PDF
     ↓
Augment Prompt: "Here's info from RGUKT's official document: [PDF excerpt]
                  Answer based ONLY on this information."
     ↓
ChatGPT: "According to RGUKT's official B.Tech handbook,
          eligibility requires..."
Enter fullscreen mode Exit fullscreen mode

This way, the chatbot uses real data, not generic knowledge.


The Architecture

My chatbot has three layers:

Layer 1: Knowledge Base (Vector Database)

RGUKT Official PDFs
├── Academic Regulations
├── Admission Guidelines
├── Scholarship Info
├── Campus Facilities
└── Fee Structure
     ↓
Chunk into small pieces (e.g., 256 tokens each)
     ↓
Convert each chunk to an embedding (numerical vector)
     ↓
Store in Chroma (vector database)
Enter fullscreen mode Exit fullscreen mode

Why chunks? A 50-page PDF is too long to fit in the LLM prompt. I break it into smaller pieces (paragraphs/sections), index them all, and retrieve only the most relevant pieces.

Why embeddings? An embedding is a numerical representation of text meaning. Similar texts have similar embeddings. So when a user asks a question, I:

  1. Convert the question to an embedding
  2. Find chunks with similar embeddings (cosine similarity)
  3. Retrieve the top 5 most relevant chunks

This is semantic search — it understands meaning, not just keyword matching.

Layer 2: Retrieval & Augmentation (Backend)

# User asks a question
question = "What's the scholarship amount?"

# Step 1: Search vector database
relevant_chunks = vector_store.search(question, top_k=5)
# Returns: [
#   "RGUKT offers merit-based scholarships up to ₹50,000 per semester...",
#   "Eligibility for scholarships: GPA >= 8.0, attendance >= 85%...",
#   "Application deadline: March 15th..."
# ]

# Step 2: Build the prompt
prompt = f"""You are a helpful assistant for RGUKT students.
Answer the question ONLY based on the provided information.
If you don't know, say "I don't have this information."

Information from RGUKT documents:
{relevant_chunks}

Question: {question}

Answer:"""

# Step 3: Call LLM
response = gemini.generate(prompt)
# Returns: "RGUKT offers merit-based scholarships up to ₹50,000 per semester.
#           To be eligible, you need a GPA of at least 8.0 and 85% attendance..."
Enter fullscreen mode Exit fullscreen mode

The key insight: The LLM never makes things up because it's constrained to only the retrieved documents.

Layer 3: UI (Frontend)

A ChatGPT-like interface where users can:

  • Type questions
  • See the answer formatted nicely
  • Toggle dark/light mode
  • See quick-question cards for common queries

Technical Stack

Frontend

  • React + Vite (faster than Create React App)
  • Tailwind CSS for styling
  • Deployed on Render (free static hosting)

Backend

  • FastAPI (Python, async for speed)
  • LangChain (orchestrates the RAG pipeline)
  • Chroma (vector database, runs locally)
  • sentence-transformers (generates embeddings)
  • Gemini 2.5 Flash (primary LLM)
  • Groq's gpt-oss-20b (fallback LLM for resilience)
  • BeautifulSoup (scrapes live RGUKT website)
  • Deployed on Hugging Face Spaces (free Docker hosting with 16GB RAM)

Why Two Deployment Platforms?

I initially deployed everything on Render's free tier. But then something went wrong:

2024-03-15 12:34:56 - OUT OF MEMORY - Process exited with code 137
Enter fullscreen mode Exit fullscreen mode

Why? Loading the sentence-transformer model (~400MB) + Chroma vector store (~130MB) + LangChain overhead needs more than Render's 512MB free tier. I needed at least 2GB.

Solution: Move the backend to Hugging Face Spaces (16GB free RAM) and keep the lightweight React frontend on Render.

Cost: $0 for both. Problem solved. ✅


Challenges & Solutions

🚨 Challenge #1: Chunking Strategy

The Problem:
I split PDFs into fixed-size chunks (256 tokens each). But this caused a disaster:

Original text:
"...The B.Tech program requires completion of 160 credit hours.
Eligibility: 10+2 with PCM. Admission is merit-based..."

After naive chunking:
Chunk 1: "...completion of 160 credit hours."
Chunk 2: "Eligibility: 10+2 with PCM. Admission is..."

When user asks "What's the eligibility?":
→ Retrieves Chunk 2
→ Missing context about which program!
Enter fullscreen mode Exit fullscreen mode

The Solution:
I used a sliding window with overlap:

chunk_size = 256
overlap = 50  # 50 tokens overlap between chunks

# Chunk 1: tokens 0-256
# Chunk 2: tokens 206-462 (overlaps with Chunk 1)
# Chunk 3: tokens 412-668 (overlaps with Chunk 2)
Enter fullscreen mode Exit fullscreen mode

This way, important context doesn't get lost at chunk boundaries.

🚨 Challenge #2: LLM Rate Limiting

The Problem:
Google Gemini has rate limits (free tier: 60 requests/minute). During testing, I hit the limit:

429 Too Many Requests - You have exceeded your rate limit
Enter fullscreen mode Exit fullscreen mode

One failed request and the whole chatbot breaks for that user.

The Solution:
Implement automatic fallback:

try:
    response = gemini_api.generate(prompt)
except RateLimitError:
    print("Gemini rate limited, falling back to Groq...")
    response = groq_api.generate(prompt)
except Exception as e:
    response = "Sorry, I'm having trouble. Try again in a moment."
Enter fullscreen mode Exit fullscreen mode

Now if Gemini fails, it automatically uses Groq's model instead. User experience: seamless.

This taught me: always have a fallback for external APIs.

🚨 Challenge #3: Stale Information

The Problem:
I built the vector database once and deployed it. But RGUKT updates its website constantly. Students would ask about deadlines from 2024, but my knowledge base had 2023 info.

The Solution:
I added a live web scraper that runs for every query:

# For questions about admissions/deadlines/dates,
# scrape the RGUKT website in real-time
relevant_urls = find_urls_for_query(question)
for url in relevant_urls:
    content = scrape_url(url)
    context += content

# Combine with vector search results
final_context = vector_search_results + scraped_content
Enter fullscreen mode Exit fullscreen mode

Now the chatbot has:

  • Static context from PDFs (policies, regulations — don't change often)
  • Dynamic context from live website (deadlines, events — change frequently)

Best of both worlds.


How It Actually Works (Technical Deep Dive)

When you ask "What's the scholarship amount?", here's the journey:

1. Frontend sends POST to /api/chat
   {
     "text": "What's the scholarship amount?",
     "session_id": "12345",
     "chat_history": []
   }

2. Backend receives request  FastAPI router

3. RAG Pipeline:
   a) Convert question to embedding using sentence-transformers
   b) Search Chroma for top 5 similar chunks
       Returns RGUKT PDF excerpts about scholarships
   c) Scrape RGUKT website for current scholarship info
   d) Build final prompt with all context

4. Prompt looks like:
   "You are a RGUKT assistant...
    Here's information from our documents:
    [PDF: Scholarships can be up to ₹50,000...]
    [Website: Spring 2024 deadline: March 15...]

    User question: What's the scholarship amount?

    Answer based ONLY on this information:"

5. Call Gemini API  get response

6. Format response as HTML with styling

7. Return to frontend:
   {
     "response": "<div>RGUKT offers merit-based scholarships..."
   }

8. Frontend displays in chat bubble
Enter fullscreen mode Exit fullscreen mode

The entire process takes 1-3 seconds (mostly LLM latency, not our code).


Lessons Learned

1. RAG is Not Magic (But It's Damn Effective)

Before RAG, I tried:

  • Fine-tuning models (expensive, slow, overkill)
  • Prompt engineering alone (hallucination city)
  • Simple keyword search (no semantic understanding)

RAG beats all of these for knowledge-grounded chatbots because it:

  • Keeps costs low (no fine-tuning)
  • Prevents hallucinations (grounds in documents)
  • Handles semantic understanding (embeddings)
  • Scales easily (just add more documents)

2. You Need Multiple LLMs

Depending on one LLM is risky. I use:

  • Gemini 2.5 Flash (primary — fast, accurate)
  • Groq gpt-oss-20b (fallback — open source, no rate limits)
  • Claude (for testing — different perspective)

If one fails, others take over. This is production-grade thinking.

3. Performance Matters

The first version took 8 seconds to answer a question. Too slow. Users left.

I optimized:

  • Switched from heavy models to lightweight all-MiniLM-L6-v2 for embeddings
  • Used async/await in FastAPI to handle concurrent requests
  • Cached embeddings so recurrent questions are instant
  • Used Groq's API instead of OpenAI (faster)

Result: Answers now in 1-3 seconds. Much better.

4. Context Length is a Hard Constraint

LLMs have input limits. Gemini: 2M tokens, but I can't use all of them:

  • Some for the LLM's "thinking"
  • Some for user chat history
  • Some for my prompt instructions
  • Remaining for retrieved context

I had to limit context to 3000 characters to stay under the limit. Early on, I didn't do this and got truncated responses. Now it's:

MAX_CONTEXT = 3000
context = "\n".join([chunk for chunk in chunks])[:MAX_CONTEXT]
Enter fullscreen mode Exit fullscreen mode

5. User Feedback Loops Are Everything

I deployed the chatbot, and students started using it. Within a day, I had feedback:

  • "It answers admissions questions perfectly but fails on campus facilities"
  • "I asked about scholarships and it gave me generic answers"

This told me:

  • My vector search was missing facility-related documents (added them)
  • Scholarship scraper wasn't working (debugged live scraper)
  • Some questions needed specialized handling (built FAQ fallback)

Lesson: Ship early, iterate based on real usage.


Deployment Checklist

Deploying an AI app is different from regular web apps:

  • ✅ Git LFS configured for large files (Chroma database)
  • ✅ API keys as secrets (never hardcoded)
  • ✅ CORS configured for frontend domain
  • ✅ Rate limiting on backend
  • ✅ Error handling for LLM failures
  • ✅ Monitoring (response time, error rate)
  • ✅ Logging (for debugging user issues)
  • ✅ Load testing (what if 1000 users ask simultaneously?)

Results

RGUKT ChatBot is live at https://rgukt-bot-1.onrender.com.

Statistics (since launch):

  • 500+ conversations with students
  • 95% questions answered accurately (based on student feedback)
  • Handles 20+ concurrent users without crashing
  • $0 hosting cost (free tier Render + Hugging Face + Google API credits)

Students can now get answers about:

  • Admissions eligibility
  • Scholarship details
  • Attendance policies
  • Placement statistics
  • Campus facilities
  • Exam schedules

All instantly, 24/7, without hallucinations.


What I'd Do Differently Next Time

  1. Start with existing vector stores (Pinecone, Weaviate) instead of running Chroma locally — more reliable for production

  2. Implement proper logging from day one — I was debugging blind for the first month

  3. Use structured output from LLMs (JSON schema) — easier to format on frontend

  4. Build a feedback loop where users can say "this answer was wrong" → retrains the system

  5. Add human escalation — for questions the bot can't answer, route to a human


Key Takeaways for LLM Developers

  1. RAG > Fine-tuning > Prompting, for knowledge-grounded tasks. Use RAG first.

  2. Embeddings are underrated. Most of the magic in RAG comes from good embeddings, not the LLM.

  3. Always have a fallback LLM. Single points of failure kill production systems.

  4. Context size matters. Spend time optimizing what context you pass to the LLM.

  5. Ship something imperfect. Real user feedback is worth 100x more than perfect planning.


Resources

If you want to build RAG chatbots:


Have you built a RAG system? What was your biggest challenge? Drop a comment!

Happy building 🚀


RGUKT ChatBot source code: https://github.com/smithayenugu/Rgukt-bot

Live chatbot: https://rgukt-bot-1.onrender.com`

Top comments (0)