DEV Community

Soham Sharma
Soham Sharma

Posted on

From Hallucinations to Grounded AI: Building a Gemini RAG System with Qdrant

Built with Google Gemini: Writing Challenge

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

Large Language Models are powerful, but they still struggle with one major issue — hallucinations.

While building AI assistants, I often found that models could generate answers that sounded convincing but were not actually grounded in real data. That led me to explore Retrieval-Augmented Generation (RAG) and build a system that allows Gemini to answer questions using real documents instead of guesses.


The Problem

Large Language Models are incredibly powerful, but they have a well-known limitation: they can generate answers that sound convincing but are actually wrong. This behavior is called AI hallucination, where a model produces fluent text that is not grounded in real facts or evidence.

These hallucinations don’t happen randomly — they usually occur because of structural limitations in how LLMs work.

Some common causes include:

  • Limited context window

    LLMs can only “remember” a fixed number of tokens in a conversation. When the context becomes too long, earlier information may drop out of the window, causing the model to lose important instructions or details.

  • Long or complex documents

    When documents are very large, the model may struggle to reason over the entire content and can miss dependencies between different parts of the text.

  • Outdated training knowledge

    LLMs rely on training data collected at a specific point in time. If new information appears after that, the model may provide answers based on stale or incomplete knowledge.

  • Probabilistic text generation

    Language models generate responses by predicting the most likely next token rather than verifying facts, which can lead to confident but incorrect outputs.

Illustration showing AI hallucination and developer frustration


Why Retrieval-Augmented Generation (RAG)

For applications like document search, knowledge assistants, or research tools, these limitations become a serious problem. Users need answers that are grounded in real documents, not guesses.

This challenge led me to explore Retrieval-Augmented Generation (RAG) — a technique that helps language models answer questions using real data instead of relying only on what they remember.

Instead of relying only on the model’s training data, a RAG system first retrieves relevant information from external documents and then uses that information as context when generating an answer.

Before vs After illustration showing LLM vs RAG system

The idea is simple: rather than asking the model to rely purely on memory, we give it access to the correct information at the moment it generates the response.

By grounding responses in retrieved documents, RAG systems help:

  • reduce hallucinations
  • provide answers based on real data
  • work with private or domain-specific knowledge
  • keep information up-to-date without retraining the model

Visual explanation of the RAG pipeline


System Architecture

The system is built as a Retrieval-Augmented Generation (RAG) pipeline using FastAPI, Google Gemini (for both LLMs and embeddings), and Qdrant (as the vector database).

Users upload documents, which are processed into embeddings and stored in a vector database. When a query is made, relevant document chunks are retrieved and used as context for Gemini to generate a grounded response.


Core Components

Layer Technology Role
Frontend HTML, CSS, JavaScript User interface for uploading PDFs and interacting with the chatbot
API Server FastAPI Handles routes, request processing, and serves the frontend
Embedding Model Gemini gemini-embedding-001 Converts queries and documents into vector embeddings
Vector Database Qdrant Stores embeddings and retrieves similar document chunks
LLM Gemini Flash models Generates answers based on retrieved context

Backend Modules

Module File Responsibility
API Layer main.py Defines endpoints (/upload, /search, /chat) and initializes services
Document Ingestion ingest.py Extracts PDF text, cleans it, and splits it into overlapping chunks
Embeddings embeddings.py Generates embeddings for queries and document chunks
Vector DB Utility qdrant_utils.py Handles Qdrant connection and collection initialization
Search Pipeline search.py Performs retrieval and generates answers using Gemini
Chat Pipeline chat.py Implements conversational RAG with memory and streaming responses

Data Flow

Step Process
1 User uploads a PDF document
2 Text is extracted and split into chunks
3 Chunks are converted into embeddings using Gemini
4 Embeddings and text are stored in Qdrant
5 User submits a query
6 Query is embedded and matched against stored vectors
7 Relevant document chunks are retrieved
8 Gemini generates an answer using the retrieved context

TECH ARSENAL

Component Technology
Backend FastAPI
LLM Google Gemini Flash
Embeddings Gemini gemini-embedding-001
Vector Database Qdrant
Framework LangChain
Frontend HTML, CSS, JavaScript

System Workflow

1) Overall RAG System Workflow

Overall RAG system workflow diagram

2) Document Ingestion Pipeline

Document ingestion pipeline diagram

3) Query Retrieval Pipeline

Query retrieval pipeline diagram

4) Conversational Chat Flow

Conversational chat flow diagram


Demo

To make the system easier to explore, I built a simple web interface where users can upload documents and interact with the RAG system in real time.

Interaction

the interaction may be slightly off or may not work as desired due to deployment on render

Video

GitHub Repo

Gemini-Qdrant RAG Backend

This is a production-ready RAG (Retrieval-Augmented Generation) backend built with FastAPI. It uses Google Gemini Embeddings for high-quality semantic search and Qdrant as the vector database.

The pipeline is non-blocking, designed for concurrent requests, and optimized for indexing documents.

⚙️ Setup and Installation

python -m venv .venv
.venv/scripts/activate
pip install -r requirements.txt
fastapi dev backend/main.py

Create a .env file in the root of your project directory and set your access keys and database URL.

# Gemini API Key (Required for embedding and generation)
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"

# Qdrant Vector Database Connection
QDRANT_URL="http://localhost:6333"
QDRANT_API_KEY="" # Use only if your Qdrant instance requires it
QDRANT_COLLECTION="rag_documents_768"

future endeavors :
[] implementing graphrag , pageindex, and other types of rag

[] deploying a full fledged app





What I Learned

  • This project helped me understand how RAG actually works in practice, not just in theory. Building the pipeline made me see how document ingestion, chunking, embeddings, vector search, and LLM generation all connect together. 🧩🤖

  • I also learned how embeddings convert text into vectors and why they are essential for semantic search. 🔢🔍

  • Working with vector databases like Qdrant helped me understand how similarity search and top-k retrieval power document-based AI systems. 🗂️⚡

  • And honestly, working with documentation can sometimes be a headache 🥲. A lot of the learning came from experimenting, debugging, and figuring things out along the way. 🛠️😇


Future Endeavours ⛰️

  1. Adding an authentication and user management system to manage chats and create separate databases for each user.

  2. Adding support for different database services like MongoDB, both local and cloud-based.

  3. Using local LLMs to minimize inference costs.

  4. Experimenting with advanced RAG architectures such as PageIndex.

  5. Deploying the system properly on cloud platforms like AWS, Azure, or GCP.


Google Gemini Feedback

LLM API — Smooth Experience 👍👍

  • Calling the Gemini generative models was straightforward.

  • The google-genai SDK has a clean interface. Methods like:

client.models.generate_content()
client.models.embed_content()
Enter fullscreen mode Exit fullscreen mode

feel intuitive, and getting a working prototype running was quick.

  • The models themselves performed well for RAG use cases. Once retrieved document chunks were provided as context, Gemini produced grounded responses and streaming worked reliably in conversational flows.

Embeddings API — Documentation Could Be Clearer

The embeddings API worked well overall, but the documentation could be easier to navigate.

While integrating gemini-embedding-001, some configuration details were not immediately obvious and well explained, especially parameters like:

task_type = "RETRIEVAL_QUERY"
task_type = "RETRIEVAL_DOCUMENT"
output_dimensionality
batch_size_limits
Enter fullscreen mode Exit fullscreen mode

Because of that, I ended up doing some trial-and-error and experimentation (and occasionally using AI coding tools) to determine the correct request structure.

The API itself works well, but a more consolidated explanation of embedding configuration would make the developer experience even smoother.


Top comments (1)

Collapse
 
sohamactive profile image
Soham Sharma

Any questions would be appreciated