This is a submission for the Built with Google Gemini: Writing Challenge
What I Built with Google Gemini
Large Language Models are powerful, but they still struggle with one major issue — hallucinations.
While building AI assistants, I often found that models could generate answers that sounded convincing but were not actually grounded in real data. That led me to explore Retrieval-Augmented Generation (RAG) and build a system that allows Gemini to answer questions using real documents instead of guesses.
The Problem
Large Language Models are incredibly powerful, but they have a well-known limitation: they can generate answers that sound convincing but are actually wrong. This behavior is called AI hallucination, where a model produces fluent text that is not grounded in real facts or evidence.
These hallucinations don’t happen randomly — they usually occur because of structural limitations in how LLMs work.
Some common causes include:
Limited context window
LLMs can only “remember” a fixed number of tokens in a conversation. When the context becomes too long, earlier information may drop out of the window, causing the model to lose important instructions or details.Long or complex documents
When documents are very large, the model may struggle to reason over the entire content and can miss dependencies between different parts of the text.Outdated training knowledge
LLMs rely on training data collected at a specific point in time. If new information appears after that, the model may provide answers based on stale or incomplete knowledge.Probabilistic text generation
Language models generate responses by predicting the most likely next token rather than verifying facts, which can lead to confident but incorrect outputs.
Why Retrieval-Augmented Generation (RAG)
For applications like document search, knowledge assistants, or research tools, these limitations become a serious problem. Users need answers that are grounded in real documents, not guesses.
This challenge led me to explore Retrieval-Augmented Generation (RAG) — a technique that helps language models answer questions using real data instead of relying only on what they remember.
Instead of relying only on the model’s training data, a RAG system first retrieves relevant information from external documents and then uses that information as context when generating an answer.
The idea is simple: rather than asking the model to rely purely on memory, we give it access to the correct information at the moment it generates the response.
By grounding responses in retrieved documents, RAG systems help:
- reduce hallucinations
- provide answers based on real data
- work with private or domain-specific knowledge
- keep information up-to-date without retraining the model
System Architecture
The system is built as a Retrieval-Augmented Generation (RAG) pipeline using FastAPI, Google Gemini (for both LLMs and embeddings), and Qdrant (as the vector database).
Users upload documents, which are processed into embeddings and stored in a vector database. When a query is made, relevant document chunks are retrieved and used as context for Gemini to generate a grounded response.
Core Components
| Layer | Technology | Role |
|---|---|---|
| Frontend | HTML, CSS, JavaScript | User interface for uploading PDFs and interacting with the chatbot |
| API Server | FastAPI | Handles routes, request processing, and serves the frontend |
| Embedding Model | Gemini gemini-embedding-001
|
Converts queries and documents into vector embeddings |
| Vector Database | Qdrant | Stores embeddings and retrieves similar document chunks |
| LLM | Gemini Flash models | Generates answers based on retrieved context |
Backend Modules
| Module | File | Responsibility |
|---|---|---|
| API Layer | main.py |
Defines endpoints (/upload, /search, /chat) and initializes services |
| Document Ingestion | ingest.py |
Extracts PDF text, cleans it, and splits it into overlapping chunks |
| Embeddings | embeddings.py |
Generates embeddings for queries and document chunks |
| Vector DB Utility | qdrant_utils.py |
Handles Qdrant connection and collection initialization |
| Search Pipeline | search.py |
Performs retrieval and generates answers using Gemini |
| Chat Pipeline | chat.py |
Implements conversational RAG with memory and streaming responses |
Data Flow
| Step | Process |
|---|---|
| 1 | User uploads a PDF document |
| 2 | Text is extracted and split into chunks |
| 3 | Chunks are converted into embeddings using Gemini |
| 4 | Embeddings and text are stored in Qdrant |
| 5 | User submits a query |
| 6 | Query is embedded and matched against stored vectors |
| 7 | Relevant document chunks are retrieved |
| 8 | Gemini generates an answer using the retrieved context |
TECH ARSENAL
| Component | Technology |
|---|---|
| Backend | FastAPI |
| LLM | Google Gemini Flash |
| Embeddings | Gemini gemini-embedding-001
|
| Vector Database | Qdrant |
| Framework | LangChain |
| Frontend | HTML, CSS, JavaScript |
System Workflow
1) Overall RAG System Workflow
2) Document Ingestion Pipeline
3) Query Retrieval Pipeline
4) Conversational Chat Flow
Demo
To make the system easier to explore, I built a simple web interface where users can upload documents and interact with the RAG system in real time.
Interaction
the interaction may be slightly off or may not work as desired due to deployment on render
Video
GitHub Repo
Gemini-Qdrant RAG Backend
This is a production-ready RAG (Retrieval-Augmented Generation) backend built with FastAPI. It uses Google Gemini Embeddings for high-quality semantic search and Qdrant as the vector database.
The pipeline is non-blocking, designed for concurrent requests, and optimized for indexing documents.
⚙️ Setup and Installation
python -m venv .venv
.venv/scripts/activate
pip install -r requirements.txt
fastapi dev backend/main.py
Create a .env file in the root of your project directory and set your access keys and database URL.
# Gemini API Key (Required for embedding and generation)
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"
# Qdrant Vector Database Connection
QDRANT_URL="http://localhost:6333"
QDRANT_API_KEY="" # Use only if your Qdrant instance requires it
QDRANT_COLLECTION="rag_documents_768"
future endeavors :
[] implementing graphrag , pageindex, and other types of rag
[] deploying a full fledged app
What I Learned
This project helped me understand how RAG actually works in practice, not just in theory. Building the pipeline made me see how document ingestion, chunking, embeddings, vector search, and LLM generation all connect together. 🧩🤖
I also learned how embeddings convert text into vectors and why they are essential for semantic search. 🔢🔍
Working with vector databases like Qdrant helped me understand how similarity search and top-k retrieval power document-based AI systems. 🗂️⚡
And honestly, working with documentation can sometimes be a headache 🥲. A lot of the learning came from experimenting, debugging, and figuring things out along the way. 🛠️😇
Future Endeavours ⛰️
Adding an authentication and user management system to manage chats and create separate databases for each user.
Adding support for different database services like MongoDB, both local and cloud-based.
Using local LLMs to minimize inference costs.
Experimenting with advanced RAG architectures such as PageIndex.
Deploying the system properly on cloud platforms like AWS, Azure, or GCP.
Google Gemini Feedback
LLM API — Smooth Experience 👍👍
Calling the Gemini generative models was straightforward.
The
google-genaiSDK has a clean interface. Methods like:
client.models.generate_content()
client.models.embed_content()
feel intuitive, and getting a working prototype running was quick.
- The models themselves performed well for RAG use cases. Once retrieved document chunks were provided as context, Gemini produced grounded responses and streaming worked reliably in conversational flows.
Embeddings API — Documentation Could Be Clearer
The embeddings API worked well overall, but the documentation could be easier to navigate.
While integrating gemini-embedding-001, some configuration details were not immediately obvious and well explained, especially parameters like:
task_type = "RETRIEVAL_QUERY"
task_type = "RETRIEVAL_DOCUMENT"
output_dimensionality
batch_size_limits
Because of that, I ended up doing some trial-and-error and experimentation (and occasionally using AI coding tools) to determine the correct request structure.
The API itself works well, but a more consolidated explanation of embedding configuration would make the developer experience even smoother.







Top comments (1)
Any questions would be appreciated