"Backend Engineering for an AI Platform: The Infrastructure Decisions Nobody Talks About"

#webdev #ai #vectordatabase

Video Explanation:https://youtu.be/1xL_Dtp7Ksg)

By Raghav Bansal | Backend Engineer & Memory Infrastructure Lead, Team FireFury
EduRag Project | Stack: FastAPI, Supabase PostgreSQL, Hindsight Memory, SlowAPI, WebSoc
Backend Engineering for an AI Platform: The Infrastructure Decisions Nobody Talks About

Good backend engineering is invisible. When I have done my job well, the people using EduRag do not think about the database schema or the rate limiter or the memory storage design. They just use the app and things work. The only time they notice infrastructure is when it fails — and I have broken it enough times to know exactly what that feels like.
I owned the backend infrastructure for EduRag: the Supabase schema, the FastAPI endpoint layer for all API operations, the memory integration with Hindsight, rate limiting, WebSocket-based chat, and the security middleware stack. This is the part of the system that nobody demos but that everything else depends on.
The Database Schema — Eight Tables, Zero Local Storage
EduRag runs entirely on Supabase PostgreSQL with no local database. That decision was made early and it simplified deployment enormously, but it required getting the schema right from the start because migrating table structures in production is painful.
The schema has eight tables. Users stores account information including the institution ID, name, bcrypt-hashed password, role enum, and avatar preference. The role enum is the foundation of the entire RBAC system — everything downstream from authentication reads from this field. Search history logs every RAG query per user with the query text, timestamp, and result count. This feeds the trending topics analytics and the personalized recommendations.
The three content tables are PDFs, pdf_chunks, and rag_embeddings. PDFs stores upload metadata and indexing status. Pdf_chunks stores the extracted text segments with their position metadata. Rag_embeddings stores the Gemini vector for each chunk as a float array alongside a source reference back to the chunk. The cosine similarity computation runs directly in the database using PostgreSQL's array operations — I did not want to pull thousands of vectors into Python just to sort them.
Row Level Security is enabled on every table. This means authorization is enforced at the database level, not just in application code. A bug in a route handler cannot accidentally expose student data to another student, because the RLS policy blocks the query before it returns results. Setting up RLS took an extra day of schema work. It has prevented at least two bugs from becoming data exposure incidents.
The PDF Indexing Bug That Took a Full Day to Find
This one still bothers me a little, because the root cause was something I should have caught earlier.
We kept getting reports that PDFs appeared as 'indexed' in the teacher dashboard but returned no results when students searched. The PDF metadata row showed indexed_at populated, the storage bucket showed the file was there, but the rag_embeddings table had no rows for that PDF's chunks.
I added logging to every step of the indexing pipeline and ran it manually three times. Everything worked. The embeddings were generated and stored correctly. I could not reproduce the failure.
It only failed under concurrency. When two teachers uploaded and indexed PDFs at the same moment, one of them would silently lose its embedding writes. The logging I added ran in the main task and showed success. The failure was happening in a background task I fired with asyncio.create_task() — and never awaited or tracked.
FastAPI's route handler was returning a 200 response immediately after creating the background task. The background task continued running briefly but when the function context was garbage-collected, the Supabase write inside it sometimes got interrupted. The task reference was lost; the failure was silent.
The fix: I moved the embedding job to a synchronous function called directly in the route handler, accepting the latency. For large PDFs that take more than a few seconds to index, the endpoint now returns 202 Accepted with a job ID. The teacher dashboard polls a status endpoint until indexing completes. The pdfs table has a failed_at column and error_message column that get populated if anything goes wrong — so failures are now visible, not silent.
Memory Infrastructure — How Hindsight Fits Into the Backend
The memory system was my primary ownership area alongside the core infrastructure. I built four endpoints that wrap the Hindsight API: status check, retain, recall, and reflect.
Retain is called automatically after each successful RAG search. It takes a structured fact string describing what the student searched — something like 'student searched for enzyme inhibition mechanisms in biochemistry' — and stores it in that user's Hindsight memory bank. Each user gets their own bank, keyed by their user ID in our system.
Before any fact is sent to Hindsight, it passes through a PII redaction function. This runs a regex pass that strips email addresses and phone numbers from the fact string. The function is called in the retain path every single time, not optionally. Storing student personally identifiable information in an external memory service is not something I was willing to leave up to chance.
Recall is used at query time by the RAG pipeline. Shrikant's retrieval code calls this endpoint with the current query, and Hindsight returns memory facts that are semantically relevant to it. Those facts get injected into the generation prompt. I made this call with a six-second timeout — the HINDSIGHT_TIMEOUT config variable — because a slow memory lookup should not block an answer. If recall times out, the query continues without memory context rather than returning an error.
Reflect is an admin-triggered operation that asks Hindsight to generate an AI summary of a user's full memory bank. I exposed this through the admin dashboard. A teacher selects a student, triggers a reflection, and gets back a natural language paragraph describing what topics that student has covered, how frequently, and where the engagement dropped off. This took about 20 lines of backend code and became one of the most-used admin features.
What Memory Changed at the Product Level
Before Hindsight, the recommendations endpoint was returning platform-wide trending topics. It was technically functional and completely impersonal. A student who had been studying organic chemistry for three weeks was seeing the same recommendations as a student who had never opened the platform before.
After integrating memory recall into the recommendations logic, each student's recommendations are derived from their own recalled facts rather than aggregate trends. The platform-trending fallback still exists — if a student is new and has no memory facts yet, they see trending topics. But for returning students, the recommendations are theirs specifically.
There is a difference between a platform that serves you information and a platform that recognizes you. Memory is what creates the second kind. From a backend perspective, the implementation difference is small — an extra API call to Hindsight, a conditional branch in the recommendations logic. But from a user perspective, it changes what the product fundamentally is.
Rate Limiting and Security Headers — The Boring Stuff That Matters
I added rate limiting relatively late in the project, which I regret. SlowAPI integrates with FastAPI's middleware chain and implements token-bucket rate limiting per IP address. We set 60 requests per minute as the limit.
That number came from a specific calculation: a classroom lab session with 30 students all querying the platform simultaneously over 30 minutes produces roughly 30 requests per minute at peak. A 60 req/min limit gives headroom for that traffic while blocking automated abuse. Below 60, legitimate classroom usage was occasionally being throttled. Above it, Gemini API costs during adversarial load testing became unpredictable.
Security headers middleware adds Content-Security-Policy, HSTS, and X-Frame-Options to every response. This took about 30 minutes to implement. I deferred it for two weeks for no good reason. These headers prevent a class of client-side attacks that the application logic cannot prevent on its own. They should have been in place on day one.
CORS configuration was another thing I spent more time on than it should have taken. EduRag runs on Vercel, which means the frontend origin changes between local development (localhost:3000), Vercel preview deployments (random subdomain), and production (fixed domain). Getting CORS to work correctly across all three environments required explicit origin lists in the FastAPI middleware and a clear decision that preview deployment origins were allowed during staging but not in production.
The WebSocket Chatroom — Surprisingly Tricky
EduRag includes a real-time student chatroom. Students can send messages that broadcast to all other connected students. Messages auto-expire — they are deleted after a configurable TTL — so the chatroom does not become a permanent record.
The implementation uses FastAPI's WebSocket support with a ConnectionManager class that tracks active connections in a dict keyed by user ID. JWT validation happens at connection time rather than per-message, because WebSocket clients cannot set HTTP headers after the handshake. The JWT is passed as a query parameter in the connection URL.
The tricky part was the auto-expiry. A background task runs every 60 seconds and deletes messages older than the TTL from Supabase. I use asyncio.create_task() for this — yes, the same pattern that caused the indexing bug — but here it is appropriate because message expiry is genuinely fire-and-forget. A missed expiry cycle is not a data integrity problem. The distinction matters: fire-and-forget is fine for best-effort operations. It is dangerous for operations that need to succeed.
What the HINDSIGHT_ENABLED Flag Did For Us
There was an afternoon during development when the Hindsight API was unreachable for about two hours. I know this because the backend started returning 500 errors on every RAG search — not because the RAG pipeline failed, but because the automatic retain call at the end of each search was throwing an unhandled exception.
I added the HINDSIGHT_ENABLED flag that same afternoon. When it is set to false, all memory calls are skipped silently. The system operates in stateless mode — RAG still works, recommendations fall back to trending topics, and the memory bank UI shows an appropriate empty state. No errors, no degraded responses, just fewer personalized features.
This is the right way to integrate any external dependency into a system that needs to be reliable. The feature you are adding is not worth making the core product unreliable. Give operators the ability to turn it off cleanly.
Lessons Worth Passing Along
• Row Level Security in Supabase is worth the schema design overhead. It enforces authorization at the layer that cannot be bypassed by application bugs.
• Never use asyncio.create_task() for operations that need guaranteed completion. Use it only for genuinely best-effort background work.
• Expose failures explicitly. A failed_at column and an error_message column turn silent failures into visible ones. Add them to any table that represents an async operation.
• Rate limiting and security headers should be first-week infrastructure, not last-week additions. The implementation cost is low; the cost of adding them under pressure is higher.
• Feature flags for external dependencies are not optional. If you cannot turn a service off cleanly, you have made your reliability dependent on their availability.
• PII redaction before external memory storage belongs in the code path that writes, not in documentation that people might not read.
Closing Thought
Backend work does not get featured in demos. Nobody is going to walk through the RLS policies or the rate limiter in a presentation. That is fine. The job is to make the parts that do get featured work correctly and reliably, and to make sure that when something goes wrong — because something always goes wrong — the failure is visible, contained, and recoverable.
The memory infrastructure in particular is the piece I am most satisfied with, not because it is technically complex but because it directly changed what the product could do for students. Infrastructure that produces visible user value is the best kind.

— Raghav Bansal, Backend Engineer & Memory Infrastructure Lead, Team FireFury

DEV Community

"Backend Engineering for an AI Platform: The Infrastructure Decisions Nobody Talks About"

Top comments (0)