Arnab Saha

Posted on Mar 2

I Built a Multi-Provider AI Chat App with Django and React — Here's What Actually Happened

#ai #webdev #django #react

The Idea

The core idea was simple: a ChatGPT-like interface backed by free-tier AI APIs, with Google login, persistent chat history, and real-time streaming. The twist was multi-provider routing — if Groq's free quota runs out mid-day, the app silently falls back to Gemini, then Mistral. The user never sees an error. They just keep chatting.

Stack I chose:

Django 6 + Django REST Framework on the backend
React 19 + TypeScript + Tailwind CSS v4 on the frontend
Groq (Llama 3.3 70B), Google Gemini 2.0 Flash, Mistral Small for AI
Google OAuth 2.0 for auth
Render (backend) + Vercel (frontend) + Neon (Neon) — all free tier

Starting with Auth

I wanted Google OAuth because it's frictionless for users. No password forms, no email verification — just "Sign in with Google" and you're in.

The standard django-allauth approach assumes a server-side redirect flow — Google redirects back to your Django backend, and Django sets a session cookie. That doesn't work cleanly when your frontend is a separate React SPA on a different domain.

So I went with the implicit flow: the frontend handles the Google popup, gets the credential token directly from Google, then sends it to Django. Django validates it against Google's API, finds or creates the user, and returns a JWT pair (access + refresh). From that point forward, it's just JWT auth.

This meant writing a custom GoogleLoginView that takes a Google credential, calls Google's tokeninfo endpoint, extracts the email and profile data, and issues tokens. It worked — but it took longer than expected because the django-allauth docs assume the redirect flow, and the custom path is not well documented.

What I learned: When your frontend and backend live on different origins, think about auth flows from first principles. Don't assume a library's default flow fits your architecture.

Designing the Database Before Writing Any AI Code

I made a deliberate choice to fully design and implement the data layer before touching any AI integration. The models were:

ChatSession — belongs to a user, has a title, tracks created/updated time
Message — belongs to a session, stores role (user/assistant), content, which AI provider responded, and which model was used
ProviderUsage — tracks daily usage per provider so the routing logic knows when to fall back

That last model was the important one. The fallback logic needed to survive server restarts. If I tracked quota exhaustion in memory only, a Render dyno restart would reset it and I'd hammer a dead quota again immediately. Storing it in the database means the state persists across restarts.

Building the Multi-Provider Router

This is the part I'm most satisfied with.

The router works like this:

Try Groq first (fastest, Llama 3.3 70B)
If Groq returns a quota error, mark it exhausted in the DB for today and try Gemini
If Gemini fails, fall back to Mistral
If all three are exhausted, return a clear error to the user

Every message response records which provider and model actually served it. So in the DB you can see "this message was answered by Gemini because Groq was exhausted."

The tricky part was making the fallback transparent to the streaming layer. Once I start streaming tokens to the client, I can't switch providers mid-response. So the fallback only happens before streaming begins — the router tries to open a connection with the provider, and if that fails, it retries with the next one before sending a single byte to the client.

SSE Streaming Was Harder Than I Expected

Server-Sent Events looked simple on paper: Django yields chunks, the browser reads them. In practice, several rough edges appeared.

CORS and SSE don't play well by default. My frontend is on Vercel, my backend on Render — two different origins. SSE connections are long-lived and behave differently from regular XHR requests when it comes to CORS headers. I had to verify that Django's CORS middleware applied correctly to streaming responses.

Django's ORM doesn't like being called inside a generator. When you yield from a Django streaming response, you're inside a generator function. Database writes (like saving the completed message after streaming finishes) need careful handling. I collected the full streamed content and wrote it to the DB in the generator's finally block.

The client needed an abort mechanism. If the user clicks "stop generating," the frontend needs to cleanly close the SSE connection. On the React side, this meant using an AbortController. On the Django side, catching GeneratorExit when the client disconnects.

Tailwind v4 and the Design Token System

Tailwind v4 is a significant departure from v3. There's no tailwind.config.js anymore — all custom tokens are defined in your CSS file using @theme:

@theme {
  --color-surface-base: #1a1a1a;
  --color-ink-primary: #f0ede8;
  --color-line: #2e2e2e;
}

I mirrored these in a theme.ts file for the rare cases where I needed values in JavaScript. The rule I set for myself: no hardcoded hex values in any component. Everything goes through a token class like bg-surface-base or text-ink-primary. This paid off when I wanted to adjust the base background — one line in the CSS file, done everywhere.

The downside: Tailwind v4 is still new and the ecosystem hasn't fully caught up. I hit cases where shadcn/ui components assumed v3 configuration and needed manual adaptation.

Deploying on Free Tiers

Both Render and Vercel have generous free tiers, but they come with real constraints you have to design around.

Render's free backend spins down after 15 minutes of inactivity. The first request after a cold start can take 30-50 seconds. For a chat app where the first thing a user does is send a message, that's a painful first impression.

The fix turned out to be simple. I created a lightweight /health endpoint on the Django backend that just returns a 200 OK, then set up a free UptimeRobot monitor to ping it every 5 minutes. Render never gets the chance to spin the instance down. No cold starts, no 50-second waits, no frustrated users — and it costs nothing on either side.

Sometimes the "real" fix is just a cron job with a URL.

PostgreSQL on Neon free tier has connection limits. Django's default connection behavior opens a new connection per request. Under any real load, this exhausts the connection pool. The fix is CONN_MAX_AGE in Django's database settings.

Environment variables across two deployments get tedious. Django needs GOOGLE_CLIENT_ID. React also needs VITE_GOOGLE_CLIENT_ID (Vite's required prefix for client-side vars). The same value, two names, two dashboards. Small thing, but error-prone. I kept a .env.example for both services and updated them in sync.

The Context Window Problem

Early on, I was only sending the user's latest message to the AI. The AI had no memory — every message was a fresh conversation. Obviously wrong for a chat app.

The fix is sending the last N messages as conversation history on every request. But N has to be bounded — very long sessions would exceed the model's context window limit and the API would return an error.

I settled on a configurable CONTEXT_WINDOW_SIZE (last 20 messages by default), queried from the DB before every AI call.

Auto-Generating Session Titles

When a new session starts, it has no title. I wanted it to appear automatically after the first exchange.

After the first user message and AI response are saved, a separate lightweight AI call generates a short title (under 8 words) based on the conversation content. It runs after streaming completes so it doesn't block anything. The frontend polls for the session update and renders the title in the sidebar when it arrives.

Small feature, but it makes the app feel complete rather than like a half-finished prototype.

What I Would Do Differently

Start with the streaming architecture. I built the basic request/response flow first, then had to rework significant parts to support SSE. Designing for streaming from day one would have avoided most of that rework.

Write the API contract first. I built backend and frontend somewhat in parallel, which meant changing endpoint shapes while the frontend was already consuming them. A simple spec written upfront would have saved some self-coordination overhead.

What I Actually Learned

Free tiers shape your architecture whether you plan for it or not. Cold starts, connection limits, quota limits — these aren't edge cases on free hosting. They're the default. Design for them early.

Multi-provider AI routing is genuinely useful, not just clever. Since deploying, Groq's free tier has hit its daily limit multiple times. Without the fallback, the app would be broken for the rest of the day. With it, users don't notice.

SSE is underused and underappreciated. For one-directional streaming from server to client, SSE is simpler than WebSockets in almost every way. It works over plain HTTP, handles reconnection natively, and needs no special infrastructure. Reach for it before WebSockets when you only need server-to-client streaming.

Design tokens are worth the upfront investment. The Tailwind v4 token setup took a couple of hours. But having a consistent, named color system meant I never debated "is this the right shade?" The answer was always "use the token and move on."

Try It

The app is live at arnabsahawrk-ai-chat-assistant.vercel.app. Sign in with Google and send a message. If Groq's quota is spent for the day, you will be talking to Gemini without knowing it.

Source code is on GitHub and the API documentation is at ai-chat-assistant-4agm.onrender.com.

If you spot something that could be done better or have an idea worth trying, drop it in the comments — always open to it.