TL;DR: I built Aria — a production-grade AI assistant for CRM agents. Agents log in with Google, type natural language questions about their leads, tasks, and bookings, and get real answers streamed from a live database. No hallucinations. No canned responses. Just SQL executed against real data, formatted by GPT-4o, and streamed token by token. Built entirely in Go, with a Python schema intelligence pipeline, pgvector for semantic search, Redis for caching and sessions, and a Next.js frontend. Here's everything I learned.
Why I built this
I work as a backend engineer at a PropTech company. Every day I watch pre-sales agents open five different tabs to answer one question — leads dashboard, task list, booking tracker, payment sheet, activity log. The data exists. It's all in PostgreSQL. But to answer "which of my leads haven't been contacted in 3 days?", an agent has to mentally join four tables themselves.
I wanted to build something that let an agent just ask that question — in plain English — and get a real answer from the live database. Not a chatbot that makes things up. An AI that runs actual SQL and tells you what it found.
That's Aria.
What Aria actually does
Before I get into the technical decisions, let me show you what it looks like in practice.
An agent logs in with Google. They see a chat interface. They type:
"How many leads are assigned to me?"
Aria responds:
"You have 5 leads assigned to you."
With a "SQL executed" toggle that reveals the actual query that ran. Then they ask:
"What tasks are due today?"
"There is one task due today. It is a high-priority call titled 'Welcome call — Priya,' which involves an intro call and confirming the budget range. The task is currently pending and has not been completed yet."
Then:
"Recent activity on my leads"
"Recent activities on your leads include:
- Email Activity — Type: Outbound Email, Subject: Sent brochure, Body: PDF with Manchester options, Occurred: May 4, 2026 at 12:43 PM
- Call Activity — Type: Inbound Call, Subject: Inbound budget check, Body: Student asked about bills-inclusive options, Occurred: May 3, 2026 at 4:43 PM, Outcome: Connected"
Real data. Live database. The agent never wrote a query.
The core technical problem — and why RAG isn't the answer
My first instinct was RAG (Retrieval Augmented Generation) — the pattern everyone uses for document question-answering. Embed all your CRM data as vectors, find the most similar documents when someone asks a question, feed them to GPT-4.
This is completely wrong for structured CRM data.
RAG is for unstructured data — PDFs, notes, articles. The answer to "how many leads are assigned to me?" isn't found by semantic similarity. It requires an exact COUNT query with a WHERE clause. Semantic search on a vector of "John Smith, high priority, interested state, assigned to agent_42" doesn't give you a count. It gives you a similar-sounding document.
The right approach is Text-to-SQL — translate the natural language question into an actual SQL query, execute it, get real rows, format the result.
But raw Text-to-SQL (just dumping your schema into a prompt and asking GPT-4 to write SQL) breaks down on complex schemas with business-specific terminology. "High priority leads" only maps to priority = 'HIGH' if the LLM knows your schema well enough. "Stale leads" needs to know that means last_activity_at < NOW() - INTERVAL '3 days'.
The solution I landed on: RAG over the schema, not over the data rows.
I build a semantic layer — plain-English descriptions of every table, every column, every enum value, and what they mean in business terms. That semantic layer gets embedded as vectors. When an agent asks a question, I retrieve the most relevant schema descriptions first, then ask GPT-4 to generate SQL using that focused context. The LLM sees exactly what it needs, not the entire 15-table schema.
Why Go
I'd been using Ruby on Rails professionally for over a year. I wanted to learn Go properly — not from tutorials, but by building something real with real constraints.
Go turned out to be the right call for this specific project for three reasons.
Goroutines for SSE streaming. LLM responses stream token by token. In Go, you open a channel, pump tokens from the OpenAI stream into it, and flush each one as an SSE event. The pattern is clean and idiomatic. The http.Flusher interface handles the rest. Zero boilerplate.
Type safety for LLM tool calling. OpenAI's function calling returns JSON that maps to your defined tool schema. In Go, you define a struct, and the type system ensures you're handling the response correctly. No runtime surprises from dynamic typing.
Concurrency for parallel queries. Some questions require multiple SQL queries — "show me my leads and today's bookings." Go's sync.WaitGroup + goroutines make parallel execution natural. Run both queries simultaneously, merge results, format the combined answer. Python could do this, but Go makes it feel like the default way to think.
The one place I kept Python: the schema intelligence pipeline that reads the DB schema and generates semantic documentation via GPT-4o. The Python AI ecosystem (psycopg2, pgvector library, openai SDK) is simply more mature for that one-time build step.
The architecture
Browser (Next.js)
│
│ Google OAuth → JWT → httpOnly refresh cookie
│ POST /chat → SSE stream
│
Go API (Chi router)
│
├── JWT middleware (validates + blacklists in Redis)
├── Rate limiter (30 req/min per agent, Redis)
│
├── Schema retriever
│ → embed question → pgvector cosine search → top-5 relevant schema docs
│
├── Intent example retriever
│ → cosine search on pre-seeded Q→SQL examples
│
├── GPT-4o tool calling
│ → system prompt: agent context + retrieved schema + examples + history
│ → tool: query_crm_database(sql, explanation)
│
├── SQL validator
│ → must be SELECT only
│ → inject WHERE assigned_agent_id = $1 (agent isolation)
│
├── pgx executor (read-only role, 5s timeout)
│
├── Redis query cache (5 min TTL, invalidated on thumbs-down)
├── Redis session history (last 10 messages, 2h TTL)
│
└── SSE streamer → tokens flow to browser
PostgreSQL
- 15 tables: leads, tasks, bookings, payments, users, partners, properties, activities...
- pgvector extension: schema_embeddings, intent_examples
- Read-only role: aria_readonly (SELECT only at DB level)
Python schema pipeline (runs once on setup)
- Introspects PostgreSQL schema
- GPT-4o generates plain-English column descriptions
- Embeds + stores in schema_embeddings
- Auto-generates 30 example Q→SQL pairs → intent_examples
The decision that matters most: agent data isolation
This is not a technical trick. It's a security requirement.
When agent A asks "show me my leads," they must only see their leads. Not agent B's. Not all agents' combined. This sounds obvious but there are three places where it could fail:
The LLM layer. GPT-4o might forget to include a WHERE clause. The LLM is non-deterministic — you cannot trust it to always include the agent filter.
The application layer. This is where I enforce it. After the LLM generates SQL, before it ever hits the database, I inject AND assigned_agent_id = $1 and pass the authenticated agent's ID as a parameterised argument. The agent ID comes from the JWT — which is cryptographically signed. This runs regardless of what the LLM produced.
The database layer. The application connects as aria_readonly — a PostgreSQL role with only SELECT permissions. Even if somehow a non-SELECT query got through, it would fail at the database level.
Three independent layers. Any one of them catches a failure in the others.
// After SQL is generated by LLM, before execution
func injectAgentFilter(sql, agentID string) string {
// Replace :agent_id placeholder if LLM included it
// If not, inject the filter regardless
if !strings.Contains(strings.ToUpper(sql), "ASSIGNED_AGENT_ID") {
// Wrap the query: SELECT * FROM (...) WHERE assigned_agent_id = $1
return fmt.Sprintf(
"SELECT * FROM (%s) AS q WHERE q.assigned_agent_id = $1", sql,
)
}
return strings.ReplaceAll(sql, ":agent_id", "$1")
}
The schema intelligence pipeline
This is the piece I'm most proud of, because it solves a real problem that existing tools don't.
The problem: Text-to-SQL tools expect you to manually write documentation about your schema — what each column means, what each enum value represents. For a 15-table CRM with columns like ai_ineligible_reason, source_details, meta_ai_call_status, manually writing that documentation is hours of work. And every time you add a column, you have to update the docs.
My solution: a Python script that introspects the PostgreSQL schema and sends each table's DDL to GPT-4o, which generates the documentation automatically.
def generate_table_doc(client, table_name, ddl, foreign_keys, sample_rows):
prompt = f"""
You are documenting a CRM database for a student accommodation company.
Here is the DDL for the '{table_name}' table:
{ddl}
Foreign keys: {foreign_keys}
Sample rows (3): {sample_rows}
Generate a JSON object with:
- table purpose (1-2 sentences)
- for each column: plain English description, possible values if enum
- common business questions this table answers
- relationships to other tables
"""
# Returns structured JSON
The output for the leads table tells GPT-4o things like: "state = 'interested' means the student has shown interest but has not yet booked. pre_sales_agent_id is the agent currently responsible for converting this lead." These descriptions get embedded as vectors and stored in schema_embeddings.
The pipeline is incremental — it hashes each table's DDL. On re-run, only tables whose schema has changed get re-documented. Add a column to leads, only leads gets reprocessed. The rest is instant.
The dual-write problem and why I store conversations in both Redis and PostgreSQL
Conversation history lives in two places:
Redis (session:{user_id}:{session_id}, TTL 2 hours) — the last 10 messages, for fast lookup on every API call. The LLM needs this context to answer follow-up questions.
PostgreSQL (conversations and messages tables) — permanent storage for audit, feedback training, and conversation history display.
Why both? Speed vs durability. Redis is fast — loading session history from Redis takes ~1ms. PostgreSQL with a JOIN takes ~10ms. For something that happens on every LLM call, that 9ms difference matters. But Redis is ephemeral — TTL expires, memory pressure evicts keys. The PostgreSQL record is permanent and queryable.
This is the same pattern I use at my day job for a different problem. The tradeoff is always: what needs to be fast, and what needs to be permanent?
Streaming with SSE in Go
Server-Sent Events is the right choice for streaming LLM responses. It's one-directional (server to client), works over standard HTTP, and auto-reconnects on disconnect. No WebSocket handshake overhead.
The Go implementation is clean:
func streamResponse(w http.ResponseWriter, openaiStream *openai.ChatCompletionStream) {
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("X-Accel-Buffering", "no")
flusher := w.(http.Flusher)
for {
chunk, err := openaiStream.Recv()
if err == io.EOF {
fmt.Fprintf(w, "data: [DONE]\n\n")
flusher.Flush()
return
}
token := chunk.Choices[0].Delta.Content
event := fmt.Sprintf(`{"type":"token","text":%q}`, token)
fmt.Fprintf(w, "data: %s\n\n", event)
flusher.Flush()
}
}
The frontend uses the EventSource API with a custom hook that parses event types (sql, token, done, error) and updates React state accordingly. The agent sees tokens appearing one by one — exactly like ChatGPT, but with real CRM data behind it.
The feedback loop
Every AI response has thumbs-up / thumbs-down buttons. This isn't just UX polish — it's the learning mechanism.
Thumbs up: The Q→SQL pair gets stored in query_feedback with is_helpful = true. After 3 upvotes on similar question patterns, the pair gets auto-promoted to intent_examples — it becomes part of the training data that future queries retrieve during the schema retrieval step. The system gets smarter from usage.
Thumbs down: The agent can optionally add a correction note. The bad Q→SQL pair is flagged. The Redis cache entry for that question is immediately invalidated — the next time someone asks the same question, it goes to the LLM fresh instead of serving the cached wrong answer.
A nightly job clusters negative feedback to identify recurring failure patterns — questions the system consistently gets wrong — and flags them to intent_gaps for manual review.
What I'd do differently
Add OpenTelemetry from the start. I added structured logging but not distributed tracing. Being able to see a full trace from HTTP request → schema retrieval → LLM call → SQL execution → SSE stream with timing at each step would have made debugging much faster.
Add a "confidence score" to responses. Right now, Aria answers with equal confidence whether it's 100% sure or guessing. A mechanism to say "I'm not sure this is right — here's what I think you're asking" would make the product more trustworthy.
Separate the schema pipeline into its own service. It's a Python CLI script today. Making it a proper FastAPI service that the Go API can call on-demand (when a new table is detected) would make the incremental update flow smoother.
What this project taught me about Go
Four things I didn't expect before building this:
Go's error handling forces you to think about every failure mode. There's no try/catch that lets you defer the "what if this fails?" question. Every database call, every API call, every Redis operation has an explicit error path. The code is more verbose but the failure modes are all visible.
Interfaces in Go make dependency injection natural. My AIService depends on a SchemaRetriever interface, not a concrete type. Swapping implementations for tests, or switching from pgvector to a different vector store, requires changing one line.
The context package is everywhere and it's the right abstraction. Timeouts, cancellation, request-scoped values (like the authenticated user ID) — all flow through context.Context. Once you understand why, you stop fighting it.
Goroutines are cheap but coordination is hard. Running parallel SQL queries with sync.WaitGroup is easy. Making sure errors from each goroutine are correctly propagated and the context cancels properly when one fails — that took more thought.
The numbers
- 15 tables with relational integrity
- 500 seeded leads, 300 tasks, 200 bookings, 150 payments
- 30 pre-seeded intent examples from the schema pipeline
- ~90 schema embeddings covering every table and column
- Sub-100ms response time on cache hits
- ~1.5–2s to first token on cache misses (LLM latency)
- 5 second SQL execution timeout, hard limit at DB level
- Read-only PostgreSQL role — no write access possible
The stack
Backend: Go 1.22, Chi router, pgx v5, go-redis v9, official openai-go SDK, golang-jwt
AI layer: GPT-4o for SQL generation and response formatting, text-embedding-3-small for schema embeddings, pgvector (HNSW index) for similarity search
Schema pipeline: Python 3.12, psycopg2, pgvector Python client, openai Python SDK
Frontend: Next.js 14, TypeScript, Tailwind CSS, Zustand for auth state, React Query
Infrastructure: Docker Compose, PostgreSQL 16, Redis 7
The code
Everything is on GitHub: github.com/Deonkar/Aria
It runs with docker compose up. Run make seed to populate the database. Run make schema-pipeline to generate the semantic documentation and embeddings. Then ask it anything about your leads.
The thing to try first: ask it the same question twice. Watch the first response stream token by token. Watch the second response come back instantly from cache with the "SQL executed" badge still visible. That's the difference between a demo and a system.
What's next
The natural next step is the RAG layer — adding support for unstructured data like call transcripts, email threads, and agent notes. "What did we discuss with this lead last week?" can't be answered by SQL alone. That question needs semantic search over unstructured text — which is where pgvector plus proper chunking comes in. The infrastructure is already there. The schema pipeline already uses pgvector for embeddings. Extending it to cover activity bodies and notes is the obvious v2.
The other direction is multi-agent context — letting admins ask questions across all agents ("which agent has the most stale leads this week?") while keeping agent-level queries scoped. The row-level security layer already handles this via the role claim in the JWT. It's a configuration change, not an architecture change.
If you're building something similar or have thoughts on the architecture decisions — particularly around the schema intelligence pipeline or the agent isolation approach — I'd like to hear from you. Drop a comment or find me on LinkedIn.
Top comments (0)