DEV Community: Zeeshan Ghazanfar

What Actually Helped Our Talk-to-DB Agent Stop Relearning the Same SQL

Zeeshan Ghazanfar — Mon, 25 May 2026 12:05:38 +0000

Natural language to SQL gets expensive when the agent keeps solving the same problem from scratch.

In our Talk-to-DB layer, we added a semantic query cache:

text-embedding-3-large
3072-dimensional embeddings
pgvector cosine similarity
top 5 similar cached queries injected into the system prompt
table schemas stored with the cached SQL
user feedback stored as thumb_up or thumb_down

The failure mode was predictable.

A user would ask:

"Show revenue by product."

The agent would generate a working SQL query.

Next week, another user would ask:

"Which products made the most sales?"

Same intent. Same schema. But the model might regenerate a slightly different query, sometimes with the classic Odoo mistake: using product_product.list_price when the column actually lives on product_template.

The cache does not blindly replay SQL. It gives the agent prior working queries plus the table schemas used by those queries. If the request is close enough, the agent can adapt the known-good pattern instead of rediscovering joins.

For standardized Odoo databases, we also support a global cache because the schema shape is repeatable across deployments. For custom databases, the cache stays scoped to the datasource.

One honest caveat: similarity thresholds matter. Too low and irrelevant SQL leaks into the prompt. Too high and the agent misses useful prior work. We currently use per-instance config with top 5 candidates and a similarity threshold that needs tuning against real query logs.

This is the part of production AI people underestimate.

The first deployment is not the finish line. The system gets better when every query, failure, schema mismatch, and user correction becomes operating memory.

Why PDF-Style RAG Fails on Structured Enterprise Data

Zeeshan Ghazanfar — Thu, 14 May 2026 06:28:30 +0000

Most teams try to use document RAG patterns on structured enterprise data.

That usually breaks.

PDF RAG and structured-data RAG are not the same problem.

With PDF RAG, the system usually retrieves text chunks and asks the model to answer from them.

With ERP or CRM data, the problem is different:

Which table contains the answer?
Which fields are reliable?
Which joins are allowed?
Which filters map to the user’s business language?
Which rows are stale, duplicated, or operationally invalid?

We tested a basic vector-only RAG setup over structured records.

It looked fine in demos.

In production-style evals, it failed on multi-step questions because the retriever found semantically similar records, but missed the required relational constraints.

The fix was not “better embeddings”.

The fix was schema grounding.

We moved to a hybrid pattern:

classify the user intent
map terms to business entities
retrieve schema and field definitions
generate constrained SQL or API calls
validate outputs against business rules
only then pass the final result to the model for explanation

Accuracy improved because the model stopped guessing from loose chunks and started operating against the real data model.

One failure mode we still monitor closely:

The model can produce a correct-looking answer from incomplete data.

That is worse than an obvious error.

For structured enterprise systems, the hard part is not retrieval.

The hard part is knowing when the retrieved data is not enough.

RAG patterns that work for structured data vs ones that fail

Zeeshan Ghazanfar — Tue, 05 May 2026 12:06:46 +0000

Most RAG failures in enterprise systems do not come from the embedding model.

They come from using the same retrieval pattern for every kind of data.

A policy document, a support article, and a 12-year-old ERP schema are not the same problem. Treating them the same is how teams end up with demos that look useful and production systems that quietly return wrong answers.

At BrainPack, we see this most often when teams try to use naive RAG over structured enterprise data.

The pattern usually looks like this:

Export tables or reports into text chunks
Embed those chunks
Retrieve the top 5 matches
Ask the model to answer from them

It works in a demo because the question is usually close to the text that was embedded.

It fails in production because enterprise questions are rarely simple lookup questions.

Where naive RAG breaks

One deployment involved a legacy ERP with roughly 180 tables, inconsistent column names, and several business concepts spread across multiple modules.

A simple semantic RAG baseline gave acceptable answers for definition-style questions, but it broke badly on operational questions.

Examples:

“Which customers had delayed payments last quarter?”
“Show vendors with duplicate tax IDs.”
“Which orders were approved but not dispatched?”
“What changed in receivables compared to the previous period?”

The issue was not that the model could not reason.

The issue was that the retrieved context was incomplete.

For one receivables query, the retriever pulled chunks describing invoice status, but missed payment allocation rules stored in a separate table. The answer looked confident, but the number was wrong by about 18%.

That is the dangerous failure mode with structured data RAG.

The model does not say, “I am missing the join path.”

It answers with the partial context it has.

Pattern that failed: embedding raw table exports

This is the most common failed pattern.

Teams dump rows or reports into text, embed them, and expect semantic search to behave like a database engine.

It does not.

Embedding raw rows loses too much structure:

Joins are not explicit
Column meanings are ambiguous
Filters are applied after retrieval, not before
Aggregations are guessed instead of computed
Similar-looking records compete with the correct records

In one benchmark, raw row-level RAG answered simple entity lookup questions reasonably well, but failed most aggregation questions.

The failure rate was highest when the answer required joins across three or more tables.

That is expected. Vector search is not designed to discover relational execution plans.

Pattern that failed: embedding schema docs only

Another pattern is embedding schema documentation and asking the model to infer the query.

This helps with table discovery, but it is not enough.

The model may identify the right table family, but still miss business rules like:

Which status values count as active
Which date field represents the business event
Whether cancelled records should be excluded
Whether amounts are gross, net, posted, pending, or reversed

We tested this on ERP-style reporting questions.

Schema-doc RAG improved table selection, but still produced unstable answers because it did not enforce execution rules.

The system knew where to look, but not how to calculate.

That difference matters.

Pattern that works: schema-aware retrieval plus SQL execution

For structured data, the strongest pattern is usually not pure RAG.

It is schema-aware retrieval feeding a constrained query-generation layer.

The retrieval layer should find:

Relevant tables
Column definitions
Join paths
Business rules
Known query examples
Validation checks

Then the model generates SQL or an intermediate query plan.

The database computes the result.

The model explains it.

That separation is important.

The model should not calculate totals from retrieved text when the database can calculate them exactly.

In one internal evaluation, moving from raw text RAG to schema-aware retrieval plus SQL execution improved accuracy on structured reporting questions from the low 40% range to the high 70% range.

The remaining failures were not random. Most came from ambiguous business definitions, especially when the same metric had different meanings across departments.

That is fixable with evaluation sets and explicit metric definitions.

Pattern that works: business concept maps

Enterprise users do not ask questions in table names.

They ask in business language.

“Delayed payment” might map to:

Invoice due date
Payment posting date
Allocation status
Partial payment rules
Customer credit terms
Reversal handling

If the system only retrieves by schema similarity, it misses this.

We maintain business concept maps for production systems.

A concept map links business terms to tables, fields, filters, joins, and known edge cases.

This reduces one of the most common failure modes: the model choosing the right-looking field instead of the correct business field.

For example, “order date” may exist in multiple places:

Created date
Approved date
Confirmed date
Dispatch date
Posted date

A generic model will often choose the most obvious one.

A managed AI layer should choose the one that matches the business definition.

Pattern that works: retrieval by task type

Not every enterprise question should trigger the same pipeline.

We usually separate structured-data questions into task types:

Lookup
Aggregation
Comparison
Exception detection
Explanation
Policy or SOP reference

A lookup question can use a lighter path.

An aggregation question should usually go through SQL execution.

An explanation question may need both data and policy context.

This routing matters because the wrong pipeline can produce an answer that sounds correct but is operationally useless.

A common example is exception detection.

If a user asks, “Which invoices look abnormal this month?” naive RAG retrieves invoices containing similar language.

That is not anomaly detection.

The system needs computed baselines, thresholds, historical comparison, and sometimes human-defined rules.

Pattern that works: answer validation before response

For production systems, we do not treat the first model answer as final.

We validate it.

Useful checks include:

Did the generated query use approved tables?
Did it include required filters?
Did it avoid deprecated fields?
Did row counts look plausible?
Did totals reconcile with known reports?
Did the answer cite the executed query or source record?

One failure we saw repeatedly was date leakage.

The model would use invoice creation date when the metric required posting date.

The answer was syntactically valid and semantically plausible, but financially wrong.

A validation rule caught it because that metric was only allowed to use posting date.

That is the difference between a chatbot and a managed AI operation.

What we learned

For structured enterprise data, RAG is useful, but only when it is not asked to do the database’s job.

Naive RAG works for:

Definitions
SOP references
Field descriptions
Simple record lookup
Explaining existing reports

Naive RAG fails for:

Multi-table joins
Aggregations
Period comparisons
Financial calculations
Exception detection
Metrics with department-specific definitions

The better architecture is usually hybrid:

RAG for context
SQL or APIs for computation
Business concept maps for meaning
Evaluations for regression detection
Validation rules for production safety

This is why we describe BrainPack as Enterprise AI Operating Infrastructure, not a one-time chatbot deployment.

The work does not end when the agent answers the first question.

The real work starts when users ask the 500th question, the schema changes, a department redefines a metric, and the model still has to answer safely.

That only happens when the AI layer is monitored, evaluated, re-prompted, and maintained continuously.

What Broke In Our Voice Agent In Production

Zeeshan Ghazanfar — Wed, 29 Apr 2026 17:00:52 +0000

I work on the AI layer at BrainPack - agents that run against real enterprise systems, not clean demo sandboxes.

One failure mode we hit with production voice agents was not "the model gave a bad answer."

It was quieter than that.

The agent needed to call a business tool, wait on a database-backed workflow, and then come back with a useful spoken answer. In a demo, that looks fine. In production, the user hears silence.

Silence is a failure mode.

For text chat, a 5 second delay is usually tolerable if the UI shows a loading state. For voice, even 2 or 3 seconds of unexplained silence feels broken. If the tool call takes longer, users start repeating themselves, interrupting, or assuming the call dropped.

In our LiveKit voice stack, the production configuration had several moving parts:

STT provider and model
LLM provider and model
TTS provider and model
Turn detection
Voice activity detection
Tool execution
Transcript storage
User information capture
Optional Talk-to-DB access
Call recording metadata

The first version treated tool latency mostly as a backend problem.

That was wrong.

Tool latency is also a conversation design problem.

What Changed

1. We added mandatory pre-tool speech

Before a tool runs, the agent now gives a short spoken update like:

Let me check that for you.

Not a paragraph. Not fake confidence. Just a small signal that the call is alive.

2. We separated normal conversation from data-backed questions

The assistant should not call Talk-to-DB for greetings, policy questions, or small talk. It should only route to data when the user is clearly asking about records, reports, counts, trends, metrics, filters, comparisons, or other database-backed facts.

This reduced unnecessary tool calls.

3. We added a voice-specific path for oversized answers

Some database answers are too large to read aloud. The system now detects when a Talk-to-DB response is too large and can move the full answer into a PDF email flow instead of forcing a bad voice experience.

4. We made the waiting state audible

The LiveKit agent config includes ambient and busy audio settings:

Ambient volume: 0.5
Busy volume: 1.0
Keyboard busy probability: 0.8
Mouse busy probability: 0.2

These numbers are not magic. They are just explicit controls so the waiting state can be tuned and measured instead of left to vibes.

5. We treated the assistant as a maintained system

The agent has a tracked worker state:

idle
building
starting
running
stopping
stopped
unhealthy
error

That matters because voice agents fail in operational ways too. Containers stop. Providers change behavior. Prompt instructions decay as new tools are added. Turn detection that works in one environment can behave badly in another.

The Lesson

Production voice AI is not just model selection.

The model is one part of a longer chain:

speech in -> transcript -> intent -> tool decision -> external system -> response planning -> speech out

Every link can fail.

At BrainPack, fully managed AI means we keep watching those links after launch. We monitor transcripts, tool behavior, silence points, model drift, prompt behavior, and worker health. Then we re-prompt, re-evaluate, and adjust the system when production exposes something the demo did not.

Most voice agent failures do not look dramatic in logs.

Sometimes the real bug is a user waiting in silence.

That is still a bug.

GPT-4 launched at $30 per million tokens. Sixteen months later, the same class of output costs ~15 cents. Roughly a 200x drop.

Zeeshan Ghazanfar — Tue, 28 Apr 2026 07:33:18 +0000

Most people stop the analysis there.

We didn’t.

At BrainPack, we run agents in production environments - against real systems, with real failure consequences. The cost drop is real, but raw token price is not what determines value.

Here is what we see in practice.

A simple agent running 24/7/365 does cost in the range of a few hundred dollars a year. On paper, that is 4 to 6 cents per hour across ~8,700 hours.

But that number is misleading if you don’t control for failure.

In early deployments, before orchestration:

Task success rate: ~62%
Silent logical errors: ~14%
Human review required: ~38% of outputs

Cheap tokens did not help here. They just made failure cheaper.

This is where most teams get stuck. They deploy a model, see low cost, and assume they have leverage. In reality, they have a system that produces inconsistent output at scale.

What actually matters is usable output per dollar.

This is the layer we build at BrainPack.

We don’t treat the model as the system. We treat it as one component inside a controlled execution loop.

What changed the economics for us:

Orchestration over raw inference

We run multi-step agents:

retrieval before generation
constrained execution paths
post-generation validation

This alone moved task success from ~62% to ~81% in one deployment.

Structured output enforcement

Free-form responses fail in production.

We enforce:

schema-bound outputs
strict validation
retries on failure

This reduced silent logical errors from ~14% to under 5%.

Evaluation in the loop

We don’t evaluate once. We continuously measure:

task success
failure types
drift over time

Agents get re-prompted and adjusted based on real logs, not static benchmarks.

Model routing

Not all tasks need the same model.

We route:

smaller models for deterministic steps
stronger models only where reasoning is required

This cut cost by ~40% without reducing accuracy.

After orchestration:

Task success rate: ~89%
Silent logical errors: ~4%
Human review: down to ~11%

Now the cost advantage becomes real.

This is the difference most discussions miss.

The price curve has moved. That is true.

But without orchestration, you are scaling inconsistency.

At BrainPack, we focus on making AI systems usable every day - not just cheap to run.

The leverage is not in lower token cost.

It is in turning that cost into reliable output.

We measured 72% 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it

Zeeshan Ghazanfar — Mon, 27 Apr 2026 11:45:17 +0000

We measured 72% → 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it

We deployed a Text-to-SQL agent on a legacy ERP with 612 tables, ~8,400 columns, and inconsistent naming across modules. Initial offline eval looked acceptable. Production said otherwise.

Baseline

Model: GPT-4 class via API
Context: full schema dump + user query
Prompt: standard "generate SQL from question"
Eval set: 220 business questions from analysts

Results:

Exact match accuracy: 72%
Execution success: 81%
Latency: ~4.8s avg
Production failure rate (week 1): 31% required human correction

What failed

Schema overload
600+ tables in context diluted attention
Wrong joins between similarly named tables like orders vs order_hdr vs order_archive
Semantic mismatch
Business language did not match schema
Example: "revenue" mapped to total_amount instead of net_sales
Join path ambiguity
Multiple valid joins existed
Model picked shortest path, not correct one
Silent logical errors
Queries executed but returned wrong numbers
Hardest to detect

What moved accuracy to 91%

This was not fixed with prompting alone. We changed the system.

Schema retrieval layer

Instead of passing full schema:

Embedded table names, column names, sample values
Retrieved top 8–15 relevant tables per query

Impact:

Accuracy: 72% → 84%
Latency reduced by ~1.2s

Join graph constraints

Built join graph from DB metadata + query logs
Forced model to only use valid relationships

Impact:

Accuracy: 84% → 88%
~60% reduction in incorrect joins

Business term mapping

Added translation layer:

revenue → net_sales
customer → client_id
orders → sales_order_hdr

Hybrid approach (rules + embeddings) worked best

Impact:

Accuracy: 88% → 90%

SQL self-check and retry

Second pass:

Validate generated SQL against schema and intent
Allow one retry

Impact:

Accuracy: 90% → 91%
Execution success: 81% → 93%

Evaluation change

Switched to:

exact match OR result equivalence
added real production edge cases

Initial accuracy dropped, then stabilized with real signal

What did not work

Larger models: <2% gain
Longer prompts: worse performance
Few-shot examples: inconsistent results
Free schema exploration: more hallucinated joins

Final production metrics

Accuracy: 91%
Execution success: 93%
Avg latency: 3.6s
Human intervention: 31% → 9%

Still unsolved

Complex financial aggregations
Time-based comparisons across inconsistent date fields
Implicit business rules not present in DB

Bottom line

Text-to-SQL at enterprise scale is not a prompt problem.

It is:

retrieval
constraints
evaluation

Without controlling schema exposure and join paths, accuracy plateaus early.