DEV Community: Trilok Kanwar

How to Detect Agent Instability Before Production

Trilok Kanwar — Tue, 03 Mar 2026 18:03:17 +0000

When building conversational agents, I made a mistake early on.
I validated prompts with single responses.

Everything looked great until real conversations happened.

By turn 3 or 4:
-constraints softened
-tone drifted
-instructions faded

The insight: users experience conversations, not outputs.

So I changed the workflow. Every prompt edit now gets tested across multiple multi-turn conversations immediately. It exposed instability that single-response testing never revealed.

That shift made iteration more structured and less reactive.

If you're building chat or voice agents, consider validating trajectories, not just responses.

I’ve documented the workflow here: https://shorturl.at/r7sfP

What we learned from 100+ production RAG deployments (free 118-page handbook)

Trilok Kanwar — Tue, 17 Feb 2026 18:13:00 +0000

We’ve been building RAG systems for a while and wanted to share a resource we just published. It’s a 118-page handbook covering the patterns that separate prototype RAG from production RAG.

If you’re building RAG right now, here are the problems this covers:

Your vector search returns “close enough” results instead of exact matches. The handbook covers hybrid retrieval that runs semantic and keyword search in parallel.
Your chunking splits documents in weird places. It covers semantic chunking, code-aware chunking using ASTs, and parent-child structures that keep context intact.
You have no idea if your retrieval is actually good. It covers evaluation frameworks that work without manually labeling test data.
Your costs keep growing and you can’t figure out why. It covers production observability that traces every step of your pipeline.

It also has dedicated chapters on building RAG for specific domains: code generation, text-to-SQL, legal search, and medical knowledge retrieval. Each one has different failure modes that generic approaches miss.

Free PDF - https://shorturl.at/rRXXP

Would love to hear what problems others are hitting with production RAG, always helps to know what to cover next.

Reimagining Synthetic Data Generation at Future AGI

Trilok Kanwar — Fri, 13 Feb 2026 17:45:21 +0000

We’ve been quietly upgrading synthetic data generation at Future AGI.

Here’s what’s new:

Grounded generation tied to uploaded knowledge bases (~90% coverage observed)

-1.78× faster dataset creation

-Non-linear scaling as dataset size increases

-Mid-generation editing support

-Improved diversity beyond 5,000 rows

-SOP-driven scenario generation with edge cases

-One-click variable generation for prompt testing

For teams in regulated industries, voice AI, or LLM evaluation workflows, this reduces manual overhead significantly.

There’s more in the changelog. We’ll break it down in a separate post.

Synthetic Data Generation: https://shorturl.at/Osgwr

Prompt Optimization, Not Prompt Guessing

Trilok Kanwar — Thu, 12 Feb 2026 18:49:34 +0000

In sales, support, and fintech workflows, teams rely on prompts to classify conversations, extract signals, and route decisions.

A skilled prompt engineer can make 100 examples look perfect.

That is exactly the problem.

Here’s the contradiction nobody talks about:
the more skilled you are at writing prompts, the more dangerous your process becomes.

Because intuition works on small samples.
It does not generalize to 10,000 inputs, multiple failure modes, and cost constraints you have not measured.

Expert intuition produces prompts that feel right.
But they cannot be reliably reproduced, versioned, or defended with metrics.

The fix is not better intuition.

It is replacing intuition with an objective function.

Dataset → Evaluator → Optimizer → Ranked prompts.

This is the same class of problem as hyperparameter tuning.
We just forgot to treat it that way.

Our team documented the full workflow in a cookbook.
https://shorturl.at/aI0zg

Keeping multimodal experimentation in one place

Trilok Kanwar — Tue, 10 Feb 2026 20:10:48 +0000

Teams building with image generation models or vision pipelines often hit the same problem. The model produces an image, but you cannot see it where the prompt lives.

That makes reviewing quality manual, comparing runs messy, and iteration slower than it should be.

We just shipped native image rendering inside Datasets and Prompt Workbench. Generated images now appear directly next to the prompts that created them.

This means:

Faster output review
Easy visual comparison across runs
Iteration without switching tools or losing context

Prompting, generating, reviewing, and experimenting now happen in one place. Multimodal workflows finally get tooling that matches how they actually work.

Mutimodal - Image Generation in Datasets & Prompt: https://shorturl.at/athOG

Your Agent Is Slow Because of Inference

Trilok Kanwar — Fri, 06 Feb 2026 15:36:26 +0000

When an agent feels sluggish, the instinct is to blame reasoning quality.

But in agentic AI systems, reasoning is rarely the real problem.

Inference today looks like:

-planning a path forward
-calling tools
-waiting on external systems
-re-planning based on outputs
-generating a final response across long sessions

That entire loop is inference.

In a recent chat with Yunmo and Alex from FriendliAI, we explored why inference has quietly become the biggest bottleneck in agent performance and how teams are optimizing for it.

The key shift:
Latency, throughput, and cost aren’t infra trade-offs anymore. They’re product decisions.

If you’re building agentic systems, this is worth rethinking.

▶️ Full webinar link: https://shorturl.at/moj3x