DEV Community: Shreeni D

Stop Trying to Prompt Your Way Out of a Hallucination

Shreeni D — Tue, 21 Apr 2026 16:02:30 +0000

I learned this the hard way.

I recently built a personal agent to map dependencies across my local repositories. The goal was simple: ask a question like “Which projects use this specific library version?” and get a clean, reliable answer.

What I got instead was something else entirely.

The agent responded with confidence. The formatting was pristine. The explanation was coherent. And the answer was completely fabricated.

It even referenced a directory that didn’t exist.

I spent time chasing that ghost—digging through my filesystem, double-checking paths—before realizing what had happened: the model had guessed. It saw a pattern in my project names and filled in the blanks with something that looked right.

That’s when it clicked.

The Problem Isn’t the Prompt

My first instinct was to tweak the prompt.

Maybe I needed to:

Be more explicit
Add constraints
Tell it to “only use verified data”
Emphasize accuracy over completeness

But none of that actually solves the problem.

Because hallucination isn’t a prompt failure. It’s a system design failure.

You’re asking a probabilistic model to behave like a deterministic system—and no amount of prompt engineering will change that.

The Fix: Add a Source of Truth

The real solution wasn’t better wording. It was better architecture.

Instead of asking the model to know, I required it to check.

I had the AI generate a Python script that scans my local repositories and verifies whether a given library version actually exists. Then I wired that script into the agent’s workflow with a hard rule:

The agent must execute the verification step before responding.

If the script finds matches, great—the model can explain and format the results.

If it doesn’t?

The agent is explicitly forced to say:
“I don’t have enough information.”

No guessing. No filling in gaps. No “best effort” answers.

Let the Model Build Its Own Guardrails

The interesting part is that I still used the LLM—but not as a source of truth.

I used it to:

Generate the verification script
Define the workflow
Integrate reasoning with execution

In other words, the model helped build the system that limits its own behavior.

That’s the shift:

From magic → to mechanism
From answers → to process
From trusting outputs → to verifying inputs

Reasoning Engine, Not Database

LLMs are incredibly good at reasoning over information.

They are not reliable sources of truth.

If your agent is answering questions about:

Your codebase
Your infrastructure
Your documents
Your data

…then the model should never be the final authority.

It should sit on top of a system that can:

Retrieve real data
Execute checks
Enforce constraints

Think of it this way:

Your LLM is the brain.
Your code is the nervous system.
Your data is reality.

If those aren’t connected, you don’t have intelligence—you have improv.

Stop Prompting, Start Designing

When an AI system fails, it’s tempting to stay in the prompt layer. It feels fast, iterative, and controllable.

But that’s often just avoiding the real work.

If your agent:

Hallucinates
Makes unverifiable claims
Invents structure
Sounds right but isn’t

…don’t rewrite the prompt.

Fix the architecture.

Add:

Deterministic checks
Tooling integrations
Execution steps
Clear failure modes

Reliability doesn’t come from convincing the model to behave.

It comes from building a system where it can’t misbehave without being caught.

A Better Default

Here’s a simple rule that has held up well for me:

If the answer depends on real-world state, the model must verify it before speaking.

That one constraint eliminates an entire class of hallucinations.

The Real Question

So now I’m curious:

When your AI fails, do you reach for the prompt…

—or the architecture?

Observability in agentic and AI applications: the essential roles of monitoring and evaluation

Shreeni D — Tue, 14 Apr 2026 03:53:50 +0000

In the artificial intelligence (AI) landscape of today, organizations are increasingly adopting agentic- and large language model (LLM)-based applications to automate tasks, streamline processes and deliver personalized experiences. However, despite their immense potential, these applications introduce new challenges in governance, control and quality assurance: challenges that cannot be adequately addressed with legacy monitoring approaches. Observability moved from checking if the server is up to verifying if the output from the application leveraging a model is helpful, safe and accurate. This is where robust observability through monitoring and evaluation becomes indispensable.

Monitoring: the eyes and ears of AI governance

Monitoring is a foundational aspect of governance and control in AI applications, especially those leveraging LLM orchestration. Yet, configuring monitoring for AI agents offers distinct requirements when compared with traditional software.

What should modern AI monitoring capture?

Chain-of-thought tracing: From tracking the input prompts supplied to agents, the context provided and outputs generated by all components (including LLMs and tools), see the exact prompt, retrieved documents, system instructions and final output in one unified timeline, providing crucial information.
Analysis of core interactions: Log and visualize every LLM call, database and resource interaction, and tool invocation.
Metrics: Monitor metrics, such as token consumption (costs) and latency (bottlenecks), and facilitate side-by-side comparisons of multiple runs or activity traces.

Key capabilities of a good monitoring process

A good monitoring process leverages a system that enables monitoring capabilities, including:

Reproducibility: Rerun LLM, tool or function calls to validate outputs across different scenarios.
Experimentation: Adjust prompts, states or context to observe differences in generated results.
Filtering and visualization: Filter outputs by metrics, such as time taken or tokens consumed, enhancing exploratory analysis and understanding.
Prompt-centric features: Since prompts are pivotal in LLM applications, utilize monitoring so that users are empowered to: (a) Experiment and iterate on prompt wording across a workflow; (b) Apply version control prompts and maintain a prompt repository, removing the need to hardcode system contexts; and (c) Auto-optimize, template and instantly update prompts in production.

Debugging in AI monitoring

Unlike classic applications, debugging agentic workflows involves hot reloading agent nodes. Developers should be able to modify prompts, update agent states and re-execute workflows, ideally with breakpoints, step-through capabilities and data inspection.

This hands-on process helps identify bottlenecks or misbehaviors at each stage of complex AI-driven logic.

Evaluation: measuring performance and verifying consistency

While monitoring gives you operational insight, evaluation answers an equally vital question: How good is your application’s output?

Evaluation is the systematic process of measuring the performance of your AI application. It confirms that your system’s output remains within acceptable boundaries over time, even as data changes, models evolve, or new features are added.

Why is evaluation essential?

Preventing model drift: Evaluation safeguards against performance drift, such as LLMs producing inconsistent responses for identical queries in production. It also helps to understand how system upgrades, such as swapping model versions, might impact business-critical results. A model version update or a newly released model can subtly change how your prompts are interpreted, leading to silent regressions.
Safety and compliance: Evaluation also can help establish that the agent stays within brand voice and safety guardrails.

Types of evaluation

Since generative AI (GenAI) systems produce variable outputs, in practice, teams combine two complementary approaches to verify correctness and consistency and also approximate human judgment for open-ended tasks.

Deterministic evaluation: This method uses closed-ended metrics, such as mathematical correctness, where outputs have clear right or wrong values. It is best for formatting and factual lookup.
Nondeterministic evaluation (LLM-as-a-Judge): Since many AI tasks lack definitive answers, equivalent or superior LLMs can be leveraged to score outputs by comparing them with reference or the golden outputs, just like a human would. You establish threshold criteria for pass/fail based on these scores. This approach is best for tone, helpfulness and nuances.

Building robust evaluation frameworks

An evaluation framework is more than just a test suite; it is a structured environment designed to quantify the nuances of AI behavior. To build a framework that actually mirrors real-world performance, focus on these two pillars.

Custom metrics: Beyond general correctness, define and enforce metrics, such as simplicity, relevance and explainability, tailoring each to your specific use case.
Data set construction: (a) Use historical examples (deemed correct) from production; and, (b) Create synthetic data sets, leveraging expert knowledge or even GenAI tools to craft plausible inputs and expected outputs.

Most applications benefit from a dedicated “golden data set” for ongoing evaluation.

Approaches

The when and where of evaluation are just as important as the what. A mature AI lifecycle utilizes two distinct approaches to catch errors before and after they reach the user.

Offline evaluation (the preflight check): Runs against a static golden data set using new or updated models to check for regressions or improvements before deploying changes. For example, if you switch from an LLM to a smaller, faster model, offline evaluation tells you exactly where the smaller model fails to meet the quality bar of the larger one.
Online evaluation: Continuously assesses production outputs in real time, providing instant feedback and safeguarding against silent quality degradation. Even if your model passed offline tests, real-world data can shift (data drift), or external tools your agent relies on might change their behavior. By using LLM-as-a-Judge to score live samples, you can trigger alerts or even fallback mechanisms if the quality of a live response falls below a specific threshold; for example, a faithfulness score of less than 0.7.

Dashboarding: the observability hub

A well-designed dashboard brings monitoring and evaluation data together in a unified view, empowering teams to spot anomalies, track trends and make data-driven decisions on iteration and deployment with confidence. It merges operational traces with evaluation scores. When you see a dip in your helpfulness score on the dashboard, you should be able to click directly into the specific trace to see the exact prompt and retrieved document that caused the failure.

As agentic and AI applications become integral to transformative strategies, organizations must reimagine observability beyond traditional monitoring and testing. Follow the suggestions listed above to deliver a reliable, transparent and continually improving AI application landscape.

AI Agents Explained: How They Work and How to Build Your First One

Shreeni D — Thu, 09 Apr 2026 06:03:12 +0000

Building AI Agents: What They Are and How to Create Them

AI agents are getting a lot of attention right now, but most explanations stay high level and skip what it actually takes to build one.

At a practical level, an AI agent is not just an LLM with a prompt. It is a system that can take a goal, decide what to do next, call tools, observe results, and repeat until it reaches an outcome. The LLM is only the reasoning layer. The real system is everything around it.

What is an AI Agent?

A simple way to think about an agent is as a loop:

Take a goal or user input
Decide the next action
Call a tool or API
Observe the result
Repeat until done

This loop is what makes agents different from a single model call. Instead of answering once, they can plan and act over multiple steps.

Core Components

To build a useful agent, you need a few key pieces.

1. LLM (Reasoning Layer)

The LLM decides what to do next. It interprets the goal, selects tools, and generates actions.

2. Tools (Execution Layer)

Tools define what the agent can actually do. These can include:

APIs
database queries
external services
internal microservices

Without tools, the agent cannot take meaningful actions.

3. Control Flow

You need a structure for how the agent operates. This can be:

simple step-by-step reasoning
loop-based execution
graph-based workflows for complex systems

This layer controls how decisions are made and when the agent stops.

4. Memory

For multi-step tasks, the agent needs context. Memory can include:

conversation history
intermediate results
task state

Without memory, agents lose track of progress and become unreliable.

How to Create an AI Agent

A basic approach to building an agent looks like this:

Define the goal
Define available tools
Create a reasoning loop
Execute actions and collect results
Repeat until completion

Here is a simple pseudo-flow:

In practice, frameworks like LangChain, LangGraph, or Bedrock Agents help structure this loop, but the core idea remains the same.

Where Agents Work Well

Agents are useful when:

tasks involve multiple steps
decisions depend on intermediate results
multiple systems need to be coordinated

Examples include:

data aggregation from multiple sources
workflow automation
multi-step decision systems

Common Mistakes

One common mistake is trying to make the agent do everything internally.

For example:

running heavy ML models inside the agent
handling complex computation in the reasoning loop

This usually leads to:

higher latency
increased cost
harder debugging

A better approach is to treat the agent as an orchestrator and let specialized systems handle actual computation.

Final Thoughts

AI agents are not just about prompts. They are about system design.

The shift happening now is from single model calls to systems that can plan, act, and adapt over time. The teams that get this right focus on building reliable tool layers, clear control flow, and well-defined boundaries.

We are still early, but the patterns are starting to emerge.

The real challenge is not calling an LLM. It is designing the system around it.