<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shreeni D</title>
    <description>The latest articles on DEV Community by Shreeni D (@shreeni_d).</description>
    <link>https://dev.to/shreeni_d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869011%2F8a6e4588-bd8a-420f-8aed-d61f59490003.png</url>
      <title>DEV Community: Shreeni D</title>
      <link>https://dev.to/shreeni_d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shreeni_d"/>
    <language>en</language>
    <item>
      <title>Stop Trying to Prompt Your Way Out of a Hallucination</title>
      <dc:creator>Shreeni D</dc:creator>
      <pubDate>Tue, 21 Apr 2026 16:02:30 +0000</pubDate>
      <link>https://dev.to/shreeni_d/stop-trying-to-prompt-your-way-out-of-a-hallucination-3hkj</link>
      <guid>https://dev.to/shreeni_d/stop-trying-to-prompt-your-way-out-of-a-hallucination-3hkj</guid>
      <description>&lt;p&gt;I learned this the hard way.&lt;/p&gt;

&lt;p&gt;I recently built a personal agent to map dependencies across my local repositories. The goal was simple: ask a question like &lt;em&gt;“Which projects use this specific library version?”&lt;/em&gt; and get a clean, reliable answer.&lt;/p&gt;

&lt;p&gt;What I got instead was something else entirely.&lt;/p&gt;

&lt;p&gt;The agent responded with confidence. The formatting was pristine. The explanation was coherent. And the answer was completely fabricated.&lt;/p&gt;

&lt;p&gt;It even referenced a directory that didn’t exist.&lt;/p&gt;

&lt;p&gt;I spent time chasing that ghost—digging through my filesystem, double-checking paths—before realizing what had happened: the model had &lt;em&gt;guessed&lt;/em&gt;. It saw a pattern in my project names and filled in the blanks with something that &lt;em&gt;looked&lt;/em&gt; right.&lt;/p&gt;

&lt;p&gt;That’s when it clicked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Isn’t the Prompt
&lt;/h2&gt;

&lt;p&gt;My first instinct was to tweak the prompt.&lt;/p&gt;

&lt;p&gt;Maybe I needed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Be more explicit&lt;/li&gt;
&lt;li&gt;Add constraints&lt;/li&gt;
&lt;li&gt;Tell it to “only use verified data”&lt;/li&gt;
&lt;li&gt;Emphasize accuracy over completeness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But none of that actually solves the problem.&lt;/p&gt;

&lt;p&gt;Because hallucination isn’t a prompt failure. It’s a system design failure.&lt;/p&gt;

&lt;p&gt;You’re asking a probabilistic model to behave like a deterministic system—and no amount of prompt engineering will change that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Add a Source of Truth
&lt;/h2&gt;

&lt;p&gt;The real solution wasn’t better wording. It was better architecture.&lt;/p&gt;

&lt;p&gt;Instead of asking the model to &lt;em&gt;know&lt;/em&gt;, I required it to &lt;em&gt;check&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I had the AI generate a Python script that scans my local repositories and verifies whether a given library version actually exists. Then I wired that script into the agent’s workflow with a hard rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The agent must execute the verification step before responding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the script finds matches, great—the model can explain and format the results.&lt;/p&gt;

&lt;p&gt;If it doesn’t?&lt;/p&gt;

&lt;p&gt;The agent is explicitly forced to say:&lt;br&gt;
&lt;strong&gt;“I don’t have enough information.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No guessing. No filling in gaps. No “best effort” answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let the Model Build Its Own Guardrails
&lt;/h2&gt;

&lt;p&gt;The interesting part is that I still used the LLM—but not as a source of truth.&lt;/p&gt;

&lt;p&gt;I used it to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate the verification script&lt;/li&gt;
&lt;li&gt;Define the workflow&lt;/li&gt;
&lt;li&gt;Integrate reasoning with execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the model helped build the system that limits its own behavior.&lt;/p&gt;

&lt;p&gt;That’s the shift:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;From &lt;em&gt;magic&lt;/em&gt; → to &lt;em&gt;mechanism&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;em&gt;answers&lt;/em&gt; → to &lt;em&gt;process&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;em&gt;trusting outputs&lt;/em&gt; → to &lt;em&gt;verifying inputs&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reasoning Engine, Not Database
&lt;/h2&gt;

&lt;p&gt;LLMs are incredibly good at reasoning over information.&lt;/p&gt;

&lt;p&gt;They are not reliable sources of truth.&lt;/p&gt;

&lt;p&gt;If your agent is answering questions about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your codebase&lt;/li&gt;
&lt;li&gt;Your infrastructure&lt;/li&gt;
&lt;li&gt;Your documents&lt;/li&gt;
&lt;li&gt;Your data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…then the model should never be the final authority.&lt;/p&gt;

&lt;p&gt;It should sit &lt;em&gt;on top&lt;/em&gt; of a system that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve real data&lt;/li&gt;
&lt;li&gt;Execute checks&lt;/li&gt;
&lt;li&gt;Enforce constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it this way:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Your LLM is the brain.&lt;br&gt;
Your code is the nervous system.&lt;br&gt;
Your data is reality.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If those aren’t connected, you don’t have intelligence—you have improv.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop Prompting, Start Designing
&lt;/h2&gt;

&lt;p&gt;When an AI system fails, it’s tempting to stay in the prompt layer. It feels fast, iterative, and controllable.&lt;/p&gt;

&lt;p&gt;But that’s often just avoiding the real work.&lt;/p&gt;

&lt;p&gt;If your agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinates&lt;/li&gt;
&lt;li&gt;Makes unverifiable claims&lt;/li&gt;
&lt;li&gt;Invents structure&lt;/li&gt;
&lt;li&gt;Sounds right but isn’t&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…don’t rewrite the prompt.&lt;/p&gt;

&lt;p&gt;Fix the architecture.&lt;/p&gt;

&lt;p&gt;Add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deterministic checks&lt;/li&gt;
&lt;li&gt;Tooling integrations&lt;/li&gt;
&lt;li&gt;Execution steps&lt;/li&gt;
&lt;li&gt;Clear failure modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliability doesn’t come from convincing the model to behave.&lt;/p&gt;

&lt;p&gt;It comes from building a system where it &lt;em&gt;can’t misbehave without being caught&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Better Default
&lt;/h2&gt;

&lt;p&gt;Here’s a simple rule that has held up well for me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If the answer depends on real-world state, the model must verify it before speaking.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That one constraint eliminates an entire class of hallucinations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Question
&lt;/h2&gt;

&lt;p&gt;So now I’m curious:&lt;/p&gt;

&lt;p&gt;When your AI fails, do you reach for the prompt…&lt;/p&gt;

&lt;p&gt;—or the architecture?&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Observability in agentic and AI applications: the essential roles of monitoring and evaluation</title>
      <dc:creator>Shreeni D</dc:creator>
      <pubDate>Tue, 14 Apr 2026 03:53:50 +0000</pubDate>
      <link>https://dev.to/shreeni_d/observability-in-agentic-and-ai-applications-the-essential-roles-of-monitoring-and-evaluation-2ej1</link>
      <guid>https://dev.to/shreeni_d/observability-in-agentic-and-ai-applications-the-essential-roles-of-monitoring-and-evaluation-2ej1</guid>
      <description>&lt;p&gt;In the artificial intelligence (AI) landscape of today, organizations are increasingly adopting agentic- and large language model (LLM)-based applications to automate tasks, streamline processes and deliver personalized experiences. However, despite their immense potential, these applications introduce new challenges in governance, control and quality assurance: challenges that cannot be adequately addressed with legacy monitoring approaches. Observability moved from checking if the server is up to verifying if the output from the application leveraging a model is helpful, safe and accurate. This is where robust observability through monitoring and evaluation becomes indispensable.&lt;/p&gt;

&lt;p&gt;Monitoring: the eyes and ears of AI governance&lt;/p&gt;

&lt;p&gt;Monitoring is a foundational aspect of governance and control in AI applications, especially those leveraging LLM orchestration. Yet, configuring monitoring for AI agents offers distinct requirements when compared with traditional software.&lt;/p&gt;

&lt;p&gt;What should modern AI monitoring capture?&lt;/p&gt;

&lt;p&gt;Chain-of-thought tracing: From tracking the input prompts supplied to agents, the context provided and outputs generated by all components (including LLMs and tools), see the exact prompt, retrieved documents, system instructions and final output in one unified timeline, providing crucial information.&lt;br&gt;
Analysis of core interactions: Log and visualize every LLM call, database and resource interaction, and tool invocation.&lt;br&gt;
Metrics: Monitor metrics, such as token consumption (costs) and latency (bottlenecks), and facilitate side-by-side comparisons of multiple runs or activity traces.&lt;/p&gt;

&lt;p&gt;Key capabilities of a good monitoring process &lt;/p&gt;

&lt;p&gt;A good monitoring process leverages a system that enables monitoring capabilities, including:&lt;/p&gt;

&lt;p&gt;Reproducibility: Rerun LLM, tool or function calls to validate outputs across different scenarios.&lt;br&gt;
Experimentation: Adjust prompts, states or context to observe differences in generated results.&lt;br&gt;
Filtering and visualization: Filter outputs by metrics, such as time taken or tokens consumed, enhancing exploratory analysis and understanding.&lt;br&gt;
Prompt-centric features: Since prompts are pivotal in LLM applications, utilize monitoring so that users are empowered to: (a) Experiment and iterate on prompt wording across a workflow; (b) Apply version control prompts and maintain a prompt repository, removing the need to hardcode system contexts; and (c) Auto-optimize, template and instantly update prompts in production.&lt;/p&gt;

&lt;p&gt;Debugging in AI monitoring&lt;/p&gt;

&lt;p&gt;Unlike classic applications, debugging agentic workflows involves hot reloading agent nodes. Developers should be able to modify prompts, update agent states and re-execute workflows, ideally with breakpoints, step-through capabilities and data inspection.&lt;/p&gt;

&lt;p&gt;This hands-on process helps identify bottlenecks or misbehaviors at each stage of complex AI-driven logic.&lt;/p&gt;

&lt;p&gt;Evaluation: measuring performance and verifying consistency&lt;/p&gt;

&lt;p&gt;While monitoring gives you operational insight, evaluation answers an equally vital question: How good is your application’s output?&lt;/p&gt;

&lt;p&gt;Evaluation is the systematic process of measuring the performance of your AI application. It confirms that your system’s output remains within acceptable boundaries over time, even as data changes, models evolve, or new features are added.&lt;/p&gt;

&lt;p&gt;Why is evaluation essential?&lt;/p&gt;

&lt;p&gt;Preventing model drift: Evaluation safeguards against performance drift, such as LLMs producing inconsistent responses for identical queries in production. It also helps to understand how system upgrades, such as swapping model versions, might impact business-critical results. A model version update or a newly released model can subtly change how your prompts are interpreted, leading to silent regressions.&lt;br&gt;
Safety and compliance: Evaluation also can help establish that the agent stays within brand voice and safety guardrails.&lt;/p&gt;

&lt;p&gt;Types of evaluation&lt;/p&gt;

&lt;p&gt;Since generative AI (GenAI) systems produce variable outputs, in practice, teams combine two complementary approaches to verify correctness and consistency and also approximate human judgment for open-ended tasks.&lt;/p&gt;

&lt;p&gt;Deterministic evaluation: This method uses closed-ended metrics, such as mathematical correctness, where outputs have clear right or wrong values. It is best for formatting and factual lookup.&lt;br&gt;
Nondeterministic evaluation (LLM-as-a-Judge): Since many AI tasks lack definitive answers, equivalent or superior LLMs can be leveraged to score outputs by comparing them with reference or the golden outputs, just like a human would. You establish threshold criteria for pass/fail based on these scores. This approach is best for tone, helpfulness and nuances.&lt;/p&gt;

&lt;p&gt;Building robust evaluation frameworks&lt;/p&gt;

&lt;p&gt;An evaluation framework is more than just a test suite; it is a structured environment designed to quantify the nuances of AI behavior. To build a framework that actually mirrors real-world performance, focus on these two pillars.&lt;/p&gt;

&lt;p&gt;Custom metrics: Beyond general correctness, define and enforce metrics, such as simplicity, relevance and explainability, tailoring each to your specific use case.&lt;br&gt;
Data set construction: (a) Use historical examples (deemed correct) from production; and, (b) Create synthetic data sets, leveraging expert knowledge or even GenAI tools to craft plausible inputs and expected outputs.&lt;/p&gt;

&lt;p&gt;Most applications benefit from a dedicated “golden data set” for ongoing evaluation.&lt;/p&gt;

&lt;p&gt;Approaches&lt;/p&gt;

&lt;p&gt;The when and where of evaluation are just as important as the what. A mature AI lifecycle utilizes two distinct approaches to catch errors before and after they reach the user.&lt;/p&gt;

&lt;p&gt;Offline evaluation (the preflight check): Runs against a static golden data set using new or updated models to check for regressions or improvements before deploying changes. For example, if you switch from an LLM to a smaller, faster model, offline evaluation tells you exactly where the smaller model fails to meet the quality bar of the larger one.&lt;br&gt;
Online evaluation: Continuously assesses production outputs in real time, providing instant feedback and safeguarding against silent quality degradation. Even if your model passed offline tests, real-world data can shift (data drift), or external tools your agent relies on might change their behavior. By using LLM-as-a-Judge to score live samples, you can trigger alerts or even fallback mechanisms if the quality of a live response falls below a specific threshold; for example, a faithfulness score of less than 0.7.&lt;/p&gt;

&lt;p&gt;Dashboarding: the observability hub&lt;/p&gt;

&lt;p&gt;A well-designed dashboard brings monitoring and evaluation data together in a unified view, empowering teams to spot anomalies, track trends and make data-driven decisions on iteration and deployment with confidence. It merges operational traces with evaluation scores. When you see a dip in your helpfulness score on the dashboard, you should be able to click directly into the specific trace to see the exact prompt and retrieved document that caused the failure.&lt;/p&gt;

&lt;p&gt;As agentic and AI applications become integral to transformative strategies, organizations must reimagine observability beyond traditional monitoring and testing. Follow the suggestions listed above to deliver a reliable, transparent and continually improving AI application landscape.&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>AI Agents Explained: How They Work and How to Build Your First One</title>
      <dc:creator>Shreeni D</dc:creator>
      <pubDate>Thu, 09 Apr 2026 06:03:12 +0000</pubDate>
      <link>https://dev.to/shreeni_d/ai-agents-explained-how-they-work-and-how-to-build-your-first-one-4ann</link>
      <guid>https://dev.to/shreeni_d/ai-agents-explained-how-they-work-and-how-to-build-your-first-one-4ann</guid>
      <description>&lt;h1&gt;
  
  
  Building AI Agents: What They Are and How to Create Them
&lt;/h1&gt;

&lt;p&gt;AI agents are getting a lot of attention right now, but most explanations stay high level and skip what it actually takes to build one.&lt;/p&gt;

&lt;p&gt;At a practical level, an AI agent is not just an LLM with a prompt. It is a system that can take a goal, decide what to do next, call tools, observe results, and repeat until it reaches an outcome. The LLM is only the reasoning layer. The real system is everything around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI Agent?
&lt;/h2&gt;

&lt;p&gt;A simple way to think about an agent is as a loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take a goal or user input
&lt;/li&gt;
&lt;li&gt;Decide the next action
&lt;/li&gt;
&lt;li&gt;Call a tool or API
&lt;/li&gt;
&lt;li&gt;Observe the result
&lt;/li&gt;
&lt;li&gt;Repeat until done
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This loop is what makes agents different from a single model call. Instead of answering once, they can plan and act over multiple steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Components
&lt;/h2&gt;

&lt;p&gt;To build a useful agent, you need a few key pieces.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. LLM (Reasoning Layer)
&lt;/h3&gt;

&lt;p&gt;The LLM decides what to do next. It interprets the goal, selects tools, and generates actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tools (Execution Layer)
&lt;/h3&gt;

&lt;p&gt;Tools define what the agent can actually do. These can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APIs
&lt;/li&gt;
&lt;li&gt;database queries
&lt;/li&gt;
&lt;li&gt;external services
&lt;/li&gt;
&lt;li&gt;internal microservices
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without tools, the agent cannot take meaningful actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Control Flow
&lt;/h3&gt;

&lt;p&gt;You need a structure for how the agent operates. This can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simple step-by-step reasoning
&lt;/li&gt;
&lt;li&gt;loop-based execution
&lt;/li&gt;
&lt;li&gt;graph-based workflows for complex systems
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer controls how decisions are made and when the agent stops.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Memory
&lt;/h3&gt;

&lt;p&gt;For multi-step tasks, the agent needs context. Memory can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;conversation history
&lt;/li&gt;
&lt;li&gt;intermediate results
&lt;/li&gt;
&lt;li&gt;task state
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without memory, agents lose track of progress and become unreliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Create an AI Agent
&lt;/h2&gt;

&lt;p&gt;A basic approach to building an agent looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define the goal
&lt;/li&gt;
&lt;li&gt;Define available tools
&lt;/li&gt;
&lt;li&gt;Create a reasoning loop
&lt;/li&gt;
&lt;li&gt;Execute actions and collect results
&lt;/li&gt;
&lt;li&gt;Repeat until completion
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a simple pseudo-flow:&lt;/p&gt;

&lt;p&gt;In practice, frameworks like LangChain, LangGraph, or Bedrock Agents help structure this loop, but the core idea remains the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Agents Work Well
&lt;/h2&gt;

&lt;p&gt;Agents are useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tasks involve multiple steps
&lt;/li&gt;
&lt;li&gt;decisions depend on intermediate results
&lt;/li&gt;
&lt;li&gt;multiple systems need to be coordinated
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data aggregation from multiple sources
&lt;/li&gt;
&lt;li&gt;workflow automation
&lt;/li&gt;
&lt;li&gt;multi-step decision systems
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;One common mistake is trying to make the agent do everything internally.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;running heavy ML models inside the agent
&lt;/li&gt;
&lt;li&gt;handling complex computation in the reasoning loop
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This usually leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;higher latency
&lt;/li&gt;
&lt;li&gt;increased cost
&lt;/li&gt;
&lt;li&gt;harder debugging
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A better approach is to treat the agent as an orchestrator and let specialized systems handle actual computation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI agents are not just about prompts. They are about system design.&lt;/p&gt;

&lt;p&gt;The shift happening now is from single model calls to systems that can plan, act, and adapt over time. The teams that get this right focus on building reliable tool layers, clear control flow, and well-defined boundaries.&lt;/p&gt;

&lt;p&gt;We are still early, but the patterns are starting to emerge.&lt;/p&gt;

&lt;p&gt;The real challenge is not calling an LLM. It is designing the system around it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
