DEV Community: Praneeth Vadlapati

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

Praneeth Vadlapati — Thu, 26 Mar 2026 15:23:36 +0000

High benchmark scores are not the same as operational trustworthiness — and in healthcare and defense, that gap can be deadly.

We are deploying AI into hospitals and military operations faster than we can verify it belongs there.

The sales pitch is compelling: large language models pass medical licensing exams, synthesize intelligence reports, and assist clinical decisions at speeds no human can match. Benchmark scores climb. Press releases follow. And somewhere along the way, a critical question gets skipped:

Is this system actually reliable?

Not "capable." Not "accurate on average." Reliable — meaning you can predict when and how it will fail, and those failures won't kill someone.

A new research paper argues that we cannot answer that question yet — and that continuing to deploy autonomous AI systems in life-critical environments before we can is a serious mistake.

Capability Is Not Reliability

Here's a distinction the AI industry has been quietly avoiding: a system can be capable and unreliable at the same time.

A capable system produces correct outputs under controlled conditions. A reliable system has a predictable failure distribution — you know where it breaks, how often, and how badly. Safety-critical engineering — think aircraft, nuclear plants, surgical robots — is built entirely around the second property, not the first.

Large language models have demonstrated capability. They have not demonstrated reliability.

A model that scores 95% on a benchmark has not told you anything useful about that remaining 5%. Are those failures random? Concentrated in the highest-stakes inputs? Invisible to the clinician reviewing the output? Benchmark accuracy doesn't answer any of those questions.

Eight Ways AI Fails When It Matters Most

The paper introduces a framework — the LLM Operational Reliability Failure Taxonomy (ORFT) — cataloguing eight failure classes that current AI systems exhibit in critical deployments. Together, they form a portrait of a technology that is impressive in the lab and dangerously unpredictable in the field.

1. Epistemic Hallucination. The model fabricates facts with complete fluency — a drug interaction that doesn't exist, a citation that was never written, an intelligence assessment built on nothing. The output is indistinguishable from a correct one.

2. Overconfidence Failure. When AI sounds certain, humans stop checking. Research confirms this: even domain experts reduce their scrutiny when AI outputs are presented confidently — and larger, more capable models are more prone to producing confident wrong answers than their smaller predecessors.

3. Abstention Failure. Sometimes a model should say "I don't know." Newer, more capable models are less likely to refuse a question and more likely to substitute a confident incorrect answer instead. That's not an improvement.

4. Prompt Fragility. Change a few words in a question — same meaning, different phrasing — and you may get a substantially different answer. One MIT study found that typos, informal language, and formatting inconsistencies in patient messages caused AI systems to make clinically unacceptable errors. Real patients don't write textbook prompts.

5. Temporal Drift. The model you validated last quarter is not the model running today. Fine-tuning updates, guardrail adjustments, and new versions change behavior — often undocumented, rarely re-validated. A system that met your reliability threshold in January may not meet it in June.

6. Reasoning Collapse. Push the model with a long document, a multi-step problem, or complex logical chains, and coherence can break down entirely. In real-time operational contexts, this may manifest as truncated responses or outputs that look correct but aren't.

7. Agentic Escalation. When AI agents take actions — calling APIs, executing code, controlling systems — a single reasoning error can trigger irreversible consequences downstream. In defense contexts, this is not theoretical.

8. Adversarial Manipulation. Malicious inputs embedded in documents or messages can cause a model to deviate from its instructions entirely. In contested environments, adversaries will find and exploit this.

The Benchmark Problem

Here's the uncomfortable truth about how we evaluate AI: the dominant method is multiple-choice tests.

Models are fed standardized questions, scored against fixed answer keys, and ranked by accuracy. That paradigm was designed to track performance improvements across generations of models. It was never designed to measure operational reliability.

The paper cites a striking finding: on free-response versions of equivalent medical questions, frontier AI models perform an average of 39 percentage points worse than on multiple-choice formats. And those same models score above chance even when the question text is completely hidden — suggesting they're pattern-matching to answer formats, not actually reasoning through the problem.

We are certifying AI systems for clinical use based on tests the models can partially pass without reading the question.

The ranking platforms we use to compare models have their own problem: an MIT study found that removing a small slice of the underlying crowdsourced evaluation data can significantly change which model comes out on top. The scoreboards we rely on to make deployment decisions are not stable.

The Mitigations We Have Aren't Enough

The industry has real tools for improving AI reliability. Retrieval-augmented generation reduces hallucination. Guardrails filter harmful outputs. Fine-tuning improves domain performance. Human oversight catches mistakes.

None of them close the reliability gap. Each addresses some failure classes while leaving others untouched or introduces new ones.

Guardrails don't reduce hallucination — they intercept outputs after the fact and can be bypassed by sophisticated prompt injection. RAG reduces reliance on the model's memory but introduces retrieval errors and its own drift problems. Fine-tuning improves average performance but leaves tail failures — the rare, high-consequence errors — largely unaddressed. And human oversight is systematically undermined by the overconfidence failure: when AI sounds certain, humans defer.

The paper is clear-eyed about this: we do not currently have the evaluation infrastructure, regulatory frameworks, or monitoring systems required to deploy autonomous AI safely in life-critical applications. We are building the plane while flying it — and some passengers are patients.

What Reliable AI Would Actually Require

The paper doesn't just diagnose the problem. It proposes a path forward, anchored in three concrete proposals:

The CRIT-LLM Benchmark — an evaluation instrument designed around adversarial inputs, noisy real-world prompts, long-context reasoning, multilingual conditions, and agentic task sequences. The kind of test that reflects how AI actually gets used.

The Operational Reliability Score (ORS) — a composite metric that captures not just accuracy, but confidence calibration, failure concentration in high-stakes inputs, and temporal stability across model updates. A system that scores well on benchmarks but fails catastrophically in adversarial conditions would score poorly on the ORS.

The LLM Reliability Stress Test Suite (LRSTS) — a modular collection of targeted tests for individual failure classes, deployable as a pre-deployment checklist for critical applications.

Alongside these, the paper calls for domain-specific operational profiles from regulators — the FDA and defense acquisition authorities need to define what reliability actually means for their contexts, not defer to academic benchmarks — and mandatory continuous monitoring after deployment.

The Honest Bottom Line

The paper's conclusion deserves to be stated plainly: frontier AI systems have not yet demonstrated the reliability required for autonomous deployment in life-critical or mission-critical environments.

That's not an argument against AI in healthcare or defense. The potential is real. It's an argument that we are moving faster than our evidence base supports, deploying technology we cannot yet verify, in situations where the cost of being wrong is measured in lives.

Anthropic, one of the leading AI developers, has stated explicitly that current AI systems do not meet the reliability requirements for fully autonomous weapons systems. The International AI Safety Report 2026, produced by more than 100 independent experts, identifies AI agents as still prone to basic errors and notes that human oversight becomes harder — not easier — as these systems grow more complex.

The benchmark scores are impressive. They are also not the right question.

The right question is: when this system fails, will we know? Will we see it coming? Can we contain it?

Until we can answer yes, meaningful human oversight isn't a limitation to be engineered around. It's the only thing standing between capability and catastrophe.

Read the full research paper: https://www.researchgate.net/publication/401422885_AI_Reliability_Gap_Why_Large_Language_Models_Fail_in_Safety-Critical_Systems

Index-RAG: Citation-first approach to RAG

Praneeth Vadlapati — Thu, 26 Mar 2026 13:59:02 +0000

The silent citation crisis in RAG systems is finally solved. Meet Index-RAG.
Your AI Just Lied About Its Sources — And It Doesn't Even Know It
You ask your AI assistant a question. It gives you a confident, well-structured answer and tells you it came from "page 5 of the compliance manual." You go to page 5. It's not there. It never was.

This isn't a fringe bug. It's the defining flaw of how most AI retrieval systems are built today — and for industries where source accuracy is non-negotiable, it's a dealbreaker.

A new paper, Index-RAG: Storing Text Locations in Vector Databases for Question-Answering Tasks, presents a deceptively simple but genuinely powerful fix. And if you're building or using AI systems that need to cite sources — in law, medicine, finance, compliance, or research — this is the most important RAG paper you'll read this year.

The Hidden Flaw in Traditional RAG

Retrieval-Augmented Generation (RAG) was supposed to solve AI hallucination. The idea: instead of asking an LLM to recall facts from memory, you give it a retrieval system that fetches relevant document chunks at query time. The model answers based on what it actually reads, not what it thinks it remembers.

It works — mostly. RAG systems do improve factual accuracy. But they left one problem completely unsolved: citation precision.

Here's why. Traditional RAG pipelines cut documents into fixed-size token chunks before embedding them. It's computationally easy, but it brutally discards the document's structural information — page numbers, line numbers, paragraph boundaries. The system might retrieve exactly the right passage, but it has no idea where in the original document that passage lives.

The result? When asked for a source, most RAG systems can only tell you the document title. They approximate, guess, or worse — hallucinate a specific page number that sounds plausible. In regulated industries, that's not just unhelpful. It's dangerous.

Workflow

What Makes Index-RAG Different

Index-RAG (i-RAG) is built on one core insight that turns out to change everything: don't store the text, store the location.

In traditional RAG, when you create multiple embeddings for a document (one per chunk, plus maybe some query expansions), you end up storing the raw text multiple times. That's expensive. And you still don't get precise citations.

i-RAG flips the model. Every embedding stored in the vector database carries precise location metadata — filename, page number, and line number — pointing back to the canonical source document. No redundant text. No approximations. When the system retrieves a passage, it retrieves the exact coordinates needed to find it in the original.

This is the key architectural decision: treat document coordinates as first-class retrieval metadata, not an afterthought.

How It Works: A Clean, Elegant Pipeline

The i-RAG pipeline has four stages that work together to deliver both citation accuracy and retrieval performance.

1. Paragraph-Level Segmentation

Rather than cutting at arbitrary token counts, i-RAG segments documents at natural paragraph boundaries. Paragraphs are coherent semantic units. They map cleanly to topics. And crucially, they have well-defined positions in the source document — which is what makes precise line-number extraction possible. PDF structural metadata is used to extract exact page numbers, and line numbers are computed from character offsets within each page.

2. Query Expansion Indexing

For each document paragraph, i-RAG uses a language model to generate multiple questions that the paragraph could answer. These questions are embedded and stored alongside the paragraph's chunk embedding — all pointing to the same location metadata. This creates multiple semantic entry points per document, solving the classic vocabulary mismatch problem: the user's phrasing might not match the document's phrasing, but it will likely match one of the generated question formulations.

3. Multi-Vector Storage Without Redundancy

Each document ends up with several embeddings in the vector index: one per chunk, one per generated question. None of them store a copy of the raw text. They store only location pointers. The vector index stays lean while retrieval coverage expands dramatically.

4. Blended Retrieval Scoring

At query time, the system retrieves the top candidates using cosine similarity and blends chunk scores (weighted at 0.6) with query-expansion scores (weighted at 0.4) per document. The location metadata attached to the winning result is used to construct a fully qualified citation: filename, page, and line.

The Numbers Don't Lie

i-RAG was evaluated against a conventional RAG baseline across four standard retrieval metrics. The results are consistent and meaningful:

Metric	Baseline RAG	Index-RAG	Improvement
Precision@1	0.667	0.833	+25.0%
Precision@5	0.367	0.383	+4.6%
MRR	0.819	0.917	+11.9%
nDCG@10	0.866	0.934	+7.8%

A 25% improvement in Precision@1 means the single most relevant document is retrieved correctly one in four more cases than before. For applications where users ask a question and expect one authoritative answer — legal lookups, medical reference, compliance queries — this is significant.

And unlike reasoning-based RAG alternatives, which achieve citation accuracy by running expensive LLM reasoning passes over entire documents at query time, i-RAG maintains fast retrieval. It's not trading speed for precision. It's getting both.

Why This Matters Beyond the Benchmarks

Think about what citation accuracy actually unlocks in practice.

In legal work, an AI assistant that can point to Smith v. Jones, exhibit 4, page 17, line 8 is usable in a professional workflow. One that says "somewhere in the case documents" is not.

In medical research, a clinician querying a drug interaction database needs to know whether the retrieved contraindication came from a peer-reviewed trial or a case report, and exactly where to go verify it.

In compliance, an audit trail isn't just about what the AI said — it's about being able to prove exactly which regulation or policy provision the AI was grounding its response in.

In academic research, imprecise citations aren't citations at all. They're noise.

The paper's author frames this problem sharply: "Imprecise citations undermine the reliability of AI-assisted information systems and limit the reliable use of generative AI in professional settings." i-RAG addresses exactly this barrier to enterprise adoption.

The Deeper Point

There's a version of AI that's impressive in demos but unreliable in production. It answers confidently. It sounds credible. But it can't show its work — not really. In domains where showing your work is legally, professionally, or ethically required, that AI isn't usable at all.

i-RAG is a step toward AI systems that are not just accurate, but verifiably accurate. Systems that don't just retrieve the right information, but can tell you, to the line, where it came from.

That's not a minor feature. That's the difference between a research novelty and a production system.

Read the Full Research

The technical architecture of i-RAG — including the full treatment of the query expansion mechanism, the multi-vector scoring strategy, the evaluation methodology, and the discussion of edge cases in paragraph segmentation — is detailed in the original paper.

If you're building RAG systems, evaluating AI for professional use cases, or just curious about the state of the art in citation-accurate retrieval, the paper is worth your time.

📄 Read the full paper on ResearchGate:
https://www.researchgate.net/publication/397745877_Index-RAG_Storing_Text_Location_in_Vector_Databases_for_QA_tasks

The problem of AI that can't cite its sources has been treated as inevitable for too long. Index-RAG makes a compelling case that it doesn't have to be.

The source code is open at github.com/Pro-GenAI/Index-RAG, and the system is designed to be up and running in minutes with Pinecone, Cohere, and OpenAI API keys.

Interested in citation-accurate AI, trustworthy LLM systems, and the future of RAG? Follow for more deep dives into applied AI research.

Tags: #MachineLearning #RAG #LLM #ArtificialIntelligence #NLP #VectorDatabases #AIEngineering #GenAI #CitationAccuracy #RetrievalAugmentedGeneration

🛡️ Agent Action Guard: Framework for Safer AI Agents

Praneeth Vadlapati — Wed, 25 Mar 2026 15:42:59 +0000

AI is perceived as a threat to humanity. As AI agents gain the ability to call APIs, run code, modify files, and interact with external systems, a new challenge emerges: how do we ensure the safety of the actions they take — not just the text they generate?

Today’s guardrails mostly filter responses, not actions. But in real-world testing, agents sometimes executed harmful actions even while verbally refusing to do so. That’s a critical gap in modern AI safety.

To address this, I developed Agent Action Guard, a framework designed to identify and block unsafe actions before they execute.

🔒 What Agent Action Guard framework includes

HarmActions Dataset A first-of-its-kind dataset focused on agent actions, not prompts. Each example includes:

MCP-style action objects
Labels: safe, harmful, unethical
Risk levels
Adversarial prompts (e.g., letter substitutions)
This dataset highlights real failure modes in tools like file operations, messaging APIs, and code execution.

Action Classifier A compact neural classifier built using MiniLM embeddings. It runs in real time inside agent loops, classifying actions as “Safe,” “Harmful,” and “Unethical”.

Despite being lightweight, it reaches 90.32% accuracy and avoids the heavy cost of running an LLM for every action’s classification.

HarmActionsEval Benchmark A new evaluation method using a metric called Harm@k, which measures how likely an agent is to produce harmful actions within its first k attempts.

In testing, some large open-source models produced harmful actions more than 70% of the time under adversarial prompts — proof that action-level safety checks are urgently needed.

🚀 Why This Matters
Agentic AI systems are becoming more capable every day.
But without action-level supervision, they can silently:

Send harmful messages
Execute unsafe code
Modify sensitive files
Interact with external APIs in risky ways
Agent Action Guard adds a missing layer of protection — one that sits between the model and its tools, intercepting dangerous behavior before it causes real harm.

HarmActionsEval results:

80% of the LLMs tested executed actions at the first attempt for over 97% of the harmful prompts.

*popular proprietary models.

💻 Try the code yourself
If you’re building AI agents — or researching agent safety — the GitHub repository includes the framework, including dataset, classifier, MCP proxy implementation, and evaluation code.

👉 https://github.com/Pro-GenAI/Agent-Action-Guard

This project is open-source and designed to be extended.
If you’re working on agentic safety, I’d love for you to explore it, test it, and help push the field forward by implementing it in your project.

📘 Paper
https://www.researchgate.net/publication/396525269_Agent_Action_Guard_Safe_AI_Agents_through_Action_Classifier