Delafosse Olivier

Posted on May 20 • Originally published at coreprose.com

How AI Hallucinations Are Creating Real Security Risks in Critical Infrastructure

#ai #machinelearning #llm #programming

Originally published on CoreProse KB-incidents

Large language models (LLMs) now sit in the core of Enterprise AI stacks:

SOC copilots triaging security threats
OT dashboards summarizing telemetry
Cloud copilots modifying IAM
Conversational AI for customer service and supply chain management
AI-native engineering tools shipping infrastructure-as-code

Once these systems hallucinate—and those hallucinations drive tools, AI agents, or human actions—the result is a new class of security incident, not a UX glitch. [3][11]

By 2026, many enterprises treat generative AI as a nervous system for decisions across finance, operations, and governance, turning hallucinations into board-level risk. [11] In critical infrastructure—data centers, energy grids, transport, and financial plumbing—where physical processes and regulation are tightly coupled, this becomes especially dangerous.

Early cases like Air Canada honoring discounts invented by a chatbot and Deloitte refunding a contract after AI-generated fake citations show real legal and financial exposure. Pushed into SOCs, OT dashboards, and operational SaaS, the same pattern can cause safety and availability incidents.

This article explains how hallucinations interact with agentic AI, RAG, and production tooling; how attackers exploit them; and which defenses engineering teams can actually ship for critical environments.

1. From Funny Mistakes to Systemic Risk: Why Hallucinations Matter for Critical Infrastructure

Here, hallucinations are model outputs that are not grounded in factual accuracy—they fabricate plausible facts, interpretations, or citations. [6] Two core types expand on earlier taxonomies like Lilian Weng’s:

Factual hallucinations – confident but false statements
Fidelity hallucinations – distortions of source documents or telemetry [6]

In SOCs and OT environments, both are dangerous because operators increasingly rely on AI to interpret SIEM alerts, OT sensor data, or regulatory texts. [6] A mis-summarized OT alert can downplay an anomaly that should trip a safety interlock, with national-scale disruption and security implications. [5][9]

By 2026, the target shifts from “zero hallucination” to calibrated uncertainty: systems should express doubt, expose confidence levels, and surface citations—especially in incident response and control rooms. [6][11] Surveys of 225 security, IT, and risk leaders show most organizations see AI risk management and verification as necessary tradeoffs for speed and automation. [11]

⚠️ Risk shift

Hallucinations are treated as systemic security and governance risks, not just product-quality bugs. [11]

Hallucinations as security, not UX, problems

Generative AI security is now a distinct discipline, covering prompt injection, model theft, and data poisoning—threats that legacy defenses cannot see. [3][5] Firewalls and EDR observe TCP flows and system calls, not reasoning failures or fabricated outputs.

AI risk frameworks (OWASP Top 10 for LLMs, NIST RMF) increasingly govern how hallucination-prone systems are deployed in safety-critical or regulated domains. [3][5] Hallucination-driven failures are often filed under:

Misuse and abuse of AI systems
Escalation of autonomous systems
Compliance and governance failures [5][11]

For critical infrastructure operators, the real danger is system-level coupling. A hallucinated instruction can:

Change cloud posture (open storage, weaken firewall rules)
Mislead defenders during an incident
Trigger the wrong physical response in OT systems [9][10]

When AI systems sit in the middle of operational and financial supply chains, these errors ripple into vendors, partners, and customers, as highlighted in “AI security in 2026” style analyses.

💡 Mini-conclusion

In critical infrastructure, hallucinations are failure modes in tightly coupled socio-technical systems that can cascade into physical and economic damage. [5][11]

2. How Hallucinations Interact with Agentic AI and Tooling

Wrapped in agentic AI systems—agents that plan, decide, and act with partial autonomy—hallucinations become actions. Agents can call APIs, run code, modify databases, or change access policies based on stochastic internal reasoning. [2][4]

When an agent hallucinates a plan or tool parameter, that error can directly change production.

Agent behavior in security-sensitive environments

By 2026, guidance warns that poorly understood agent behavior, limited operator expertise, and rapid deployment let hallucinated decisions propagate without review. [2] Databricks notes that useful agents almost always combine: [4]

Sensitive data
Untrusted inputs
External actions

This combination makes prompt injection and hallucination exploitation dangerous in SOCs, CI/CD, OT‑IT bridges, and critical SaaS apps. [4][10] Industrialized cybercrime can leverage off-the-shelf agents and GPT-based copilots to scale attacks.

⚠️ Rule of Two for Agents

No agent should simultaneously have:

1) high-privilege tools and

2) exposure to untrusted content

without at least one strong mitigating control (moderation, isolation, containment, or human review). [4]

New failure modes: tool hijacking, memory, and cascading actions

Agentic threat analyses highlight: [10]

Tool hijacking – malicious content steering an agent to the wrong tool
Privilege escalation – hallucinated or manipulated role assumptions
Memory poisoning – false “facts” stored and reused in planning
Cascading failures – one bad action spawning many follow-ons

As enterprises grant agents autonomous code execution and database modification, hallucinated commands become a direct attack surface on core infrastructure. [9][10] AI security engineers across vendors and labs concur on the need for strong guardrails as these capabilities move into production.

A utility SOC copilot that misreads a benign OT maintenance event as lateral movement and pushes the wrong EDR containment playbook illustrates the risk: errors appear as confident automation unless humans verify raw data.

💼 Mini-conclusion

Once agents are wired to production tools, hallucinations become operational choices. Without strict tool governance and privilege boundaries, those choices can resemble full attacker control. [2][4][10]

3. Concrete Attack Paths: From Hallucinated Outputs to Exploitable Channels

Hallucinations are not just random noise; adversaries can steer and exploit them.

AI assistants as covert C2 channels

Check Point Research showed an LLM assistant with web access can be repurposed as a covert C2 channel, without dedicated C2 infrastructure or API keys. [1] In controlled tests against Grok and Microsoft Copilot, malware:

Asked the assistant to fetch an attacker-controlled URL
Used instructions on that page as “commands”
Exfiltrated data via follow-up queries [1]

Because many organizations treat assistant traffic as low-risk and under-instrumented, these flows often bypass EDR and blend into normal SaaS collaboration. [1]

📊 Key pattern

Attackers extend the pattern of abusing trusted cloud services as C2, now via AI assistants that are even less instrumented and harder to block. [1] Stealthy data exfiltration via embedded instructions or chained queries becomes realistic for critical operators.

Prompt injection and hallucinated state

Databricks’ threat model shows how untrusted content—logs, wikis, API docs—can embed prompts that cause an agent to hallucinate system state and execute dangerous actions, especially when it: [4]

Assumes it sees ground truth
Takes multi-step actions without checks
Summarizes or transforms content before humans read it [4][10]

SentinelOne’s taxonomy calls this misuse and escalation of autonomous systems—adversaries steering models into high-impact behaviors that blend hallucination, prompt injection, and tool misuse. [5]

In offensive cloud PoCs, multi-agent LLM systems autonomously performed 80–90% of an espionage campaign against a misconfigured GCP environment, chaining recon, misconfig exploitation, and exfiltration. [9] Imperfect but fast behavior was still highly effective. [9][10]

⚡ Mini-conclusion

Attackers do not need perfectly accurate AI—just AI good enough at exploring options. Hallucinations can help by generating diverse behaviors defenders did not anticipate. [1][4][9]

4. Hallucinations Inside the AI Stack: RAG, Memory, and Detection Systems

Many teams expect Retrieval-Augmented Generation (RAG) to “fix” hallucinations. In practice, RAG changes failure modes instead of eliminating them. [6][8]

RAG and fidelity failures

Models can still fabricate citations or distort retrieved content, especially with loose prompts or system messages. [6][8] A SOC copilot summarizing SIEM alerts, threat intel, and OT telemetry can:

Omit key fields
Merge distinct alerts
Add plausible but nonexistent indicators [6]

In incident response, such distortions can change priorities, containment scope, or regulatory reporting, amplifying business impact—echoed by hypothetical financial incidents where misclassifications delay response.

💡 Guarded RAG pattern

To reduce hallucinations, production RAG systems usually combine: [7][8]

Explicit source citation and “according to these documents” phrasing
Constrained formats (JSON schemas, enums, strict field types)
Retrieval grounding, instructing the model to answer only from context

These are mitigations, not guarantees. [7][8]

Guardrails and internal detectors

Modern guardrail architectures separate: [7]

Input control (prompt validation, prompt-injection filters, Input Sanitization)
Output moderation (toxicity, PII, policy violations)
Governance loops (usage analytics, AI risk management, feedback)

Advanced techniques like Cross-Layer Attention Probing (CLAP) train lightweight classifiers on model activations to flag likely hallucinations in real time, even without external ground truth—useful for SOC copilots and change-management bots. [6]

Agent memory adds memory poisoning: prior hallucinations or adversarial inserts stored in long-term memory later shape planning as if they were facts. [6][10]

⚠️ Mini-conclusion

RAG, memory, and guardrails form a complex ecosystem. Without logging, monitoring, and periodic scrubbing of what the system “believes,” hallucinations accumulate and propagate into every future decision. [6][7][11]

5. Threat Modeling Hallucinations in Critical Infrastructure Workflows

To manage hallucinations as security risks, fold them into standard threat modeling and governance—not just model-quality discussions.

Make the AI stack visible

Generative AI security programs recommend building an AI bill of materials (AI-BOM) so defenders know: [3]

Where LLMs and GPT-style models sit in control paths
Which tools and APIs they can call
What data they can read and write

Without this, “shadow agents” emerge in side projects with quiet access to OT telemetry, production databases, or IAM APIs. [2][3][10] Less obvious surfaces include plugins and knowledge-base tools that summarize logs before humans see them.

AI risk frameworks categorize threats like adversarial inputs, supply-chain attacks, privacy leakage, misuse, and bias—all of which hallucinations can trigger or worsen. [5] Offensive cloud PoCs show AI mainly accelerates recon and misconfig exploitation using existing services. [9]

📊 Governance impact

Case studies describe hallucinations costing millions through bad strategic decisions, mispriced risk, and compliance failures, often because leadership implicitly trusted AI outputs. [11] Large consultancies now frame this as central to long-term value capture, not a side ethics issue.

Applying zero-trust to AI outputs

Security leaders are urged to align AI with zero-trust, treating model outputs as untrusted data that must be validated before influencing identity, network, or OT controls. [3][5]

A practical threat-modeling checklist:

Map every place AI can write to a control surface (firewall, IAM, PLC, SaaS admin)
Classify each as read-only, suggest, or write/execute
Require human or independent verification for any “write/execute” path
Instrument detailed logging for every AI-driven action and decision [3][5][11]

Upskilling security teams on AI-specific risks is now seen as a prerequisite for deploying agentic systems into production infrastructure. [2][5] Practitioners and academics emphasize the need for playbooks specific to hallucination-driven failures, not just generic cyber hygiene.

💼 Mini-conclusion

Treat hallucinations like any exploitable error: visible in architecture diagrams, explicitly modeled in threat scenarios, and constrained by zero-trust controls before they reach real-world levers. [3][5][9][11]

6. Engineering Defenses: Patterns, Controls, and Implementation Guidance

Hallucinations are inherent to current LLMs, so defenses must assume they will happen and constrain blast radius. [6][11] The focus shifts from model perfection to system resilience.

Layered guardrails for critical systems

Guardrail frameworks stress that input filtering, output moderation, and governance telemetry should be designed together. [7] For critical infrastructure, that often means:

A policy engine validating every planned action
Strong tool schemas and safe defaults
Out-of-band monitoring for anomalous AI behavior [4][7]

💡 Example: policy-enforced tool call

def execute_change(request, agent_suggestion):
    policy = evaluate_policy(request, agent_suggestion)  # denies, allows, or requires_approval

    if policy.decision == "deny":
        log_event("ai_change_blocked", details=policy.reason)
        return "Change rejected by policy engine."

    if policy.decision == "requires_approval":
        create_ticket(request, agent_suggestion)
        return "Change pending human review."

    # Only low-risk changes get here
    return apply_change(agent_suggestion)

This ensures hallucinated “risky” changes never execute without human review.

Reducing hallucinations where it matters

Effective mitigation techniques include: [6][8]

RAG grounding with high-quality, scoped retrieval
Explicit source citation and “according to these logs/documents” prompts
Constrained response schemas (JSON, enums, fixed ranges)
Prompts that encourage uncertainty and “I don’t know”

These live in prompt templates and middleware, not only fine-tuning. [6][8] Creative systems can tolerate more hallucinations; SOC, OT, and finance tools cannot.

Databricks’ layered controls show how to restrict data access, validate inputs (including Input Sanitization), and constrain outputs for agents, implementing the Rule of Two so a single hallucination cannot reach high-impact tools. [4]

Monitoring and lifecycle risk management

Generative AI security best practices recommend: [3][5]

Zero-trust access controls around AI tools and data
Specialized AI security platforms for discovering shadow AI and attack paths
Continuous monitoring for prompt injection and anomalous outputs

Enterprise AI risk programs emphasize continuous assessment from data collection through deployment to catch new hallucination patterns and emerging threats. [5][11] Incidents involving shadow AI, browser extensions, and poorly governed copilots already show unanticipated exposure.

Because hallucinations are unavoidable, governance guidance urges processes where critical decisions require human confirmation or independent data verification before execution. [6][11] In the emerging “Answer Economy,” AI drafts answers, and humans specialize in verification before those answers touch money, safety, or reputation.

⚠️ Mini-conclusion

The engineering goal is not perfect LLM accuracy but system resilience to LLM errors—via guardrails, monitoring, policy engines, and human-in-the-loop checks around every high-impact control surface. [3][4][6][7][10][11]

Conclusion: Treat Hallucinations as a First-Class Threat

Hallucinations are a pervasive failure mode that interacts with agents, tools, cloud services, and human decision-making across your environment. [5][6]

In critical infrastructure—where AI is woven into SOC workflows, OT dashboards, data centers, and governance—these stochastic errors become concrete security risks, amplified by prompt injection, memory poisoning, covert C2 over trusted AI traffic, and AI’s expansion into national-scale infrastructure. [1][5][9][10][11]

Defensive mindset shift:

Treat LLM outputs as untrusted
Instrument agents with strong guardrails and policy engines
Apply AI-specific security frameworks and AI-BOM visibility
Build governance that assumes model error and demands verification for high-impact actions [3][5][7][11]

As you design or review AI-powered systems, map every place a hallucinated output can touch a control surface—cloud, identity, network, OT, or financial—and treat each as a threat-modeling exercise. Coordinate ML, platform, security, and AI ethics teams to implement guardrails, monitoring, and human-in-the-loop checks before hallucinations appear in your incident reports. [2][3][5][11]

About CoreProse: Research-first AI content generation with verified citations. Zero hallucinations.

🔗 Try CoreProse | 📚 More KB Incidents

DEV Community