Uchi Uchibeke

Posted on Mar 16 • Originally published at uchibeke.com

AI Guardrail Poisoning: Someone Rewrote McKinsey’s Lilli With One SQL Query

#aisecurity #guardrails #aiagents #security

Someone rewrote McKinsey's AI chatbot's guardrails with a single SQL UPDATE statement. No deployment needed. No code change. No one noticed until a security researcher wrote it up.

That's the story of Lilli, McKinsey's internal AI assistant used by thousands of consultants. A researcher found a SQL injection flaw in the application layer. Because the flaw was read-write, an attacker could silently rewrite the prompts that controlled how Lilli behaved: what guardrails it followed, how it cited sources, what it refused to do. The Register covered it last week.

"No deployment needed. No code change. Just a single UPDATE statement wrapped in a single HTTP call."

The holes are now patched. But the larger threat, as the researcher told The Register, remains.

This is what I'd call guardrail poisoning. And it's more common than the industry wants to admit.

TL;DR

McKinsey's Lilli AI had its behavioral guardrails silently rewritten via SQL injection
The attack vector: guardrails stored as mutable database rows, not enforced at runtime
Static guardrails (stored as config) decay; runtime authorization (verified at call time) does not
The fix isn't better SQL sanitization; it's moving the trust boundary from storage to execution
Pre-action authorization at the tool call level is the architecture that makes this class of attack structurally impossible

Why guardrail poisoning is different from prompt injection

Prompt injection is the one people talk about. An attacker slips instructions into a document or user input, and the agent follows them. It's been widely discussed since 2023, and most developers are at least aware of it.

Guardrail poisoning is quieter and, I'd argue, harder to detect.

In prompt injection, the attacker convinces the AI to do something it shouldn't do right now. In guardrail poisoning, the attacker changes what the AI believes it is allowed to do, persistently, across every future interaction.

Think of it this way. Prompt injection is a forged boarding pass. Guardrail poisoning is getting into the airline's system and rewriting your travel history so you're now registered as a trusted crew member.

One is a one-time exploit. The other is a persistent identity compromise.

The architecture that makes this possible

Here's what I believe happened in the Lilli case, based on the public writeup.

The AI's behavioral rules, things like "cite sources this way," "refuse requests about X," "don't discuss Y topics," were stored as rows in a database. The application layer read those rows at query time and injected them into the prompt context.

That's a common pattern. It's flexible. It lets product teams update guardrail behavior without a code deploy. And on the surface, it makes sense.

The problem is this: a guardrail that can be rewritten by anyone with database write access is not a guardrail. It's a preference.

The attack surface here is not the AI model. It's not the inference layer. It's the database that happens to hold the behavioral configuration. And SQL injection vulnerabilities are not rare. They are, according to OWASP's 2021 Top 10, the third most common web application vulnerability class. They're not exotic. They're table stakes.

When your guardrails live in a mutable row, every SQL injection, every misconfigured admin panel, every insider with database write access is a potential attacker.

Static configuration versus runtime enforcement

This is the distinction the industry keeps underweighting.

	Static Guardrails	Runtime Authorization
Where enforced	At prompt assembly time	At action execution time
Trust source	The stored config	An independently verified policy
Vulnerable to	SQL injection, config overwrite, prompt injection	Only a compromised signing key
Audit trail	Optional, often absent	Inherent (receipt per action)
What Lilli had	This	Not this

Static guardrails: rules stored as text, injected into prompts, evaluated by the model's own judgment. They can be updated, overwritten, ignored by a sufficiently adversarial prompt, or, as in Lilli's case, silently replaced before the model ever sees them.

Runtime authorization: a check that fires at the moment the agent is about to take an action, compares the action against a policy, and allows or blocks it regardless of what the model was told in the system prompt.

The difference is the trust boundary. Static guardrails trust the storage. Runtime authorization trusts neither the storage nor the model. It enforces at the point of execution.

I've been building in this space with APort, and one of the clearest things I've learned is that the most dangerous assumption in AI security is this: "we already told the model what not to do."

Telling a model what not to do is useful. Verifying what it's about to do, at the moment it's about to do it, is what actually stops things.

What pre-action authorization looks like in practice

When I wrote about pre-action authorization earlier in this series, the core idea was simple: put a checkpoint between the agent and the tool.

Here is what that looks like in a minimal implementation:

# Before the agent executes a tool call
def before_tool_call(tool_name: str, params: dict, context: dict) -> bool:
    decision = aport.verify(
        tool=tool_name,
        params=params,
        agent_id=context["agent_id"],
        policy_scope=context["policy_scope"]
    )
    if not decision.allow:
        raise AuthorizationError(f"Blocked: {decision.reason}")
    log_receipt(decision.receipt_id)
    return True

The key properties this has that a stored guardrail does not:

It runs at execution time, not at prompt assembly time. Rewriting the system prompt doesn't affect it.
The policy is evaluated by a separate process, not the model itself. The model's opinion of what it should do is not the enforcement mechanism.
Every blocked and allowed action produces a receipt. Audit trail is inherent, not optional.
The policy source can be cryptographically signed. If someone tries to rewrite the policy, the signature fails.

Point four is the direct answer to what happened with Lilli. If the guardrail policy carried a signature that the runtime enforcement layer verified before applying, a SQL injection that changed the rows would produce a signature mismatch and fail closed.

The vulnerability is not "SQL injection exists." The vulnerability is "the system trusted modified rows without verification."

This is not a McKinsey problem; it is an industry pattern

I want to be careful here. This is not a takedown of McKinsey's engineering. SQL injection vulnerabilities happen to careful teams. The more interesting question is why the architecture made this attack so impactful.

And the answer is that the industry has largely converged on a pattern where behavioral control of AI agents lives in a layer that was never designed for security enforcement: the prompt.

Prompts are text. Text can be overwritten, injected, extended, and ignored. Building your security model on top of text that gets fed to a probabilistic model is not security engineering. It's optimistic text engineering.

NIST's AI Risk Management Framework (AI RMF 1.0) specifically flags this under the "Govern" function: AI systems need controls that operate independently of the model's learned behavior. The model should not be the policy enforcement point.

A recent analysis of enterprise AI agent security in 2026 found that 88% of organizations had AI agent security incidents last year, yet a third still have no process to validate AI security before deployment. Not validate AI accuracy. Validate AI security. A third.

We are deploying agents into production that can send emails, write to databases, call APIs, and execute code, and a significant fraction of those agents have no authorization layer that operates independently of the prompts fed to the model.

What this means if you're building agents today

If your AI agent's behavioral rules are stored as rows in a database, or as strings in a config file, or as text in a system prompt: ask yourself what happens if those strings change.

Can they change without a code deploy? Can they change without a review? Can they be changed by anyone with SQL write access, or S3 write access, or environment variable write access?

If yes, you don't have guardrails. You have defaults.

The Lilli attack is a clarifying example, but it's not the only vector. Prompt injection via user input, jailbreaks, compromised retrieval sources that inject into RAG context, and insider modification of stored configurations all share the same underlying flaw: they all assume the model or the stored config can be trusted at execution time.

The fix is the same in each case: enforce at execution time, independent of the model's own judgment, with receipts.

My experience building identity infrastructure for financial systems taught me this the hard way. In fintech, we never trusted the transaction description. We verified the transaction. The authorization step was not optional and it did not read from a user-supplied field. It compared against a signed, independently stored policy.

That is the model AI agent security needs to borrow.

What this is NOT

Pre-action authorization is not a silver bullet for all AI security concerns. It does not protect against a compromised policy store if the policy store itself has no integrity verification. It does not prevent the model from producing bad outputs that don't involve tool calls. It does not replace prompt engineering or input validation.

What it does: it closes the specific attack class where the agent takes a consequential external action that was not authorized by a current, verified policy. That class includes the Lilli scenario. It includes the production database deletion I have seen in my own testing. It includes the accidental bulk email sends that show up on HN every few months.

Those are the actions you cannot undo. Those are the ones that need a hard checkpoint.

The question the industry needs to answer

The Lilli holes are closed. But the researcher's point stands: the larger threat remains.

Every team building production AI agents is making a choice, often implicitly, about where the trust boundary lives. Is it the model? The system prompt? The stored config? The database?

Runtime authorization says: none of those. The trust boundary is the execution checkpoint, and the policy it enforces is independently verified every single time.

That is not a new idea. It is how we built secure financial systems, secure access control, and secure identity infrastructure. We are just overdue to apply it to AI agents.

Over to You

Has your AI agent ever done something it was never supposed to do? Not a prompt injection demo in a sandbox; a real production action that surprised you. What was the first sign something was wrong?

I'm curious whether the failure came from the model ignoring a rule, from a misconfigured policy, or from something upstream that changed the context the agent was operating in.

DEV Community