DEV Community

Cover image for The McKinsey AI Breach Isn't About SQL Injection. It's About Writable System Prompts.
AI Gov Dev for Aguardic

Posted on • Originally published at aguardic.com

The McKinsey AI Breach Isn't About SQL Injection. It's About Writable System Prompts.

A red-team security startup reportedly gained read-write access to McKinsey's internal AI chatbot platform, Lilli, in about two hours. The agent accessed tens of millions of messages and, more critically, could modify the system prompts that steer the entire application's behavior. No deployment needed. No code change. Just an HTTP request with an UPDATE statement.

To be clear: this was a controlled red-team engagement by CodeWall, not a malicious breach. But the vulnerability pattern it exposed applies to every organization running LLM-powered applications in production. And the real lesson isn't the SQL injection that got them in. It's what they could do once they were there.

Why This Is Bigger Than a Database Vulnerability

The initial foothold was classic application security. Publicly exposed API documentation described unauthenticated endpoints. One of those endpoints was vulnerable to SQL injection. That gave the researchers direct database access, including read and write operations on the tables storing system prompts.

SQL injection in 2026 shouldn't happen, but it does. The interesting part of this story isn't the entry point. It's the blast radius.

In a traditional application, SQL injection gives you access to data. In an LLM application, SQL injection gives you access to behavior. The system prompts that define how an AI application responds, what policies it follows, what it refuses to do, and how it handles sensitive information were stored in the same database as chat logs and user data. A single injection vulnerability didn't just expose data. It exposed the control surface of the entire application.

This is the architectural pattern that makes LLM applications fundamentally different from traditional software. In traditional applications, behavior is defined in code. Code is reviewed, versioned, and deployed through a controlled pipeline. In LLM applications, behavior is significantly defined by prompts. And prompts are often treated as configuration: editable, dynamic, stored in databases, managed through admin UIs with less rigor than production code.

The problem is that prompts don't behave like configuration. They behave like code. A system prompt that says "never reveal client confidential information" is a security control. A system prompt that says "act as a senior consultant and provide detailed analysis" defines the application's capability scope. Changing either of these changes what the application does in production, for every user, immediately.

The Difference Between Prompt Leakage and Prompt Tampering

Most teams building LLM applications have thought about prompt leakage. Users extracting system instructions through clever prompting is a well-known risk, and there are established techniques to mitigate it: instruction hierarchy, input filtering, system prompt isolation.

Prompt tampering is a different and more dangerous category. If an attacker can read your system prompts, they understand your application's constraints. If an attacker can write your system prompts, they control your application's behavior.

The asymmetry matters. Prompt leakage is a one-time information disclosure. Prompt tampering is persistent. A modified system prompt affects every subsequent conversation. It scales across every user of the application. And it's hard to detect because the outputs can look normal. The application still responds, still generates plausible text, still appears to function correctly. It's just operating under different rules than the ones you defined.

Consider what a tampered prompt enables. An attacker could relax confidentiality behavior, causing the application to surface internal documents it was supposed to protect. They could bias analysis, steering recommendations toward specific vendors or outcomes. They could suppress safety checks, removing disclaimers or uncertainty indicators from responses. They could change citation behavior, making fabricated claims appear sourced. None of these produce obviously broken outputs. They produce subtly wrong outputs at scale, which is significantly harder to catch.

Prompts Are Production Code. Treat Them Accordingly.

The fix isn't primarily about SQL injection, though obviously that needs to be addressed. The deeper fix is about how organizations treat the artifacts that define LLM behavior.

System prompts need the same controls you apply to production code. That means version control. Every prompt change should be tracked in a system that records who changed what, when, and why. Prompts should live in a repository with history, not in a database field that gets overwritten with no record of the previous value.

It means change review. A prompt change that modifies how the application handles confidential information is a security-relevant change. It should require review before it reaches production, the same way a code change to an authentication system would require review.

It means environment separation. Prompts should move through a pipeline: development, staging, production. Editing prompts directly in production should be as unusual and as controlled as editing code directly in production.

It means immutability in production. The application runtime should have read-only access to prompt artifacts. If prompts need to be updated, the update should come through the deployment pipeline, not through a database write from the running application. This single control would have prevented the Lilli vulnerability from becoming a behavioral takeover, even after the SQL injection succeeded.

And it means drift detection. Even with all the above controls, you should be monitoring whether the prompts running in production match the prompts you deployed. Hash-based verification, scheduled validation runs that compare production behavior against expected baselines, and alerts on any deviation. The same mindset as infrastructure-as-code drift detection, applied to the artifacts that define your AI application's behavior.

The Control Plane Is the New Attack Surface

The Lilli incident illustrates a pattern that extends beyond prompts. LLM applications have a control plane that includes system prompts, tool configurations, retrieval settings, and permission schemas. Each of these artifacts influences what the application can do, what data it can access, and how it responds. Together, they define the application's behavior more than its code does.

Traditional application security focuses on the data plane: protecting the information the application processes. LLM application security also needs to protect the control plane: the configuration artifacts that determine how the application processes information.

Admin UIs that let internal teams edit prompts, configure tool access, and manage retrieval settings are high-value targets. They should have strong authentication (SSO plus MFA at minimum), role-based access that separates prompt authors from prompt deployers, session recording for audit trails, and just-in-time access for production changes.

API endpoints that interact with prompt storage, even read-only endpoints, need authentication and rate limiting. The Lilli researchers reportedly found their initial foothold through API documentation that described logging and search query endpoints. These "boring" endpoints are exactly where agents and attackers probe first because they're often the least protected.

Defense in Depth for LLM Applications

Prompt integrity is a critical control, but it shouldn't be your only one. Even if your prompts are properly secured, you want additional enforcement layers that catch policy violations regardless of their source.

A compromised prompt might instruct the application to surface confidential data. An output evaluation layer that checks every response for sensitive information before it reaches the user catches this, whether the data surfaced because of a tampered prompt, a retrieval error, or a model hallucination. The enforcement doesn't need to know why the violation occurred. It just needs to prevent the violating content from reaching the user.

The same principle applies across surfaces. Policy checks on code commits catch secrets that an AI coding assistant might introduce. Content evaluation on documents catches sensitive data before it's shared externally. Session-aware governance for AI agents catches multi-step violations that no single checkpoint would detect.

This is the defense-in-depth model applied to LLM applications: secure the control plane (prompts, tools, retrieval), enforce policies on the data plane (inputs, outputs, actions), and generate audit trails that prove both layers are operating continuously.

What This Means for Your Organization

If you're running LLM applications in production, audit your prompt storage architecture this week. Ask three questions. Where are system prompts stored? Who has write access to that storage? And is there a mechanism to detect unauthorized changes?

If the answer to any of these is "I'm not sure," you have the same vulnerability class that the Lilli researchers exploited. The specific entry point (SQL injection, exposed API, misconfigured admin UI) varies. The impact is the same: whoever can write your prompts controls your application.

The organizations that will avoid these incidents aren't the ones with the most sophisticated AI models. They're the ones that treat every artifact influencing AI behavior (prompts, tool permissions, retrieval rules, policy configurations) with the same rigor they apply to production code. Version it. Review it. Deploy it through a pipeline. Monitor it in production. And enforce policies on the outputs regardless, because defense in depth means assuming any single layer can fail.

The McKinsey incident will be remembered not for the SQL injection. It will be remembered as the moment the industry realized that LLM application security isn't just about the model. It's about every artifact that tells the model what to do.


I'm building Aguardic — a policy-as-code platform that enforces organizational policies across AI outputs, code, documents, and agents. When prompts fail, policy enforcement catches what gets through. Happy to answer questions about defense in depth for LLM applications in the comments.

Top comments (0)