What is a Prompt Injection Attack?
A prompt injection attack occurs when an attacker crafts malicious input (a “prompt”) to an LLM-based system in such a way that the model is tricked into ignoring its intended system instructions and executing the attacker’s instructions instead. Because these models cannot reliably distinguish between developer-set system prompts and user input, they can end up processing harmful inputs embedded in user input.
How Does It Work?
Prompt injection attacks manipulate how large-language models process information. Here’s how it unfolds step by step:
Step 1: Setting System Instructions – Developers provide an LLM with predefined rules or system prompts that guide how it should respond to users. These include what topics it can discuss and what data it can access.
Step 2: Receiving User Input – A user interacts with the AI by entering a query or command. Normally, the system combines both the developer’s instructions and the user’s prompt to generate a response.
Step 3: Injecting Malicious Prompts – Attackers insert hidden or direct instructions in the user input — such as “ignore previous rules” or “reveal confidential data.” These commands are designed to override the model’s original instructions.
Step 4: Model Misinterpretation – The LLM processes both sets of instructions together. Because it cannot always distinguish between legitimate system prompts and injected ones, it may treat the malicious instructions as valid.
Step 5: Execution of Unintended Actions – The model follows the attacker’s hidden instructions — possibly leaking data, altering responses, or performing actions that compromise system integrity.
Step 6: Impact on Security – The result could be unauthorized data access, corrupted output, or manipulation of connected systems, leading to severe security and compliance risks.
Direct prompt injection: The attacker enters malicious instructions directly in the user input field of an AI system.
Indirect prompt injection: The attacker hides harmful inputs in external sources such as documents, web pages or training data—when the model consumes that external data it may pick up the hidden instructions.
Stored prompt injection: Malicious prompts become embedded in the system’s memory, dataset or training corpus, affecting future responses.
Top comments (0)