Security failure
Prompt Injection
Malicious or unintended instructions embedded in user input, retrieved content, or tool output that override system behavior.
What failed
Prompt injection is a security failure mode where an AI system follows malicious, untrusted, or unintended instructions embedded in user input, retrieved documents, tool outputs, webpages, emails, or other external content. The injected instruction attempts to override the system intended behavior.
Architecture context
RAG systems, browser agents, email assistants, document summarizers, customer support tools, autonomous research agents, coding agents, and any workflow where the model reads untrusted content.
Impact
Prompt injection can cause an AI system to reveal sensitive information, ignore policy, execute unsafe tool calls, alter outputs, exfiltrate data, or mislead users. The risk increases when agents consume untrusted content and have access to tools, credentials, private data, or write actions.
Symptoms
- The model follows instructions found inside retrieved content instead of system instructions.
- The model reveals hidden prompts or sensitive context.
- The agent performs an action unrelated to the original user intent.
- The response includes strange instructions, hidden text, or attacker-controlled content.
- The system ignores safety rules after reading a document or webpage.
Detection signals
- Presence of instruction-like text in retrieved content.
- Sudden behavior changes after external content is loaded.
- Attempts to reveal prompts, credentials, or memory.
- Tool calls triggered by untrusted content rather than user intent.
- Policy violations linked to specific documents, URLs, or messages.
Mitigations
- Separate trusted instructions from untrusted content.
- Restrict tool permissions based on context.
- Add confirmation for sensitive actions.
- Sanitize or label retrieved content.
- Use policy checks before tool execution.
- Prevent external content from controlling system behavior.
- Add red-team tests for prompt injection.