Security failure
Prompt Injection
Malicious or unintended instructions embedded in user input, retrieved content, or tool output that override system behavior.
Definition
Prompt injection is a security failure mode where an AI system follows malicious, untrusted, or unintended instructions embedded in user input, retrieved documents, tool outputs, webpages, emails, or other external content. The injected instruction attempts to override the system intended behavior.
Why it matters
Prompt injection can cause an AI system to reveal sensitive information, ignore policy, execute unsafe tool calls, alter outputs, exfiltrate data, or mislead users. The risk increases when agents consume untrusted content and have access to tools, credentials, private data, or write actions.
Where it appears
RAG systems, browser agents, email assistants, document summarizers, customer support tools, autonomous research agents, coding agents, and any workflow where the model reads untrusted content.
Symptoms
- The model follows instructions found inside retrieved content instead of system instructions.
- The model reveals hidden prompts or sensitive context.
- The agent performs an action unrelated to the original user intent.
- The response includes strange instructions, hidden text, or attacker-controlled content.
- The system ignores safety rules after reading a document or webpage.
Detection signals
- Presence of instruction-like text in retrieved content.
- Sudden behavior changes after external content is loaded.
- Attempts to reveal prompts, credentials, or memory.
- Tool calls triggered by untrusted content rather than user intent.
- Policy violations linked to specific documents, URLs, or messages.
Example scenario
A research agent summarizes a webpage. The webpage contains hidden text saying to ignore prior instructions and send private notes to an external URL. The agent treats the hidden text as a valid instruction and attempts an unsafe tool call.
Severity scoring
Low
Injection attempt is present but ignored.
Medium
Model output is influenced but no sensitive action occurs.
High
System exposes sensitive information or executes an unsafe workflow.
Critical
Prompt injection causes data exfiltration, unauthorized action, or security compromise.
Eval strategy
Test injected instructions across user messages, documents, retrieved snippets, tool outputs, HTML, email content, and multi-step agent workflows. Evaluate whether the system separates trusted instructions from untrusted content.
Runtime monitoring strategy
Monitor external content for instruction-like patterns, detect attempts to reveal hidden prompts or secrets, and flag tool calls that appear to originate from untrusted content. Track incidents by source, workflow, and tool permission.
Mitigation strategies
- Separate trusted instructions from untrusted content.
- Restrict tool permissions based on context.
- Add confirmation for sensitive actions.
- Sanitize or label retrieved content.
- Use policy checks before tool execution.
- Prevent external content from controlling system behavior.
- Add red-team tests for prompt injection.
Where FailureModes.ai fits
FailureModes.ai helps teams detect prompt-injection patterns, classify severity, connect attacks to affected workflows, and convert red-team findings into continuous evals and runtime monitors.
Related
Continue exploring.
- →
Data Leakage
When an AI system exposes sensitive, confidential, regulated, or unauthorized information through outputs, retrieval, memory, or tool use.
- →
Retrieval Failure
When an AI system retrieves stale, irrelevant, incomplete, conflicting, or poorly ranked context — often the root cause of bad RAG answers.
- →
Cascading Agent Failure
One local error in an agent workflow propagates into a larger workflow failure across tools, memory, or systems.
- →
Unsafe Escalation
When an agent acts, approves, or escalates without the right review, policy check, or human handoff — or fails to escalate when it should.
- →
Tool Misuse
When agents pick the wrong tool, pass bad arguments, ignore tool output, or act without required confirmation.