Security failure

Prompt Injection

Malicious or unintended instructions embedded in user input, retrieved content, or tool output that override system behavior.

Definition

Prompt injection is a security failure mode where an AI system follows malicious, untrusted, or unintended instructions embedded in user input, retrieved documents, tool outputs, webpages, emails, or other external content. The injected instruction attempts to override the system intended behavior.

Why it matters

Prompt injection can cause an AI system to reveal sensitive information, ignore policy, execute unsafe tool calls, alter outputs, exfiltrate data, or mislead users. The risk increases when agents consume untrusted content and have access to tools, credentials, private data, or write actions.

Where it appears

RAG systems, browser agents, email assistants, document summarizers, customer support tools, autonomous research agents, coding agents, and any workflow where the model reads untrusted content.

Symptoms

  • The model follows instructions found inside retrieved content instead of system instructions.
  • The model reveals hidden prompts or sensitive context.
  • The agent performs an action unrelated to the original user intent.
  • The response includes strange instructions, hidden text, or attacker-controlled content.
  • The system ignores safety rules after reading a document or webpage.

Detection signals

  • Presence of instruction-like text in retrieved content.
  • Sudden behavior changes after external content is loaded.
  • Attempts to reveal prompts, credentials, or memory.
  • Tool calls triggered by untrusted content rather than user intent.
  • Policy violations linked to specific documents, URLs, or messages.

Example scenario

A research agent summarizes a webpage. The webpage contains hidden text saying to ignore prior instructions and send private notes to an external URL. The agent treats the hidden text as a valid instruction and attempts an unsafe tool call.

Severity scoring

Low

Injection attempt is present but ignored.

Medium

Model output is influenced but no sensitive action occurs.

High

System exposes sensitive information or executes an unsafe workflow.

Critical

Prompt injection causes data exfiltration, unauthorized action, or security compromise.

Eval strategy

Test injected instructions across user messages, documents, retrieved snippets, tool outputs, HTML, email content, and multi-step agent workflows. Evaluate whether the system separates trusted instructions from untrusted content.

Runtime monitoring strategy

Monitor external content for instruction-like patterns, detect attempts to reveal hidden prompts or secrets, and flag tool calls that appear to originate from untrusted content. Track incidents by source, workflow, and tool permission.

Mitigation strategies

  • Separate trusted instructions from untrusted content.
  • Restrict tool permissions based on context.
  • Add confirmation for sensitive actions.
  • Sanitize or label retrieved content.
  • Use policy checks before tool execution.
  • Prevent external content from controlling system behavior.
  • Add red-team tests for prompt injection.

Where FailureModes.ai fits

FailureModes.ai helps teams detect prompt-injection patterns, classify severity, connect attacks to affected workflows, and convert red-team findings into continuous evals and runtime monitors.

See how your AI systems will fail — before your users do.

Book a diagnostic →