Operational failure
Infinite Loop
When an agent repeats reasoning, tool calls, or retries without making meaningful progress.
Definition
An infinite loop occurs when an AI agent repeats reasoning steps, tool calls, retrieval attempts, or retries without making meaningful progress. Some loops are literal; others are bounded by system limits but still waste time, money, and user trust.
Why it matters
Loops can create cost runaway, latency spikes, tool-rate-limit problems, bad user experiences, and downstream workflow delays. A loop also indicates that the agent may lack proper stopping conditions, uncertainty handling, or recovery behavior.
Where it appears
Research agents, coding agents, browsing agents, IT automation, customer support agents, retrieval workflows, and systems that automatically retry failed tool calls.
Symptoms
- Repeated calls to the same tool with similar arguments.
- Repeated retrieval of similar documents.
- The agent keeps revising without improving.
- The agent retries after the same error.
- Long traces with no state progress.
- User waits while the agent continues unnecessary work.
- Repeated action patterns.
Detection signals
- High retry counts.
- No meaningful state transition.
- Similar tool inputs across steps.
- Cost or latency spikes.
- Agent reaches maximum step limit.
Example scenario
A coding agent encounters a test failure. It repeatedly makes small code changes, reruns the same test, receives the same error, and continues without diagnosing the underlying dependency mismatch.
Severity scoring
Low
Short loop stopped automatically.
Medium
Loop increases latency or cost.
High
Loop blocks workflow or consumes significant resources.
Critical
Loop triggers external actions, rate limits, outages, or material cost exposure.
Eval strategy
Test ambiguous, impossible, or tool-error scenarios. Evaluate whether the agent stops, asks for clarification, changes strategy, or escalates instead of repeating the same behavior.
Runtime monitoring strategy
Monitor repeated tool calls, retries, state progress, trace length, cost, and latency. Alert when workflows exceed expected step counts or repeat patterns without new evidence.
Mitigation strategies
- Add maximum step and retry limits.
- Detect repeated actions.
- Require strategy change after repeated failure.
- Add explicit stop conditions.
- Escalate when progress stalls.
- Budget tokens, tool calls, and elapsed time.
Where FailureModes.ai fits
FailureModes.ai helps teams detect looping behavior in agent traces, distinguish normal iteration from failure, and add monitors that prevent repeated actions from becoming production incidents.
Related
Continue exploring.
- →
Cost Runaway
AI systems consuming far more resources than expected through retries, loops, long context, or excessive tool calls.
- →
Tool Misuse
When agents pick the wrong tool, pass bad arguments, ignore tool output, or act without required confirmation.
- →
Planning Failure
When an AI agent decomposes a task incorrectly, picks a wrong strategy, skips required steps, or fails to adapt to new information.
- →
Cascading Agent Failure
One local error in an agent workflow propagates into a larger workflow failure across tools, memory, or systems.
- →
Retrieval Failure
When an AI system retrieves stale, irrelevant, incomplete, conflicting, or poorly ranked context — often the root cause of bad RAG answers.