Operational failure

Cost Runaway

AI systems consuming far more resources than expected through retries, loops, long context, or excessive tool calls.

What failed

Cost runaway occurs when an AI system consumes far more resources than expected. In LLM and agent systems, this can happen through excessive token usage, repeated retries, large context windows, unnecessary tool calls, inefficient retrieval, long-running agent loops, or cascading workflows.

Architecture context

Autonomous agents, research agents, coding assistants, RAG systems, customer support bots, batch summarization pipelines, and workflows using expensive models or tools.

Impact

Cost runaway can make an AI product economically unsustainable. It can also indicate reliability problems: the system may be confused, looping, retrieving irrelevant context, or retrying failed tools. In production, cost spikes can affect margins, budgets, user experience, and system availability.

Symptoms

  • Token usage rises sharply without better outcomes.
  • Agents call many tools for simple requests.
  • The system retries the same failing operation.
  • Context windows grow with irrelevant history.
  • Costs increase after prompt, model, or routing changes.
  • A small group of workflows drives disproportionate spend.

Detection signals

  • Cost per task.
  • Tokens per successful completion.
  • Tool calls per task.
  • Retry counts.
  • Loop length.
  • Model routing frequency.
  • Cost spikes by workflow, tenant, user, or model version.

Mitigations

  • Set token and tool-call budgets.
  • Add loop and retry limits.
  • Route simple tasks to cheaper models.
  • Compress or summarize context.
  • Deduplicate retrieved content.
  • Stop work when confidence is sufficient.
  • Require approval for expensive workflows.

Contribute what failed. Unlock how others fixed it.