Output failure

Refusal Drift

Unexpected shifts in an AI system's willingness to answer — over-refusing safe requests, or under-refusing risky ones.

What failed

Refusal drift occurs when an AI system willingness to answer changes unexpectedly. The system may refuse benign requests it should complete, answer risky requests it should decline, or behave inconsistently across similar inputs. Refusal drift can result from model upgrades, prompt changes, policy changes, retrieval context, or safety tuning differences.

Architecture context

Customer support assistants, policy bots, legal or compliance tools, employee copilots, educational assistants, regulated workflows, and systems with safety or content boundaries.

Impact

Refusal behavior affects user trust, safety, compliance, and product quality. Over-refusal makes useful systems frustrating and less productive. Under-refusal creates risk when systems provide unsafe, confidential, regulated, or policy-violating information.

Symptoms

The system refuses ordinary business requests.
The system answers requests that should trigger a policy boundary.
Similar requests receive inconsistent refusal behavior.
Refusal rates change after a model or prompt update.
The system refuses because retrieved context contains sensitive-looking but safe content.

Detection signals

Refusal-rate changes by model version, prompt version, or workflow.
User abandonment or correction after refusals.
Policy boundary violations.
Inconsistent outcomes across semantically similar requests.
Increased escalation to human agents.

Mitigations

Define clear refusal policies and examples.
Add allowed/disallowed eval sets.
Test refusal behavior before model changes.
Improve clarification behavior for ambiguous requests.
Calibrate refusal thresholds by use case.
Monitor refusal trend shifts after deployment.

Contribute what failed. Unlock how others fixed it.

Submit a failure mode →Back to directory →