Output failure
Refusal Drift
Unexpected shifts in an AI system's willingness to answer — over-refusing safe requests, or under-refusing risky ones.
What failed
Refusal drift occurs when an AI system willingness to answer changes unexpectedly. The system may refuse benign requests it should complete, answer risky requests it should decline, or behave inconsistently across similar inputs. Refusal drift can result from model upgrades, prompt changes, policy changes, retrieval context, or safety tuning differences.
Architecture context
Customer support assistants, policy bots, legal or compliance tools, employee copilots, educational assistants, regulated workflows, and systems with safety or content boundaries.
Impact
Refusal behavior affects user trust, safety, compliance, and product quality. Over-refusal makes useful systems frustrating and less productive. Under-refusal creates risk when systems provide unsafe, confidential, regulated, or policy-violating information.
Symptoms
- The system refuses ordinary business requests.
- The system answers requests that should trigger a policy boundary.
- Similar requests receive inconsistent refusal behavior.
- Refusal rates change after a model or prompt update.
- The system refuses because retrieved context contains sensitive-looking but safe content.
Detection signals
- Refusal-rate changes by model version, prompt version, or workflow.
- User abandonment or correction after refusals.
- Policy boundary violations.
- Inconsistent outcomes across semantically similar requests.
- Increased escalation to human agents.
Mitigations
- Define clear refusal policies and examples.
- Add allowed/disallowed eval sets.
- Test refusal behavior before model changes.
- Improve clarification behavior for ambiguous requests.
- Calibrate refusal thresholds by use case.
- Monitor refusal trend shifts after deployment.