A chatbot that gives a wrong answer wastes a minute. An agent that issues a wrong refund, sends a wrong email, or updates the wrong record causes a real-world consequence. Designing agents is mostly about designing how they fail.
Make actions reversible by default
The single biggest safety lever is reversibility. Prefer drafts over sends, holds over charges, and staged changes over direct writes. An action you can undo is an action a mistake cannot make catastrophic.

Layers of containment
- Scoped tools: the agent can only call the actions it genuinely needs.
- Typed arguments: validate every tool call before it executes.
- Human checkpoints: high-impact actions wait for an approval.
- Budgets: cap steps, spend, and blast radius per run.
Observe everything
Every plan, tool call, and result should be logged and replayable. When something goes wrong, you want a trace you can read, not a black box you can only guess at. Observability is what turns a scary incident into a five-minute fix.
A safe agent is not one that never makes a mistake. It is one whose mistakes are cheap, visible, and reversible.
Choose the model for the job
For long-horizon, high-stakes work, a model with strong refusal behaviour that stops and asks when unsure beats a faster one that confidently does the wrong thing. Reliability outranks latency when actions have consequences.
A pre-production checklist
- Can every action the agent can take be undone or held for review?
- Is each tool scoped, typed, and validated before execution?
- Are high-impact steps gated behind a human approval?
- Can you replay any run from its logged plan and tool calls?
If you can answer yes to all four, you have an agent whose worst day is an inconvenience, not an incident — which is the only kind worth putting in front of customers.



