Guardrails & hallucination control
What stops the agent from doing or saying the wrong thing.
The defence-in-depth picture
No single layer is enough. Multiple weak filters compose into a strong one.
What each layer does
Input guardrails
- Prompt injection detection — catches inputs trying to override the system prompt (“ignore previous instructions…”)
- PII redaction — strips/masks card numbers, IDs, etc. before they hit logs or the model where unnecessary
- Rate limiting — caps requests per user / per minute
Output guardrails
- PII / profanity filter — strips sensitive data and inappropriate language before the response leaves
- Policy compliance — checks the response against customer-specific policies (no quoting prices outside approval, no promising delivery dates without checking, etc.)
- Confidence threshold — if model confidence < X, escalate instead of acting
Grounding (RAG)
Anchors responses in customer data — SOPs, contracts, prior tickets — so the model doesn’t invent answers. Most powerful single anti-hallucination technique.
Human-in-the-loop (HITL)
For high-stakes actions (payments, refunds, fraud routing, customer-facing decisions over a threshold), the agent proposes and a human approves. Always-on for production deployments.
The hallucination problem specifically
LLMs hallucinate when they generate plausible-sounding text that isn’t true. Common cases:
| Cause | Mitigation |
|---|---|
| Missing context | RAG — feed in the relevant data |
| Out-of-date knowledge | RAG with recent data + explicit “as of” markers |
| Ambiguous query | Ask clarifying question instead of answering |
| Pressure to answer | Explicit “say ‘I don’t know’ if not in retrieved context” instruction |
| Tool result formatting confuses the model | Cleaner tool output schemas |
You can’t eliminate hallucination, only reduce it. The right framing for customers: the agent will be more accurate than the average human operator on the things it’s grounded in, and will escalate when it’s not sure.
What CS folks tell customers
Three things every customer asks about. Have answers ready:
- “What if it gives a wrong answer?” Guardrails catch most. RAG grounds the rest. Confidence threshold + HITL catches the residual. We track accuracy in production via evals.
- “What if someone tries to trick it?” Input guardrails detect prompt injection. Output guardrails prevent PII / policy leaks. Adversarial evals run before every release.
- “What about PII / GDPR / data residency?” Masked at input, encrypted at rest, redacted in logs. On-prem deployment available where data residency requires it. See Security & compliance.
Sources
- BDO and Carrix decks, “Shipsy AI Platform — Secure, Compliant & Observable” section
- See Security & compliance for the compliance posture
Changelog
- 26 May 2026: Initial draft.