Guardrails & hallucination control

What stops the agent from doing or saying the wrong thing.

The defence-in-depth picture

No single layer is enough. Multiple weak filters compose into a strong one.

What each layer does

Input guardrails

Prompt injection detection — catches inputs trying to override the system prompt (“ignore previous instructions…”)
PII redaction — strips/masks card numbers, IDs, etc. before they hit logs or the model where unnecessary
Rate limiting — caps requests per user / per minute

Output guardrails

PII / profanity filter — strips sensitive data and inappropriate language before the response leaves
Policy compliance — checks the response against customer-specific policies (no quoting prices outside approval, no promising delivery dates without checking, etc.)
Confidence threshold — if model confidence < X, escalate instead of acting

Grounding (RAG)

Anchors responses in customer data — SOPs, contracts, prior tickets — so the model doesn’t invent answers. Most powerful single anti-hallucination technique.

Human-in-the-loop (HITL)

For high-stakes actions (payments, refunds, fraud routing, customer-facing decisions over a threshold), the agent proposes and a human approves. Always-on for production deployments.

The hallucination problem specifically

LLMs hallucinate when they generate plausible-sounding text that isn’t true. Common cases:

Cause	Mitigation
Missing context	RAG — feed in the relevant data
Out-of-date knowledge	RAG with recent data + explicit “as of” markers
Ambiguous query	Ask clarifying question instead of answering
Pressure to answer	Explicit “say ‘I don’t know’ if not in retrieved context” instruction
Tool result formatting confuses the model	Cleaner tool output schemas

You can’t eliminate hallucination, only reduce it. The right framing for customers: the agent will be more accurate than the average human operator on the things it’s grounded in, and will escalate when it’s not sure.

What CS folks tell customers

Three things every customer asks about. Have answers ready:

“What if it gives a wrong answer?” Guardrails catch most. RAG grounds the rest. Confidence threshold + HITL catches the residual. We track accuracy in production via evals.
“What if someone tries to trick it?” Input guardrails detect prompt injection. Output guardrails prevent PII / policy leaks. Adversarial evals run before every release.
“What about PII / GDPR / data residency?” Masked at input, encrypted at rest, redacted in logs. On-prem deployment available where data residency requires it. See Security & compliance.

Sources

BDO and Carrix decks, “Shipsy AI Platform — Secure, Compliant & Observable” section
See Security & compliance for the compliance posture

Changelog

26 May 2026: Initial draft.

Evals: how we know it works Architecture overview