2 · AI & agentsGuardrails & hallucination control

Guardrails & hallucination control

What stops the agent from doing or saying the wrong thing.

The defence-in-depth picture

No single layer is enough. Multiple weak filters compose into a strong one.

What each layer does

Input guardrails

  • Prompt injection detection — catches inputs trying to override the system prompt (“ignore previous instructions…”)
  • PII redaction — strips/masks card numbers, IDs, etc. before they hit logs or the model where unnecessary
  • Rate limiting — caps requests per user / per minute

Output guardrails

  • PII / profanity filter — strips sensitive data and inappropriate language before the response leaves
  • Policy compliance — checks the response against customer-specific policies (no quoting prices outside approval, no promising delivery dates without checking, etc.)
  • Confidence threshold — if model confidence < X, escalate instead of acting

Grounding (RAG)

Anchors responses in customer data — SOPs, contracts, prior tickets — so the model doesn’t invent answers. Most powerful single anti-hallucination technique.

Human-in-the-loop (HITL)

For high-stakes actions (payments, refunds, fraud routing, customer-facing decisions over a threshold), the agent proposes and a human approves. Always-on for production deployments.

The hallucination problem specifically

LLMs hallucinate when they generate plausible-sounding text that isn’t true. Common cases:

CauseMitigation
Missing contextRAG — feed in the relevant data
Out-of-date knowledgeRAG with recent data + explicit “as of” markers
Ambiguous queryAsk clarifying question instead of answering
Pressure to answerExplicit “say ‘I don’t know’ if not in retrieved context” instruction
Tool result formatting confuses the modelCleaner tool output schemas

You can’t eliminate hallucination, only reduce it. The right framing for customers: the agent will be more accurate than the average human operator on the things it’s grounded in, and will escalate when it’s not sure.

What CS folks tell customers

Three things every customer asks about. Have answers ready:

  1. “What if it gives a wrong answer?” Guardrails catch most. RAG grounds the rest. Confidence threshold + HITL catches the residual. We track accuracy in production via evals.
  2. “What if someone tries to trick it?” Input guardrails detect prompt injection. Output guardrails prevent PII / policy leaks. Adversarial evals run before every release.
  3. “What about PII / GDPR / data residency?” Masked at input, encrypted at rest, redacted in logs. On-prem deployment available where data residency requires it. See Security & compliance.

Sources

  • BDO and Carrix decks, “Shipsy AI Platform — Secure, Compliant & Observable” section
  • See Security & compliance for the compliance posture

Changelog

  • 26 May 2026: Initial draft.