Evals · how we know it works

You can’t ship an agent without evals. This page is what “eval” actually means.

The problem

LLM agents are non-deterministic. The same input can produce slightly different outputs across runs. Traditional software tests (assert equals) don’t work cleanly.

So you build eval sets — curated input/output pairs (or input + judging criteria) — and run them as a suite. You measure pass rate, latency, cost over time and across model/prompt changes.

Four kinds of evals Shipsy runs

Eval type	When	What it answers
Scenario-based eval set	Pre-deployment	Does the agent handle the cases we care about?
LLM-as-judge	Pre-deployment	Was the response correct/safe/on-tone? (graded by another LLM)
Continuous canary	In prod	Did anything drift since last week?
Human feedback	In prod	What did real users think?

Plus benchmarking — A/B tests on prompts and model variants to optimise outcomes.

What a Shipsy eval set looks like

Placeholder — link to real example: The agent-platform repo has eval-set examples. Add a link here and inline a small sample.

Conceptually:

agent: clara
eval_set: bdo_card_delivery_v1
scenarios:
  - id: status_query_happy_path
    input:
      caller_speech: "Where is my card?"
      caller_id: "verified"
      order_state: "out_for_delivery"
    expected:
      intent: "status_query"
      action: "speak_status"
      contains_eta: true
    score: pass/fail
 
  - id: caller_hangs_up_mid_auth
    input:
      caller_speech: "..."
      caller_id: "unverified"
      caller_disconnects_at_turn: 2
    expected:
      action: "log_incomplete_call"
      retry_scheduled: true
    score: pass/fail
 
  # ... 50+ more

What Shipsy measures

Metric	Why
Accuracy / correctness	Did the agent do the right thing?
Hallucination rate	Did it make stuff up?
Latency (first token, full response)	Especially matters for voice
Cost per interaction	Tracks token economy
Escalation rate to human	Inverse of self-resolution
Drift over time	Same eval set, different days — has the model or platform changed?

Adversarial / safety evals

Separate suite. Red-team prompts designed to make the agent break rules:

Prompt injection (“ignore previous instructions and …”)
PII extraction attempts
Off-topic / abuse handling
Multilingual edge cases

These run before every release.

Anti-patterns

Shipping without an eval set. You’ll be debugging blind in prod.
Eval sets that only cover happy paths. The whole point is to catch edge cases.
Treating eval scores as absolute. Use them to detect regressions, not to pass arbitrary thresholds.
No human-in-the-loop for borderline cases. LLM-as-judge has its own failure modes.

Sources

See Eval framework for the Shipsy-specific framework
Lab 4 · Build an eval set for hands-on

Changelog

26 May 2026: Initial draft.

Tools & MCP Guardrails & hallucination control