2 · AI & agentsEvals: how we know it works

Evals · how we know it works

You can’t ship an agent without evals. This page is what “eval” actually means.

The problem

LLM agents are non-deterministic. The same input can produce slightly different outputs across runs. Traditional software tests (assert equals) don’t work cleanly.

So you build eval sets — curated input/output pairs (or input + judging criteria) — and run them as a suite. You measure pass rate, latency, cost over time and across model/prompt changes.

Four kinds of evals Shipsy runs

Eval typeWhenWhat it answers
Scenario-based eval setPre-deploymentDoes the agent handle the cases we care about?
LLM-as-judgePre-deploymentWas the response correct/safe/on-tone? (graded by another LLM)
Continuous canaryIn prodDid anything drift since last week?
Human feedbackIn prodWhat did real users think?

Plus benchmarking — A/B tests on prompts and model variants to optimise outcomes.

What a Shipsy eval set looks like

Placeholder — link to real example: The agent-platform repo has eval-set examples. Add a link here and inline a small sample.

Conceptually:

agent: clara
eval_set: bdo_card_delivery_v1
scenarios:
  - id: status_query_happy_path
    input:
      caller_speech: "Where is my card?"
      caller_id: "verified"
      order_state: "out_for_delivery"
    expected:
      intent: "status_query"
      action: "speak_status"
      contains_eta: true
    score: pass/fail
 
  - id: caller_hangs_up_mid_auth
    input:
      caller_speech: "..."
      caller_id: "unverified"
      caller_disconnects_at_turn: 2
    expected:
      action: "log_incomplete_call"
      retry_scheduled: true
    score: pass/fail
 
  # ... 50+ more

What Shipsy measures

MetricWhy
Accuracy / correctnessDid the agent do the right thing?
Hallucination rateDid it make stuff up?
Latency (first token, full response)Especially matters for voice
Cost per interactionTracks token economy
Escalation rate to humanInverse of self-resolution
Drift over timeSame eval set, different days — has the model or platform changed?

Adversarial / safety evals

Separate suite. Red-team prompts designed to make the agent break rules:

  • Prompt injection (“ignore previous instructions and …”)
  • PII extraction attempts
  • Off-topic / abuse handling
  • Multilingual edge cases

These run before every release.

Anti-patterns

  • Shipping without an eval set. You’ll be debugging blind in prod.
  • Eval sets that only cover happy paths. The whole point is to catch edge cases.
  • Treating eval scores as absolute. Use them to detect regressions, not to pass arbitrary thresholds.
  • No human-in-the-loop for borderline cases. LLM-as-judge has its own failure modes.

Sources

Changelog

  • 26 May 2026: Initial draft.