Evals · how we know it works
You can’t ship an agent without evals. This page is what “eval” actually means.
The problem
LLM agents are non-deterministic. The same input can produce slightly different outputs across runs. Traditional software tests (assert equals) don’t work cleanly.
So you build eval sets — curated input/output pairs (or input + judging criteria) — and run them as a suite. You measure pass rate, latency, cost over time and across model/prompt changes.
Four kinds of evals Shipsy runs
| Eval type | When | What it answers |
|---|---|---|
| Scenario-based eval set | Pre-deployment | Does the agent handle the cases we care about? |
| LLM-as-judge | Pre-deployment | Was the response correct/safe/on-tone? (graded by another LLM) |
| Continuous canary | In prod | Did anything drift since last week? |
| Human feedback | In prod | What did real users think? |
Plus benchmarking — A/B tests on prompts and model variants to optimise outcomes.
What a Shipsy eval set looks like
Placeholder — link to real example: The agent-platform repo has eval-set examples. Add a link here and inline a small sample.
Conceptually:
agent: clara
eval_set: bdo_card_delivery_v1
scenarios:
- id: status_query_happy_path
input:
caller_speech: "Where is my card?"
caller_id: "verified"
order_state: "out_for_delivery"
expected:
intent: "status_query"
action: "speak_status"
contains_eta: true
score: pass/fail
- id: caller_hangs_up_mid_auth
input:
caller_speech: "..."
caller_id: "unverified"
caller_disconnects_at_turn: 2
expected:
action: "log_incomplete_call"
retry_scheduled: true
score: pass/fail
# ... 50+ moreWhat Shipsy measures
| Metric | Why |
|---|---|
| Accuracy / correctness | Did the agent do the right thing? |
| Hallucination rate | Did it make stuff up? |
| Latency (first token, full response) | Especially matters for voice |
| Cost per interaction | Tracks token economy |
| Escalation rate to human | Inverse of self-resolution |
| Drift over time | Same eval set, different days — has the model or platform changed? |
Adversarial / safety evals
Separate suite. Red-team prompts designed to make the agent break rules:
- Prompt injection (“ignore previous instructions and …”)
- PII extraction attempts
- Off-topic / abuse handling
- Multilingual edge cases
These run before every release.
Anti-patterns
- Shipping without an eval set. You’ll be debugging blind in prod.
- Eval sets that only cover happy paths. The whole point is to catch edge cases.
- Treating eval scores as absolute. Use them to detect regressions, not to pass arbitrary thresholds.
- No human-in-the-loop for borderline cases. LLM-as-judge has its own failure modes.
Sources
- See Eval framework for the Shipsy-specific framework
- Lab 4 · Build an eval set for hands-on
Changelog
- 26 May 2026: Initial draft.