Eval framework
At a glance
- Framework: pytest-based with
EvalRunsDefinitionmodel for agent evaluation - Coverage targets: 80% minimum overall; 90%+ for auth/DAOs; 85%+ for LLM service and orchestration
- Structure:
tests/unit/(20+ subdirectories),tests/integration/,tests/middleware/,tests/services/ - Pattern: Arrange-Act-Assert, mock at service boundaries, scenario-based agent evals
Why this matters
Evals are how you know an agent works before it goes live — and how you catch drift after it’s been running for months. When a customer reports that “the agent started giving wrong answers last week,” the eval framework is where you start. For CS, understanding evals means you can help scope test scenarios, explain what’s being tested, and interpret results.
Types of testing
1. Unit tests
Standard Python unit tests covering individual functions and services. Located in tests/unit/ with 20+ subdirectories mirroring the app/ structure.
| Area | What’s tested | Target coverage |
|---|---|---|
| Auth | ProjectX auth, API key validation, HMAC | 90%+ |
| DAOs | Database access objects (CRUD operations) | 90%+ |
| LLM service | Provider routing, response parsing, error handling | 85%+ |
| Orchestration | Node builders, graph compilation, state management | 85%+ |
| Middleware | All 14 middleware types | 80%+ |
| Communication | Channel routing, provider selection, rate limiting | 80%+ |
| Tools | Tool registry, function execution | 80%+ |
2. Integration tests
End-to-end tests that run actual workflows against test databases. Located in tests/integration/.
3. Middleware tests
Dedicated test suite for the 14 node-level middleware types. Located in tests/middleware/. Tests middleware ordering, error propagation, and interaction between middleware layers.
4. Agent evaluation runs
The EvalRunsDefinition model supports structured agent evaluation:
| Field | Purpose |
|---|---|
dataset_name | Name of the test scenario dataset |
config_json | Evaluation configuration (which agent, which model, parameters) |
metrics_json | Metrics to evaluate (accuracy, latency, cost, tool-call correctness) |
result | Evaluation outcome |
Building an eval set for a new agent
Step 1: Define scenarios
For each agent, create scenarios that cover:
| Category | What to test | Example |
|---|---|---|
| Happy path | Standard input, expected output | WISMO: customer provides valid AWB, agent returns correct status |
| Edge cases | Unusual but valid inputs | Multiple orders, partial data, uncommon formats |
| Adversarial | Inputs designed to break the agent | SQL injection in order number, prompt injection attempts |
| Timeout | Slow or unresponsive tool calls | External API takes 30s to respond |
| Language switch | Multilingual handling | Customer starts in English, switches to Hindi |
| Escalation | Out-of-scope queries | Customer asks about billing when agent only handles tracking |
Step 2: Define metrics
| Metric | What it measures |
|---|---|
| Accuracy | Did the agent return the correct answer? (LLM-as-judge or exact match) |
| Tool-call correctness | Did the agent call the right tools with the right parameters? |
| Latency | End-to-end response time |
| Cost | Total token cost (USD) per scenario |
| Escalation rate | Percentage of scenarios that triggered escalation |
| Guardrail triggers | How often content safety / PII detection fired |
Step 3: Run and analyze
# Run the full test suite
pytest tests/ -v --cov=app --cov-report=term-missing
# Run a specific eval set
pytest tests/integration/ -k "eval" -vResults feed into Langfuse for visualization and trend analysis over time.
Drift detection
After deployment, evals should run periodically (via the TaskIQ scheduler or CI/CD) to catch:
- Model drift: A model update changes behavior (e.g., GPT version bump)
- Data drift: Changes in customer query patterns that the agent wasn’t tested for
- Tool drift: External APIs change their response format
- Prompt drift: System prompt edits that have unintended side effects
CI/CD integration
The Jenkins pipeline runs tests automatically:
- Pre-push hook:
git config core.hooksPath hooks— runs all tests before push - CI pipeline: Full test suite runs in Jenkins before Docker build
- Coverage gate: Build fails if coverage drops below 80%
Sources
- agent-platform repo:
tests/ - agent-platform repo:
agents-helper/test_agents.md - agent-platform repo:
pytest.ini - See Observability for Langfuse eval visualization
- See The agent-platform repo for test directory layout
Changelog
- 26 May 2026: Full content from GitHub repo exploration. Test structure, eval model, scenario framework, drift detection.