Eval framework

At a glance

Framework: pytest-based with EvalRunsDefinition model for agent evaluation
Coverage targets: 80% minimum overall; 90%+ for auth/DAOs; 85%+ for LLM service and orchestration
Structure: tests/unit/ (20+ subdirectories), tests/integration/, tests/middleware/, tests/services/
Pattern: Arrange-Act-Assert, mock at service boundaries, scenario-based agent evals

Why this matters

Evals are how you know an agent works before it goes live — and how you catch drift after it’s been running for months. When a customer reports that “the agent started giving wrong answers last week,” the eval framework is where you start. For CS, understanding evals means you can help scope test scenarios, explain what’s being tested, and interpret results.

Types of testing

1. Unit tests

Standard Python unit tests covering individual functions and services. Located in tests/unit/ with 20+ subdirectories mirroring the app/ structure.

Area	What’s tested	Target coverage
Auth	ProjectX auth, API key validation, HMAC	90%+
DAOs	Database access objects (CRUD operations)	90%+
LLM service	Provider routing, response parsing, error handling	85%+
Orchestration	Node builders, graph compilation, state management	85%+
Middleware	All 14 middleware types	80%+
Communication	Channel routing, provider selection, rate limiting	80%+
Tools	Tool registry, function execution	80%+

2. Integration tests

End-to-end tests that run actual workflows against test databases. Located in tests/integration/.

3. Middleware tests

Dedicated test suite for the 14 node-level middleware types. Located in tests/middleware/. Tests middleware ordering, error propagation, and interaction between middleware layers.

4. Agent evaluation runs

The EvalRunsDefinition model supports structured agent evaluation:

Field	Purpose
`dataset_name`	Name of the test scenario dataset
`config_json`	Evaluation configuration (which agent, which model, parameters)
`metrics_json`	Metrics to evaluate (accuracy, latency, cost, tool-call correctness)
`result`	Evaluation outcome

Building an eval set for a new agent

Step 1: Define scenarios

For each agent, create scenarios that cover:

Category	What to test	Example
Happy path	Standard input, expected output	WISMO: customer provides valid AWB, agent returns correct status
Edge cases	Unusual but valid inputs	Multiple orders, partial data, uncommon formats
Adversarial	Inputs designed to break the agent	SQL injection in order number, prompt injection attempts
Timeout	Slow or unresponsive tool calls	External API takes 30s to respond
Language switch	Multilingual handling	Customer starts in English, switches to Hindi
Escalation	Out-of-scope queries	Customer asks about billing when agent only handles tracking

Step 2: Define metrics

Metric	What it measures
Accuracy	Did the agent return the correct answer? (LLM-as-judge or exact match)
Tool-call correctness	Did the agent call the right tools with the right parameters?
Latency	End-to-end response time
Cost	Total token cost (USD) per scenario
Escalation rate	Percentage of scenarios that triggered escalation
Guardrail triggers	How often content safety / PII detection fired

Step 3: Run and analyze

# Run the full test suite
pytest tests/ -v --cov=app --cov-report=term-missing
 
# Run a specific eval set
pytest tests/integration/ -k "eval" -v

Results feed into Langfuse for visualization and trend analysis over time.

Drift detection

After deployment, evals should run periodically (via the TaskIQ scheduler or CI/CD) to catch:

Model drift: A model update changes behavior (e.g., GPT version bump)
Data drift: Changes in customer query patterns that the agent wasn’t tested for
Tool drift: External APIs change their response format
Prompt drift: System prompt edits that have unintended side effects

CI/CD integration

The Jenkins pipeline runs tests automatically:

Pre-push hook: git config core.hooksPath hooks — runs all tests before push
CI pipeline: Full test suite runs in Jenkins before Docker build
Coverage gate: Build fails if coverage drops below 80%

Sources

agent-platform repo: tests/
agent-platform repo: agents-helper/test_agents.md
agent-platform repo: pytest.ini
See Observability for Langfuse eval visualization
See The agent-platform repo for test directory layout

Changelog

26 May 2026: Full content from GitHub repo exploration. Test structure, eval model, scenario framework, drift detection.

Security & compliance The agent-platform repo