Eval framework

At a glance

  • Framework: pytest-based with EvalRunsDefinition model for agent evaluation
  • Coverage targets: 80% minimum overall; 90%+ for auth/DAOs; 85%+ for LLM service and orchestration
  • Structure: tests/unit/ (20+ subdirectories), tests/integration/, tests/middleware/, tests/services/
  • Pattern: Arrange-Act-Assert, mock at service boundaries, scenario-based agent evals

Why this matters

Evals are how you know an agent works before it goes live — and how you catch drift after it’s been running for months. When a customer reports that “the agent started giving wrong answers last week,” the eval framework is where you start. For CS, understanding evals means you can help scope test scenarios, explain what’s being tested, and interpret results.

Types of testing

1. Unit tests

Standard Python unit tests covering individual functions and services. Located in tests/unit/ with 20+ subdirectories mirroring the app/ structure.

AreaWhat’s testedTarget coverage
AuthProjectX auth, API key validation, HMAC90%+
DAOsDatabase access objects (CRUD operations)90%+
LLM serviceProvider routing, response parsing, error handling85%+
OrchestrationNode builders, graph compilation, state management85%+
MiddlewareAll 14 middleware types80%+
CommunicationChannel routing, provider selection, rate limiting80%+
ToolsTool registry, function execution80%+

2. Integration tests

End-to-end tests that run actual workflows against test databases. Located in tests/integration/.

3. Middleware tests

Dedicated test suite for the 14 node-level middleware types. Located in tests/middleware/. Tests middleware ordering, error propagation, and interaction between middleware layers.

4. Agent evaluation runs

The EvalRunsDefinition model supports structured agent evaluation:

FieldPurpose
dataset_nameName of the test scenario dataset
config_jsonEvaluation configuration (which agent, which model, parameters)
metrics_jsonMetrics to evaluate (accuracy, latency, cost, tool-call correctness)
resultEvaluation outcome

Building an eval set for a new agent

Step 1: Define scenarios

For each agent, create scenarios that cover:

CategoryWhat to testExample
Happy pathStandard input, expected outputWISMO: customer provides valid AWB, agent returns correct status
Edge casesUnusual but valid inputsMultiple orders, partial data, uncommon formats
AdversarialInputs designed to break the agentSQL injection in order number, prompt injection attempts
TimeoutSlow or unresponsive tool callsExternal API takes 30s to respond
Language switchMultilingual handlingCustomer starts in English, switches to Hindi
EscalationOut-of-scope queriesCustomer asks about billing when agent only handles tracking

Step 2: Define metrics

MetricWhat it measures
AccuracyDid the agent return the correct answer? (LLM-as-judge or exact match)
Tool-call correctnessDid the agent call the right tools with the right parameters?
LatencyEnd-to-end response time
CostTotal token cost (USD) per scenario
Escalation ratePercentage of scenarios that triggered escalation
Guardrail triggersHow often content safety / PII detection fired

Step 3: Run and analyze

# Run the full test suite
pytest tests/ -v --cov=app --cov-report=term-missing
 
# Run a specific eval set
pytest tests/integration/ -k "eval" -v

Results feed into Langfuse for visualization and trend analysis over time.

Drift detection

After deployment, evals should run periodically (via the TaskIQ scheduler or CI/CD) to catch:

  • Model drift: A model update changes behavior (e.g., GPT version bump)
  • Data drift: Changes in customer query patterns that the agent wasn’t tested for
  • Tool drift: External APIs change their response format
  • Prompt drift: System prompt edits that have unintended side effects

CI/CD integration

The Jenkins pipeline runs tests automatically:

  1. Pre-push hook: git config core.hooksPath hooks — runs all tests before push
  2. CI pipeline: Full test suite runs in Jenkins before Docker build
  3. Coverage gate: Build fails if coverage drops below 80%

Sources

Changelog

  • 26 May 2026: Full content from GitHub repo exploration. Test structure, eval model, scenario framework, drift detection.