Observability & monitoring

At a glance

Three-layer observability stack: New Relic (application), Elasticsearch (execution logs), Langfuse (LLM tracing).
Every reasoning step, tool call, and decision an agent makes is logged.
The AgentFleet Dashboard is the CS-facing view; the layers below are for debugging.

Why this matters

When a customer reports “the agent gave a wrong answer” or “the agent is slow”, you need to trace what happened. The observability stack is your forensic toolkit. Know which layer to look at and you’ll resolve issues in minutes instead of hours.

The three layers

Layer 1: New Relic — is the platform healthy?

New Relic monitors the agent-platform application itself. Use it when the question is “is the service up?” or “are we seeing elevated errors?”

What to check	Where in NR	When
Service health	APM → agent-platform	First thing during any incident
Error rate spike	APM → Error analytics	Customer reports failures
Latency degradation	APM → Distributed tracing	Agent responses feel slow
Alerts	Alerts & AI → Open incidents	Proactive monitoring

NR alerts are configured for agent-platform health checks. ECS tasks run with desiredCount >= 2 for high availability.

Layer 2: Elasticsearch — what did the agent do?

Execution logs for every agent run. Each step in an agent workflow logs its input and output to Elasticsearch. Retention: approximately 1-2 months.

Use this layer when the question is “what happened in this specific conversation?”

How to trace a conversation:

Get the conversation ID from the Dashboard or customer report.
Search Elasticsearch for that conversation ID.
Walk through each step: intent classification → tool calls → response generation → escalation decision.

Layer 3: Langfuse — how well is the LLM performing?

Langfuse is the prompt-level observability layer. Use it when the question is “is the model producing good outputs?” or “are we spending too much on tokens?”

What Langfuse shows	Why it matters
Prompt evaluations	Are we meeting quality benchmarks?
LLM call traces	Which model was called, with what prompt, and what it returned
Token usage & cost	Budget tracking per agent, per customer
Latency breakdown	How much time is LLM inference vs. tool calls?

What to look at first

When something goes wrong, work top-down:

New Relic — is the platform healthy? If NR shows elevated errors or the service is degraded, that’s the root cause. Escalate to engineering.
Elasticsearch — if the platform is healthy, trace the specific conversation. Did the agent classify intent correctly? Did the tool call succeed? Did it escalate when it should have?
Langfuse — if the conversation trace looks right structurally but the output was wrong, check the prompt evaluation. Was the model hallucinating? Was context missing from the RAG retrieval?

The Dashboard (CS view)

The AgentFleet Dashboard is the friendly layer on top. It surfaces HITL queue, per-agent SLAs, latency, accuracy, escalation rate, and alert feed. Most CS workflows start and end in the Dashboard. Dive into Elasticsearch or Langfuse only when you need to debug something the Dashboard doesn’t explain.

Sources

Slack: #team-ai — engineering discussions on observability stack
See Architecture overview for how observability fits in the platform
Agent Platform Enhancements doc (internal)

Changelog

26 May 2026: Full content from Slack engineering discussions. Three-layer stack documented.

Tools & MCP integrations Deployment modes