3 · AgentFleet platformObservability & monitoring

Observability & monitoring

At a glance

  • Three-layer observability stack: New Relic (application), Elasticsearch (execution logs), Langfuse (LLM tracing).
  • Every reasoning step, tool call, and decision an agent makes is logged.
  • The AgentFleet Dashboard is the CS-facing view; the layers below are for debugging.

Why this matters

When a customer reports “the agent gave a wrong answer” or “the agent is slow”, you need to trace what happened. The observability stack is your forensic toolkit. Know which layer to look at and you’ll resolve issues in minutes instead of hours.

The three layers

Layer 1: New Relic — is the platform healthy?

New Relic monitors the agent-platform application itself. Use it when the question is “is the service up?” or “are we seeing elevated errors?”

What to checkWhere in NRWhen
Service healthAPM → agent-platformFirst thing during any incident
Error rate spikeAPM → Error analyticsCustomer reports failures
Latency degradationAPM → Distributed tracingAgent responses feel slow
AlertsAlerts & AI → Open incidentsProactive monitoring

NR alerts are configured for agent-platform health checks. ECS tasks run with desiredCount >= 2 for high availability.

Layer 2: Elasticsearch — what did the agent do?

Execution logs for every agent run. Each step in an agent workflow logs its input and output to Elasticsearch. Retention: approximately 1-2 months.

Use this layer when the question is “what happened in this specific conversation?”

How to trace a conversation:

  1. Get the conversation ID from the Dashboard or customer report.
  2. Search Elasticsearch for that conversation ID.
  3. Walk through each step: intent classification → tool calls → response generation → escalation decision.

Layer 3: Langfuse — how well is the LLM performing?

Langfuse is the prompt-level observability layer. Use it when the question is “is the model producing good outputs?” or “are we spending too much on tokens?”

What Langfuse showsWhy it matters
Prompt evaluationsAre we meeting quality benchmarks?
LLM call tracesWhich model was called, with what prompt, and what it returned
Token usage & costBudget tracking per agent, per customer
Latency breakdownHow much time is LLM inference vs. tool calls?

What to look at first

When something goes wrong, work top-down:

  1. New Relic — is the platform healthy? If NR shows elevated errors or the service is degraded, that’s the root cause. Escalate to engineering.
  2. Elasticsearch — if the platform is healthy, trace the specific conversation. Did the agent classify intent correctly? Did the tool call succeed? Did it escalate when it should have?
  3. Langfuse — if the conversation trace looks right structurally but the output was wrong, check the prompt evaluation. Was the model hallucinating? Was context missing from the RAG retrieval?

The Dashboard (CS view)

The AgentFleet Dashboard is the friendly layer on top. It surfaces HITL queue, per-agent SLAs, latency, accuracy, escalation rate, and alert feed. Most CS workflows start and end in the Dashboard. Dive into Elasticsearch or Langfuse only when you need to debug something the Dashboard doesn’t explain.

Sources

Changelog

  • 26 May 2026: Full content from Slack engineering discussions. Three-layer stack documented.