Skip to Content
DashboardAgent Monitoring

Agent Monitoring

The agent monitoring system provides real-time health dashboards and performance metrics for every agent in your deployment. It surfaces anomalies, tracks execution history, and gives operators immediate visibility into agent status without digging through logs.

Health Dashboard

The main monitoring view displays all registered agents with their current status. Each agent is categorized into one of three states:

StatusIndicatorDescription
HealthyGreenAgent is responding within expected latency and error thresholds
DegradedAmberAgent is operational but showing elevated latency or error rates
OfflineRedAgent has failed health checks or is unreachable

Status is determined automatically based on configurable thresholds for latency, error rate, and throughput. When an agent crosses a threshold, its status transitions and an alert is generated.

Performance Metrics

Each agent tracks four core metrics in real time:

Latency

  • P50, P95, P99 response times measured per execution
  • Historical latency trends displayed as time-series charts
  • Latency spikes are flagged as anomalies when they exceed 2x the rolling average

Throughput

  • Executions per minute/hour/day
  • Pipeline-level throughput aggregation
  • Capacity utilization relative to configured concurrency limits

Error Rates

  • Percentage of failed executions over a rolling window
  • Error categorization: timeout, model error, input validation, internal
  • Error rate trends with configurable alert thresholds

Token Usage

  • Input and output token counts per execution
  • Average tokens per agent over time
  • Token efficiency metrics (output quality relative to token spend)

Anomaly Detection

The monitoring system applies statistical anomaly detection to agent metrics. When a metric deviates significantly from its baseline, an alert is raised in the dashboard.

Anomaly detection covers:

  • Latency spikes — sudden increases in response time
  • Error bursts — rapid increase in failure rate
  • Cost anomalies — unexpected jumps in per-agent spend
  • Throughput drops — agent processing fewer requests than expected

Alerts appear in the dashboard notification center and can be configured to trigger webhooks for integration with external alerting systems.

Per-Agent Detail View

Clicking into any agent opens a detail view with:

  • Execution history — paginated list of recent runs with status, duration, and cost
  • Input/output inspection — view the exact data passed to and returned from the agent
  • Configuration — current model, prompt version, and settings
  • Error log — filtered view of failed executions with stack traces
  • Performance charts — dedicated latency, throughput, and cost charts for the agent

API Endpoints

Agent Status

Returns the current status and health metrics for all agents.

GET /api/v1/admin/agents/status
{ "agents": [ { "agent_id": "style_analysis_v3", "status": "healthy", "latency_p50_ms": 1240, "latency_p95_ms": 2800, "error_rate": 0.002, "executions_last_hour": 342, "last_execution": "2026-02-08T14:23:11Z" } ] }

Agent Summary

Returns aggregate metrics and summary statistics across all agents.

GET /api/v1/admin/agents/summary
{ "total_agents": 54, "healthy": 51, "degraded": 2, "offline": 1, "total_executions_today": 12847, "avg_latency_ms": 1580, "total_cost_today": 42.17 }

Both endpoints require a valid JWT and the X-Org-ID header for tenant-scoped results.