Agent Monitoring

The agent monitoring system provides real-time health dashboards and performance metrics for every agent in your deployment. It surfaces anomalies, tracks execution history, and gives operators immediate visibility into agent status without digging through logs.

Health Dashboard

The main monitoring view displays all registered agents with their current status. Each agent is categorized into one of three states:

Status	Indicator	Description
Healthy	Green	Agent is responding within expected latency and error thresholds
Degraded	Amber	Agent is operational but showing elevated latency or error rates
Offline	Red	Agent has failed health checks or is unreachable

Status is determined automatically based on configurable thresholds for latency, error rate, and throughput. When an agent crosses a threshold, its status transitions and an alert is generated.

Performance Metrics

Each agent tracks four core metrics in real time:

Latency

P50, P95, P99 response times measured per execution
Historical latency trends displayed as time-series charts
Latency spikes are flagged as anomalies when they exceed 2x the rolling average

Throughput

Executions per minute/hour/day
Pipeline-level throughput aggregation
Capacity utilization relative to configured concurrency limits

Error Rates

Percentage of failed executions over a rolling window
Error categorization: timeout, model error, input validation, internal
Error rate trends with configurable alert thresholds

Token Usage

Input and output token counts per execution
Average tokens per agent over time
Token efficiency metrics (output quality relative to token spend)

Anomaly Detection

The monitoring system applies statistical anomaly detection to agent metrics. When a metric deviates significantly from its baseline, an alert is raised in the dashboard.

Anomaly detection covers:

Latency spikes — sudden increases in response time
Error bursts — rapid increase in failure rate
Cost anomalies — unexpected jumps in per-agent spend
Throughput drops — agent processing fewer requests than expected

Alerts appear in the dashboard notification center and can be configured to trigger webhooks for integration with external alerting systems.

Per-Agent Detail View

Clicking into any agent opens a detail view with:

Execution history — paginated list of recent runs with status, duration, and cost
Input/output inspection — view the exact data passed to and returned from the agent
Configuration — current model, prompt version, and settings
Error log — filtered view of failed executions with stack traces
Performance charts — dedicated latency, throughput, and cost charts for the agent

API Endpoints

Agent Status

Returns the current status and health metrics for all agents.


GET /api/v1/admin/agents/status


{
  "agents": [
    {
      "agent_id": "style_analysis_v3",
      "status": "healthy",
      "latency_p50_ms": 1240,
      "latency_p95_ms": 2800,
      "error_rate": 0.002,
      "executions_last_hour": 342,
      "last_execution": "2026-02-08T14:23:11Z"
    }
  ]
}

Agent Summary

Returns aggregate metrics and summary statistics across all agents.


GET /api/v1/admin/agents/summary


{
  "total_agents": 54,
  "healthy": 51,
  "degraded": 2,
  "offline": 1,
  "total_executions_today": 12847,
  "avg_latency_ms": 1580,
  "total_cost_today": 42.17
}

Both endpoints require a valid JWT and the X-Org-ID header for tenant-scoped results.