Monitoring
Multi-layer health monitoring for the agent fleet with real-time alerts, automatic cleanup of stuck resources, and a fleet-wide health dashboard.
Health Levels
Ordered by severity:
| Level | Meaning |
|---|---|
| healthy | All checks passing |
| degraded | Minor issues detected |
| unhealthy | Significant problems |
| critical | Immediate attention required |
| unknown | Unable to determine status |
Three Monitoring Layers
Docker layer — Container status, CPU/memory usage, restart count, OOM detection.
Network layer — Agent HTTP reachability with latency tracking.
Business layer — Runtime availability, context usage, error rates.
Alert Cooldowns
Repeated alerts for the same condition are throttled to prevent notification spam.
Fleet Health Dashboard
The fleet health dashboard is an admin-only view that summarizes the health of all agents in the system. It is accessible from the admin area of the UI.
Cleanup Service
A background service that automatically recovers stuck resources:
status='running' past its per-slot timeout is marked failed.activity_state='started' past the configured threshold is marked failed.failed immediately and their slots are released.Retention Sweeps
The same cleanup service runs daily retention sweeps to keep the database lean:
| Sweep | Default | Setting |
|---|---|---|
schedule_executions.execution_log nulled past | 30 days | execution_log_retention_days |
Terminal schedule_executions rows deleted past | 90 days | execution_row_retention_days |
agent_health_checks rows deleted past | 7 days | health_check_retention_days |
audit_log rows deleted past | 365 days | AUDIT_LOG_RETENTION_DAYS (floor 365) |
Each sweep is capped at 5,000 rows per cycle, so the first post-deploy backfill spans hours, not minutes. Setting any retention value to 0 disables that sweep. A daily VACUUM at 04:30 UTC reclaims freed pages.
Real-Time Event Reliability
Trinity uses a Redis Streams-backed event bus for all WebSocket delivery (RELIABILITY-003). This is invisible during normal operation but has operator-visible behaviour in a few edge cases.
Reconnect Replay
When a browser tab reconnects after a brief disconnect (e.g., laptop sleep, flaky network), it automatically requests missed events using a ?last-event-id= cursor tracked in memory. Events are replayed from the Redis stream, so the collaboration dashboard, activity timeline, and operator queue resume without stale state.
resync_required Events
If the cursor is too far behind (>5,000 events missed) or the stream has been trimmed past the stored cursor, the server sends a {"type": "resync_required"} message. The frontend clears the cursor and refetches authoritative state via REST. Users see a brief refresh but no data loss.
The stream retains approximately the last 10,000 events (configurable via REDIS_STREAM_MAXLEN in .env).
Admin Stats Endpoint
For soak monitoring and diagnosing delivery issues:
GET /api/debug/event-bus-stats (admin-only)Returns counters since last backend restart:
| Field | What to check |
|---|---|
| publisher.events_published | Total events emitted |
| dispatcher.drops_queue_full | Events dropped due to slow clients |
| dispatcher.clients_evicted | Connections closed after 3 consecutive send failures |
| dispatcher.resyncs_sent | Forced full-state refreshes sent to clients |
| watchdog.cumulative_orphaned | Orphaned executions recovered by cleanup service |
Healthy baseline: drops_queue_full + clients_evicted + resyncs_sent should be < 0.1% of events_published. Non-zero cumulative_orphaned warrants investigation.
MCP Tools
Agents can query monitoring data through these MCP tools:
| Tool | Description |
|---|---|
| get_fleet_health() | Fleet-wide health summary |
| get_agent_health(name) | Individual agent health |
| trigger_health_check() | Force an immediate health check |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| /api/monitoring/status | GET | Fleet health summary |
| /api/monitoring/agents/{name} | GET | Single-agent health detail |
| /api/monitoring/agents/{name}/check | POST | Force immediate health check |
| /api/monitoring/cleanup-status | GET | Cleanup service status (admin) |
| /api/monitoring/cleanup-trigger | POST | Force a cleanup run (admin) |