Skip to main content
Trinity
Architecture/Monitoring

Monitoring

Multi-layer health monitoring for the agent fleet with real-time alerts, automatic cleanup of stuck resources, and a fleet-wide health dashboard.

Health Levels

Ordered by severity:

LevelMeaning
healthyAll checks passing
degradedMinor issues detected
unhealthySignificant problems
criticalImmediate attention required
unknownUnable to determine status

Three Monitoring Layers

1

Docker layer — Container status, CPU/memory usage, restart count, OOM detection.

2

Network layer — Agent HTTP reachability with latency tracking.

3

Business layer — Runtime availability, context usage, error rates.

Alert Cooldowns

Repeated alerts for the same condition are throttled to prevent notification spam.

Fleet Health Dashboard

The fleet health dashboard is an admin-only view that summarizes the health of all agents in the system. It is accessible from the admin area of the UI.

Real-time WebSocket updates push health state changes as they occur.
Individual agent health is visible in both the agent header and the Agents listing page.

Cleanup Service

A background service that automatically recovers stuck resources:

Stale executions — Any execution with status='running' for longer than 120 minutes is marked failed.
Stale activities — Any activity with activity_state='started' for longer than 120 minutes is marked failed.
Stale Redis slots — Orphaned slot reservations are released.
Run frequency — Every 5 minutes, plus a one-shot sweep on backend restart.
Startup recovery — Orphaned executions (container down, not in process registry) are marked failed immediately and their slots are released.

MCP Tools

Agents can query monitoring data through these MCP tools:

ToolDescription
get_fleet_health()Fleet-wide health summary
get_agent_health(name)Individual agent health
trigger_health_check()Force an immediate health check

API Endpoints

EndpointMethodDescription
/api/monitoring/fleet-healthGETFleet health summary
/api/monitoring/cleanup-statusGETCleanup service status (admin)
/api/monitoring/cleanup-triggerPOSTForce a cleanup run (admin)