Monitoring
Six-probe health check, resource thresholds, log viewing, fleet health API, and recovery patterns for a running Trinity instance.
When to Check
- •After every upgrade or restart
- •When an agent stops responding
- •When the platform feels slow or unresponsive
- •As a daily practice on production instances
Six-Probe Health Check
| Probe | Command | Expected |
|---|---|---|
| Backend | curl -s http://localhost:8000/health | {"status":"healthy",...} |
| Scheduler | curl -s http://localhost:8001/health | {"status":"healthy","active_schedules":N} |
| Frontend | curl -s -o /dev/null -w '%{http_code}' http://localhost | 200 |
| Redis | docker exec trinity-redis redis-cli ping | PONG |
| MCP Server | curl -s http://localhost:8080/health | HTTP 200 |
| Vector | docker exec trinity-vector wget -q -O - http://localhost:8686/health | Non-empty response |
Run as a block:
# 1. Backend
curl -s http://localhost:8000/health
# 2. Scheduler
curl -s http://localhost:8001/health
# 3. Frontend (HTTP 200)
curl -s -o /dev/null -w '%{http_code}' http://localhost
# 4. Redis
docker exec trinity-redis redis-cli ping
# 5. MCP Server
curl -s http://localhost:8080/health
# 6. Vector (log aggregation)
docker exec trinity-vector wget -q -O - http://localhost:8686/healthResource Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Backend /health | — | not 200 | Restart trinity-backend |
| Scheduler /health | — | not 200 | Restart trinity-scheduler |
| Agent context usage | >75% | >90% | Reset agent context or restart agent container |
| Host CPU | >80% | >95% | Investigate runaway processes |
| Host memory | >85% | >95% | Check container memory limits |
| Disk free | <20% | <5% | Prune Docker, archive logs |
| Error rate (per hour) | >10 | >50 | Inspect platform.json log |
| Container restarts | any | repeated | docker logs <container> |
| trinity.db size | >1 GB | >5 GB | Archive old data |
| Vector log size | >5 GB | >10 GB | Trigger archival rotation |
Check disk and Docker space:
df -h /
docker system dfCheck trinity.db size:
# Development (named volume)
docker run --rm -v trinity_trinity-data:/data alpine ls -lh /data/trinity.db
# Production (bind mount)
ls -lh /srv/trinity-data/trinity.dbContainer Status
# All platform services
docker compose ps
# Agent containers only
docker ps --filter "label=trinity.platform=agent"
# Look for unexpected restart counts
docker ps --format "table {{.Names}} {{.Status}} {{.RunningFor}}"A Restarting status or restart count next to Up indicates a crash loop.
Viewing Logs
Structured logs (via Vector)
# Platform logs (backend, scheduler, MCP server)
docker exec trinity-vector sh -c "tail -50 /data/logs/platform.json" | jq .
# Agent logs
docker exec trinity-vector sh -c "tail -50 /data/logs/agents.json" | jq .
# Filter for errors
docker exec trinity-vector sh -c "cat /data/logs/platform.json" | jq 'select(.level == "ERROR")'Container logs (Docker directly)
docker compose logs -f backend
docker compose logs -f frontend
docker compose logs -f scheduler
docker logs trinity-backend --tail 100Fleet Health API
The fleet health endpoint returns per-agent health data (admin-only):
TOKEN=$(curl -s -X POST http://localhost:8000/api/token \
-d 'username=admin&password=your-admin-password' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
curl -s -H "Authorization: Bearer $TOKEN" \
http://localhost:8000/api/ops/fleet/health | jq .Recovery Patterns
Backend not responding
docker compose restart backend
# Wait ~15 seconds
curl -s http://localhost:8000/healthScheduler not running schedules
curl -s http://localhost:8001/health
docker compose restart schedulerAgent network not found
This happens when docker compose down was used instead of docker compose stop. The trinity-agent-network was removed.
# Recreate missing networks while leaving running containers intact
docker compose up -d
# or for production:
docker compose -f docker-compose.prod.yml up -dAgent context >90%
Reset context via the web UI: navigate to the agent, open the Session or Chat tab, and use the reset/close option. Or restart the agent container directly:
docker restart <agent-container-name>Database locked (SQLITE_BUSY in backend logs)
Check for duplicate backend processes (should be exactly one):
docker ps | grep trinity-backendMCP clients disconnected after restart
JWT tokens are invalidated when the backend restarts. Users need to log in again. Claude Code MCP clients need to reconnect — run /mcp in your Claude Code session or restart the client.
Disk full — Docker cleanup
# Remove unused images, containers, networks (safe to run)
docker system prune -f
# Remove dangling images only
docker image prune -f
# Check size recovered
docker system df