Skip to main content
Trinity

Monitoring

Six-probe health check, resource thresholds, log viewing, fleet health API, and recovery patterns for a running Trinity instance.

When to Check

  • After every upgrade or restart
  • When an agent stops responding
  • When the platform feels slow or unresponsive
  • As a daily practice on production instances

Six-Probe Health Check

ProbeCommandExpected
Backendcurl -s http://localhost:8000/health{"status":"healthy",...}
Schedulercurl -s http://localhost:8001/health{"status":"healthy","active_schedules":N}
Frontendcurl -s -o /dev/null -w '%{http_code}' http://localhost200
Redisdocker exec trinity-redis redis-cli pingPONG
MCP Servercurl -s http://localhost:8080/healthHTTP 200
Vectordocker exec trinity-vector wget -q -O - http://localhost:8686/healthNon-empty response

Run as a block:

# 1. Backend
curl -s http://localhost:8000/health

# 2. Scheduler
curl -s http://localhost:8001/health

# 3. Frontend (HTTP 200)
curl -s -o /dev/null -w '%{http_code}' http://localhost

# 4. Redis
docker exec trinity-redis redis-cli ping

# 5. MCP Server
curl -s http://localhost:8080/health

# 6. Vector (log aggregation)
docker exec trinity-vector wget -q -O - http://localhost:8686/health

Resource Thresholds

MetricWarningCriticalAction
Backend /healthnot 200Restart trinity-backend
Scheduler /healthnot 200Restart trinity-scheduler
Agent context usage>75%>90%Reset agent context or restart agent container
Host CPU>80%>95%Investigate runaway processes
Host memory>85%>95%Check container memory limits
Disk free<20%<5%Prune Docker, archive logs
Error rate (per hour)>10>50Inspect platform.json log
Container restartsanyrepeateddocker logs <container>
trinity.db size>1 GB>5 GBArchive old data
Vector log size>5 GB>10 GBTrigger archival rotation

Check disk and Docker space:

df -h /
docker system df

Check trinity.db size:

# Development (named volume)
docker run --rm -v trinity_trinity-data:/data alpine ls -lh /data/trinity.db

# Production (bind mount)
ls -lh /srv/trinity-data/trinity.db

Container Status

# All platform services
docker compose ps

# Agent containers only
docker ps --filter "label=trinity.platform=agent"

# Look for unexpected restart counts
docker ps --format "table {{.Names}}	{{.Status}}	{{.RunningFor}}"

A Restarting status or restart count next to Up indicates a crash loop.

Viewing Logs

Structured logs (via Vector)

# Platform logs (backend, scheduler, MCP server)
docker exec trinity-vector sh -c "tail -50 /data/logs/platform.json" | jq .

# Agent logs
docker exec trinity-vector sh -c "tail -50 /data/logs/agents.json" | jq .

# Filter for errors
docker exec trinity-vector sh -c "cat /data/logs/platform.json" | jq 'select(.level == "ERROR")'

Container logs (Docker directly)

docker compose logs -f backend
docker compose logs -f frontend
docker compose logs -f scheduler
docker logs trinity-backend --tail 100

Fleet Health API

The fleet health endpoint returns per-agent health data (admin-only):

TOKEN=$(curl -s -X POST http://localhost:8000/api/token \
  -d 'username=admin&password=your-admin-password' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

curl -s -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/ops/fleet/health | jq .

Recovery Patterns

Backend not responding

docker compose restart backend
# Wait ~15 seconds
curl -s http://localhost:8000/health

Scheduler not running schedules

curl -s http://localhost:8001/health
docker compose restart scheduler

Agent network not found

This happens when docker compose down was used instead of docker compose stop. The trinity-agent-network was removed.

# Recreate missing networks while leaving running containers intact
docker compose up -d
# or for production:
docker compose -f docker-compose.prod.yml up -d

Agent context >90%

Reset context via the web UI: navigate to the agent, open the Session or Chat tab, and use the reset/close option. Or restart the agent container directly:

docker restart <agent-container-name>

Database locked (SQLITE_BUSY in backend logs)

Check for duplicate backend processes (should be exactly one):

docker ps | grep trinity-backend

MCP clients disconnected after restart

JWT tokens are invalidated when the backend restarts. Users need to log in again. Claude Code MCP clients need to reconnect — run /mcp in your Claude Code session or restart the client.

Disk full — Docker cleanup

# Remove unused images, containers, networks (safe to run)
docker system prune -f

# Remove dangling images only
docker image prune -f

# Check size recovered
docker system df