Getting Started/Deploying Trinity/Monitoring

Monitoring

Six-probe health check, resource thresholds, log viewing, fleet health API, and recovery patterns for a running Trinity instance.

When to Check

•After every upgrade or restart
•When an agent stops responding
•When the platform feels slow or unresponsive
•As a daily practice on production instances

Six-Probe Health Check

Probe	Command	Expected
Backend	curl -s http://localhost:8000/health	{"status":"healthy",...}
Scheduler	curl -s http://localhost:8001/health	{"status":"healthy","active_schedules":N}
Frontend	curl -s -o /dev/null -w '%{http_code}' http://localhost	200
Redis	docker exec trinity-redis redis-cli ping	PONG
MCP Server	curl -s http://localhost:8080/health	HTTP 200
Vector	docker exec trinity-vector wget -q -O - http://localhost:8686/health	Non-empty response

Run as a block:

# 1. Backend
curl -s http://localhost:8000/health

# 2. Scheduler
curl -s http://localhost:8001/health

# 3. Frontend (HTTP 200)
curl -s -o /dev/null -w '%{http_code}' http://localhost

# 4. Redis
docker exec trinity-redis redis-cli ping

# 5. MCP Server
curl -s http://localhost:8080/health

# 6. Vector (log aggregation)
docker exec trinity-vector wget -q -O - http://localhost:8686/health

Resource Thresholds

Metric	Warning	Critical	Action
Backend /health	—	not 200	Restart trinity-backend
Scheduler /health	—	not 200	Restart trinity-scheduler
Agent context usage	>75%	>90%	Reset agent context or restart agent container
Host CPU	>80%	>95%	Investigate runaway processes
Host memory	>85%	>95%	Check container memory limits
Disk free	<20%	<5%	Prune Docker, archive logs
Error rate (per hour)	>10	>50	Inspect platform.json log
Container restarts	any	repeated	docker logs <container>
trinity.db size	>1 GB	>5 GB	Archive old data
Vector log size	>5 GB	>10 GB	Trigger archival rotation

Check disk and Docker space:

df -h /
docker system df

Check trinity.db size:

# Development (named volume)
docker run --rm -v trinity_trinity-data:/data alpine ls -lh /data/trinity.db

# Production (bind mount)
ls -lh /srv/trinity-data/trinity.db

Container Status

# All platform services
docker compose ps

# Agent containers only
docker ps --filter "label=trinity.platform=agent"

# Look for unexpected restart counts
docker ps --format "table {{.Names}}	{{.Status}}	{{.RunningFor}}"

A Restarting status or restart count next to Up indicates a crash loop.

Viewing Logs

Structured logs (via Vector)

# Platform logs (backend, scheduler, MCP server)
docker exec trinity-vector sh -c "tail -50 /data/logs/platform.json" | jq .

# Agent logs
docker exec trinity-vector sh -c "tail -50 /data/logs/agents.json" | jq .

# Filter for errors
docker exec trinity-vector sh -c "cat /data/logs/platform.json" | jq 'select(.level == "ERROR")'

Container logs (Docker directly)

docker compose logs -f backend
docker compose logs -f frontend
docker compose logs -f scheduler
docker logs trinity-backend --tail 100

Fleet Health API

The fleet health endpoint returns per-agent health data (admin-only):

TOKEN=$(curl -s -X POST http://localhost:8000/api/token \
  -d 'username=admin&password=your-admin-password' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

curl -s -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/ops/fleet/health | jq .

Recovery Patterns

Backend not responding

docker compose restart backend
# Wait ~15 seconds
curl -s http://localhost:8000/health

Scheduler not running schedules

curl -s http://localhost:8001/health
docker compose restart scheduler

Agent network not found

This happens when docker compose down was used instead of docker compose stop. The trinity-agent-network was removed.

# Recreate missing networks while leaving running containers intact
docker compose up -d
# or for production:
docker compose -f docker-compose.prod.yml up -d

Agent context >90%

Reset context via the web UI: navigate to the agent, open the Session or Chat tab, and use the reset/close option. Or restart the agent container directly:

docker restart <agent-container-name>

Database locked (SQLITE_BUSY in backend logs)

Check for duplicate backend processes (should be exactly one):

docker ps | grep trinity-backend

MCP clients disconnected after restart

JWT tokens are invalidated when the backend restarts. Users need to log in again. Claude Code MCP clients need to reconnect — run /mcp in your Claude Code session or restart the client.

Disk full — Docker cleanup

# Remove unused images, containers, networks (safe to run)
docker system prune -f

# Remove dangling images only
docker image prune -f

# Check size recovered
docker system df

Backup and Restore Ops Agent