Guides/Voice Chat

Voice Chat

Real-time voice conversations with agents via Gemini 2.5 Flash Native Audio model (~280ms latency). Gemini handles speech-to-speech; Claude Code remains the agent's reasoning engine and is invoked on demand via tool calling.

Concepts

•Voice Session — A live audio session bridged between the browser, Trinity backend, and Gemini Live API. Transcripts are saved to the agent's chat session on close.

•Animated Orb — Canvas-rendered visualization that reflects session state via color and particle movement.

•Tool Calling (run_task) — During a voice session, Gemini can delegate complex tasks to the underlying Claude agent. The orb shows an amber badge while the task runs.

•Voice System Prompt — Controls Gemini's persona for the session. Looked up in order: DB setting → voice-agent-system-prompt.md in the container → auto-generated from template info → generic fallback.

•Workspace Mode — A full-page voice surface with a live canvas the agent can draw on (diagrams, images, formatted text) while you talk. Admin opt-in, BETA. See Workspace Mode below.

How It Works

Open an agent's Chat tab.

Click the microphone button next to the chat input.

A full-screen voice overlay appears with an animated canvas orb.

Speak — audio is captured as PCM 16 kHz and streamed to the backend WebSocket.

The backend proxies audio to the Gemini Live API in real-time.

Agent response audio (PCM 24 kHz) plays back immediately (~280ms TTFT).

When Gemini needs to perform a complex task, it calls run_task: the orb shifts to an amber badge state. Trinity sends the prompt to the Claude agent (up to 30 seconds). Gemini speaks the result when done; the orb returns to listening state.

Click End to close the session. Transcripts are saved to the current chat session.

Orb State Reference

State	Orb color	Trigger
Idle / Connecting	Base hue (0°)	Before audio starts
Listening	+90° shift (green)	Microphone active, user speaking
Speaking	+210° shift (indigo)	Gemini responding
Tool calling	Amber badge overlay	`run_task` dispatched to Claude

Click Mute to silence microphone mid-session. Gemini continues speaking.

Requirements

•GEMINI_API_KEY configured in Settings → AI Keys

•VOICE_ENABLED must be on (default: on when API key is present)

•Browser microphone permission granted

Configuration

Variable	Description	Default
GEMINI_API_KEY	API key for Gemini Live API	— (required)
VOICE_ENABLED	Global toggle	`true`
WORKSPACE_ENABLED	Enable the Workspace Mode canvas (BETA, admin opt-in)	`false`
VOICE_MODEL	Gemini model ID	`gemini-2.5-flash-native-audio-preview-12-2025`
VOICE_MAX_DURATION	Max session duration in seconds	`300`

Per-Agent Voice Prompt

Set a custom voice system prompt by placing voice-agent-system-prompt.md in /home/developer/. Controls Gemini's persona independently of CLAUDE.md. If no file is present, Trinity auto-generates a prompt from template info.

Tool Calling

When Gemini encounters complex requests, it calls run_task:

Gemini formulates a task prompt (max 2000 chars)

Trinity dispatches the prompt to the Claude agent

Agent runs with full tool access

Result returned to Gemini

If agent is unreachable or times out (30s), Gemini recovers gracefully

All run_task invocations are written to the platform audit log.

Workspace Mode BETA

Workspace Mode is a full-page voice surface with a live canvas beside the orb. While you talk, the agent can paint the canvas with diagrams, images, and formatted text — useful for walkthroughs, design reviews, and any conversation where a picture helps. It is opt-in and admin-gated, off by default.

Enabling Workspace Mode

Workspace Mode is hidden unless an admin enables it platform-wide via WORKSPACE_ENABLED (default false). The button only appears when workspace_available is true, which requires both voice to be available (VOICE_ENABLED + GEMINI_API_KEY) and WORKSPACE_ENABLED=true.

How It Works

On the Agent Detail page (agent must be running), click Workspace in the header — it carries an amber BETA badge.

The browser opens the full-page workspace at /agents/{name}/workspace: the animated orb and controls on the left, the canvas on the right.

Start talking. The voice session behaves exactly like standard mode — same orb states, same run_task delegation to Claude.

When the agent decides a visual helps, it calls a panel tool. The canvas updates within ~300ms.

Panel Tools

The agent drives the canvas with these in-session tools (resolved inside Trinity — they never run in the agent container):

Tool	Effect on the canvas
show_markdown	Render formatted text (headings, lists, tables)
show_diagram	Render a Mermaid diagram (flowcharts, sequence diagrams, etc.)
show_image	Show an image — a web URL or a file from the agent's workspace
update_panel	Replace the canvas with an HTML layout
append_to_panel	Add content to the current panel
clear_panel	Empty the canvas

Panel History

The canvas keeps a 40-snapshot history. Use the prev/nextcontrols or the dropdown to step back through what was shown earlier in the conversation. "Live" follows the newest snapshot; navigating back pins the view until a new update arrives.

Rendering & Safety

All canvas content is sanitized before display (DOMPurify, the same trust model as every other markdown surface on the platform):

•Markdown and Mermaid diagrams render directly in the page.

•Images from the agent's workspace are fetched over an authenticated channel; web URLs load directly. Paths are confined to the workspace — traversal (..), absolute escapes, and non-http schemes (data:, etc.) are rejected.

•HTML panels render as static layout only — <script> tags are stripped, so agent-supplied JavaScript (e.g. Chart.js) does not execute. Use show_diagram for dynamic visuals instead.

API Endpoints

Endpoint	Method	Description
/api/agents/{name}/voice/start	POST	Start a voice session — pass `workspace_mode: true` for canvas mode. Returns `voice_session_id` and WebSocket URL
/api/agents/{name}/voice/stop	POST	Stop a voice session — returns transcript and cost
/api/agents/{name}/voice/status	GET	Get session status
/api/agents/{name}/voice/{session_id}/panel	GET	Current workspace canvas state (`type`, `content`, `title`, `updated_at`); polled by the canvas
/api/agents/{name}/voice/prompt	GET / PUT	Read or set the per-agent voice system prompt
/ws/voice/{session_id}	WebSocket	Bidirectional audio bridge

WebSocket Message Types

Client → Server

{ "type": "audio", "data": "<base64 PCM 16kHz audio>" }

Server → Client

•audio — PCM 24 kHz response audio chunk

•transcript — Incremental or final transcript text

•status — Session state change (listening, speaking, idle)

•tool_call — run_task dispatched to Claude

•tool_result — Claude agent response returned to Gemini

Limitations

•Voice is available only in authenticated chat (not public links).

•One voice session per agent at a time.

•Maximum session duration: 300 seconds (configurable via VOICE_MAX_DURATION).

•run_task tool calls time out after 30 seconds.

•Incremental transcript display not yet implemented — transcripts appear in chat after session ends.

•Workspace Mode is BETA, off by default, and HTML panels are static (no JavaScript execution). Exporting canvas content (PDF/markdown) and multi-page canvases are not yet available.

Advanced Features Image Generation