Voice Services

Grid's voice services bring natural language interaction to your agents through text-to-speech (TTS) and speech-to-text (STT) capabilities. Voice is implemented as a first-class service following Grid's closure-based architecture pattern.

Architecture Overview

Voice services follow the same architectural principles as other Grid services:

┌─────────────────────────────────────────┐
│         Voice-Enabled Agents            │
│        (speak, listen methods)          │
└─────────────────────────────────────────┘
                    ↑
┌─────────────────────────────────────────┐
│         Voice Service Layer             │
│    (TTS, STT, Voice Management)         │
└─────────────────────────────────────────┘
                    ↑
┌─────────────────────────────────────────┐
│        Provider Implementations         │
│    (ElevenLabs, Azure, Google, etc)     │
└─────────────────────────────────────────┘

Voice Service Interface

The voice service follows Grid's closure-based pattern:

interface VoiceService {
  // Core capabilities
  synthesize(text: string, options?: VoiceOptions): Promise<AudioResult>;
  transcribe(audio: AudioData): Promise<TranscriptionResult>;
  
  // Streaming capabilities
  streamSynthesize(text: string, options?: VoiceOptions): AsyncGenerator<AudioChunk>;
  streamTranscribe(audioStream: AsyncGenerator<AudioData>): AsyncGenerator<TranscriptionChunk>;
  
  // Voice management
  listVoices(): Promise<Voice[]>;
  getVoice(voiceId: string): Promise<Voice | null>;
  
  // Voice cloning (if supported by provider)
  cloneVoice?(name: string, samples: AudioInput[]): Promise<Voice>;
  deleteVoice?(voiceId: string): Promise<boolean>;
}

Creating Voice Services

Voice services are created by implementing the VoiceService interface. The baseVoiceService function provides utilities to help with common tasks:

import { baseVoiceService, type VoiceService } from "@mrck-labs/grid-core";

// Get utilities from base voice service
const utils = baseVoiceService({
  apiKey: process.env.VOICE_API_KEY,
  defaultVoiceId: "default-voice",
  onProgress: (event) => console.log(event),
});

// Create your voice service implementation
const myVoiceService: VoiceService = {
  synthesize: async (text, options) => {
    // Use utilities for common tasks
    await utils.rateLimit();
    utils.validateText(text);
    const mergedOptions = utils.mergeOptions(options);
    
    // Provider-specific TTS implementation
    const audio = await providerAPI.textToSpeech(text, {
      voice: mergedOptions.voiceId,
      ...mergedOptions
    });
    
    return {
      audio,
      format: "mp3",
      sampleRate: 44100,
      size: audio.byteLength,
    };
  },
  
  transcribe: async (audio, options) => {
    await utils.rateLimit();
    utils.validateAudioInput(audio);
    
    // Use utility to prepare audio data
    const audioData = await utils.prepareAudioInput(audio);
    
    // Provider-specific STT implementation
    const result = await providerAPI.speechToText(audioData);
    
    return {
      text: result.transcript,
      confidence: result.confidence,
      language: result.detectedLanguage,
    };
  },
  
  listVoices: async () => {
    // Implementation
  },
  
  isAvailable: async () => {
    // Check if service is available
    return true;
  },
};

ElevenLabs Integration

Grid includes a built-in ElevenLabs voice service implementation:

import { elevenlabsVoiceService } from "@mrck-labs/grid-core";

const voiceService = elevenlabsVoiceService({
  apiKey: process.env.ELEVENLABS_API_KEY,
  defaultVoiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel voice
  defaultOptions: {
    stability: 0.75,
    similarityBoost: 0.75,
    style: 0.5,
    useSpeakerBoost: true,
  },
});

// List available voices
const voices = await voiceService.listVoices();
console.log(voices.map(v => `${v.name} (${v.id})`));

// Synthesize speech
const audio = await voiceService.synthesize("Hello, I'm your AI assistant!", {
  voiceId: "21m00Tcm4TlvDq8ikWAM",
  stability: 0.8,
  similarityBoost: 0.7,
});

// Stream synthesis for faster response
for await (const chunk of voiceService.streamSynthesize("This is a longer response...")) {
  // Play audio chunks as they arrive
  await playAudioChunk(chunk);
}

Voice-Enabled Agents

Agents become voice-enabled when provided with a voice service:

import { createConfigurableAgent } from "@mrck-labs/grid-core";

const agent = createConfigurableAgent({
  llmService,
  voiceService, // Optional voice service
  config: {
    id: "voice-assistant",
    prompts: {
      system: "You are a helpful voice assistant. Keep responses concise for speech.",
    },
    voice: {
      enabled: true,
      voiceId: "21m00Tcm4TlvDq8ikWAM",
      autoSpeak: true, // Automatically speak responses
      interruptible: true, // Allow interruption mid-speech
    },
  },
});

// Agent now has voice methods
if (agent.hasVoice()) {
  // Listen for user input
  const transcript = await agent.listen();
  console.log("User said:", transcript.text);
  
  // Speak a response
  await agent.speak("I heard you say: " + transcript.text);
}

// Voice is integrated into the act method
const response = await agent.act({
  messages: [{ role: "user", content: "Hello!" }],
});
// If autoSpeak is true, the response is automatically spoken

Voice Configuration

Voice behavior can be configured at multiple levels:

Agent-Level Configuration

const config = {
  voice: {
    enabled: true,              // Enable voice capabilities
    voiceId: "voice-id",       // Default voice for this agent
    autoSpeak: true,           // Automatically speak responses
    interruptible: true,       // Allow interruption
    speed: 1.0,               // Speed multiplier (0.5 to 2.0)
  },
};

Per-Request Options

// Override voice settings for specific requests
await agent.speak("Important announcement!", {
  voiceId: "different-voice-id",
  stability: 0.9,        // More consistent
  similarityBoost: 0.9,  // More similar to original voice
  style: 0.3,           // Less expressive,
  speed: 0.8,             // Slower for clarity
});

Mixed Modality

Grid supports mixed voice and text interactions in the same conversation:

const conversation = createConversationLoop({
  agent: voiceEnabledAgent,
  modality: {
    allowMixed: true,
    preferred: "voice",
  },
});

// User can speak or type
conversation.on("userSpeaking", (transcript) => {
  console.log("Voice input:", transcript);
});

conversation.on("userTyping", (text) => {
  console.log("Text input:", text);
});

// Agent responds in the appropriate modality
conversation.on("agentSpeaking", (audio) => {
  // Play audio response
});

conversation.on("agentTyping", (text) => {
  // Display text response
});

Terminal Voice Support

Grid includes terminal voice capabilities for CLI applications:

import { TerminalVoiceService } from "@mrck-labs/grid-core";

const terminalVoice = new TerminalVoiceService();

// Check if audio is available
if (await terminalVoice.checkAudioSupport()) {
  // Record audio from microphone
  const audio = await terminalVoice.recordAudio({
    duration: 5000, // Max 5 seconds
    onProgress: (level) => {
      console.log("Audio level:", level);
    },
  });
  
  // Play audio through speakers
  await terminalVoice.playAudio(audioBuffer);
}

Voice Progress Events

Voice operations emit progress events for UI feedback:

agent.on("voice:listening:start", () => {
  console.log("🎤 Listening...");
});

agent.on("voice:listening:end", (duration) => {
  console.log(`⏹️ Recording complete (${duration}ms)`);
});

agent.on("voice:transcribing", () => {
  console.log("📝 Transcribing...");
});

agent.on("voice:speaking:start", (text) => {
  console.log("🔊 Speaking:", text);
});

agent.on("voice:speaking:progress", (progress) => {
  console.log(`Speaking: ${(progress * 100).toFixed(0)}%`);
});

agent.on("voice:speaking:end", () => {
  console.log("✅ Finished speaking");
});

Error Handling

Voice services include comprehensive error handling:

try {
  await agent.speak("Hello world");
} catch (error) {
  if (error instanceof VoiceServiceError) {
    switch (error.code) {
      case "VOICE_NOT_FOUND":
        console.error("Selected voice is not available");
        break;
      case "QUOTA_EXCEEDED":
        console.error("Voice API quota exceeded");
        break;
      case "AUDIO_PLAYBACK_FAILED":
        console.error("Could not play audio");
        break;
      default:
        console.error("Voice error:", error.message);
    }
  }
}

// Graceful degradation
if (!agent.hasVoice() || !voiceService.isAvailable()) {
  // Fall back to text-only interaction
  console.log(response.content);
} else {
  await agent.speak(response.content);
}

Performance Optimization

Streaming Synthesis

For faster voice responses, use streaming:

// Start speaking while still generating
const streamResponse = async (text: string) => {
  const chunks = text.match(/.{1,100}[.!?]?\s/g) || [text];
  
  for (const chunk of chunks) {
    // Synthesize and play each chunk immediately
    const audio = await voiceService.synthesize(chunk);
    playAudio(audio); // Non-blocking
  }
};

Voice Caching

Cache commonly spoken phrases:

const cachedVoiceService = withCache(voiceService, {
  maxSize: 100, // Cache up to 100 phrases
  ttl: 3600000, // 1 hour TTL
});

// Repeated phrases are served from cache
await cachedVoiceService.synthesize("Welcome back!"); // API call
await cachedVoiceService.synthesize("Welcome back!"); // From cache

Parallel Processing

Process voice and LLM operations in parallel:

const processVoiceQuery = async (audioInput: AudioData) => {
  // Start transcription and LLM warmup in parallel
  const [transcript, _] = await Promise.all([
    voiceService.transcribe(audioInput),
    agent.prepare(), // Pre-load models, tools, etc.
  ]);
  
  // Process transcribed text
  const response = await agent.act({
    messages: [{ role: "user", content: transcript.text }],
  });
  
  // Start synthesis while logging
  const [audio, _] = await Promise.all([
    voiceService.synthesize(response.content),
    logConversation(transcript, response),
  ]);
  
  return audio;
};

Security Considerations

API Key Management

// Use environment variables
const voiceService = elevenlabsVoiceService({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

// Or use a key management service
const voiceService = elevenlabsVoiceService({
  apiKey: await keyVault.getSecret("elevenlabs-api-key"),
});

Content Filtering

// Filter inappropriate content before synthesis
const safeVoiceService = withContentFilter(voiceService, {
  filterProfanity: true,
  maxLength: 500, // Prevent abuse with long texts
});

Rate Limiting

// Prevent abuse with rate limiting
const rateLimitedVoice = withRateLimit(voiceService, {
  maxRequests: 100,
  windowMs: 60000, // 100 requests per minute
});

Testing Voice Services

Mock voice services for testing:

const mockVoiceService = createMockVoiceService({
  voices: [
    { id: "test-1", name: "Test Voice 1" },
    { id: "test-2", name: "Test Voice 2" },
  ],
  synthesizeDelay: 100, // Simulate API delay
  transcribeResult: { text: "Mock transcription", confidence: 0.95 },
});

// Use in tests
const testAgent = createConfigurableAgent({
  llmService: mockLLMService,
  voiceService: mockVoiceService,
  config: { /* ... */ },
});

// Test voice capabilities
expect(testAgent.hasVoice()).toBe(true);
const audio = await testAgent.speak("Test message");
expect(audio).toBeDefined();

Best Practices

Keep responses concise - Speech is slower than reading
Use appropriate voices - Match voice to agent personality
Handle interruptions gracefully - Users may want to stop long responses
Provide visual feedback - Show speaking/listening states
Test fallback behavior - Ensure text-only mode works
Monitor API usage - Voice APIs can be expensive
Cache when possible - Reduce API calls for common phrases
Use streaming - Start playback before synthesis completes
Handle errors gracefully - Fall back to text when voice fails
Respect user preferences - Allow disabling voice features

Next Steps

Voice Integration Guide - Step-by-step setup
Terminal Voice Setup - CLI voice features
Voice Service API - Detailed API reference

Architecture Overview​

Voice Service Interface​

Creating Voice Services​

ElevenLabs Integration​

Voice-Enabled Agents​

Voice Configuration​

Agent-Level Configuration​

Per-Request Options​

Mixed Modality​

Terminal Voice Support​

Voice Progress Events​

Error Handling​

Performance Optimization​

Streaming Synthesis​

Voice Caching​

Parallel Processing​

Security Considerations​

API Key Management​

Content Filtering​

Rate Limiting​

Testing Voice Services​

Best Practices​

Next Steps​