Voice Services
Grid's voice services bring natural language interaction to your agents through text-to-speech (TTS) and speech-to-text (STT) capabilities. Voice is implemented as a first-class service following Grid's closure-based architecture pattern.
Architecture Overview
Voice services follow the same architectural principles as other Grid services:
┌─────────────────────────────────────────┐
│ Voice-Enabled Agents │
│ (speak, listen methods) │
└─────────────────────────────────────────┘
↑
┌─────────────────────────────────────────┐
│ Voice Service Layer │
│ (TTS, STT, Voice Management) │
└─────────────────────────────────────────┘
↑
┌─────────────────────────────────────────┐
│ Provider Implementations │
│ (ElevenLabs, Azure, Google, etc) │
└─────────────────────────────────────────┘
Voice Service Interface
The voice service follows Grid's closure-based pattern:
interface VoiceService {
// Core capabilities
synthesize(text: string, options?: VoiceOptions): Promise<AudioResult>;
transcribe(audio: AudioData): Promise<TranscriptionResult>;
// Streaming capabilities
streamSynthesize(text: string, options?: VoiceOptions): AsyncGenerator<AudioChunk>;
streamTranscribe(audioStream: AsyncGenerator<AudioData>): AsyncGenerator<TranscriptionChunk>;
// Voice management
listVoices(): Promise<Voice[]>;
getVoice(voiceId: string): Promise<Voice | null>;
// Voice cloning (if supported by provider)
cloneVoice?(name: string, samples: AudioInput[]): Promise<Voice>;
deleteVoice?(voiceId: string): Promise<boolean>;
}
Creating Voice Services
Voice services are created by implementing the VoiceService interface. The baseVoiceService
function provides utilities to help with common tasks:
import { baseVoiceService, type VoiceService } from "@mrck-labs/grid-core";
// Get utilities from base voice service
const utils = baseVoiceService({
apiKey: process.env.VOICE_API_KEY,
defaultVoiceId: "default-voice",
onProgress: (event) => console.log(event),
});
// Create your voice service implementation
const myVoiceService: VoiceService = {
synthesize: async (text, options) => {
// Use utilities for common tasks
await utils.rateLimit();
utils.validateText(text);
const mergedOptions = utils.mergeOptions(options);
// Provider-specific TTS implementation
const audio = await providerAPI.textToSpeech(text, {
voice: mergedOptions.voiceId,
...mergedOptions
});
return {
audio,
format: "mp3",
sampleRate: 44100,
size: audio.byteLength,
};
},
transcribe: async (audio, options) => {
await utils.rateLimit();
utils.validateAudioInput(audio);
// Use utility to prepare audio data
const audioData = await utils.prepareAudioInput(audio);
// Provider-specific STT implementation
const result = await providerAPI.speechToText(audioData);
return {
text: result.transcript,
confidence: result.confidence,
language: result.detectedLanguage,
};
},
listVoices: async () => {
// Implementation
},
isAvailable: async () => {
// Check if service is available
return true;
},
};
ElevenLabs Integration
Grid includes a built-in ElevenLabs voice service implementation:
import { elevenlabsVoiceService } from "@mrck-labs/grid-core";
const voiceService = elevenlabsVoiceService({
apiKey: process.env.ELEVENLABS_API_KEY,
defaultVoiceId: "21m00Tcm4TlvDq8ikWAM", // Rachel voice
defaultOptions: {
stability: 0.75,
similarityBoost: 0.75,
style: 0.5,
useSpeakerBoost: true,
},
});
// List available voices
const voices = await voiceService.listVoices();
console.log(voices.map(v => `${v.name} (${v.id})`));
// Synthesize speech
const audio = await voiceService.synthesize("Hello, I'm your AI assistant!", {
voiceId: "21m00Tcm4TlvDq8ikWAM",
stability: 0.8,
similarityBoost: 0.7,
});
// Stream synthesis for faster response
for await (const chunk of voiceService.streamSynthesize("This is a longer response...")) {
// Play audio chunks as they arrive
await playAudioChunk(chunk);
}
Voice-Enabled Agents
Agents become voice-enabled when provided with a voice service:
import { createConfigurableAgent } from "@mrck-labs/grid-core";
const agent = createConfigurableAgent({
llmService,
voiceService, // Optional voice service
config: {
id: "voice-assistant",
prompts: {
system: "You are a helpful voice assistant. Keep responses concise for speech.",
},
voice: {
enabled: true,
voiceId: "21m00Tcm4TlvDq8ikWAM",
autoSpeak: true, // Automatically speak responses
interruptible: true, // Allow interruption mid-speech
},
},
});
// Agent now has voice methods
if (agent.hasVoice()) {
// Listen for user input
const transcript = await agent.listen();
console.log("User said:", transcript.text);
// Speak a response
await agent.speak("I heard you say: " + transcript.text);
}
// Voice is integrated into the act method
const response = await agent.act({
messages: [{ role: "user", content: "Hello!" }],
});
// If autoSpeak is true, the response is automatically spoken
Voice Configuration
Voice behavior can be configured at multiple levels:
Agent-Level Configuration
const config = {
voice: {
enabled: true, // Enable voice capabilities
voiceId: "voice-id", // Default voice for this agent
autoSpeak: true, // Automatically speak responses
interruptible: true, // Allow interruption
speed: 1.0, // Speed multiplier (0.5 to 2.0)
},
};
Per-Request Options
// Override voice settings for specific requests
await agent.speak("Important announcement!", {
voiceId: "different-voice-id",
stability: 0.9, // More consistent
similarityBoost: 0.9, // More similar to original voice
style: 0.3, // Less expressive,
speed: 0.8, // Slower for clarity
});
Mixed Modality
Grid supports mixed voice and text interactions in the same conversation:
const conversation = createConversationLoop({
agent: voiceEnabledAgent,
modality: {
allowMixed: true,
preferred: "voice",
},
});
// User can speak or type
conversation.on("userSpeaking", (transcript) => {
console.log("Voice input:", transcript);
});
conversation.on("userTyping", (text) => {
console.log("Text input:", text);
});
// Agent responds in the appropriate modality
conversation.on("agentSpeaking", (audio) => {
// Play audio response
});
conversation.on("agentTyping", (text) => {
// Display text response
});
Terminal Voice Support
Grid includes terminal voice capabilities for CLI applications:
import { TerminalVoiceService } from "@mrck-labs/grid-core";
const terminalVoice = new TerminalVoiceService();
// Check if audio is available
if (await terminalVoice.checkAudioSupport()) {
// Record audio from microphone
const audio = await terminalVoice.recordAudio({
duration: 5000, // Max 5 seconds
onProgress: (level) => {
console.log("Audio level:", level);
},
});
// Play audio through speakers
await terminalVoice.playAudio(audioBuffer);
}
Voice Progress Events
Voice operations emit progress events for UI feedback:
agent.on("voice:listening:start", () => {
console.log("🎤 Listening...");
});
agent.on("voice:listening:end", (duration) => {
console.log(`⏹️ Recording complete (${duration}ms)`);
});
agent.on("voice:transcribing", () => {
console.log("📝 Transcribing...");
});
agent.on("voice:speaking:start", (text) => {
console.log("🔊 Speaking:", text);
});
agent.on("voice:speaking:progress", (progress) => {
console.log(`Speaking: ${(progress * 100).toFixed(0)}%`);
});
agent.on("voice:speaking:end", () => {
console.log("✅ Finished speaking");
});
Error Handling
Voice services include comprehensive error handling:
try {
await agent.speak("Hello world");
} catch (error) {
if (error instanceof VoiceServiceError) {
switch (error.code) {
case "VOICE_NOT_FOUND":
console.error("Selected voice is not available");
break;
case "QUOTA_EXCEEDED":
console.error("Voice API quota exceeded");
break;
case "AUDIO_PLAYBACK_FAILED":
console.error("Could not play audio");
break;
default:
console.error("Voice error:", error.message);
}
}
}
// Graceful degradation
if (!agent.hasVoice() || !voiceService.isAvailable()) {
// Fall back to text-only interaction
console.log(response.content);
} else {
await agent.speak(response.content);
}
Performance Optimization
Streaming Synthesis
For faster voice responses, use streaming:
// Start speaking while still generating
const streamResponse = async (text: string) => {
const chunks = text.match(/.{1,100}[.!?]?\s/g) || [text];
for (const chunk of chunks) {
// Synthesize and play each chunk immediately
const audio = await voiceService.synthesize(chunk);
playAudio(audio); // Non-blocking
}
};
Voice Caching
Cache commonly spoken phrases:
const cachedVoiceService = withCache(voiceService, {
maxSize: 100, // Cache up to 100 phrases
ttl: 3600000, // 1 hour TTL
});
// Repeated phrases are served from cache
await cachedVoiceService.synthesize("Welcome back!"); // API call
await cachedVoiceService.synthesize("Welcome back!"); // From cache
Parallel Processing
Process voice and LLM operations in parallel:
const processVoiceQuery = async (audioInput: AudioData) => {
// Start transcription and LLM warmup in parallel
const [transcript, _] = await Promise.all([
voiceService.transcribe(audioInput),
agent.prepare(), // Pre-load models, tools, etc.
]);
// Process transcribed text
const response = await agent.act({
messages: [{ role: "user", content: transcript.text }],
});
// Start synthesis while logging
const [audio, _] = await Promise.all([
voiceService.synthesize(response.content),
logConversation(transcript, response),
]);
return audio;
};
Security Considerations
API Key Management
// Use environment variables
const voiceService = elevenlabsVoiceService({
apiKey: process.env.ELEVENLABS_API_KEY,
});
// Or use a key management service
const voiceService = elevenlabsVoiceService({
apiKey: await keyVault.getSecret("elevenlabs-api-key"),
});
Content Filtering
// Filter inappropriate content before synthesis
const safeVoiceService = withContentFilter(voiceService, {
filterProfanity: true,
maxLength: 500, // Prevent abuse with long texts
});
Rate Limiting
// Prevent abuse with rate limiting
const rateLimitedVoice = withRateLimit(voiceService, {
maxRequests: 100,
windowMs: 60000, // 100 requests per minute
});
Testing Voice Services
Mock voice services for testing:
const mockVoiceService = createMockVoiceService({
voices: [
{ id: "test-1", name: "Test Voice 1" },
{ id: "test-2", name: "Test Voice 2" },
],
synthesizeDelay: 100, // Simulate API delay
transcribeResult: { text: "Mock transcription", confidence: 0.95 },
});
// Use in tests
const testAgent = createConfigurableAgent({
llmService: mockLLMService,
voiceService: mockVoiceService,
config: { /* ... */ },
});
// Test voice capabilities
expect(testAgent.hasVoice()).toBe(true);
const audio = await testAgent.speak("Test message");
expect(audio).toBeDefined();
Best Practices
- Keep responses concise - Speech is slower than reading
- Use appropriate voices - Match voice to agent personality
- Handle interruptions gracefully - Users may want to stop long responses
- Provide visual feedback - Show speaking/listening states
- Test fallback behavior - Ensure text-only mode works
- Monitor API usage - Voice APIs can be expensive
- Cache when possible - Reduce API calls for common phrases
- Use streaming - Start playback before synthesis completes
- Handle errors gracefully - Fall back to text when voice fails
- Respect user preferences - Allow disabling voice features
Next Steps
- Voice Integration Guide - Step-by-step setup
- Terminal Voice Setup - CLI voice features
- Voice Service API - Detailed API reference