Introducing Voice Capabilities in Grid 🎙️

August 29, 2025 · 4 min read

The team behind Grid

We're excited to announce that Grid now supports voice interactions! This major update brings natural speech synthesis and recognition to your AI agents, enabling more intuitive and accessible conversations.

What's New

Grid agents can now:

Speak their responses using natural-sounding voices
Listen to voice input (speech-to-text)
Mix modalities - seamlessly blend voice and text in the same conversation
Stream audio for faster response times

Powered by ElevenLabs

Our initial implementation leverages ElevenLabs' industry-leading voice technology, providing:

High-quality, natural-sounding voices
Multiple voice options with different personalities
Real-time streaming synthesis
Multi-language support
Voice cloning capabilities (Pro accounts)

Quick Example

Making your agents speak is as simple as adding a voice service:

import { 
  createConfigurableAgent,
  elevenlabsVoiceService 
} from "@mrck-labs/grid-core";

// Create voice service
const voiceService = elevenlabsVoiceService({
  apiKey: process.env.ELEVENLABS_API_KEY,
});

// Create voice-enabled agent
const agent = createConfigurableAgent({
  llmService,
  voiceService, // That's it!
  config: {
    voice: {
      enabled: true,
      autoSpeak: true, // Automatically speak responses
    }
  }
});

// Agent responses are now spoken automatically
await agent.act("Tell me about the weather");

Terminal Voice Experience

We've also built a complete voice conversation experience for the terminal:

# Install dependencies
brew install sox  # macOS
# or
sudo apt-get install sox libsox-fmt-all  # Linux

# Run terminal agent
npx terminal-agent

# Select "🎙️ Voice Conversation"
# Press SPACE to talk!

Features include:

Push-to-talk with the SPACE key
Beautiful ASCII animations for voice states
Mixed modality - type while the assistant speaks
Voice selection from available ElevenLabs voices
Built-in voice commands (/voice on|off|list)

Key Features

1. Service-Based Architecture

Voice follows Grid's established patterns - it's just another service:

const agent = createConfigurableAgent({
  llmService,      // Required
  toolExecutor,    // Required
  voiceService,    // Optional - enables voice!
  config: { /* ... */ }
});

2. Graceful Degradation

Voice features degrade gracefully when unavailable:

if (agent.hasVoice()) {
  await agent.speak("Hello!");
} else {
  console.log("Hello!");
}

3. Streaming Support

Stream synthesis for faster responses:

for await (const chunk of voiceService.streamSynthesize(text)) {
  await playAudioChunk(chunk);
}

4. Mixed Modality

Users can type and speak in the same conversation, inspired by ElevenLabs' own interface:

Speak naturally for most content
Type while speaking for URLs, technical terms, names
System intelligently merges both inputs

Use Cases

Customer Support

const supportAgent = createConfigurableAgent({
  voiceService,
  config: {
    prompts: {
      system: "You are a friendly customer support agent..."
    },
    voice: {
      voiceId: "21m00Tcm4TlvDq8ikWAM", // Warm, friendly voice
      defaultOptions: {
        stability: 0.8,
        style: 0.6, // More expressive
      }
    }
  }
});

Educational Assistants

const tutorAgent = createConfigurableAgent({
  voiceService,
  config: {
    prompts: {
      system: "You are a patient tutor. Speak slowly and clearly..."
    },
    voice: {
      defaultOptions: {
        speed: 0.9, // Slower pace
        stability: 0.9, // Clearer pronunciation
      }
    }
  }
});

Accessibility

Voice enables Grid agents to be more accessible to users with:

Visual impairments
Mobility limitations
Dyslexia or reading difficulties
Preferences for audio learning

Architecture Highlights

The implementation follows Grid's architectural principles:

Closure-based services - No classes, just functions
Provider abstraction - Easy to add more voice providers
Type-safe - Full TypeScript support
Testable - Mock voice services for testing
Observable - Integrated with Grid's telemetry

Performance Optimization

We've implemented several optimizations:

Voice caching for repeated phrases
Parallel processing of voice and compute
Streaming synthesis for long responses
Smart chunking for natural speech flow

What's Next

This is just the beginning! Our roadmap includes:

Additional voice providers (Azure, Google, AWS)
Voice activity detection (VAD) for hands-free interaction
Emotion and tone analysis
Voice-based authentication
Multi-speaker conversations
Ambient listening mode
Real-time translation

Getting Started

Ready to add voice to your agents? Check out:

Voice Integration Guide - Step-by-step setup
Voice Services Docs - Architecture details
API Reference - Complete API docs
Example Code - Working examples

Feedback Welcome!

We'd love to hear about your voice use cases and experiences. Please:

Open an issue for bugs or features
Join our Discord to discuss voice features
Share your voice agents with the community

Special Thanks

A huge thank you to:

The ElevenLabs team for their amazing voice API
Our early testers who provided invaluable feedback
The community for inspiring this feature

Happy voice coding! 🎙️✨

What's New​

Powered by ElevenLabs​

Quick Example​

Terminal Voice Experience​

Key Features​

1. Service-Based Architecture​

2. Graceful Degradation​

3. Streaming Support​

4. Mixed Modality​

Use Cases​

Customer Support​

Educational Assistants​

Accessibility​

Architecture Highlights​

Performance Optimization​

What's Next​

Getting Started​

Feedback Welcome!​

Special Thanks​