UNITHdocs
Sign inarrow_forward

Overview

The UNITH interFace supports multiple text-to-speech (TTS) providers to give your Digital Human a natural, engaging voice. You can select voices from ElevenLabs or Microsoft Azure directly through the interface, or integrate custom voice providers using our connector framework.

info

Please check our documentation on Voice connectors that we support.

Need a different voice provider? You have full flexibility to create custom voice connectors. Please check out the following repository.


How Voice Selection Impacts Performance

Digital Human responses require audio generation before video synthesis can begin. The audio generation speed directly affects the overall response time of your Digital Human.

Response Pipeline:

  1. User query processed
  2. Audio generated ← Voice model speed matters here
  3. Video synthesized from audio
  4. Complete response delivered
check_circle

Faster audio generation means quicker responses and a more natural, engaging user experience.

ElevenLabs

ElevenLabs offers a wide variety of voices powered by different models, each optimized for specific use cases. For Digital Human applications, we recommend using voices powered by their speed-optimized models.

Recommended Models

ModelCharacteristicsUse Case
flash_v2Fastest generation, balanced qualityReal-time conversations
flash_v2_5Enhanced flash modelReal-time conversations with improved quality
turbo_v2High-speed generationLow-latency interactions
turbo_v2_5Latest turbo generationOptimal balance of speed and quality

Best Practice: Select ElevenLabs voices that use flash_v2, flash_v2_5, turbo_v2, or turbo_v2_5 models for the fastest Digital Human response times.

Important Notes

  • All ElevenLabs models will function correctly with Digital Humans
  • Non-optimized models may result in longer response delays
  • Speed-optimized models are specifically designed for real-time conversational applications
info

For a complete list of available voices and their associated models, visit the ElevenLabs Voice Library.


Microsoft Azure

Microsoft Azure offers an extensive voice catalog across multiple performance tiers. For optimal Digital Human performance, we recommend selecting voices from their speed-optimized tiers.

Recommended Voice Types

Select voices that include one of these identifiers in their name:

Voice TypeIdentifierLanguage SupportPerformance
Turbo MultilingualTurboMultilingual40+ languagesFastest generation across multiple languages
HD FlashHDFlashEnglish (US), Chinese (Mandarin)Very fast with high-definition quality

Voices to Avoid

Avoid voices containing HDNeural in their name, as these prioritize audio quality over generation speed and will result in longer response times.

Azure Voice Performance Tiers

The table below provides an overview of Microsoft Azure's voice catalog organized by performance characteristics:

Performance TierLanguage CoverageAvailable VoicesRecommended
TurboEnglish (US) only7✅ Yes - Fastest option
HD FlashEnglish (US), Mandarin Chinese10✅ Yes - Fast with HD quality
Multilingual40+ languages52✅ Yes - Best for multilingual applications
HD NeuralLimited (10-15 languages)54⚠️ Not recommended - Slower generation
Standard Neural150+ locales500+⚠️ Mixed performance
check_circle

For multilingual Digital Humans, prioritize voices with TurboMultilingual in their name to maintain fast response times across all supported languages.

info

For the complete Azure voice catalog and detailed specifications, visit the Microsoft Azure TTS Documentation.


Voice Selection Best Practices

  1. Prioritize Speed-Optimized Models: Choose voices specifically designed for low-latency applications
  2. Test Before Deploying: Always test selected voices with your Digital Human to ensure they meet your quality and performance requirements
  3. Consider Your Audience: Balance response speed with voice quality based on your use case
  4. Language Requirements: If you need multilingual support, select voices that cover all required languages while maintaining performance
scheduleLast updated Mar 6, 2026