Voice Selection Guide [TTS]
Overview
The UNITH interFace supports multiple text-to-speech (TTS) providers to give your Digital Human a natural, engaging voice. You can select voices from ElevenLabs or Microsoft Azure directly through the interface, or integrate custom voice providers using our connector framework.
Please check our documentation on Voice connectors that we support.
Need a different voice provider? You have full flexibility to create custom voice connectors. Please check out the following repository.
How Voice Selection Impacts Performance
Digital Human responses require audio generation before video synthesis can begin. The audio generation speed directly affects the overall response time of your Digital Human.
Response Pipeline:
- User query processed
- Audio generated ← Voice model speed matters here
- Video synthesized from audio
- Complete response delivered
Faster audio generation means quicker responses and a more natural, engaging user experience.
Recommended Voices by Provider
ElevenLabs
ElevenLabs offers a wide variety of voices powered by different models, each optimized for specific use cases. For Digital Human applications, we recommend using voices powered by their speed-optimized models.
Recommended Models
| Model | Characteristics | Use Case |
|---|---|---|
flash_v2 | Fastest generation, balanced quality | Real-time conversations |
flash_v2_5 | Enhanced flash model | Real-time conversations with improved quality |
turbo_v2 | High-speed generation | Low-latency interactions |
turbo_v2_5 | Latest turbo generation | Optimal balance of speed and quality |
Best Practice: Select ElevenLabs voices that use flash_v2, flash_v2_5, turbo_v2, or turbo_v2_5 models for the fastest Digital Human response times.
Important Notes
- All ElevenLabs models will function correctly with Digital Humans
- Non-optimized models may result in longer response delays
- Speed-optimized models are specifically designed for real-time conversational applications
For a complete list of available voices and their associated models, visit the ElevenLabs Voice Library.
Microsoft Azure
Microsoft Azure offers an extensive voice catalog across multiple performance tiers. For optimal Digital Human performance, we recommend selecting voices from their speed-optimized tiers.
Recommended Voice Types
Select voices that include one of these identifiers in their name:
| Voice Type | Identifier | Language Support | Performance |
|---|---|---|---|
| Turbo Multilingual | TurboMultilingual | 40+ languages | Fastest generation across multiple languages |
| HD Flash | HDFlash | English (US), Chinese (Mandarin) | Very fast with high-definition quality |
Voices to Avoid
Avoid voices containing HDNeural in their name, as these prioritize audio quality over generation speed and will result in longer response times.
Azure Voice Performance Tiers
The table below provides an overview of Microsoft Azure's voice catalog organized by performance characteristics:
| Performance Tier | Language Coverage | Available Voices | Recommended |
|---|---|---|---|
| Turbo | English (US) only | 7 | ✅ Yes - Fastest option |
| HD Flash | English (US), Mandarin Chinese | 10 | ✅ Yes - Fast with HD quality |
| Multilingual | 40+ languages | 52 | ✅ Yes - Best for multilingual applications |
| HD Neural | Limited (10-15 languages) | 54 | ⚠️ Not recommended - Slower generation |
| Standard Neural | 150+ locales | 500+ | ⚠️ Mixed performance |
For multilingual Digital Humans, prioritize voices with TurboMultilingual in their name to maintain fast response times across all supported languages.
For the complete Azure voice catalog and detailed specifications, visit the Microsoft Azure TTS Documentation.
Voice Selection Best Practices
- Prioritize Speed-Optimized Models: Choose voices specifically designed for low-latency applications
- Test Before Deploying: Always test selected voices with your Digital Human to ensure they meet your quality and performance requirements
- Consider Your Audience: Balance response speed with voice quality based on your use case
- Language Requirements: If you need multilingual support, select voices that cover all required languages while maintaining performance