1. Use cases
ext-to-Speech (TTS) models are AI models that convert text information into spoken output. These models generate natural and expressive speech from input text, suitable for various use cases:- Providing audio readings for blog articles
- Generating multilingual speech content
- Supporting real-time streaming audio output
2. API usage guide
- Endpoint: /audio/speech. For detailed usage, refer to the API documentation
- Main request parameters::
- model: The model used for speech synthesis, with a list of supported models。
- input: The text content to be converted into audio.
- voice: The reference voice, supporting system-predefined voices、user-predefined voices、user-dynamic voices. Detailed parameters: Refer toCreating a text-to-speech request。
- speed: Controls the audio speed, a float type with a default value of 1.0 and a selectable range of [0.25, 4.0].
- gain: Audio gain in dB, controls the volume of the audio, a float type with a default value of 0.0 and a selectable range of [-10, 10].
- response_format: Controls the output format, supporting mp3, opus, wav, and pcm formats. Different output formats result in different sampling rates.
- sample_rate: Controls the output sampling rate, with different default values and selectable ranges depending on the output format:
- opus: Currently supports only 48000 Hz.
- wav, pcm: Supports (8000, 16000, 24000, 32000, 44100), default is 44100.
- mp3: Supports (32000, 44100), default is 44100.
Note: Do not add spaces to the input content, and the reference audio should be less than 30 seconds.
2.1 System-predefined voices:
Currently, the system provides the following 8 voice options:- male voices:
- female voice:
FunAudioLLM/CosyVoice2-0.5B:alex
indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.
2.2 User-predefined voices:
Note: Using user-predefined voices requires real-name authentication.
2.2.1 Upload user-predefined voices using base64 encoding format
2.2.2 Upload user-predefined voices through a file
2.3 Get the list of user-dynamic voices
2.4 Use user-dynamic voices
Note: Using user-predefined voices requires real-name authentication.
2.5 Delete user-dynamic voices
3. List of supported models
Note: The supported TTS models may be subject to change. Please filter by the “Speech” tag on the 「Models」to obtain the current list of supported models.
Charging method: Charged based on the number of UTF-8 bytes corresponding to the input text. Online byte counter demo.
3.1 FunAudioLLM/CosyVoice2-0.5B series models
- Cross-Language text-to-speech: Achieve text-to-speech synthesis across different languages, including Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghai dialect, Zhengzhou dialect, Changsha dialect, Tianjin dialect)
- Emotional control: Support the generation of speech with various emotional expressions, including happiness, excitement, sadness, anger, etc.
- Fine-grained control: Control the emotional and rhythmic aspects of the generated speech through rich text or natural language.
3.2 fnlp/MOSS-TTSD-v0.5
- High-Expressivity Voice: Natural conversational tone with emotional expression support
- Dual-Voice Cloning: Zero-shot cloning with automatic speaker switching
- Bilingual Support (Chinese & English): Seamless mixed synthesis and natural pronunciation
- Long-Form Speech Generation: Low-latency, stable output for lengthy texts
4. Best practices for reference audio
Providing high-quality reference audio samples can enhance the cloning effect.4.1 Audio quality guidelines
- Single speaker only
- Clear articulation, steady volume, tone, and emotion
- Brief pauses (0.5 seconds recommended)
- Ideal: No background noise, professional recording quality, no room echo
- Suggested duration: 8 to 10 seconds
4.2 File formats
- Supported formats: mp3, wav, pcm, opus
- Recommend using mp3 with a bitrate above 192kbps to avoid quality loss
- Additional benefits of uncompressed formats (e.g., WAV) are limited