Text to speech
1. Use cases
ext-to-Speech (TTS) models are AI models that convert text information into spoken output. These models generate natural and expressive speech from input text, suitable for various use cases:
- Providing audio readings for blog articles
- Generating multilingual speech content
- Supporting real-time streaming audio output
2. API usage guide
- Endpoint: /audio/speech. For detailed usage, refer to the API documentation
- Main request parameters::
- model: The model used for speech synthesis, with a list of supported models。
- input: The text content to be converted into audio.
- voice: The reference voice, supporting system-predefined voices、user-predefined voices、user-dynamic voices. Detailed parameters: Refer toCreating a text-to-speech request。
- speed: Controls the audio speed, a float type with a default value of 1.0 and a selectable range of [0.25, 4.0].
- gain: Audio gain in dB, controls the volume of the audio, a float type with a default value of 0.0 and a selectable range of [-10, 10].
- response_format: Controls the output format, supporting mp3, opus, wav, and pcm formats. Different output formats result in different sampling rates.
- sample_rate: Controls the output sampling rate, with different default values and selectable ranges depending on the output format:
- opus: Currently supports only 48000 Hz.
- wav, pcm: Supports (8000, 16000, 24000, 32000, 44100), default is 44100.
- mp3: Supports (32000, 44100), default is 44100.
2.1 System-predefined voices:
Currently, the system provides the following 8 voice options:
-
male voices:
-
female voice:
online listeningof the above audio.
When using the system-predefined voices in the request, you need to prepend the model name, such as:
FunAudioLLM/CosyVoice2-0.5B:alex
indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.
fishaudio/fish-speech-1.5:anna
indicates the anna voice from the fishaudio/fish-speech-1.5 model.
RVC-Boss/GPT-SoVITS:david
indicates the david voice from the RVC-Boss/GPT-SoVITS model.
2.2 User-predefined voices:
2.2.1 Upload user-predefined voices using base64 encoding format
The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.
In the request, use the user-predefined voices as indicated.
2.2.2 Upload user-predefined voices through a file
The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.
In the request, use the user-predefined voices as indicated.
2.3 Get the list of user-dynamic voices
The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.
In the request, use the user-predefined voices as indicated.
2.4 Use user-dynamic voices
2.5 Delete user-dynamic voices
The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.
3. List of supported models
3.1 fishaudio/fish-speech series models
- fish-speech-1.5 supported languages: Chinese, English, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese
- fish-speech-1.4 supported languages: Chinese, English, Japanese, German, French, Spanish, Korean, Arabic
3.2 RVC-Boss/GPT-SoVITS series models
- Zero-shot text-to-speech (TTS): Generate speech from 5 seconds of audio samples instantly.
- Cross-language support: Support inference in languages different from the training dataset. Currently supports English, Japanese, Korean, Cantonese, and Chinese.
3.3 FunAudioLLM/CosyVoice2-0.5B series models
- Cross-Language text-to-speech: Achieve text-to-speech synthesis across different languages, including Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghai dialect, Zhengzhou dialect, Changsha dialect, Tianjin dialect)
- Emotional control: Support the generation of speech with various emotional expressions, including happiness, excitement, sadness, anger, etc.
- Fine-grained control: Control the emotional and rhythmic aspects of the generated speech through rich text or natural language.
4. Best practices for reference audio
Providing high-quality reference audio samples can enhance the cloning effect.
4.1 Audio quality guidelines
- Single speaker only
- Stable volume, pitch, and emotion
- Short pauses (suggest 0.5 seconds)
- Ideal scenario: No background noise, professional recording quality, no room echo
4.2 File formats
- Supported formats: mp3, wav, pcm, opus
- Recommend using mp3 with a bitrate above 192kbps to avoid quality loss
- Additional benefits of uncompressed formats (e.g., WAV) are limited