Create speech
Generate audio from input text. The data generated by the interface is the binary data of the audio, which requires the user to handle it themselves. Reference:https://docs.siliconflow.cn/capabilities/text-to-speech#5
Authorizations
Use the following format for authentication: Bearer <your api key>
Body
Corresponding Model Name. To better enhance service quality, we will make periodic changes to the models provided by this service, including but not limited to model on/offlining and adjustments to model service capabilities. We will notify you of such changes through appropriate means such as announcements or message pushes where feasible.
FunAudioLLM/CosyVoice2-0.5B
For natural language instructions, add a special end marker "<|endofprompt|>" before the natural language description. These descriptions cover aspects such as emotion, speaking speed, role-playing, and dialects. For detailed instructions, insert pitch bursts between text markers, using markers like "[laughter]" and "[breath]." Additionally, we apply pitch feature markers to phrases; for example:Can you say it with a happy emotion? <|endofprompt|> Today is really happy, Spring Festival is coming! I’m so happy, Spring Festival is coming! [laughter] [breath].
1 - 128000
"Can you say it with a happy emotion? <|endofprompt|>I'm so happy, Spring Festival is coming!"
FunAudioLLM/CosyVoice2-0.5B:alex
, FunAudioLLM/CosyVoice2-0.5B:anna
, FunAudioLLM/CosyVoice2-0.5B:bella
, FunAudioLLM/CosyVoice2-0.5B:benjamin
, FunAudioLLM/CosyVoice2-0.5B:charles
, FunAudioLLM/CosyVoice2-0.5B:claire
, FunAudioLLM/CosyVoice2-0.5B:david
, FunAudioLLM/CosyVoice2-0.5B:diana
The format to audio out. Supported formats are mp3
, opus
, wav
, pcm
mp3
, opus
, wav
, pcm
Control the output sample rate. The default values and differ for different video output types, as follows: opus: Supports 48000 Hz. wav, pcm: Supports 8000, 16000, 24000, 32000, 44100 Hz, with a default of 44100 Hz. mp3: Supports 32000, 44100 Hz, with a default of 44100 Hz.
streaming or not
The speed of the generated audio. Select a value from 0.25
to 4.0
. 1.0
is the default.
0.25 <= x <= 4
-10 <= x <= 10
Response
The response is of type file
.