POST
/
audio
/
speech
curl --request POST \
  --url https://api.siliconflow.cn/v1/audio/speech \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "FunAudioLLM/CosyVoice2-0.5B",
  "input": "Can you say it with a happy emotion? <|endofprompt|>I'\''m so happy, Spring Festival is coming!",
  "voice": "FunAudioLLM/CosyVoice2-0.5B:alex",
  "response_format": "mp3",
  "sample_rate": 123,
  "stream": true,
  "speed": 1,
  "gain": 0
}'
This response does not have an example.

Authorizations

Authorization
string
header
required

Use the following format for authentication: Bearer <your api key>

Body

application/json
model
enum<string>
required

Corresponding Model Name. To better enhance service quality, we will make periodic changes to the models provided by this service, including but not limited to model on/offlining and adjustments to model service capabilities. We will notify you of such changes through appropriate means such as announcements or message pushes where feasible.

Available options:
FunAudioLLM/CosyVoice2-0.5B
input
string
default:Can you say it with a happy emotion? <|endofprompt|>I'm so happy, Spring Festival is coming!
required

For natural language instructions, add a special end marker "<|endofprompt|>" before the natural language description. These descriptions cover aspects such as emotion, speaking speed, role-playing, and dialects. For detailed instructions, insert pitch bursts between text markers, using markers like "[laughter]" and "[breath]." Additionally, we apply pitch feature markers to phrases; for example:Can you say it with a happy emotion? <|endofprompt|> Today is really happy, Spring Festival is coming! I’m so happy, Spring Festival is coming! [laughter] [breath].

Required string length: 1 - 128000
Example:

"Can you say it with a happy emotion? <|endofprompt|>I'm so happy, Spring Festival is coming!"

voice
enum<string>
required
Available options:
FunAudioLLM/CosyVoice2-0.5B:alex,
FunAudioLLM/CosyVoice2-0.5B:anna,
FunAudioLLM/CosyVoice2-0.5B:bella,
FunAudioLLM/CosyVoice2-0.5B:benjamin,
FunAudioLLM/CosyVoice2-0.5B:charles,
FunAudioLLM/CosyVoice2-0.5B:claire,
FunAudioLLM/CosyVoice2-0.5B:david,
FunAudioLLM/CosyVoice2-0.5B:diana
response_format
enum<string>
default:mp3

The format to audio out. Supported formats are mp3, opus, wav, pcm

Available options:
mp3,
opus,
wav,
pcm
sample_rate
number

Control the output sample rate. The default values and differ for different video output types, as follows: opus: Supports 48000 Hz. wav, pcm: Supports 8000, 16000, 24000, 32000, 44100 Hz, with a default of 44100 Hz. mp3: Supports 32000, 44100 Hz, with a default of 44100 Hz.

stream
boolean
default:true

streaming or not

speed
number
default:1

The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.

Required range: 0.25 <= x <= 4
gain
number
default:0
Required range: -10 <= x <= 10

Response

200
application/audio
Generate audio based on the input text. The data generated by the interface is in binary format and requires the user to process it themselves. Reference:https://docs.siliconflow.cn/capabilities/text-to-speech#5

The response is of type file.