Create speech

cURL

curl --location 'https://api.siliconflow.cn/v1/audio/speech' \
--header 'Authorization: Bearer sk-xx' \
--header 'Content-Type: application/json' \
--data '{
  "model": "fnlp/MOSS-TTSD-v0.5",
  "input": "你站在桥上看风景，看风景的人在楼上看你。明月装饰了你的窗子，你装饰了别人的梦",
  "voice": "fnlp/MOSS-TTSD-v0.5:alex",
  "response_format": "mp3",
  "stream": true
}'

"音频的二进制数据"

POST

audio

speech

cURL

curl --location 'https://api.siliconflow.cn/v1/audio/speech' \
--header 'Authorization: Bearer sk-xx' \
--header 'Content-Type: application/json' \
--data '{
  "model": "fnlp/MOSS-TTSD-v0.5",
  "input": "你站在桥上看风景，看风景的人在楼上看你。明月装饰了你的窗子，你装饰了别人的梦",
  "voice": "fnlp/MOSS-TTSD-v0.5:alex",
  "response_format": "mp3",
  "stream": true
}'

"音频的二进制数据"

Authorizations

Authorization

string

header

required

Use the following format for authentication: Bearer

Body

application/json

MOSS-TTSD-v0.5
CosyVoice2-0.5B

model

enum<string>

required

MOSS-TTSD (text to spoken dialogue) is an open-source bilingual spoken dialogue synthesis model that supports both Chinese and English. It can transform dialogue scripts between two speakers into natural, expressive conversational speech. MOSS-TTSD supports voice cloning and long single-session speech generation, making it ideal for AI podcast production.

To better enhance service quality, we will make periodic changes to the models provided by this service, including but not limited to model on/offlining and adjustments to model service capabilities. We will notify you of such changes through appropriate means such as announcements or message pushes where feasible.

Available options:

fnlp/MOSS-TTSD-v0.5

input

string

required

The dialogue text uses speaker tags to indicate turns: [S1]: Indicates Speaker 1 is speaking [S2]: Indicates Speaker 2 is speaking

Required string length: 1 - 128000

Example:

"[S1]Hello, how are you today?[S2]I'm doing great, thanks for asking![S1]That's wonderful to hear "

max_tokens

integer

default:2048

The maximum number of tokens to generate. The input + output does not exceed 32k tokens.

Example:

4096

references

object[]

The voice field and references field are mutually exclusive. If you want to use scripted dialogue, you need to pass two voice tones through the references field. Scripted dialogue is only available for the moss model.

Show child attributes

voice

enum<string>

The "voice" field currently does not support two timbres. If you need to upload two timbres, please use "reference".

Available options:

fnlp/MOSS-TTSD-v0.5:alex,

fnlp/MOSS-TTSD-v0.5:anna,

fnlp/MOSS-TTSD-v0.5:bella,

fnlp/MOSS-TTSD-v0.5:benjamin,

fnlp/MOSS-TTSD-v0.5:charles,

fnlp/MOSS-TTSD-v0.5:claire,

fnlp/MOSS-TTSD-v0.5:david,

fnlp/MOSS-TTSD-v0.5:diana

response_format

enum<string>

default:mp3

The format to audio out. Supported formats are mp3, opus, wav, pcm

Available options:

mp3,

opus,

wav,

pcm

sample_rate

number

default:32000

Control the output sample rate. The default values and differ for different video output types, as follows: opus: Supports 48000 Hz. wav, pcm: Supports 8000, 16000, 24000, 32000, 44100 Hz, with a default of 44100 Hz. mp3: Supports 32000, 44100 Hz, with a default of 44100 Hz.

stream

boolean

default:true

streaming or not

speed

number<float>

default:1

The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.

Required range: 0.25 <= x <= 4

gain

number<float>

default:0

Required range: -10 <= x <= 10

Response

Generate audio based on the input text. The data generated by the interface is in binary format and requires the user to process it themselves. Reference:https://docs.siliconflow.cn/capabilities/text-to-speech#5 The response header contains the x-siliconcloud-trace-id field, which serves as a unique identifier for tracing requests, facilitating log queries and issue troubleshooting.

The response is of type file.

Upload reference audio Retrieve reference audio list

Chat

Image

Audio

Video

Batch

Platform

Authorizations

Body

Response