POST
/
audio
/
speech
Create Speech
curl --request POST \
  --url https://api.siliconflow.cn/v1/audio/speech \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "fnlp/MOSS-TTSD-v0.5",
  "input": "[S1]Hello, how are you today?[S2]I'\''m doing great, thanks for asking![S1]That'\''s wonderful to hear "
}'
This response does not have an example.

Authorizations

Authorization
string
header
required

Use the following format for authentication: Bearer <your api key>

Body

application/json
model
enum<string>
required

MOSS-TTSD (text to spoken dialogue) is an open-source bilingual spoken dialogue synthesis model that supports both Chinese and English. It can transform dialogue scripts between two speakers into natural, expressive conversational speech. MOSS-TTSD supports voice cloning and long single-session speech generation, making it ideal for AI podcast production.

To better enhance service quality, we will make periodic changes to the models provided by this service, including but not limited to model on/offlining and adjustments to model service capabilities. We will notify you of such changes through appropriate means such as announcements or message pushes where feasible.

Available options:
fnlp/MOSS-TTSD-v0.5
input
string
required

The dialogue text uses speaker tags to indicate turns: [S1]: Indicates Speaker 1 is speaking [S2]: Indicates Speaker 2 is speaking

Required string length: 1 - 128000
Example:

"[S1]Hello, how are you today?[S2]I'm doing great, thanks for asking![S1]That's wonderful to hear "

max_tokens
integer
default:2048

The maximum number of tokens to generate. The input + output does not exceed 32k tokens.

Example:

4096

references
object[]

The voice field and references field are mutually exclusive. If you want to use scripted dialogue, you need to pass two voice tones through the references field. Scripted dialogue is only available for the moss model.

voice
enum<string>

The "voice" field currently does not support two timbres. If you need to upload two timbres, please use "reference".

Available options:
fnlp/MOSS-TTSD-v0.5:alex,
fnlp/MOSS-TTSD-v0.5:anna,
fnlp/MOSS-TTSD-v0.5:bella,
fnlp/MOSS-TTSD-v0.5:benjamin,
fnlp/MOSS-TTSD-v0.5:charles,
fnlp/MOSS-TTSD-v0.5:claire,
fnlp/MOSS-TTSD-v0.5:david,
fnlp/MOSS-TTSD-v0.5:diana
response_format
enum<string>
default:mp3

The format to audio out. Supported formats are mp3, opus, wav, pcm

Available options:
mp3,
opus,
wav,
pcm
sample_rate
number
default:32000

Control the output sample rate. The default values and differ for different video output types, as follows: opus: Supports 48000 Hz. wav, pcm: Supports 8000, 16000, 24000, 32000, 44100 Hz, with a default of 44100 Hz. mp3: Supports 32000, 44100 Hz, with a default of 44100 Hz.

stream
boolean
default:true

streaming or not

speed
number
default:1

The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.

Required range: 0.25 <= x <= 4
gain
number
default:0
Required range: -10 <= x <= 10

Response

Generate audio based on the input text. The data generated by the interface is in binary format and requires the user to process it themselves. Reference:https://docs.siliconflow.cn/capabilities/text-to-speech#5

The response is of type file.