Authorizations
Use the following format for authentication: Bearer <your api key>
Body
MOSS-TTSD (text to spoken dialogue) is an open-source bilingual spoken dialogue synthesis model that supports both Chinese and English. It can transform dialogue scripts between two speakers into natural, expressive conversational speech. MOSS-TTSD supports voice cloning and long single-session speech generation, making it ideal for AI podcast production.
To better enhance service quality, we will make periodic changes to the models provided by this service, including but not limited to model on/offlining and adjustments to model service capabilities. We will notify you of such changes through appropriate means such as announcements or message pushes where feasible.
fnlp/MOSS-TTSD-v0.5
The dialogue text uses speaker tags to indicate turns: [S1]: Indicates Speaker 1 is speaking [S2]: Indicates Speaker 2 is speaking
1 - 128000
"[S1]Hello, how are you today?[S2]I'm doing great, thanks for asking![S1]That's wonderful to hear "
The maximum number of tokens to generate. The input + output does not exceed 32k tokens.
4096
The voice field and references field are mutually exclusive. If you want to use scripted dialogue, you need to pass two voice tones through the references field. Scripted dialogue is only available for the moss model.
The "voice" field currently does not support two timbres. If you need to upload two timbres, please use "reference".
fnlp/MOSS-TTSD-v0.5:alex
, fnlp/MOSS-TTSD-v0.5:anna
, fnlp/MOSS-TTSD-v0.5:bella
, fnlp/MOSS-TTSD-v0.5:benjamin
, fnlp/MOSS-TTSD-v0.5:charles
, fnlp/MOSS-TTSD-v0.5:claire
, fnlp/MOSS-TTSD-v0.5:david
, fnlp/MOSS-TTSD-v0.5:diana
The format to audio out. Supported formats are mp3
, opus
, wav
, pcm
mp3
, opus
, wav
, pcm
Control the output sample rate. The default values and differ for different video output types, as follows: opus: Supports 48000 Hz. wav, pcm: Supports 8000, 16000, 24000, 32000, 44100 Hz, with a default of 44100 Hz. mp3: Supports 32000, 44100 Hz, with a default of 44100 Hz.
streaming or not
The speed of the generated audio. Select a value from 0.25
to 4.0
. 1.0
is the default.
0.25 <= x <= 4
-10 <= x <= 10
Response
Generate audio based on the input text. The data generated by the interface is in binary format and requires the user to process it themselves. Reference:https://docs.siliconflow.cn/capabilities/text-to-speech#5
The response is of type file
.