Text to speech

1. Use cases

ext-to-Speech (TTS) models are AI models that convert text information into spoken output. These models generate natural and expressive speech from input text, suitable for various use cases:

Providing audio readings for blog articles
Generating multilingual speech content
Supporting real-time streaming audio output

2. API usage guide

Endpoint: /audio/speech. For detailed usage, refer to the API documentation
Main request parameters:：
- model: The model used for speech synthesis, with a list of supported models。
- input: The text content to be converted into audio.
- voice: The reference voice, supporting system-predefined voices、user-predefined voices、user-dynamic voices. Detailed parameters: Refer toCreating a text-to-speech request。
- speed: Controls the audio speed, a float type with a default value of 1.0 and a selectable range of [0.25, 4.0].
- gain: Audio gain in dB, controls the volume of the audio, a float type with a default value of 0.0 and a selectable range of [-10, 10].
- response_format: Controls the output format, supporting mp3, opus, wav, and pcm formats. Different output formats result in different sampling rates.
- sample_rate: Controls the output sampling rate, with different default values and selectable ranges depending on the output format:
  - opus: Currently supports only 48000 Hz.
  - wav, pcm: Supports (8000, 16000, 24000, 32000, 44100), default is 44100.
  - mp3: Supports (32000, 44100), default is 44100.

2.1 System-predefined voices:

Currently, the system provides the following 8 voice options:

male voices:
- steady male voice: alex
- deep male voice: benjamin
- magnetic male voice: charles
- cheerful male voice: david
female voice:
- steady female voice: anna
- passionate female voice: bella
- gentle female voice: claire
- cheerful female voice: diana

online listeningof the above audio. When using the system-predefined voices in the request, you need to prepend the model name, such as: FunAudioLLM/CosyVoice2-0.5B:alex indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.

2.2 User-predefined voices:

Note: Using user-predefined voices requires real-name authentication.

To ensure the quality of the generated voice, it is recommended that users upload a voice sample that is 8 to 10 seconds long, with clear pronunciation and no background noise or interference.

2.2.1 Upload user-predefined voices using base64 encoding format

import requests
import json

url = "https://api.siliconflow.cn/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key", # 从 https://cloud.siliconflow.cn/account/ak 获取
    "Content-Type": "application/json"
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B", # model name
    "customName": "your-voice-name", # user-predefined voices name
    "audio": "data:audio/mpeg;base64,SUQzBAAAAAAAIlRTU0UAAAAOAAADTGF2ZjYxLjcuMTAwAAAAAAAAAAAAAAD/40DAAAAAAAAAAAAASW5mbwAAAA8AAAAWAAAJywAfHx8fKioqKio1NTU1Pz8/Pz9KSkpKVVVVVVVfX19fampqamp1dXV1f39/f3+KioqKlZWVlZWfn5+fn6qqqqq1tbW1tb+/v7/KysrKytXV1dXf39/f3+rq6ur19fX19f////", # 参考音频的 base64 编码
    "text": "在一无所知中, 梦里的一天结束了，一个新的轮回便会开始" # 参考音频的文字内容
}

response = requests.post(url, headers=headers, data=json.dumps(data))

# 打印响应状态码和响应内容
print(response.status_code)
print(response.json())  # 如果响应是 JSON 格式

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

In the request, use the user-predefined voices as indicated.

2.2.2 Upload user-predefined voices through a file

import requests

url = "https://api.siliconflow.cn/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key" # 从 https://cloud.siliconflow.cn/account/ak 获取
}
files = {
    "file": open("/Users/senseb/Downloads/fish_audio-Alex.mp3", "rb") # 参考音频文件
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B", # 模型名称
    "customName": "your-voice-name", # 参考音频名称
    "text": "在一无所知中, 梦里的一天结束了，一个新的轮回便会开始" # 参考音频的文字内容
}

response = requests.post(url, headers=headers, files=files, data=data)

print(response.status_code)
print(response.json())

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

In the request, use the user-predefined voices as indicated.

2.3 Get the list of user-dynamic voices

import requests
url = "https://api.siliconflow.cn/v1/audio/voice/list"

headers = {
    "Authorization": "Bearer your-api-key" # 从https://cloud.siliconflow.cn/account/ak获取
}
response = requests.get(url, headers=headers)

print(response.status_code)
print(response.json())

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

In the request, use the user-predefined voices as indicated.

2.4 Use user-dynamic voices

Note: Using user-predefined voices requires real-name authentication.

Use the user-dynamic voices in the request as indicated.

2.5 Delete user-dynamic voices

import requests

url = "https://api.siliconflow.cn/v1/audio/voice/deletions"
headers = {
    "Authorization": "Bearer your-api-key",
    "Content-Type": "application/json"
}
payload = {
    "uri": "speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.status_code)
print(response.text)

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

3. List of supported models

Note: The supported TTS models may be subject to change. Please filter by the “Speech” tag on the 「Models」to obtain the current list of supported models.

Charging method: Charged based on the number of UTF-8 bytes corresponding to the input text. Online byte counter demo.

3.1 FunAudioLLM/CosyVoice2-0.5B series models

Cross-Language text-to-speech: Achieve text-to-speech synthesis across different languages, including Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghai dialect, Zhengzhou dialect, Changsha dialect, Tianjin dialect)
Emotional control: Support the generation of speech with various emotional expressions, including happiness, excitement, sadness, anger, etc.
Fine-grained control: Control the emotional and rhythmic aspects of the generated speech through rich text or natural language.

3.2 fnlp/MOSS-TTSD-v0.5

High-Expressivity Voice: Natural conversational tone with emotional expression support
Dual-Voice Cloning: Zero-shot cloning with automatic speaker switching
Bilingual Support (Chinese & English): Seamless mixed synthesis and natural pronunciation
Long-Form Speech Generation: Low-latency, stable output for lengthy texts

4. Best practices for reference audio

Providing high-quality reference audio samples can enhance the cloning effect.

4.1 Audio quality guidelines

Single speaker only
Clear articulation, steady volume, tone, and emotion
Brief pauses (0.5 seconds recommended)
Ideal: No background noise, professional recording quality, no room echo
Suggested duration: 8 to 10 seconds

4.2 File formats

Supported formats: mp3, wav, pcm, opus
Recommend using mp3 with a bitrate above 192kbps to avoid quality loss
Additional benefits of uncompressed formats (e.g., WAV) are limited

5. Usage examples

5.1 Use system-predefined voices

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="您的 APIKEY", # 从 https://cloud.siliconflow.cn/account/ak 获取
    base_url="https://api.siliconflow.cn/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", # 支持 fishaudio / GPT-SoVITS / CosyVoice2-0.5B 系列模型
  voice="FunAudioLLM/CosyVoice2-0.5B:alex", # 系统预置音色
  # 用户输入信息
  input="你能用高兴的情感说吗？<|endofprompt|>今天真是太开心了，马上要放假了！I'm so happy, Spring Festival is coming!",
  response_format="mp3" # 支持 mp3, wav, pcm, opus 格式
) as response:
    response.stream_to_file(speech_file_path)

5.2 Use user-predefined voices

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="您的 APIKEY", # 从 https://cloud.siliconflow.cn/account/ak 获取
    base_url="https://api.siliconflow.cn/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", # 支持 fishaudio / GPT-SoVITS / CosyVoice2-0.5B 系列模型
  voice="speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd", # 用户上传音色名称，参考
  # 用户输入信息
  input=" 请问你能模仿粤语的口音吗？< |endofprompt| >多保重，早休息。",
  response_format="mp3"
) as response:
    response.stream_to_file(speech_file_path)

5.3 Use user-dynamic voices

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="您的 APIKEY", # 从 https://cloud.siliconflow.cn/account/ak 获取
    base_url="https://api.siliconflow.cn/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", 
  voice="", # 此处传入空值，表示使用动态音色
  # 用户输入信息
  input="  [laughter]有时候，看着小孩子们的天真行为[laughter]，我们总会会心一笑。",
  response_format="mp3",
  extra_body={"references":[
        {
            "audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Alex.mp3", # 参考音频 url。也支持 base64 格式
            "text": "在一无所知中, 梦里的一天结束了，一个新的轮回便会开始", # 参考音频的文字内容
        }
    ]}
) as response:
    response.stream_to_file(speech_file_path)

5.4 fnlp/MOSS-TTSD-v0.5 usage examples

# -*- coding: utf-8 -*-
import requests
import json

url = "https://api.siliconflow.cn/v1/audio/speech"
token = "Your-api-key"

request_data = {
                "model": "fnlp/MOSS-TTSD-v0.5",
                "stream": True,
                "input": "[S1]Hello, how are you today?[S2]I'm doing great, thanks for asking![S1]That's wonderful to hear.",
                "references": [
                    {
                        "audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Charles.mp3",
                        "text": "他又躺在那里，眼睛闭着，仍然沉浸在梦境的气氛里。那是个庞杂而亮堂的梦",
                    },
                    {
                        "audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Claire.mp3",
                        "text": "他又躺在那里，眼睛闭着，仍然沉浸在梦境的气氛里。那是个庞杂而亮堂的梦"
                    }
                    ],
                "max_tokens": 1600,
                "response_format": "mp3",
                "speed": 1,
                "gain": 0
                }
headers = {'Content-Type': 'application/json', 'Authorization': "Bearer " + token}
try:
    res = requests.post(url=url, data=json.dumps(request_data), headers=headers)
    if res.status_code != 200:
        print(res.text)
    with open('./test.mp3', 'wb') as file:
        file.write(res.content)
except Exception as e:
    print(request_data)
    print(e)

GET STARTED

CAPABILITIES

FEATURES

RATE LIMITS

SILICONFLOW PRODUCT SUITE

1. Use cases

2. API usage guide

2.1 System-predefined voices:

2.2 User-predefined voices:

2.2.1 Upload user-predefined voices using base64 encoding format

2.2.2 Upload user-predefined voices through a file

2.3 Get the list of user-dynamic voices

2.4 Use user-dynamic voices

2.5 Delete user-dynamic voices

3. List of supported models

3.1 FunAudioLLM/CosyVoice2-0.5B series models

3.2 fnlp/MOSS-TTSD-v0.5

4. Best practices for reference audio

4.1 Audio quality guidelines

4.2 File formats

5. Usage examples

5.1 Use system-predefined voices

5.2 Use user-predefined voices

5.3 Use user-dynamic voices

5.4 fnlp/MOSS-TTSD-v0.5 usage examples

GET STARTED

CAPABILITIES

FEATURES

RATE LIMITS

SILICONFLOW PRODUCT SUITE

​1. Use cases

​2. API usage guide

​2.1 System-predefined voices:

​2.2 User-predefined voices:

​2.2.1 Upload user-predefined voices using base64 encoding format

​2.2.2 Upload user-predefined voices through a file

​2.3 Get the list of user-dynamic voices

​2.4 Use user-dynamic voices

​2.5 Delete user-dynamic voices

​3. List of supported models

​3.1 FunAudioLLM/CosyVoice2-0.5B series models

​3.2 fnlp/MOSS-TTSD-v0.5

​4. Best practices for reference audio

​4.1 Audio quality guidelines

​4.2 File formats

​5. Usage examples

​5.1 Use system-predefined voices

​5.2 Use user-predefined voices

​5.3 Use user-dynamic voices

​5.4 fnlp/MOSS-TTSD-v0.5 usage examples

1. Use cases

2. API usage guide

2.1 System-predefined voices:

2.2 User-predefined voices:

2.2.1 Upload user-predefined voices using base64 encoding format

2.2.2 Upload user-predefined voices through a file

2.3 Get the list of user-dynamic voices

2.4 Use user-dynamic voices

2.5 Delete user-dynamic voices

3. List of supported models

3.1 FunAudioLLM/CosyVoice2-0.5B series models

3.2 fnlp/MOSS-TTSD-v0.5

4. Best practices for reference audio

4.1 Audio quality guidelines

4.2 File formats

5. Usage examples

5.1 Use system-predefined voices

5.2 Use user-predefined voices

5.3 Use user-dynamic voices

5.4 fnlp/MOSS-TTSD-v0.5 usage examples