1. Use cases

ext-to-Speech (TTS) models are AI models that convert text information into spoken output. These models generate natural and expressive speech from input text, suitable for various use cases:

  • Providing audio readings for blog articles
  • Generating multilingual speech content
  • Supporting real-time streaming audio output

2. API usage guide

  • Endpoint: /audio/speech. For detailed usage, refer to the API documentation
  • Main request parameters::
    • model: The model used for speech synthesis, with a list of supported models
    • input: The text content to be converted into audio.
    • voice: The reference voice, supporting system-predefined voicesuser-predefined voicesuser-dynamic voices. Detailed parameters: Refer toCreating a text-to-speech request
    • speed: Controls the audio speed, a float type with a default value of 1.0 and a selectable range of [0.25, 4.0].
    • gain: Audio gain in dB, controls the volume of the audio, a float type with a default value of 0.0 and a selectable range of [-10, 10].
    • response_format: Controls the output format, supporting mp3, opus, wav, and pcm formats. Different output formats result in different sampling rates.
    • sample_rate: Controls the output sampling rate, with different default values and selectable ranges depending on the output format:
      • opus: Currently supports only 48000 Hz.
      • wav, pcm: Supports (8000, 16000, 24000, 32000, 44100), default is 44100.
      • mp3: Supports (32000, 44100), default is 44100.

2.1 System-predefined voices:

Currently, the system provides the following 8 voice options:

  • male voices:

  • female voice:

    • steady female voice: anna
    • passionate female voice: bella
    • gentle female voice: claire
    • cheerful female voice: diana

online listeningof the above audio.

When using the system-predefined voices in the request, you need to prepend the model name, such as:

FunAudioLLM/CosyVoice2-0.5B:alex indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.

fishaudio/fish-speech-1.5:anna indicates the anna voice from the fishaudio/fish-speech-1.5 model.

RVC-Boss/GPT-SoVITS:david indicates the david voice from the RVC-Boss/GPT-SoVITS model.

2.2 User-predefined voices:

Note: Using user-predefined voices requires real-name authentication.
To ensure the quality of the generated voice, it is recommended that users upload a voice sample that is 8 to 10 seconds long, with clear pronunciation and no background noise or interference.

2.2.1 Upload user-predefined voices using base64 encoding format

import requests
import json

url = "https://api.siliconflow.cn/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key", # 从 https://cloud.siliconflow.cn/account/ak 获取
    "Content-Type": "application/json"
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B", # 模型名称
    "customName": "your-voice-name", # 用户自定义的音频名称
    "audio": "data:audio/mpeg;base64,SUQzBAAAAAAAIlRTU0UAAAAOAAADTGF2ZjYxLjcuMTAwAAAAAAAAAAAAAAD/40DAAAAAAAAAAAAASW5mbwAAAA8AAAAWAAAJywAfHx8fKioqKio1NTU1Pz8/Pz9KSkpKVVVVVVVfX19fampqamp1dXV1f39/f3+KioqKlZWVlZWfn5+fn6qqqqq1tbW1tb+/v7/KysrKytXV1dXf39/f3+rq6ur19fX19f////", # 参考音频的 base64 编码
    "text": "在一无所知中, 梦里的一天结束了,一个新的轮回便会开始" # 参考音频的文字内容
}

response = requests.post(url, headers=headers, data=json.dumps(data))

# 打印响应状态码和响应内容
print(response.status_code)
print(response.json())  # 如果响应是 JSON 格式

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

In the request, use the user-predefined voices as indicated.

2.2.2 Upload user-predefined voices through a file

import requests

url = "https://api.siliconflow.cn/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key" # 从 https://cloud.siliconflow.cn/account/ak 获取
}
files = {
    "file": open("/Users/senseb/Downloads/fish_audio-Alex.mp3", "rb") # 参考音频文件
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B", # 模型名称
    "customName": "your-voice-name", # 参考音频名称
    "text": "在一无所知中, 梦里的一天结束了,一个新的轮回便会开始" # 参考音频的文字内容
}

response = requests.post(url, headers=headers, files=files, data=data)

print(response.status_code)
print(response.json())  # 打印响应内容(如果是JSON格式)

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

In the request, use the user-predefined voices as indicated.

2.3 Get the list of user-dynamic voices

import requests
url = "https://api.siliconflow.cn/v1/audio/voice/list"

headers = {
    "Authorization": "Bearer your-api-key" # 从https://cloud.siliconflow.cn/account/ak获取
}
response = requests.get(url, headers=headers)

print(response.status_code)
print(response.json) # 打印响应内容(如果是JSON格式)

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

In the request, use the user-predefined voices as indicated.

2.4 Use user-dynamic voices

Note: Using user-predefined voices requires real-name authentication.
Use the user-dynamic voices in the request as indicated.

2.5 Delete user-dynamic voices

import requests

url = "https://api.siliconflow.cn/v1/audio/voice/deletions"
headers = {
    "Authorization": "Bearer your-api-key",
    "Content-Type": "application/json"
}
payload = {
    "uri": "speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.status_code)
print(response.text) #打印响应内容

The URI field in the response is the ID of the custom voice. Users can use this ID as the voice parameter in subsequent requests.

3. List of supported models

Note: The supported TTS models may be subject to change. Please filter by the “Speech” tag on the 「Model Square」to obtain the current list of supported models.
Charging method: Charged based on the number of UTF-8 bytes corresponding to the input text. Online byte counter demo.

3.1 fishaudio/fish-speech series models

Note: The current fishaudio/fish-speech series models only support payment with recharge balance. Please ensure that your account has sufficient recharge balance before use.
  • fish-speech-1.5 supported languages: Chinese, English, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese
  • fish-speech-1.4 supported languages: Chinese, English, Japanese, German, French, Spanish, Korean, Arabic

3.2 RVC-Boss/GPT-SoVITS series models

  • Zero-shot text-to-speech (TTS): Generate speech from 5 seconds of audio samples instantly.
  • Cross-language support: Support inference in languages different from the training dataset. Currently supports English, Japanese, Korean, Cantonese, and Chinese.

3.3 FunAudioLLM/CosyVoice2-0.5B series models

  • Cross-Language text-to-speech: Achieve text-to-speech synthesis across different languages, including Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghai dialect, Zhengzhou dialect, Changsha dialect, Tianjin dialect)
  • Emotional control: Support the generation of speech with various emotional expressions, including happiness, excitement, sadness, anger, etc.
  • Fine-grained control: Control the emotional and rhythmic aspects of the generated speech through rich text or natural language.

4. Best practices for reference audio

Providing high-quality reference audio samples can enhance the cloning effect.

4.1 Audio quality guidelines

  • Single speaker only
  • Stable volume, pitch, and emotion
  • Short pauses (suggest 0.5 seconds)
  • Ideal scenario: No background noise, professional recording quality, no room echo

4.2 File formats

  • Supported formats: mp3, wav, pcm, opus
  • Recommend using mp3 with a bitrate above 192kbps to avoid quality loss
  • Additional benefits of uncompressed formats (e.g., WAV) are limited

5. Usage examples

5.1 Use system-predefined voices

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="您的 APIKEY", # 从 https://cloud.siliconflow.cn/account/ak 获取
    base_url="https://api.siliconflow.cn/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", # 支持 fishaudio / GPT-SoVITS / CosyVoice2-0.5B 系列模型
  voice="FunAudioLLM/CosyVoice2-0.5B:alex", # 系统预置音色
  # 用户输入信息
  input="你能用高兴的情感说吗?<|endofprompt|>今天真是太开心了,马上要放假了!I'm so happy, Spring Festival is coming!",
  response_format="mp3" # 支持 mp3, wav, pcm, opus 格式
) as response:
    response.stream_to_file(speech_file_path)

5.2 Use user-predefined voices

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="您的 APIKEY", # 从 https://cloud.siliconflow.cn/account/ak 获取
    base_url="https://api.siliconflow.cn/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", # 支持 fishaudio / GPT-SoVITS / CosyVoice2-0.5B 系列模型
  voice="speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd", # 用户上传音色名称,参考
  # 用户输入信息
  input=" 请问你能模仿粤语的口音吗?< |endofprompt| >多保重,早休息。",
  response_format="mp3"
) as response:
    response.stream_to_file(speech_file_path)

5.3 Use user-dynamic voices

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="您的 APIKEY", # 从 https://cloud.siliconflow.cn/account/ak 获取
    base_url="https://api.siliconflow.cn/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", 
  voice="", # 此处传入空值,表示使用动态音色
  # 用户输入信息
  input="  [laughter]有时候,看着小孩子们的天真行为[laughter],我们总会会心一笑。",
  response_format="mp3",
  extra_body={"references":[
        {
            "audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Alex.mp3", # 参考音频 url。也支持 base64 格式
            "text": "在一无所知中, 梦里的一天结束了,一个新的轮回便会开始", # 参考音频的文字内容
        }
    ]}
) as response:
    response.stream_to_file(speech_file_path)