1. Overview

Reasoning models are AI systems based on deep learning that solve complex tasks through logical deduction, knowledge association, and context analysis. Typical applications include mathematical problem solving, code generation, logical judgment, and multi-step reasoning scenarios. These types of models typically have the following characteristics:

  • Structured thinking: Using techniques like Chain-of-Thought to break down complex problems
  • Knowledge integration: Combining domain knowledge bases with common sense reasoning capabilities
  • Self-correction mechanism: Enhancing result reliability through feedback loops
  • Multimodal processing: Some advanced models support mixed input of text, code, and formulas

2. Supported model list

  • Qwen Series:
    • Tongyi-Zhiwen/QwenLong-L1-32B
    • Qwen/Qwen3-30B-A3B
    • Qwen/Qwen3-32B
    • Qwen/Qwen3-14B
    • Qwen/Qwen3-8B
    • Qwen/Qwen3-235B-A22B
    • Qwen/QwQ-32B
  • THUDM Series:
    • THUDM/GLM-Z1-32B-0414
    • THUDM/GLM-Z1-Rumination-32B-0414
    • THUDM/GLM-Z1-9B-0414
  • deepseek-ai Series:
    • deepseek-ai/DeepSeek-R1
    • Pro/deepseek-ai/DeepSeek-R1
    • deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
    • deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
    • deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
    • deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
    • Pro/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
    • Pro/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

3. Usage recommendations

3.1 API parameters

3.1.1 Input parameters

  • Input parameters:
    • Maximum Chain-of-Thought Length (thinking_budget): The number of tokens the model uses for internal reasoning. Adjusting the thinking_budget controls the length of the chain-of-thought process.

    • Maximum Response Length (max_tokens): This is used to limit the number of tokens in the final output to the user, excluding the chain-of-thought part. Users can configure this normally to control the maximum length of the response.

Maximum Context Length (context_length): This is the maximum total content length, including the user input, chain-of-thought, and output. It is not a request parameter and does not need to be set by the user.

The maximum response length, maximum reasoning chain length, and maximum context length supported by different models are shown in the table below:

ModelMaximum Response LengthMaximum Reasoning Chain LengthMaximum Context Length
DeepSeek-R1163843276898304
DeepSeek-R1-Distill Series1638432768131072
Qwen3 Series819232768131072
QwQ-32B3276816384131072
GLM-Z1 Series1638432768131072
  • After decoupling the reasoning model’s chain-of-thought process from the response length, the output behavior will follow the following rules:

    • If the number of tokens generated during the “thinking phase” reaches the thinking_budget, the Qwen3 series reasoning model, which natively supports this parameter, will forcibly stop the chain-of-thought reasoning. Other reasoning models might continue to output the thinking content.
    • If the maximum response length exceeds the max_tokens limit or the context length exceeds the context_length restriction, the response content will be truncated. The finish_reason field in the response will be marked as length, indicating that the output was terminated due to length constraints.

3.1.2 Return parameters

  • Return parameters:
    • reasoning_content:Reasoning chain content, at the same level as content.
    • content:Final answer content

3.2 DeepSeek-R1 Usage Recommendations

  • Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
  • Set the value of top_p to 0.95.
  • Avoid adding a system prompt; all instructions should be contained within the user prompt
  • For mathematical problems, it is advisable to include a directive in your prompt such as: “Please reason step by step, and put your final answer within \boxed.”
  • When evaluating model performance, it is recommended to conduct multiple tests and average the results.

4. OpenAI request examples

4.1 Stream Mode Request

from openai import OpenAI

url = 'https://api.siliconflow.cn/v1/'
api_key = 'your api_key'

client = OpenAI(
    base_url=url,
    api_key=api_key
)

# Send a request with streaming output.
content = ""
reasoning_content=""
messages = [
    {"role": "user", "content": "Who are the legendary athletes of the Olympic Games?"}
]
response = client.chat.completions.create(
    model="Pro/deepseek-ai/DeepSeek-R1",
    messages=messages,
    stream=True,  # Enable streaming output.
    max_tokens=4096,
    extra_body={
        "thinking_budget": 1024
    }
)
# Step-by-step receiving and processing the response.
for chunk in response:
    if chunk.choices[0].delta.content:
        content += chunk.choices[0].delta.content
    if chunk.choices[0].delta.reasoning_content:
        reasoning_content += chunk.choices[0].delta.reasoning_content

# Round 2
messages.append({"role": "assistant", "content": content})
messages.append({'role': 'user', 'content': "Go on"})
response = client.chat.completions.create(
    model="Pro/deepseek-ai/DeepSeek-R1",
    messages=messages,
    stream=True
)

4.2 Non-Stream Mode Request

from openai import OpenAI
url = 'https://api.siliconflow.cn/v1/'
api_key = 'your api_key'

client = OpenAI(
    base_url=url,
    api_key=api_key
)

# Send a non-streaming request.
messages = [
    {"role": "user", "content": "Who are the legendary athletes of the Olympic Games?"}
]
response = client.chat.completions.create(
    model="Pro/deepseek-ai/DeepSeek-R1",
    messages=messages,
    stream=False, 
    max_tokens=4096,
    extra_body={
        "thinking_budget": 1024
    }
)
content = response.choices[0].message.content
reasoning_content = response.choices[0].message.reasoning_content

# Round 2
messages.append({"role": "assistant", "content": content})
messages.append({'role': 'user', 'content': "GO on"})
response = client.chat.completions.create(
    model="Pro/deepseek-ai/DeepSeek-R1",
    messages=messages,
    stream=False
)

5. Notes

  • API Key: Ensure you use the correct API key for authentication.
  • Stream Mode: Stream mode is suitable for scenarios where responses need to be received incrementally, while non-stream mode is suitable for scenarios where a complete response is needed at once.

6. Common questions

  • How to obtain the API key?

    Please visit SiliconFlow to register and obtain the API key.

  • How to handle long text?

    You can adjust the max_tokens parameter to control the length of the output, but please note that the maximum length is 16K.