Vision

1. Use cases

Vision-Language Models (VLM) are large language models that can accept both visual (image) and linguistic (text) inputs. Based on VLM, you can input image and text information, and the model can understand both the image and the context information and respond accordingly. For example:

Visual Content Interpretation: Require the model to interpret and describe the information in the image, such as objects, text, spatial relationships, colors, and atmosphere in the image;
Conduct multi-round conversations combining visual content and context;
Partially replace traditional machine vision models like OCR;
With the continuous improvement of model capabilities, future applications can include visual intelligent agents and robots.

2. Usage

For VLM models, you can construct a message content containing image URL or base64 encoded image when calling the /chat/completions interface. Use the detail parameter to control the preprocessing of the image.

2.1 Description of image detail control parameters

SiliconCloud provides three detail parameter options: low, high, and auto. For currently supported models, if detail is not specified or set to high, the high (high-resolution) mode will be used. If detail is set to low or auto, the low (low-resolution) mode will be used.

2.2 Example format of message containing an image

For InternVL series models: It is recommended to { "type":"text", "text":"text-prompt here" } place after the image in the request body to achieve the best results.

2.2.1 Using image URL format

{
    "role": "user",
    "content":[
        {
            "type": "image_url",
            "image_url": {
                "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
                "detail":"high"
            }
        },
        {
            "type": "text",
            "text": "text-prompt here"
        }
    ]
}

2.2.2 base64 format

{
    "role": "user",
    "content":[
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}",
                "detail":"low"
            }
        },
        {
            "type": "text",
            "text": "text-prompt here"
        }
    ]
}

Example of converting image to base64

from PIL import Image
import io
import base64
def convert_image_to_webp_base64(input_image_path):
    try:
        with Image.open(input_image_path) as img:
            byte_arr = io.BytesIO()
            img.save(byte_arr, format='webp')
            byte_arr = byte_arr.getvalue()
            base64_str = base64.b64encode(byte_arr).decode('utf-8')
            return base64_str
    except IOError:
        print(f"Error: Unable to open or convert the image {input_image_path}")
        return None

base64_image=convert_image_to_webp_base64(input_image_path)

2.2.3 Multiple image, where each image can be one of the two forms mentioned above

Note that the DeepseekVL2 series models are suitable for handling short contexts. It is recommended to input a maximum of 2 images. If more than 2 images are input, the model will automatically resize the images to 384*384, and the specified detail parameter will be invalid.

{
    "role": "user",
    "content":[
        {
            "type": "image_url",
            "image_url": {
                "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/outputs/658c7434-ec12-49cc-90e6-fe22ccccaf62_00001_.png",
            }
        },
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
            }
        },
        {
            "type": "text",
            "text": "text-prompt here"
        }
    ]
}

3. Supported model list

Currently supported VLM models:

THUDM series:
- Pro/THUDM/GLM-4.1V-9B-Thinking
- THUDM/GLM-4.1V-9B-Thinking
Qwen series:
- Qwen/Qwen2.5-VL-32B-Instruct
- Qwen/Qwen2.5-VL-72B-Instruct
- Qwen/QVQ-72B-Preview
- Qwen/Qwen2-VL-72B-Instruct
- Pro/Qwen/Qwen2.5-VL-7B-Instruct
DeepseekVL2 series:
- deepseek-ai/deepseek-vl2

Note: The supported VLM models may be subject to adjustments. Please filter by the “Visual” tag on the「Models」to view the current list of supported models.

4. Billing for visual input content

For visual input content such as images, the models will convert them into tokens, which will be included in the model output context and billed accordingly. The conversion methods vary for different models. Below are the current conversion methods for supported models.

4.1 Qwen series

Rules: Qwen supports a maximum resolution of 3584 * 3584 = 12845056 and a minimum resolution of 56 * 56 = 3136. The dimensions of each image will be resized to multiples of 28, specifically (h * 28) * (w * 28). If the resolution is not within the minimum and maximum pixel range, it will be proportionally resized to that range.

When detail=low, all images will be resized to 448 * 448 pixels, corresponding to 256 tokens;
When detail=high, the resolution will be proportionally resized. First, the width and height will be rounded up to the nearest multiple of 28, then proportionally resized to the pixel range (3136, 12845056), ensuring that both the width and height are multiples of 28.

Examples:

Images with dimensions 224 * 448, 1024 * 1024, and 3172 * 4096, when detail=low is selected, will consume 256 tokenseach;
An image with dimensions 224 * 448, when detail=high is selected, because 224 * 448 is within the pixel range and both width and height are multiples of 28, it will consume (224/28) * (448/28) = 8 * 16 = 128 tokens;
An image with dimensions 1024 * 1024, when detail=high is selected, will be rounded up to the nearest multiple of 28 to 1036 * 1036, which is within the pixel range, consuming (1036/28) * (1036/28) = 1369 tokens;
An image with dimensions 3172 * 4096, when detail=high is selected, will be rounded up to the nearest multiple of 28 to 3192 * 4116, which exceeds the maximum pixel limit. It will then be proportionally resized to 3136 * 4060, consuming (3136/28) * (4060/28) = 16240 tokens.

4.2 DeepseekVL2 series

Rules: DeepseekVL2 processes each image into two parts: global_view and local_view. global_view resizes the original image to 384*384pixels, while local_view divides the image into multiple 384*384 blocks. Additional tokens are added to connect the blocks based on the width.

When detail=low, all images will be resized to 384*384 pixels.
When detail=high, the images will be resized to dimensions that are multiples of 384(OpenAI uses 512), (h*384) * (w * 384), and 1 <= h*w <= 9.

The scaling dimensions h * w will be chosen according to the following rules:
- Both h and w are integers, and within the constraint 1 <= h*w <= 9, traverse the combinations of (h, w).
- Resize the image to (h*384, w*384) pixels and compare with the original image’s pixels. Take the minimum value between the new image’s pixels and the original image’s pixels as the effective pixel value. Take the difference between the original image’s pixels and the effective pixel value as the invalid pixel value. If the effective pixel value exceeds the previously determined effective pixel value, or if the effective pixel value is the same but the invalid pixel value is smaller, choose the current (h*384, w*384) combination.
- Token consumption will follow the following rules:
  - (h*w + 1) * 196 + (w+1) * 14 + 1 token

Examples:

Images with dimensions 224 x 448, 1024 x 1024, and 2048 x 4096, when detail=low is selected, will consume 421 tokens each.
An image with dimensions 384 x 768, when detail=high is selected, has an aspect ratio of 1:1 and will be resized to 384 x 768. At this point, h=1, w=2, consuming (1*2 + 1) * 196 + (2+1) * 14 + 1 = 631 tokens.
An image with dimensions 1024 x 1024, when detail=high is selected, will be resized to 1152*1152(h=w=3), consuming (3*3 + 1) * 196 + (3+1) * 14 + 1 = 2017 tokens.
An image with dimensions 2048 x 4096, when detail=high is selected, has an aspect ratio of 1:2 and will be resized to 768*1536(h=2, w=4), consuming (2*4 + 1) * 196 + (4+1) * 14 + 1 = 1835 tokens.

4.3 GLM-4.1V-9B-Thinking

Rules: GLM-4.1V supports a minimum pixel size of 28 * 28, scaling image dimensions proportionally to the nearest integer multiple of 28 pixels. If the scaled pixel size is smaller than 112 * 112 or larger than 4816894, adjust the dimensions proportionally to fit within the range while maintaining multiples of 28.

detail=low: Resize all images to 448*448 pixels, resulting in 256 tokens.
detail=high: Scale proportionally by first rounding the dimensions to the nearest 28-pixel multiple, then adjusting to fit within the pixel range (12544, 4816894)while ensuring both dimensions remain multiples of 28.

Examples:

224 x 448, 1024 x 1024, 3172 x 4096: With detail=low, all consume 256 tokens.
224 x 448: With detail=high, since dimensions are within range and multiples of 28, tokens = (224//28) * (448//28) = 8 * 16 = 128 tokens.
1024 x 1024: With detail=high, dimensions are rounded to 1036*1036 (within range), tokens = (1036//28) * (1036//28) = 1369 tokens.
3172 x 4096: With detail=high, rounded to 3192 x 4088 (exceeds maximum), then scaled proportionally to 1932 x 2464, tokens = (1932//28) * (2464//28) = 6072 tokens.

5. Usage example

5.1. Example 1 image understanding

import json  
from openai import OpenAI

client = OpenAI(
    api_key="Your APIKEY", # Get from https://cloud.siliconflow.cn/account/ak
    base_url="https://api.siliconflow.cn/v1"
)

response = client.chat.completions.create(
        model="Qwen/Qwen2-VL-72B-Instruct",
        messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/dog.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe the image."
                }
            ]
        }],
        stream=True
)

for chunk in response:
    chunk_message = chunk.choices[0].delta.content
    print(chunk_message, end='', flush=True)

5.2. Example 2 multi-image understanding

import json  
from openai import OpenAI

client = OpenAI(
    api_key="Your APIKEY", # Get from https://cloud.siliconflow.cn/account/ak
    base_url="https://api.siliconflow.cn/v1"
)

response = client.chat.completions.create(
        model="Qwen/Qwen2-VL-72B-Instruct",
        messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/dog.png"
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/shark.jpg"
                    }
                },
                {
                    "type": "text",
                    "text": "Identify the similarities between these images."
                }
            ]
        }],
        stream=True
)

for chunk in response:
    chunk_message = chunk.choices[0].delta.content
    print(chunk_message, end='', flush=True)

GET STARTED

CAPABILITIES

FEATURES

RATE LIMITS

SILICONFLOW PRODUCT SUITE

1. Use cases

2. Usage

2.1 Description of image detail control parameters

2.2 Example format of message containing an image

2.2.1 Using image URL format

2.2.2 base64 format

2.2.3 Multiple image, where each image can be one of the two forms mentioned above

3. Supported model list

4. Billing for visual input content

4.1 Qwen series

4.2 DeepseekVL2 series

4.3 GLM-4.1V-9B-Thinking

5. Usage example

5.1. Example 1 image understanding

5.2. Example 2 multi-image understanding

GET STARTED

CAPABILITIES

FEATURES

RATE LIMITS

SILICONFLOW PRODUCT SUITE

​1. Use cases

​2. Usage

​2.1 Description of image detail control parameters

​2.2 Example format of message containing an image

​2.2.1 Using image URL format

​2.2.2 base64 format

​2.2.3 Multiple image, where each image can be one of the two forms mentioned above

​3. Supported model list

​4. Billing for visual input content

​4.1 Qwen series

​4.2 DeepseekVL2 series

​4.3 GLM-4.1V-9B-Thinking

​5. Usage example

​5.1. Example 1 image understanding

​5.2. Example 2 multi-image understanding

1. Use cases

2. Usage

2.1 Description of image detail control parameters

2.2 Example format of message containing an image

2.2.1 Using image URL format

2.2.2 base64 format

2.2.3 Multiple image, where each image can be one of the two forms mentioned above

3. Supported model list

4. Billing for visual input content

4.1 Qwen series

4.2 DeepseekVL2 series

4.3 GLM-4.1V-9B-Thinking

5. Usage example

5.1. Example 1 image understanding

5.2. Example 2 multi-image understanding