Vision
1. Use cases
Vision-Language Models (VLM) are large language models that can accept both visual (image) and linguistic (text) inputs. Based on VLM, you can input image and text information, and the model can understand both the image and the context information and respond accordingly. For example:
- Visual Content Interpretation: Require the model to interpret and describe the information in the image, such as objects, text, spatial relationships, colors, and atmosphere in the image;
- Conduct multi-round conversations combining visual content and context;
- Partially replace traditional machine vision models like OCR;
- With the continuous improvement of model capabilities, future applications can include visual intelligent agents and robots.
2. Usage
For VLM models, you can construct a message content containing image URL or base64 encoded image when calling the /chat/completions interface. Use the detail parameter to control the preprocessing of the image.
2.1 Description of image detail control parameters
SiliconCloud provides three detail parameter options: low
, high
, and auto
. For currently supported models, if detail is not specified or set to high
, the high
(high-resolution) mode will be used. If detail is set to low
or auto
, the low
(low-resolution) mode will be used.
2.2 Example format of message containing an image
{ "type":"text", "text":"text-prompt here" }
place after the image in the request body to achieve the best results. Using image URL format
2.2 base64 format
2.3 Multiple image, where each image can be one of the two forms mentioned above
DeepseekVL2
series models are suitable for handling short contexts. It is recommended to input a maximum of 2 images. If more than 2 images are input, the model will automatically resize the images to 384*384, and the specified detail parameter will be invalid. 3. Supported model list
Currently supported VLM models:
- Qwen series:
- Qwen/Qwen2-VL-72B-Instruct
- Pro/Qwen/Qwen2-VL-7B-Instruct
- Qwen/QVQ-72B-Preview
- InternVL series:
- OpenGVLab/InternVL2-Llama3-76B
- OpenGVLab/InternVL2-26B
- Pro/OpenGVLab/InternVL2-8B
- DeepseekVL2 series:
- deepseek-ai/deepseek-vl2
4. Billing for visual input content
For visual input content such as images, the models will convert them into tokens, which will be included in the model output context and billed accordingly. The conversion methods vary for different models. Below are the current conversion methods for supported models.
4.1 Qwen series
Rules:
Qwen
supports a maximum resolution of 3584 * 3584 = 12845056
and a minimum resolution of 56 * 56 = 3136
. The dimensions of each image will be resized to multiples of 28
, specifically (h * 28) * (w * 28)
. If the resolution is not within the minimum and maximum pixel range, it will be proportionally resized to that range.
- When
detail=low
, all images will be resized to448 * 448
pixels, corresponding to256 tokens
; - When
detail=high
, the resolution will be proportionally resized. First, the width and height will be rounded up to the nearest multiple of28
, then proportionally resized to the pixel range(3136, 12845056)
, ensuring that both the width and height are multiples of28
.
Examples:
- Images with dimensions
224 * 448
,1024 * 1024
, and3172 * 4096
, whendetail=low
is selected, will consume256 tokens
each; - An image with dimensions
224 * 448
, whendetail=high
is selected, because224 * 448
is within the pixel range and both width and height are multiples of28
, it will consume(224/28) * (448/28) = 8 * 16 = 128 tokens
; - An image with dimensions
1024 * 1024
, when detail=high is selected, will be rounded up to the nearest multiple of28
to1036 * 1036
, which is within the pixel range, consuming(1036/28) * (1036/28) = 1369 tokens
; - An image with dimensions
3172 * 4096
, whendetail=high
is selected, will be rounded up to the nearest multiple of28
to3192 * 4116
, which exceeds the maximum pixel limit. It will then be proportionally resized to3136 * 4060
, consuming(3136/28) * (4060/28) = 16240 tokens
.
4.2 InternVL series
Rules:
InternVL2
actually processes pixels and consumes tokens
based on the aspect ratio of the original image. The minimum processing pixel is 448 * 448
, and the maximum is 12 * 448 * 448
.
- When
detail=low
, all images will be resized to448 * 448
pixels, corresponding to256 tokens
; - When
detail=high
, the images will be resized to dimensions that are multiples of448
,(h * 448) * (w * 448)
,and1 <= h * w <=12
。- The scaling dimensions
h * w
will be chosen according to the following rules:- Both
h
andw
are integers, and within the constraint1 <= h * w <= 12
traverse the combinations ofh * w
from smallest to largest. - For the current
(h, w)
combination, if the aspect ratio of the original image is closer toh / w
,choose this(h, w)
combination. - For subsequent
(h, w)
combinations with the same ratio but larger values, if the original image pixels are greater than0.5 * h * w * 448 * 448
, choose the larger(h, w)
combination.
- Both
- Token consumption will follow the following rules:
- If
h * w = 1
,consume256 tokens
; - If
h * w > 1
,consume an additional256 token
for each448 * 448
sliding window, totaling(h * w + 1) * 256 tokens
。
- If
- The scaling dimensions
Examples:
- Images with dimensions
224 * 448
,1024 * 1024
, and2048 * 4096
, whendetail=low
is selected, will consume256 tokens
each; - An image with dimensions
224 * 448
, whendetail=high
is selected, has an aspect ratio of1:2
, and will be resized to448 x 896
. At this point,h = 1, w = 2
, consuming(h * w + 1) * 256 = 768 tokens
; - An image with dimensions
1024 * 1024
, whendetail=high
is selected, has an aspect ratio of1:1
, and will be resized to1344 * 1344 (h = w = 3)
. Since1024 * 1024 > 0.5 * 1344 * 1344
, at this point,h = w = 3
, consuming(3 * 3 + 1) * 256 = 2560 tokens
; - An image with dimensions
2048 * 4096
, whendetail=high
is selected, has an aspect ratio of1:2
, and under the condition1 <= h * w <= 12
, the largest(h, w)
combination ish = 2, w = 4
. Therefore, it will be resized to896 * 1792
, consuming(2 * 4 + 1) * 256 = 2304 tokens
.
4.3 DeepseekVL2 series
Rules:
DeepseekVL2
processes each image into two parts: global_view and local_view. global_view resizes the original image to 384*384
pixels, while local_view divides the image into multiple 384*384
blocks. Additional tokens are added to connect the blocks based on the width.
- When
detail=low
, all images will be resized to384*384
pixels. - When
detail=high
, the images will be resized to dimensions that are multiples of384(OpenAI uses 512)
,(h*384) * (w * 384)
, and1 <= h*w <= 9
.
-
The scaling dimensions
h * w
will be chosen according to the following rules:-
Both
h
andw
are integers, and within the constraint1 <= h*w <= 9
, traverse the combinations of(h, w)
. -
Resize the image to
(h*384, w*384)
pixels and compare with the original image’s pixels. Take the minimum value between the new image’s pixels and the original image’s pixels as the effective pixel value. Take the difference between the original image’s pixels and the effective pixel value as the invalid pixel value. If the effective pixel value exceeds the previously determined effective pixel value, or if the effective pixel value is the same but the invalid pixel value is smaller, choose the current(h*384, w*384)
combination. -
Token consumption will follow the following rules:
(h*w + 1) * 196 + (w+1) * 14 + 1 token
-
Examples:
- Images with dimensions
224 x 448
,1024 x 1024
, and2048 x 4096
, whendetail=low
is selected, will consume421 tokens
each. - An image with dimensions
384 x 768
, whendetail=high
is selected, has an aspect ratio of1:1
and will be resized to384 x 768
. At this point,h=1, w=2
, consuming(1*2 + 1) * 196 + (2+1) * 14 + 1 = 631 tokens
. - An image with dimensions
1024 x 1024
, whendetail=high
is selected, will be resized to1152*1152(h=w=3)
, consuming(3*3 + 1) * 196 + (3+1) * 14 + 1 = 2017 tokens
. - An image with dimensions
2048 x 4096
, whendetail=high
is selected, has an aspect ratio of1:2
and will be resized to768*1536(h=2, w=4)
,consuming (2*4 + 1) * 196 + (4+1) * 14 + 1 = 1835 tokens
.