Qwen3-VL-235B-A22B-Instruct-AWQ

Qwen3-VL-235B-A22B is the flagship vision-language model featuring a Mixture-of-Experts (MoE) architecture with 235 billion total parameters (22 billion active per token). This AWQ (4-bit) version is optimized for high-throughput deployment, offering superior performance in document intelligence, spatial reasoning, and video understanding.

Key Capabilities

Unified Modality: Native support for images and videos without separate encoders.
Complex Visual Coding: Convert UI wireframes or handwritten diagrams directly into HTML/CSS/JS.
Spatial Grounding: Precision 2D/3D object detection and location within visual scenes.
Long Video Comprehension: Support for analyzing dynamic content up to 20-60 minutes with second-level indexing.

Billing

Billed per 1M tokens (Input + Output). Images and videos are tokenized based on their resolution and frame count.

Request Parameters

Parameter	Type	Required	Description
`model`	`string`	Yes	Must be `Qwen3-VL-235B-A22B-Instruct-AWQ`.
`messages`	`array`	Yes	Standard OpenAI-compatible message objects. Supports multimodal `content` array (text, image, video).
`max_tokens`	`integer`	No	Maximum tokens to generate. Supports a massive 262,144 token context window.
`temperature`	`float`	No	Controls randomness (0.0 - 2.0). Recommended: 0.1 for extraction, 0.7 for chat.
`top_p`	`float`	No	Nucleus sampling threshold. Recommended: 0.8 to balance speed and quality.
`video_fps`	`float`	No	Frame sampling rate for video analysis. Default: `1.0`.
`stream`	`boolean`	No	Whether to stream the response tokens in real-time.

Optional Parameters (Qwen3-VL Optimization)

The AWQ-quantized MoE architecture is highly efficient but sensitive to sampling. Use these values to optimize for specific multimodal tasks:

Scenario	Recommended Params	Purpose
OCR & Table Extraction	`temperature: 0.1`, `top_p: 1.0`	Ensures zero hallucinations and strict structural accuracy in data extraction.
Visual UI Coding	`temperature: 0.2`, `top_p: 0.9`	Balances technical syntax precision with creative design layout for frontend code.
Dynamic Video Search	`video_fps: 2.0`, `stream: true`	Provides better temporal resolution for tracking fast-moving objects in video clips.
Spatial Grounding	`temperature: 0.0`, `max_tokens: 1024`	Ideal for obtaining precise [ymin, xmin, ymax, xmax] coordinates in object detection.
Creative Analysis	`temperature: 0.8`, `top_p: 0.95`	Best for comparing multiple images or creative storytelling based on visual input.

Parameters Summary

model: Must be "Qwen3-VL-235B-A22B-Instruct-AWQ".
messages: Array supporting multimodal blocks (image, video, text).
max_tokens: Supports up to 262K context window.
temperature: Default 0.7. Recommended 0.1 for precise extraction.

Qwen3-VL-235B-A22B-Instruct-AWQ

Input

Output

Qwen3-VL-235B-A22B-Instruct-AWQ

Key Capabilities

Billing

Request Parameters

Optional Parameters (Qwen3-VL Optimization)

Parameters Summary

Unlock the most affordable AI hosting