
Qwen3-VL-235B-A22B-Instruct-AWQ
Template
Qwen3-VL-235B-A22B is the flagship vision-language model featuring a Mixture-of-Experts (MoE) architecture with 235 billion total parameters (22 billion active per token). This AWQ (4-bit) version is optimized for high-throughput deployment, offering superior performance in document intelligence, spatial reasoning, and video understanding.
Billed per 1M tokens (Input + Output). Images and videos are tokenized based on their resolution and frame count.
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Must be Qwen3-VL-235B-A22B-Instruct-AWQ. |
messages |
array |
Yes | Standard OpenAI-compatible message objects. Supports multimodal content array (text, image, video). |
max_tokens |
integer |
No | Maximum tokens to generate. Supports a massive 262,144 token context window. |
temperature |
float |
No | Controls randomness (0.0 - 2.0). Recommended: 0.1 for extraction, 0.7 for chat. |
top_p |
float |
No | Nucleus sampling threshold. Recommended: 0.8 to balance speed and quality. |
video_fps |
float |
No | Frame sampling rate for video analysis. Default: 1.0. |
stream |
boolean |
No | Whether to stream the response tokens in real-time. |
The AWQ-quantized MoE architecture is highly efficient but sensitive to sampling. Use these values to optimize for specific multimodal tasks:
| Scenario | Recommended Params | Purpose |
|---|---|---|
| OCR & Table Extraction | temperature: 0.1, top_p: 1.0 |
Ensures zero hallucinations and strict structural accuracy in data extraction. |
| Visual UI Coding | temperature: 0.2, top_p: 0.9 |
Balances technical syntax precision with creative design layout for frontend code. |
| Dynamic Video Search | video_fps: 2.0, stream: true |
Provides better temporal resolution for tracking fast-moving objects in video clips. |
| Spatial Grounding | temperature: 0.0, max_tokens: 1024 |
Ideal for obtaining precise [ymin, xmin, ymax, xmax] coordinates in object detection. |
| Creative Analysis | temperature: 0.8, top_p: 0.95 |
Best for comparing multiple images or creative storytelling based on visual input. |
"Qwen3-VL-235B-A22B-Instruct-AWQ".image, video, text).0.7. Recommended 0.1 for precise extraction.Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.