Qwen3-VL-235B-A22B-Instruct-AWQ Robot

Qwen3-VL-235B-A22B-Instruct-AWQ

Qwen3-VL-235B-A22B-Instruct-AWQ

0.0025 $ / 1k

Input

Template

Output

Qwen3-VL-235B-A22B-Instruct-AWQ

Qwen3-VL-235B-A22B is the flagship vision-language model featuring a Mixture-of-Experts (MoE) architecture with 235 billion total parameters (22 billion active per token). This AWQ (4-bit) version is optimized for high-throughput deployment, offering superior performance in document intelligence, spatial reasoning, and video understanding.

Key Capabilities

  • Unified Modality: Native support for images and videos without separate encoders.
  • Complex Visual Coding: Convert UI wireframes or handwritten diagrams directly into HTML/CSS/JS.
  • Spatial Grounding: Precision 2D/3D object detection and location within visual scenes.
  • Long Video Comprehension: Support for analyzing dynamic content up to 20-60 minutes with second-level indexing.

Billing

Billed per 1M tokens (Input + Output). Images and videos are tokenized based on their resolution and frame count.

Request Parameters

Parameter Type Required Description
model string Yes Must be Qwen3-VL-235B-A22B-Instruct-AWQ.
messages array Yes Standard OpenAI-compatible message objects. Supports multimodal content array (text, image, video).
max_tokens integer No Maximum tokens to generate. Supports a massive 262,144 token context window.
temperature float No Controls randomness (0.0 - 2.0). Recommended: 0.1 for extraction, 0.7 for chat.
top_p float No Nucleus sampling threshold. Recommended: 0.8 to balance speed and quality.
video_fps float No Frame sampling rate for video analysis. Default: 1.0.
stream boolean No Whether to stream the response tokens in real-time.

Optional Parameters (Qwen3-VL Optimization)

The AWQ-quantized MoE architecture is highly efficient but sensitive to sampling. Use these values to optimize for specific multimodal tasks:

Scenario Recommended Params Purpose
OCR & Table Extraction temperature: 0.1, top_p: 1.0 Ensures zero hallucinations and strict structural accuracy in data extraction.
Visual UI Coding temperature: 0.2, top_p: 0.9 Balances technical syntax precision with creative design layout for frontend code.
Dynamic Video Search video_fps: 2.0, stream: true Provides better temporal resolution for tracking fast-moving objects in video clips.
Spatial Grounding temperature: 0.0, max_tokens: 1024 Ideal for obtaining precise [ymin, xmin, ymax, xmax] coordinates in object detection.
Creative Analysis temperature: 0.8, top_p: 0.95 Best for comparing multiple images or creative storytelling based on visual input.

Parameters Summary

  • model: Must be "Qwen3-VL-235B-A22B-Instruct-AWQ".
  • messages: Array supporting multimodal blocks (image, video, text).
  • max_tokens: Supports up to 262K context window.
  • temperature: Default 0.7. Recommended 0.1 for precise extraction.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales