The Gemma 4 26B-A4B-it (Instruct) represents Google DeepMind’s latest evolution in Mixture-of-Experts (MoE) architecture, released in April 2026. This model is specifically optimized for high-throughput efficiency, balancing the massive knowledge base of a 26B parameter model with the inference speed of a much smaller 4B active parameter model.
Gemma 4 26B-A4B-it is a sparse Mixture-of-Experts (MoE) model. While it contains a total of 26.2 Billion parameters, only 4.1 Billion parameters are activated per token. This allows for the reasoning capabilities of a large model with the latency profile of a lightweight model.
To interact with this model via vLLM or OpenAI-compatible endpoints:
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string |
Yes | Use "google/gemma-4-26b-a4b-it". |
messages |
array |
Yes | Standard chat format (role/content). |
max_tokens |
integer |
No | Maximum generation length. Default: 4096. |
temperature |
float |
No | Recommended: 0.1 for coding; 0.7 for chat. |
top_p |
float |
No | Nucleus sampling. Default: 0.9. |
| Parameter | Type | Default | Description |
|---|---|---|---|
frequency_penalty |
float |
0.0 |
Prevents repetitive word usage. |
stop |
array |
null |
Sequences to end generation (e.g., ["\nUser:"]). |
logprobs |
boolean |
false |
Returns the probability of the generated tokens. |
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.
