Google Gemma 4 E4B-it Documentation

The Gemma 4 E4B-it is a 4.1-billion parameter dense model built for high-performance edge computing. It serves as a superior alternative to traditional 7B models, offering similar reasoning benchmarks while requiring 40% less memory. It is the ideal choice for local RAG (Retrieval-Augmented Generation), mobile integration, and high-speed agentic workflows.

Key Capabilities

Edge Mastery: Optimized for NPU (Neural Processing Unit) acceleration on mobile and desktop chips (Apple M-series, Snapdragon, Intel Core Ultra).
Instruction Following: Fine-tuned using RLHF for strict adherence to complex system prompts and JSON output formats.
128K Context Window: Large enough for analyzing multiple source documents locally without offloading to the cloud.
Low Latency: Capable of generating over 100 tokens per second on mid-range consumer GPUs.
Privacy First: Designed for "On-Device" deployment where data security and offline functionality are paramount.

Request Parameters

To interact with this model via the us-01.bytecompute.ai endpoint:

Parameter	Type	Required	Description
`model`	`string`	Yes	Use `"google/gemma-4-e4b-it"`.
`messages`	`array`	Yes	Standard role-based message objects (system, user, assistant).
`max_tokens`	`integer`	No	Maximum generation length. Default: `2048`.
`temperature`	`float`	No	Controls creativity. Recommended: `0.1` for logic, `0.6` for chat.
`top_p`	`float`	No	Nucleus sampling threshold. Default: `0.9`.

gemma-4-E4B-it

Input

Output

Google Gemma 4 E4B-it Documentation

Key Capabilities

Request Parameters

Unlock the most affordable AI hosting