Google DeepMind Launches DiffusionGemma, Boosting Local AI Text Generation Speed Fourfold

Google DeepMind has introduced DiffusionGemma, a novel AI model in its Gemma 4 family that generates text blocks in parallel rather than sequentially, delivering up to four times faster performance on local hardware like Nvidia RTX and H100 GPUs.

DiffusionGemma runs 4x faster than similar-sized autoregressive models on local GPUs
Model uses a diffusion-style approach producing text in parallel rather than sequentially
Available now under Apache 2.0 license, optimized for RTX and Nvidia enterprise platforms

What happened

Google DeepMind has released DiffusionGemma, a new open AI model that differs fundamentally from typical autoregressive text generators by producing entire blocks of text in parallel through a diffusion process. Unlike sequential token generation, this method iteratively refines a canvas of placeholder tokens, resulting in a final text output after multiple passes.

DiffusionGemma is a Mixture of Experts (MoE) model with 26 billion parameters, though only 3.8 billion are active during inference, enabling it to run efficiently on GPUs with around 18GB of RAM. Testing shows it achieves around 700 tokens per second on an Nvidia RTX 5090 and exceeds 1,000 tokens per second on the Nvidia H100, approximately four times faster than similar-sized autoregressive Gemma models.

Why it matters

This new diffusion-based generation approach shifts the computational bottleneck from memory bandwidth to compute power, making it highly advantageous for local AI deployments where memory bandwidth and idle compute can limit performance. The parallel token generation process supports non-linear and complex tasks such as in-line text editing, molecular sequencing, and solving puzzles like Sudoku more effectively.

While diffusion models have been predominantly successful in image generation, applying the technique to text offers speed and efficiency gains at the expense of a somewhat higher error rate. This makes DiffusionGemma particularly suitable for offline or edge computing scenarios where fast, local inference is preferred over the scale and batching efficiencies of cloud autoregressive models.

What to watch next

Despite the experimental nature and some trade-offs in accuracy, DiffusionGemma is openly available under Apache 2.0 licensing and optimized in collaboration with Nvidia for consumer GPUs and enterprise AI platforms. This release may signal increased adoption of diffusion methods in future text generation models tailored for local and embedded environments.

Google is also exploring complementary speed improvements, such as Multi-Token Prediction (MTP), but DiffusionGemma currently surpasses these alternatives in throughput. Observers should watch for further development that balances diffusion’s speed with error reduction, as well as potential integration into hybrid or cloud-edge AI systems.

Source assisted: This briefing began from a discovered source item from Ars Technica. Open the original source.

How SignalDesk reports: feeds and outside sources are used for discovery. Public briefings are edited to add context, buyer relevance and attribution before they are published. Read the standards