Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family to significantly accelerate inference speeds. By utilizing a specialized speculative decoding architecture, these drafters can deliver up to a 3x speedup without compromising output quality or reasoning capabilities. This technology addresses memory-bandwidth bottlenecks by allowing a lightweight drafter to predict multiple future tokens that are then verified in parallel by the larger target model.
Key points:
* Improved responsiveness for real-time chat, voice applications, and agentic workflows.
* Faster local development on personal computers and consumer GPUs.
* Enhanced performance and battery efficiency on edge devices.
* Architectural optimizations including KV cache sharing and activation utilization.
* Available now under the Apache 2.0 license via Hugging Face and Kaggle.