SemanticScuttle - klotz.me » Tags: inference optimization

Tags: inference optimization*

0 bookmark(s) - Sort by: Date ↓ / Title /

Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware

Google Research has introduced TurboQuant, a new quantization algorithm designed to compress the Key-Value (KV) cache of large language models by up to 6x. By utilizing a two-step process involving randomized Hadamard transforms and Quantized Johnson-Lindenstrauss transforms, the method achieves 3.5-bit compression with near-zero accuracy loss on benchmarks like LongBench. This optimization addresses the massive VRAM requirements of long-context windows, potentially allowing large models to run on significantly less powerful hardware.
Key points:
* Compresses KV cache down to 3.5 bits per value.
* Maintains inference accuracy without requiring model retraining.
* Uses data vector rotation and QJL transforms to handle outlier distribution skew.
* Reduces the memory bottleneck for long-context LLM inference.
* Enables massive context windows on more modest hardware configurations.

2026-04-17 Tags: machine learning, large language models, kv cache, quantization, turboquant, google research, inference optimization by klotz

Open-Sourcing Sarvam 30B and 105B

Sarvam AI is releasing Sarvam 30B and Sarvam 105B as open-source models, trained from scratch on large-scale, high-quality datasets. These models demonstrate strong reasoning, programming, and agentic capabilities, with optimizations for efficient deployment across various hardware. Sarvam 30B powers Samvaad, while Sarvam 105B powers Indus. The release includes details on the model architecture, training process, benchmark results, and inference optimizations. The models are available on AI Kosh and Hugging Face, and the article details their performance across benchmarks and in real-world applications like webpage generation, JEE problem solving, and conversational agents.

2026-03-07 Tags: sarvam 30b, sarvam 105b, open-source, llm, indiaai, mixture of experts, reasoning, coding, agentic, benchmarks, inference optimization, indian languages, samvaad, indus by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: inference optimization*

Linked Tags

Related Tags