Tags: inference optimization*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Google Research has introduced TurboQuant, a new quantization algorithm designed to compress the Key-Value (KV) cache of large language models by up to 6x. By utilizing a two-step process involving randomized Hadamard transforms and Quantized Johnson-Lindenstrauss transforms, the method achieves 3.5-bit compression with near-zero accuracy loss on benchmarks like LongBench. This optimization addresses the massive VRAM requirements of long-context windows, potentially allowing large models to run on significantly less powerful hardware.
    Key points:
    * Compresses KV cache down to 3.5 bits per value.
    * Maintains inference accuracy without requiring model retraining.
    * Uses data vector rotation and QJL transforms to handle outlier distribution skew.
    * Reduces the memory bottleneck for long-context LLM inference.
    * Enables massive context windows on more modest hardware configurations.
  2. Sarvam AI is releasing Sarvam 30B and Sarvam 105B as open-source models, trained from scratch on large-scale, high-quality datasets. These models demonstrate strong reasoning, programming, and agentic capabilities, with optimizations for efficient deployment across various hardware. Sarvam 30B powers Samvaad, while Sarvam 105B powers Indus. The release includes details on the model architecture, training process, benchmark results, and inference optimizations. The models are available on AI Kosh and Hugging Face, and the article details their performance across benchmarks and in real-world applications like webpage generation, JEE problem solving, and conversational agents.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "inference optimization"

About - Propulsed by SemanticScuttle