Tags: kv cache* + machine learning*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Google Research has introduced TurboQuant, a new quantization algorithm designed to compress the Key-Value (KV) cache of large language models by up to 6x. By utilizing a two-step process involving randomized Hadamard transforms and Quantized Johnson-Lindenstrauss transforms, the method achieves 3.5-bit compression with near-zero accuracy loss on benchmarks like LongBench. This optimization addresses the massive VRAM requirements of long-context windows, potentially allowing large models to run on significantly less powerful hardware.
    Key points:
    * Compresses KV cache down to 3.5 bits per value.
    * Maintains inference accuracy without requiring model retraining.
    * Uses data vector rotation and QJL transforms to handle outlier distribution skew.
    * Reduces the memory bottleneck for long-context LLM inference.
    * Enables massive context windows on more modest hardware configurations.
  2. This paper introduces KVTC, a lightweight transform coder designed to compress key-value (KV) caches, which are crucial for efficient large language model (LLM) serving. KV caches enable reuse across conversation turns, but can consume significant GPU memory. KVTC addresses this by applying techniques from classical media compression – PCA-based decorrelation, adaptive quantization, and entropy coding – to reduce cache size without requiring changes to the underlying model. The authors demonstrate that KVTC achieves up to 20x compression while maintaining reasoning accuracy and long-context performance, and even higher compression for specific applications.
  3. A deep dive into the process of LLM inference, covering tokenization, transformer architecture, KV caching, and optimization techniques for efficient text generation.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "kv cache+machine learning"

About - Propulsed by SemanticScuttle