SemanticScuttle - klotz.me » Tags: kv cache

Tags: kv cache*

0 bookmark(s) - Sort by: Date ↓ / Title /

Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware

Google Research has introduced TurboQuant, a new quantization algorithm designed to compress the Key-Value (KV) cache of large language models by up to 6x. By utilizing a two-step process involving randomized Hadamard transforms and Quantized Johnson-Lindenstrauss transforms, the method achieves 3.5-bit compression with near-zero accuracy loss on benchmarks like LongBench. This optimization addresses the massive VRAM requirements of long-context windows, potentially allowing large models to run on significantly less powerful hardware.
Key points:
* Compresses KV cache down to 3.5 bits per value.
* Maintains inference accuracy without requiring model retraining.
* Uses data vector rotation and QJL transforms to handle outlier distribution skew.
* Reduces the memory bottleneck for long-context LLM inference.
* Enables massive context windows on more modest hardware configurations.

2026-04-17 Tags: machine learning, large language models, kv cache, quantization, turboquant, google research, inference optimization by klotz

TurboQuant: Reducing LLM Memory Usage With Vector Quantization

This article explores TurboQuant, a new vector quantization method introduced by Google researchers to address the massive memory requirements of Large Language Models (LLMs). As LLM parameters and Key-Value (KV) caches grow, memory management becomes a critical bottleneck for performance. TurboQuant utilizes the PolarQuant algorithm and the quantized Johnson-Lindenstrauss (QJL) algorithm to compress the KV cache significantly. Google claims this method can achieve up to 6x compression levels without a noticeable impact on inference times or accuracy. While the article notes that Google's benchmarking data is somewhat vague compared to competitors like NVIDIA's NVFP4, TurboQuant represents a significant development in optimizing AI hardware compatibility and real-time inference performance.

2026-04-09 Tags: hackaday, llm, vector quantization, turboquant, kv cache, google, polarquant by klotz

KV Cache Transform Coding for Compact Storage in LLM Inference

This paper introduces KVTC, a lightweight transform coder designed to compress key-value (KV) caches, which are crucial for efficient large language model (LLM) serving. KV caches enable reuse across conversation turns, but can consume significant GPU memory. KVTC addresses this by applying techniques from classical media compression – PCA-based decorrelation, adaptive quantization, and entropy coding – to reduce cache size without requiring changes to the underlying model. The authors demonstrate that KVTC achieves up to 20x compression while maintaining reasoning accuracy and long-context performance, and even higher compression for specific applications.

2026-03-18 Tags: llm, kv cache, kvtc, compression, machine learning, transformers by klotz

How LLM Inference Works

A deep dive into the process of LLM inference, covering tokenization, transformer architecture, KV caching, and optimization techniques for efficient text generation.

2025-11-26 Tags: llm, inference, transformer, tokenization, kv cache, quantization, deep learning, machine learning, neural networks by klotz

production-stack

K8S-native cluster-wide deployment for vLLM. Provides a reference implementation for building an inference stack on top of vLLM, enabling scaling, monitoring, request routing, and KV cache offloading with easy cloud deployment.

2025-04-28 Tags: vllm, kubernetes, inference, deployment, scaling, monitoring, request routing, kv cache, cloud, inference engineering, production engineering, llm by klotz

vLLM Production Stack: reference stack for production vLLM deployment

vLLM Production Stack provides a reference implementation on how to build an inference stack on top of vLLM, allowing for scalable, monitored, and performant LLM deployments using Kubernetes and Helm.

2025-04-28 Tags: vllm, kubernetes, helm, llm, inference, deployment, observability, kv cache, scalability, production engineering, inference engineering by klotz

LLM inference optimization - KV Cache

The article discusses how the Key-Value (KV) Cache is used to optimize the inference process of Large Language Models (LLMs) by reducing redundant computations and improving performance.

2024-12-27 Tags: llm, kv cache, self-attention by klotz

Techniques for KV Cache Optimization in Large Language Models

This post explores optimization techniques for the Key-Value (KV) cache in Large Language Models (LLMs) to enhance scalability and reduce memory footprint, covering methods like Grouped-query Attention, Sliding Window Attention, PagedAttention, and distributed KV cache across multiple GPUs.

2024-12-27 Tags: llm, kv cache, grouped-query attention, sliding window attention, pagedattention by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: kv cache*

Linked Tags

Related Tags