This paper introduces KVTC, a lightweight transform coder designed to compress key-value (KV) caches, which are crucial for efficient large language model (LLM) serving. KV caches enable reuse across conversation turns, but can consume significant GPU memory. KVTC addresses this by applying techniques from classical media compression – PCA-based decorrelation, adaptive quantization, and entropy coding – to reduce cache size without requiring changes to the underlying model. The authors demonstrate that KVTC achieves up to 20x compression while maintaining reasoning accuracy and long-context performance, and even higher compression for specific applications.
OpenZL is a new open source data compression framework that offers lossless compression for structured data, achieving performance comparable to specialized compressors by applying configurable transforms to reveal hidden order in the data.
A detailed analysis of the DeepMind/Meta study: how large language models achieve unprecedented compression rates on text, image, and audio data - and the implications of these results
ffmpeg -i demo.mov -vcodec h264 demo.mp4