Tags: kv cache*

0 bookmark(s) - Sort by: Date โ†“ / Title /

  1. This paper introduces KVTC, a lightweight transform coder designed to compress key-value (KV) caches, which are crucial for efficient large language model (LLM) serving. KV caches enable reuse across conversation turns, but can consume significant GPU memory. KVTC addresses this by applying techniques from classical media compression โ€“ PCA-based decorrelation, adaptive quantization, and entropy coding โ€“ to reduce cache size without requiring changes to the underlying model. The authors demonstrate that KVTC achieves up to 20x compression while maintaining reasoning accuracy and long-context performance, and even higher compression for specific applications.
  2. A deep dive into the process of LLM inference, covering tokenization, transformer architecture, KV caching, and optimization techniques for efficient text generation.
  3. K8S-native cluster-wide deployment for vLLM. Provides a reference implementation for building an inference stack on top of vLLM, enabling scaling, monitoring, request routing, and KV cache offloading with easy cloud deployment.
  4. vLLM Production Stack provides a reference implementation on how to build an inference stack on top of vLLM, allowing for scalable, monitored, and performant LLM deployments using Kubernetes and Helm.
  5. The article discusses how the Key-Value (KV) Cache is used to optimize the inference process of Large Language Models (LLMs) by reducing redundant computations and improving performance.
    2024-12-27 Tags: , , by klotz
  6. This post explores optimization techniques for the Key-Value (KV) cache in Large Language Models (LLMs) to enhance scalability and reduce memory footprint, covering methods like Grouped-query Attention, Sliding Window Attention, PagedAttention, and distributed KV cache across multiple GPUs.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "kv cache"

About - Propulsed by SemanticScuttle