SemanticScuttle - klotz.me » klotz: key-value cache

klotz: key-value cache*

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

This paper introduces Cross-Layer Attention (CLA), an extension of Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) for reducing the size of the key-value cache in transformer-based autoregressive large language models (LLMs). The authors demonstrate that CLA can reduce the cache size by another 2x while maintaining nearly the same accuracy as unmodified MQA, enabling inference with longer sequence lengths and larger batch sizes.

2024-05-26 Tags: transformer, autoregressive language models, key-value cache, attention, multiquery attention, cross-layer attention, machine learning, computer science, llm, mit, csail by klotz

Techniques to Improve Memory and Computational Efficiency of Large Language Models

Improving the memory and computational efficiency of Large Language Models (LLMs) for handling long input sequences, including retrieval augmented questions answering, summarization, and chat tasks. It covers various techniques, such as lower precision computing, Flash Attention algorithm, positional embedding methods, and key-value caching strategies. These methods help reduce memory consumption and increase inference speeds while maintaining high accuracy levels in LLM applications. Furthermore, it highlights some advanced approaches like Multi-Query-Attention (MQA) and Grouped-Query-Attention (GQA), which further enhance computational and memory efficiency without compromising performance.

2024-01-30 Tags: llm, quantization, flash attention, position embeddings, key-value cache, multi-query-attention, grouped-query-attention, performance, optimization by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: key-value cache*

Linked Tags

Related Tags