SemanticScuttle - klotz.me » klotz: attention+llm

What Is Attention in Language Models? This bookmark is certified by an admin user.

2023-02-14 Tags: llm, attention by klotz

Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs This bookmark is certified by an admin user.

2024-04-19 Tags: llm, attention, python, pytorch, self-attention by klotz

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention This bookmark is certified by an admin user.

This paper introduces Cross-Layer Attention (CLA), an extension of Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) for reducing the size of the key-value cache in transformer-based autoregressive large language models (LLMs). The authors demonstrate that CLA can reduce the cache size by another 2x while maintaining nearly the same accuracy as unmodified MQA, enabling inference with longer sequence lengths and larger batch sizes.

2024-05-26 Tags: transformer, autoregressive language models, key-value cache, attention, multiquery attention, cross-layer attention, machine learning, computer science, llm, mit, csail by klotz

SemanticScuttle - klotz.me

klotz: attention* + llm*

Linked Tags

Related Tags