Tags: key-value cache* + computer science* + transformer*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. This paper introduces Cross-Layer Attention (CLA), an extension of Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) for reducing the size of the key-value cache in transformer-based autoregressive large language models (LLMs). The authors demonstrate that CLA can reduce the cache size by another 2x while maintaining nearly the same accuracy as unmodified MQA, enabling inference with longer sequence lengths and larger batch sizes.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "key-value cache+computer science+transformer"

About - Propulsed by SemanticScuttle