Long contexts in language models are bottlenecked by KV cache size. While summarization compacts token space, it can lose information. This work introduces Attention Matching, a fast method for compacting the KV cache in latent space by matching attention outputs. This allows for up to 50x compression with little quality degradation, offering a faster alternative to full optimization.
Python implementation of Recursive Language Models for processing unbounded context lengths. Process 100k+ tokens with any LLM by storing context as variables instead of prompts.