Google DeepMind research reveals a fundamental architectural limitation in Retrieval-Augmented Generation (RAG) systems related to fixed-size embeddings. The research demonstrates that retrieval performance degrades as database size increases, with theoretical limits based on embedding dimensionality. They introduce the LIMIT benchmark to empirically test these limitations and suggest alternatives like cross-encoders, multi-vector models, and sparse models.
Researchers from Google DeepMind have developed Differentiable Cache Augmentation, a method that uses a coprocessor to augment LLM's key-value cache with latent embeddings, enhancing reasoning capabilities without increasing computational burden.
"The methodology revolves around a three-stage process. First, the frozen LLM generates a kv-cache from an input sequence, encapsulating its internal representation. This kv-cache is passed to the coprocessor, which processes it with additional trainable soft tokens. Not tied to specific words, these tokens act as abstract prompts for generating latent embeddings. Once processed, the augmented kv-cache is fed back into the LLM, enabling it to generate contextually enriched outputs. This asynchronous operation ensures the coprocessor’s enhancements are applied efficiently without delaying the LLM’s primary functions. Training the coprocessor is conducted using a language modeling loss, focusing solely on its parameters while preserving the integrity of the frozen LLM. This targeted approach allows for scalable and effective optimization."