This paper proposes SkyMemory, a LEO satellite constellation hosted key-value cache (KVC) to accelerate transformer-based inference, particularly for large language models (LLMs). It explores different chunk-to-server mapping strategies (rotation-aware, hop-aware, and combined) and presents simulation results and a proof-of-concept implementation demonstrating performance improvements.
In this notebook, we will explore a typical RAG solution where we will utilize an open-source model and the vector database Chroma DB. However, we will integrate a semantic cache system that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache.