A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
A 120 billion parameter OpenAI model can now run on consumer hardware thanks to the Mixture of Experts (MoE) technique, which significantly reduces memory requirements and allows processing on CPUs while offloading key parts to modest GPUs.