oLLM is a Python library for running large-context Transformers on NVIDIA GPUs by offloading weights and KV-cache to SSDs. It supports models like Llama-3, GPT-OSS-20B, and Qwen3-Next-80B, enabling up to 100K tokens of context on 8-10 GB GPUs without quantization.
This article discusses how to test small language models using 3.8B Phi-3 and 8B Llama-3 models on a PC and Raspberry Pi with LlamaCpp and ONNX. Written by Dmitrii Eliuseev.