This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
The article discusses the growing trend of running Large Language Models (LLMs) locally on personal machines, exploring the motivations behind this shift – including privacy concerns, cost savings, and a desire for technological sovereignty – as well as the hardware and software advancements making it increasingly feasible.
This article details 7 lessons the author learned while self-hosting Large Language Models (LLMs), covering topics like the importance of memory bandwidth, quantization, electricity costs, hardware choices beyond Nvidia, prompt engineering, Mixture of Experts models, and starting with simpler tools like LM Studio.
This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.
SGLang is a fast serving framework for large language models and vision language models. It focuses on efficient serving and controllable interaction through co-designed backend runtime and frontend language.
This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.
This document details how to run Gemma models, covering framework selection, variant choice, and running generation/inference requests. It emphasizes considering available hardware resources and provides recommendations for beginners.
This document details how to run Qwen models locally using the Text Generation Web UI (oobabooga), covering installation, setup, and launching the web interface.
Meta AI has released quantized versions of the Llama 3.2 models (1B and 3B), which improve inference speed by up to 2-4x and reduce model size by 56%, making advanced AI technology more accessible to a wider range of users.