This article details benchmarks for Unsloth Dynamic GGUFs of the Qwen3.5 model, including analysis of perplexity, KL divergence, and MXFP4. It covers performance across different bit widths and quant types, highlighting the impact of Imatrix and the limitations of certain quantization approaches. Full benchmark data is also provided.
Qwen3-Coder-Next is an 80B MoE model with 256K context designed for fast, agentic coding and local use. It offers performance comparable to models with 10-20x more active parameters and excels in long-horizon reasoning, complex tool use, and recovery from execution failures.
A deep dive into the process of LLM inference, covering tokenization, transformer architecture, KV caching, and optimization techniques for efficient text generation.
This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
The article discusses the growing trend of running Large Language Models (LLMs) locally on personal machines, exploring the motivations behind this shift – including privacy concerns, cost savings, and a desire for technological sovereignty – as well as the hardware and software advancements making it increasingly feasible.
This article details 7 lessons the author learned while self-hosting Large Language Models (LLMs), covering topics like the importance of memory bandwidth, quantization, electricity costs, hardware choices beyond Nvidia, prompt engineering, Mixture of Experts models, and starting with simpler tools like LM Studio.
This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.
SGLang is a fast serving framework for large language models and vision language models. It focuses on efficient serving and controllable interaction through co-designed backend runtime and frontend language.
This article details the often overlooked cost of storing embeddings for RAG systems, and how quantization techniques (int8 and binary) can significantly reduce storage requirements and improve retrieval speed without substantial accuracy loss.