This guide helps engineers build and ship LLM products by covering the full technical stack. It moves from core mechanics (tokenization, embeddings, attention) to training methodologies (pretraining, SFT, RLHF/DPO) and deployment optimizations (LoRA, quantization, vLLM). The focus is on managing critical production tradeoffs between accuracy, latency, memory, and cost
This document details how to run Google's Gemma 4 models locally, including the E2B, E4B, 26B-A4B, and 31B variants. Gemma 4 is a family of open models supporting over 140 languages and up to 256K context, available in both dense and MoE configurations. The E2B and E4B models support image and audio input. These models can be run locally on your device and fine-tuned using Unsloth Studio. The document outlines hardware requirements, recommended settings, and best practices for prompting and multimodal use, including guidance on context length and thinking mode.
Understand API rate limits and restrictions. This document details how OpenAI’s rate limit system works, including usage tiers, headers, error mitigation strategies like exponential backoff, and batching requests.
This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
This blog post details a fine-tuning workflow for the gpt-oss model that recovers post-training accuracy while retaining the performance benefits of FP4. It involves supervised fine-tuning (SFT) on an upcasted BF16 version of the model, followed by quantization-aware training (QAT) using NVIDIA TensorRT Model Optimizer. The article also discusses the benefits of using NVFP4 for even better convergence and accuracy recovery.
How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth! This page details running Gemma 3 on various platforms, including phones, and fine-tuning it using Unsloth, addressing potential issues with float16 precision and providing optimal configuration settings.
Learn how to run and fine-tune Mistral Devstral 1.1, including Small-2507 and 2505. This guide covers official recommended settings, tutorials for running Devstral in Ollama and llama.cpp, experimental vision support, and fine-tuning with Unsloth.
A post with pithy observations and clear conclusions from building complex LLM workflows, covering topics like prompt chaining, data structuring, model limitations, and fine-tuning strategies.
This document details how to run and fine-tune Gemma 3 models (1B, 4B, 12B, and 27B) using Unsloth, covering setup with Ollama and llama.cpp, and addressing potential float16 precision issues. It also highlights Unsloth's unique ability to run Gemma 3 in float16 on machines like Colab notebooks with Tesla T4 GPUs.
This article details a method for training large language models (LLMs) for code generation using a secure, local WebAssembly-based code interpreter and reinforcement learning with Group Relative Policy Optimization (GRPO). It covers the setup, training process, evaluation, and potential next steps.