This repository provides the GGUF quantized weights for Qwen3.6-27B, a flagship-level coding model designed for stability and real-world utility. The model features significant upgrades in agentic coding capabilities, allowing it to handle frontend workflows and repository-level reasoning with high precision. It also introduces thinking preservation, which enables the model to retain reasoning context from historical messages to improve iterative development.
Key technical highlights:
* Native context length of 262,144 tokens, extensible up to 1,010,000 via RoPE scaling (YaRN).
* Enhanced tool-calling capabilities for complex agentic tasks.
* Support for multimodal inputs including images and video.
* Optimized for various inference frameworks like SGLang, vLLM, and KTransformers.
Unsloth AI presents performance benchmarks for Qwen3.6-35B-A3B GGUF quantizations, claiming state-of-the-art results in mean KL divergence across most model sizes. The discussion includes community analysis regarding SWE-bench Verified performance, where some users noted unexpected discrepancies between Qwen3.5 and Qwen3.6 quantization results during coding tasks.
Key points:
- Unsloth ranks first in 21 of 22 model sizes for mean KL divergence.
- Community debate over SWE-bench testing methodology and sample sizes.
- Reported performance variations between different quantization levels (Q4, Q5, Q6, Q8).
- Discussion on system prompt adherence and error rates in coding benchmarks.
This document details how to run Google's Gemma 4 models locally, including the E2B, E4B, 26B-A4B, and 31B variants. Gemma 4 is a family of open models supporting over 140 languages and up to 256K context, available in both dense and MoE configurations. The E2B and E4B models support image and audio input. These models can be run locally on your device and fine-tuned using Unsloth Studio. The document outlines hardware requirements, recommended settings, and best practices for prompting and multimodal use, including guidance on context length and thinking mode.
This Hugging Face page details the Gemma 4 31B-it model, an open-weights multimodal model created by Google DeepMind. Gemma 4 can process both text and image inputs, generating text outputs, with smaller models also supporting audio. It comes in various sizes (E2B, E4B, 26B A4B, and 31B) allowing for deployment on diverse hardware, from phones to servers.
The model boasts a context window of up to 256K tokens and supports over 140 languages. It utilizes dense and Mixture-of-Experts (MoE) architectures, excelling in tasks like text generation, coding, and reasoning. The page provides details on model data, training, ethics, usage, limitations, and best practices, along with code snippets for getting started with Transformers.
This article details benchmarks for Unsloth Dynamic GGUFs of the Qwen3.5 model, including analysis of perplexity, KL divergence, and MXFP4. It covers performance across different bit widths and quant types, highlighting the impact of Imatrix and the limitations of certain quantization approaches. Full benchmark data is also provided.
This guide explains how to use tool calling with local LLMs, including examples with mathematical, story, Python code, and terminal functions, using llama.cpp, llama-server, and OpenAI endpoints.
This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth! This page details running Gemma 3 on various platforms, including phones, and fine-tuning it using Unsloth, addressing potential issues with float16 precision and providing optimal configuration settings.
Learn how to run and fine-tune Mistral Devstral 1.1, including Small-2507 and 2505. This guide covers official recommended settings, tutorials for running Devstral in Ollama and llama.cpp, experimental vision support, and fine-tuning with Unsloth.
This document details how to run and fine-tune Gemma 3 models (1B, 4B, 12B, and 27B) using Unsloth, covering setup with Ollama and llama.cpp, and addressing potential float16 precision issues. It also highlights Unsloth's unique ability to run Gemma 3 in float16 on machines like Colab notebooks with Tesla T4 GPUs.