This repository provides the GGUF quantized weights for Qwen3.6-27B, a flagship-level coding model designed for stability and real-world utility. The model features significant upgrades in agentic coding capabilities, allowing it to handle frontend workflows and repository-level reasoning with high precision. It also introduces thinking preservation, which enables the model to retain reasoning context from historical messages to improve iterative development.
Key technical highlights:
* Native context length of 262,144 tokens, extensible up to 1,010,000 via RoPE scaling (YaRN).
* Enhanced tool-calling capabilities for complex agentic tasks.
* Support for multimodal inputs including images and video.
* Optimized for various inference frameworks like SGLang, vLLM, and KTransformers.
Unsloth AI presents performance benchmarks for Qwen3.6-35B-A3B GGUF quantizations, claiming state-of-the-art results in mean KL divergence across most model sizes. The discussion includes community analysis regarding SWE-bench Verified performance, where some users noted unexpected discrepancies between Qwen3.5 and Qwen3.6 quantization results during coding tasks.
Key points:
- Unsloth ranks first in 21 of 22 model sizes for mean KL divergence.
- Community debate over SWE-bench testing methodology and sample sizes.
- Reported performance variations between different quantization levels (Q4, Q5, Q6, Q8).
- Discussion on system prompt adherence and error rates in coding benchmarks.
A comprehensive technical guide on setting up a high-performance local large language model environment for agentic coding tasks. The author demonstrates how to run a quantized Qwen3.5-27B model on a remote RTX 4090 workstation and access it from a MacBook using Tailscale, integrating the setup with OpenCode and Codex.
Key topics include:
* Step-by-step llama.cpp build configuration for CUDA support.
* Using Tailscale to create a secure network between client and GPU machine.
* Optimizing VRAM usage through specific quantization (UD-Q4_K_XL) and context size management.
* Implementing a corrected chat template to prevent tool-calling errors in agentic workflows.
* Performance insights regarding hybrid architectures and KV cache precision.
The llama.cpp server has introduced support for the Anthropic Messages API, a highly requested feature that allows users to run Claude-compatible clients with locally hosted models. This implementation enables powerful tools like Claude Code to interface directly with local GGUF models by internally converting Anthropic's message format to OpenAI's standard. Key features of this update include full support for chat completions with streaming, advanced tool use through function calling, token counting capabilities, vision support for multimodal models, and extended thinking for reasoning models. This development bridges the gap between proprietary AI ecosystems and local, privacy-focused inference pipelines, providing a seamless experience for developers working with agentic workloads and coding assistants.
ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL=
A technical guide to running lightweight OCR models (LightOnOCR, GLM-OCR, Deepseek-OCR) on low-end hardware using llama.cpp. Includes implementation details for CLI, REST APIs, and performance optimization.
Topics Covered:
- llama.cpp OCR integration
- Low-spec hardware optimization
- CLI & REST API setup
- Quantization & Prompting
- Hallucination mitigation
Bonsai-8B-GGUF-1bit is an end-to-end 1-bit language model designed for high-efficiency deployment using llama.cpp across CUDA, Metal, and CPU architectures. This model provides a massive 14.1x reduction in memory footprint compared to standard FP16, requiring only 1.15 GB of parameter memory. By leveraging the GGUF Q1_0_g128 format, it achieves significant performance boosts, including 6.2x faster throughput on an RTX 4090 and substantially lower energy consumption per token. It is an ideal solution for on-device assistants, mobile applications, and edge robotics where memory, thermal, and power constraints are paramount.
This collection, curated by prism-ml, features a specialized series of 1-bit Bonsai models designed for efficient text generation. The repository includes various model architectures and sizes, such as the 8B, 4B, and 1.7B parameter versions, optimized through extreme quantization. Available in formats like GGUF and MLX-1bit, these models are highly compressed to maximize performance while minimizing the computational footprint. This makes them ideal for running large language model tasks on hardware with limited resources. The collection serves as a hub for exploring the potential of ultra-compact, highly compressed models in the evolving landscape of machine learning and efficient inference.
This Hugging Face page details the Gemma 4 31B-it model, an open-weights multimodal model created by Google DeepMind. Gemma 4 can process both text and image inputs, generating text outputs, with smaller models also supporting audio. It comes in various sizes (E2B, E4B, 26B A4B, and 31B) allowing for deployment on diverse hardware, from phones to servers.
The model boasts a context window of up to 256K tokens and supports over 140 languages. It utilizes dense and Mixture-of-Experts (MoE) architectures, excelling in tasks like text generation, coding, and reasoning. The page provides details on model data, training, ethics, usage, limitations, and best practices, along with code snippets for getting started with Transformers.
This article details benchmarks for Unsloth Dynamic GGUFs of the Qwen3.5 model, including analysis of perplexity, KL divergence, and MXFP4. It covers performance across different bit widths and quant types, highlighting the impact of Imatrix and the limitations of certain quantization approaches. Full benchmark data is also provided.
Announcement that ggml.ai is joining Hugging Face to ensure the long-term sustainability and progress of the ggml/llama.cpp community and Local AI. Highlights continued open-source development, improved user experience, and integration with the Hugging Face transformers library.