SemanticScuttle - klotz.me » Tags: gguf+quantization

Tags: gguf* + quantization*

0 bookmark(s) - Sort by: Date ↓ / Title /

Unsloth AI presents performance benchmarks for Qwen3.6-35B-A3B GGUF quantizations, claiming state-of-the-art results in mean KL divergence across most model sizes. The discussion includes community analysis regarding SWE-bench Verified performance, where some users noted unexpected discrepancies between Qwen3.5 and Qwen3.6 quantization results during coding tasks.
Key points:
- Unsloth ranks first in 21 of 22 model sizes for mean KL divergence.
- Community debate over SWE-bench testing methodology and sample sizes.
- Reported performance variations between different quantization levels (Q4, Q5, Q6, Q8).
- Discussion on system prompt adherence and error rates in coding benchmarks.

2026-04-18 Tags: unsloth, qwen3.6, gguf, benchmarks, quantization, swe-bench, llm performance by klotz

Bonsai-8B-GGUF-1bit

Bonsai-8B-GGUF-1bit is an end-to-end 1-bit language model designed for high-efficiency deployment using llama.cpp across CUDA, Metal, and CPU architectures. This model provides a massive 14.1x reduction in memory footprint compared to standard FP16, requiring only 1.15 GB of parameter memory. By leveraging the GGUF Q1_0_g128 format, it achieves significant performance boosts, including 6.2x faster throughput on an RTX 4090 and substantially lower energy consumption per token. It is an ideal solution for on-device assistants, mobile applications, and edge robotics where memory, thermal, and power constraints are paramount.

2026-04-05 Tags: bonsai-8b, 1-bit language model, gguf, llama.cpp, quantization, prism ml, on-device ai, efficiency by klotz

Bonsai 1-bit LLM

This collection, curated by prism-ml, features a specialized series of 1-bit Bonsai models designed for efficient text generation. The repository includes various model architectures and sizes, such as the 8B, 4B, and 1.7B parameter versions, optimized through extreme quantization. Available in formats like GGUF and MLX-1bit, these models are highly compressed to maximize performance while minimizing the computational footprint. This makes them ideal for running large language model tasks on hardware with limited resources. The collection serves as a hub for exploring the potential of ultra-compact, highly compressed models in the evolving landscape of machine learning and efficient inference.

2026-04-05 Tags: bonsai, 1-bit models, prism-ml, text generation, gguf, mlx, quantization, llm, machine learning by klotz

Qwen3.5 GGUF Benchmarks

This article details benchmarks for Unsloth Dynamic GGUFs of the Qwen3.5 model, including analysis of perplexity, KL divergence, and MXFP4. It covers performance across different bit widths and quant types, highlighting the impact of Imatrix and the limitations of certain quantization approaches. Full benchmark data is also provided.

2026-03-01 Tags: qwen3.5, gguf, benchmarks, quantization, perplexity, kl divergence, mxfp4, imatrix, llm, inference, dynamic quantization, unsloth by klotz

Unsloth Dynamic GGUFs on Aider Polyglot

This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.

2025-10-13 Tags: unsloth, gguf, aider polyglot, llm, quantization, deepseek-v3.1, gpt-4, claude-4, model compression, fine-tuning, inference by klotz

DeepSeek-R1-0528-Qwen3-8B-GGUF

This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.

2025-05-30 Tags: deepseek-r1, qwen3, gguf, llm, quantization, reasoning, text generation, transformers, model card, mcp, huggingface by klotz

Server approved! 4xH100 (320gb vram). Looking for advice

A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.

2025-04-28 Tags: h100, kubernetes, vllm, llama.cpp, gpu, ai, deployment, rke2, litellm, quantization, sxm, fp8, awq, gguf, production engineering, inference engineering, scale, reddit, localllama by klotz

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.

2024-09-14 Tags: gguf, quantization, llm, cpu, inference, imatrix by klotz

Artifacts Quantized LLM Inference Performance Results on 70b+ Models

This document contains the quantized LLM inference performance results on 70b+ models.

2024-06-23 Tags: artifacts, quantization, llm, gguf by klotz

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Exploring Pre-Quantized Large Language Models

2023-11-15 Tags: llm, quantization, gguf by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: gguf* + quantization*

Linked Tags

Related Tags