Tags: gguf* + quantization*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Unsloth AI presents performance benchmarks for Qwen3.6-35B-A3B GGUF quantizations, claiming state-of-the-art results in mean KL divergence across most model sizes. The discussion includes community analysis regarding SWE-bench Verified performance, where some users noted unexpected discrepancies between Qwen3.5 and Qwen3.6 quantization results during coding tasks.
    Key points:
    - Unsloth ranks first in 21 of 22 model sizes for mean KL divergence.
    - Community debate over SWE-bench testing methodology and sample sizes.
    - Reported performance variations between different quantization levels (Q4, Q5, Q6, Q8).
    - Discussion on system prompt adherence and error rates in coding benchmarks.
  2. Bonsai-8B-GGUF-1bit is an end-to-end 1-bit language model designed for high-efficiency deployment using llama.cpp across CUDA, Metal, and CPU architectures. This model provides a massive 14.1x reduction in memory footprint compared to standard FP16, requiring only 1.15 GB of parameter memory. By leveraging the GGUF Q1_0_g128 format, it achieves significant performance boosts, including 6.2x faster throughput on an RTX 4090 and substantially lower energy consumption per token. It is an ideal solution for on-device assistants, mobile applications, and edge robotics where memory, thermal, and power constraints are paramount.
  3. This collection, curated by prism-ml, features a specialized series of 1-bit Bonsai models designed for efficient text generation. The repository includes various model architectures and sizes, such as the 8B, 4B, and 1.7B parameter versions, optimized through extreme quantization. Available in formats like GGUF and MLX-1bit, these models are highly compressed to maximize performance while minimizing the computational footprint. This makes them ideal for running large language model tasks on hardware with limited resources. The collection serves as a hub for exploring the potential of ultra-compact, highly compressed models in the evolving landscape of machine learning and efficient inference.
  4. This article details benchmarks for Unsloth Dynamic GGUFs of the Qwen3.5 model, including analysis of perplexity, KL divergence, and MXFP4. It covers performance across different bit widths and quant types, highlighting the impact of Imatrix and the limitations of certain quantization approaches. Full benchmark data is also provided.
  5. This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
  6. This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.
  7. A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.
  8. This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.
    2024-09-14 Tags: , , , , , by klotz
  9. This document contains the quantized LLM inference performance results on 70b+ models.
    2024-06-23 Tags: , , , by klotz
  10. Exploring Pre-Quantized Large Language Models
    2023-11-15 Tags: , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "gguf+quantization"

About - Propulsed by SemanticScuttle