klotz: gguf* + quantization*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. This article details the performance of Unsloth Dynamic GGUFs on the Aider Polyglot benchmark, showcasing how it can quantize LLMs like DeepSeek-V3.1 to as low as 1-bit while outperforming models like GPT-4.5 and Claude-4-Opus. It also covers benchmark setup, comparisons to other quantization methods, and chat template bug fixes.
  2. This page details the DeepSeek-R1-0528-Qwen3-8B model, a quantized version of DeepSeek-R1-0528, highlighting its improved reasoning capabilities, evaluation results, usage guidelines, and licensing information. It offers various quantization options (GGUF) for local execution.
  3. A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.
  4. This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.
    2024-09-14 Tags: , , , , , by klotz
  5. This document contains the quantized LLM inference performance results on 70b+ models.
    2024-06-23 Tags: , , , by klotz
  6. Exploring Pre-Quantized Large Language Models
    2023-11-15 Tags: , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: gguf + quantization

About - Propulsed by SemanticScuttle