klotz: quantization*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. Meta AI has released quantized versions of the Llama 3.2 models (1B and 3B), which improve inference speed by up to 2-4x and reduce model size by 56%, making advanced AI technology more accessible to a wider range of users.

    2024-10-26 Tags: , , , , by klotz
  2. This article discusses the extensive evaluation of quantized large language models (LLMs) by Neural Magic, finding that quantized LLMs maintain competitive accuracy and efficiency with their full-precision counterparts.

    • Quantization Schemes: Three different quantization schemes were tested: W8A8-INT, W8A8-FP, and W4A16-INT, each optimized for different hardware and deployment scenarios.
    • Accuracy Recovery: The quantized models demonstrated high accuracy recovery, often reaching over 99%, across a range of benchmarks, including OpenLLM Leaderboard v1 and v2, Arena-Hard, and HumanEval.
    • Text Similarity: Text generated by quantized models was found to be highly similar to that generated by full-precision models, maintaining semantic and structural consistency.
    2025-02-27 Tags: , , , by klotz
  3. A guide on how to download, convert, quantize, and use Llama 3.1 8B model with llama.cpp on a Mac.

    2024-09-28 Tags: , , , by klotz
  4. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods, including GPTQ, AWQ, SmoothQuant, and FP8, on models ranging from 7B to 405B. Key findings include quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following.

    2024-09-22 Tags: , by klotz
  5. This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.

    2024-09-14 Tags: , , , , , by klotz
  6. Introducing sqlite-vec, a new SQLite extension for vector search written entirely in C. It's a stable release and can be installed in multiple ways. It runs on various platforms, is fast, and supports quantization techniques for efficient storage and search.

  7. A ruby script calculates VRAM requirements for large language models (LLMs) based on model, bits per weight, and context length. It can determine required VRAM, maximum context length, or best bpw given available VRAM.

  8. This article explores the concept of quantization in large language models (LLMs) and its benefits, including reducing memory usage and improving performance. It also discusses various quantization methods and their effects on model quality.

    2024-07-14 Tags: , , , by klotz
  9. An explanation of the quant names used in the llama.cpp implementation, as well as information on the different types of quant schemes available.

    2024-06-23 Tags: , , by klotz
  10. This document contains the quantized LLM inference performance results on 70b+ models.

    2024-06-23 Tags: , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: quantization

About - Propulsed by SemanticScuttle