SemanticScuttle - klotz.me » Tags: quantization

Tags: quantization*

0 bookmark(s) - Sort by: Date ↓ / Title /

TIL: Quantize and use Llama 3.1 with llama.cpp on a Mac

A guide on how to download, convert, quantize, and use Llama 3.1 8B model with llama.cpp on a Mac.

2024-09-28 Tags: llama.cpp, quantization, llm, howto by klotz

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

This paper evaluates the performance of instruction-tuned LLMs across various quantization methods, including GPTQ, AWQ, SmoothQuant, and FP8, on models ranging from 7B to 405B. Key findings include quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following.

2024-09-22 Tags: quantization, llm by klotz

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU

This article explains how to accurately quantize a Large Language Model (LLM) and convert it to the GGUF format for efficient CPU inference. It covers using an importance matrix (imatrix) and K-Quantization method with Gemma 2 Instruct as an example, while highlighting its applicability to other models like Qwen2, Llama 3, and Phi-3.

2024-09-14 Tags: gguf, quantization, llm, cpu, inference, imatrix by klotz

Introducing sqlite-vec v0.1.0: a vector search SQLite extension that runs everywhere

Introducing sqlite-vec, a new SQLite extension for vector search written entirely in C. It's a stable release and can be installed in multiple ways. It runs on various platforms, is fast, and supports quantization techniques for efficient storage and search.

2024-08-04 Tags: sqlite-vec, sqlite, vector search, embeddings, quantization, database, ann, knn, machine learning by klotz

[Script] Calculate VRAM requirements for LLM models

A ruby script calculates VRAM requirements for large language models (LLMs) based on model, bits per weight, and context length. It can determine required VRAM, maximum context length, or best bpw given available VRAM.

2024-08-01 Tags: llm, vram, script, quantization, model size, context length, cli, linux, ruby reddit, foss by klotz

Honey, I shrunk the LLM! A beginner's guide to quantization

This article explores the concept of quantization in large language models (LLMs) and its benefits, including reducing memory usage and improving performance. It also discusses various quantization methods and their effects on model quality.

2024-07-14 Tags: llm, quantization, gpu, benchmark by klotz

llama.cpp quant names

An explanation of the quant names used in the llama.cpp implementation, as well as information on the different types of quant schemes available.

2024-06-23 Tags: llama.cpp, quantization, llm by klotz

Artifacts Quantized LLM Inference Performance Results on 70b+ Models

This document contains the quantized LLM inference performance results on 70b+ models.

2024-06-23 Tags: artifacts, quantization, llm, gguf by klotz

Fine-Tuning LLM Models Course | freeCodeCamp.org

This article announces a comprehensive course on fine-tuning large language models (LLMs) offered on the freeCodeCamp.org YouTube channel. The course, developed by Krish Naik, covers topics such as QLORA, LORA, quantization with LLama2, gradient, and Google Gemma Model, among others. The course aims to help learners deepen their understanding of machine learning and artificial intelligence.

2024-05-24 Tags: freecodecamp, course, fine-tuning, llm, qlora, lora, quantization, llama2 by klotz

mobiusml/hqq: Official implementation of Half-Quadratic Quantization (HQQ)

HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super simple to implement (just a few lines of code for the optimizer). It can crunch through quantizing the Llama2-70B model in only 4 minutes!

2024-02-24 Tags: llm.hqq, quantization, github by klotz

SemanticScuttle - klotz.me

Tags: quantization*

Linked Tags

Related Tags