SemanticScuttle - klotz.me » Tags: llama.cpp+quantization

Tags: llama.cpp* + quantization*

0 bookmark(s) - Sort by: Date ↓ / Title /

Server approved! 4xH100 (320gb vram). Looking for advice

A user is seeking advice on deploying a new server with 4x H100 GPUs (320GB VRAM) for on-premise AI workloads. They are considering a Kubernetes-based deployment with RKE2, Nvidia GPU Operator, and tools like vLLM, llama.cpp, and Litellm. They are also exploring the option of GPU pass-through with a hypervisor. The post details their current infrastructure and asks for potential gotchas or best practices.

2025-04-28 Tags: h100, kubernetes, vllm, llama.cpp, gpu, ai, deployment, rke2, litellm, quantization, sxm, fp8, awq, gguf, production engineering, inference engineering, scale, reddit, localllama by klotz

Text Generation Web UI

This document details how to run Qwen models locally using the Text Generation Web UI (oobabooga), covering installation, setup, and launching the web interface.

2025-04-08 Tags: alibaba, qwen, text generation web ui, oobabooga, llm, inference, llama.cpp, transformers, quantization, python by klotz

TIL: Quantize and use Llama 3.1 with llama.cpp on a Mac

A guide on how to download, convert, quantize, and use Llama 3.1 8B model with llama.cpp on a Mac.

2024-09-28 Tags: llama.cpp, quantization, llm, howto by klotz

llama.cpp quant names

An explanation of the quant names used in the llama.cpp implementation, as well as information on the different types of quant schemes available.

2024-06-23 Tags: llama.cpp, quantization, llm by klotz

The Most Simple Way to Set Up ChatGPT Locally

2024-01-18 Tags: llm, quantization, llama.cpp, self-hosted, tutorial by klotz

TheBloke/Wizard-Vicuna-30B-Uncensored-GGML · Hugging Face

Explanation of the new k-quant methods
The new methods available are:

GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.

2023-06-08 Tags: huggingface, llama, vicuna, quantization, k-quant, gpu, cpu, acceleration, llama.cpp by klotz

How is LLaMa.cpp possible?

2023-06-06 Tags: llama, llm, llama.cpp, quantization, self-hosted by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: llama.cpp* + quantization*

Linked Tags

Related Tags