SemanticScuttle - klotz.me » klotz: quantization

klotz: quantization*

mobiusml/hqq: Official implementation of Half-Quadratic Quantization (HQQ)

HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super simple to implement (just a few lines of code for the optimizer). It can crunch through quantizing the Llama2-70B model in only 4 minutes!

2024-02-24 Tags: llm.hqq, quantization, github by klotz

LoneStriker/Everyone-Coder-4x7b-Base-5.0bpw-h6-exl2 · Hugging Face

Not Mixtral MoE but Merge-kit MoE

EveryoneLLM series of models are a new Mixtral type model created using experts that were finetuned by the community, for the community. This is the first model to release in the series and it is a coding specific model. EveryoneLLM, which will be a more generalized model, will be released in the near future after more work is done to fine tune the process of merging Mistral models into a larger Mixtral models with greater success.

The goal of the EveryoneLLM series of models is to be a replacement or an alternative to Mixtral-8x7b that is more suitable for general and specific use, as well as easier to fine tune. Since Mistralai is being secretive about the "secret sause" that makes Mixtral-Instruct such an effective fine tune of the Mixtral-base model, I've decided its time for the community to directly compete with Mistralai on our own.

2024-02-09 Tags: llm, huggingface, everyone, coder, mistral, moe, mixtral, quantization, lonestriker by klotz

Perfecting Merge-kit MoE's - Google Docs

Not Mixtral MoE but Merge-kit MoE

- What makes a perfect MoE: The secret formula
- Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
- Why the community working together to improve as a whole is the only way we will get Mixtral right

2024-02-09 Tags: llm, everyone, coder, mistral, moe, frankenmoe, mixtral, quantization, lonestriker by klotz

Techniques to Improve Memory and Computational Efficiency of Large Language Models

Improving the memory and computational efficiency of Large Language Models (LLMs) for handling long input sequences, including retrieval augmented questions answering, summarization, and chat tasks. It covers various techniques, such as lower precision computing, Flash Attention algorithm, positional embedding methods, and key-value caching strategies. These methods help reduce memory consumption and increase inference speeds while maintaining high accuracy levels in LLM applications. Furthermore, it highlights some advanced approaches like Multi-Query-Attention (MQA) and Grouped-Query-Attention (GQA), which further enhance computational and memory efficiency without compromising performance.

2024-01-30 Tags: llm, quantization, flash attention, position embeddings, key-value cache, multi-query-attention, grouped-query-attention, performance, optimization by klotz

GitHub unsloth: 5X faster 60% less memory QLoRA finetuning

2024-01-29 Tags: llm, qlora, fine tuning, github, quantization by klotz

How to fine-tune an open-source LLaMa using QLoRa

2024-01-28 Tags: llm, qlora, llama, fine tuning, quantization by klotz

The Most Simple Way to Set Up ChatGPT Locally

2024-01-18 Tags: llm, quantization, llama.cpp, self-hosted, tutorial by klotz

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Exploring Pre-Quantized Large Language Models

2023-11-15 Tags: llm, quantization, gguf by klotz

(1) ggml : LocalLLaMA

2023-06-09 Tags: ggml, llm, model, quantization, reddit, foss, self-hosted by klotz

TheBloke/Wizard-Vicuna-30B-Uncensored-GGML · Hugging Face

Explanation of the new k-quant methods
The new methods available are:

GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.

2023-06-08 Tags: huggingface, llama, vicuna, quantization, k-quant, gpu, cpu, acceleration, llama.cpp by klotz

SemanticScuttle - klotz.me

klotz: quantization*

Linked Tags

Related Tags