SemanticScuttle - klotz.me » Tags: quantization+llm

Tags: quantization* + llm*

0 bookmark(s) - Sort by: Date ↓ / Title /

Perfecting Merge-kit MoE's - Google Docs

Not Mixtral MoE but Merge-kit MoE

What makes a perfect MoE: The secret formula
Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
Why the community working together to improve as a whole is the only way we will get Mixtral right

2024-02-09 Tags: llm, everyone, coder, mistral, moe, frankenmoe, mixtral, quantization, lonestriker by klotz

Techniques to Improve Memory and Computational Efficiency of Large Language Models

Improving the memory and computational efficiency of Large Language Models (LLMs) for handling long input sequences, including retrieval augmented questions answering, summarization, and chat tasks. It covers various techniques, such as lower precision computing, Flash Attention algorithm, positional embedding methods, and key-value caching strategies. These methods help reduce memory consumption and increase inference speeds while maintaining high accuracy levels in LLM applications. Furthermore, it highlights some advanced approaches like Multi-Query-Attention (MQA) and Grouped-Query-Attention (GQA), which further enhance computational and memory efficiency without compromising performance.