SemanticScuttle - klotz.me

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents

Zhipu AI has released GLM-4.7-Flash, a 30B-A3B MoE model designed for efficient local coding and agent applications. It offers strong coding and reasoning performance with a 128k token context length and supports English and Chinese.

2026-01-22 Tags: llm, glm-4.7-flash, zhipu ai, moe, coding, agents, machine learning, deep learning, local deployment by klotz

Inside GPT-OSS: OpenAI’s Latest LLM Architecture

An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.

2025-09-27 Tags: llm, gpt-oss, openai, transformer, mixture of experts, moe, attention, gqa, rope, quantization, machine learning, .qwen3–30b-a3b. by klotz

120B runs awesome on just 8GB VRAM!

A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.

2025-08-21 Tags: 120b, moe, llama.cpp, gpt-oss, localllama, gpt-oss-120b, openai, llm by klotz

The Big LLM Architecture Comparison

A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.

1. **DeepSeek V3/R1**:
- Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
- MLA compresses key and value tensors to reduce KV cache memory usage.
- MoE activates only a subset of experts per token, improving inference efficiency.

2. **OLMo 2**:
- Focuses on transparency in training data and code.
- Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
- Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.

3. **Gemma 3**:
- Employs sliding window attention to reduce memory requirements in the KV cache.
- Uses a 5:1 ratio of sliding window attention to global attention layers.
- Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.

4. **Mistral Small 3.1**:
- Outperforms Gemma 3 27B on several benchmarks while being faster.
- Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.

5. **Llama 4**:
- Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
- Alternates MoE and dense modules in every other transformer block.

6. **Qwen3**:
- Comes in both dense and MoE variants.
- Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.

7. **SmolLM3**:
- Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
- NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.

8. **Kimi K2 and Kimi K2 Thinking**:
- Uses a variant of the Muon optimizer over AdamW.
- Kimi K2 Thinking extends the context size to 256k tokens.

9. **GPT-OSS**:
- OpenAI's first open-weight models since GPT-2.
- Uses sliding window attention and a width-versus-depth trade-off.

10. **Grok 2.5**:
- Uses a small number of large experts and a shared expert module.
- Reflects an older trend in MoE architectures.

11. **GLM-4.5**:
- Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
- Uses a shared expert and starts with several dense layers before introducing MoE blocks.

12. **Qwen3-Next**:
- Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
- Uses Multi-Token Prediction (MTP) for efficiency.

13. **MiniMax-M2**:
- Uses per-layer QK-Norm and partial RoPE.
- More "sparse" than Qwen3, with fewer active experts per token.

14. **Kimi Linear**:
- Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
- Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).

15. **Olmo 3 Thinking**:
- Uses sliding window attention and YaRN for context extension.
- Comes in base, instruct, and reasoning variants.

16. **DeepSeek V3.2**:
- Adds a sparse attention mechanism to improve efficiency.
- On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.

17. **Mistral 3**:
- First MoE model since Mixtral in 2023.
- Partnered with NVIDIA for optimization on Blackwell chips.

18. **Nemotron 3**:
- A Transformer-Mamba hybrid architecture.
- Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.

19. **Xiaomi MiMo-V2-Flash**:
- Uses sliding window attention in a 5:1 ratio with global attention.
- Employs multi-token prediction (MTP) for efficiency.

20. **Arcee AI Trinity Large**:
- Uses alternating local:global attention layers, NoPE, and gated attention.
- Introduces depth-scaled sandwich norm for training stability.

2026-01-29 Tags: llm, deep learning, architecture, deepseek, olmo, gemma, mistral, llama, qwen, smollm, kimi, moe, attention, transformers, sebastian raschka by klotz

LoneStriker/Everyone-Coder-4x7b-Base-5.0bpw-h6-exl2 · Hugging Face

Not Mixtral MoE but Merge-kit MoE

EveryoneLLM series of models are a new Mixtral type model created using experts that were finetuned by the community, for the community. This is the first model to release in the series and it is a coding specific model. EveryoneLLM, which will be a more generalized model, will be released in the near future after more work is done to fine tune the process of merging Mistral models into a larger Mixtral models with greater success.

The goal of the EveryoneLLM series of models is to be a replacement or an alternative to Mixtral-8x7b that is more suitable for general and specific use, as well as easier to fine tune. Since Mistralai is being secretive about the "secret sause" that makes Mixtral-Instruct such an effective fine tune of the Mixtral-base model, I've decided its time for the community to directly compete with Mistralai on our own.

2024-02-09 Tags: llm, huggingface, everyone, coder, mistral, moe, mixtral, quantization, lonestriker by klotz

Perfecting Merge-kit MoE's - Google Docs

Not Mixtral MoE but Merge-kit MoE

- What makes a perfect MoE: The secret formula
- Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
- Why the community working together to improve as a whole is the only way we will get Mixtral right

2024-02-09 Tags: llm, everyone, coder, mistral, moe, frankenmoe, mixtral, quantization, lonestriker by klotz

About - Propulsed by SemanticScuttle

SemanticScuttle - klotz.me

Tags: moe*

Linked Tags

Related Tags