Zhipu AI has released GLM-4.7-Flash, a 30B-A3B MoE model designed for efficient local coding and agent applications. It offers strong coding and reasoning performance with a 128k token context length and supports English and Chinese.
An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.
1. **DeepSeek V3/R1**:
- Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
- MLA compresses key and value tensors to reduce KV cache memory usage.
- MoE activates only a subset of experts per token, improving inference efficiency.
2. **OLMo 2**:
- Focuses on transparency in training data and code.
- Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
- Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.
3. **Gemma 3**:
- Employs sliding window attention to reduce memory requirements in the KV cache.
- Uses a 5:1 ratio of sliding window attention to global attention layers.
- Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.
4. **Mistral Small 3.1**:
- Outperforms Gemma 3 27B on several benchmarks while being faster.
- Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.
5. **Llama 4**:
- Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
- Alternates MoE and dense modules in every other transformer block.
6. **Qwen3**:
- Comes in both dense and MoE variants.
- Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.
7. **SmolLM3**:
- Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
- NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.
8. **Kimi K2 and Kimi K2 Thinking**:
- Uses a variant of the Muon optimizer over AdamW.
- Kimi K2 Thinking extends the context size to 256k tokens.
9. **GPT-OSS**:
- OpenAI's first open-weight models since GPT-2.
- Uses sliding window attention and a width-versus-depth trade-off.
10. **Grok 2.5**:
- Uses a small number of large experts and a shared expert module.
- Reflects an older trend in MoE architectures.
11. **GLM-4.5**:
- Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
- Uses a shared expert and starts with several dense layers before introducing MoE blocks.
12. **Qwen3-Next**:
- Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
- Uses Multi-Token Prediction (MTP) for efficiency.
13. **MiniMax-M2**:
- Uses per-layer QK-Norm and partial RoPE.
- More "sparse" than Qwen3, with fewer active experts per token.
14. **Kimi Linear**:
- Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
- Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).
15. **Olmo 3 Thinking**:
- Uses sliding window attention and YaRN for context extension.
- Comes in base, instruct, and reasoning variants.
16. **DeepSeek V3.2**:
- Adds a sparse attention mechanism to improve efficiency.
- On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.
17. **Mistral 3**:
- First MoE model since Mixtral in 2023.
- Partnered with NVIDIA for optimization on Blackwell chips.
18. **Nemotron 3**:
- A Transformer-Mamba hybrid architecture.
- Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.
19. **Xiaomi MiMo-V2-Flash**:
- Uses sliding window attention in a 5:1 ratio with global attention.
- Employs multi-token prediction (MTP) for efficiency.
20. **Arcee AI Trinity Large**:
- Uses alternating local:global attention layers, NoPE, and gated attention.
- Introduces depth-scaled sandwich norm for training stability.
Not Mixtral MoE but Merge-kit MoE
EveryoneLLM series of models are a new Mixtral type model created using experts that were finetuned by the community, for the community. This is the first model to release in the series and it is a coding specific model. EveryoneLLM, which will be a more generalized model, will be released in the near future after more work is done to fine tune the process of merging Mistral models into a larger Mixtral models with greater success.
The goal of the EveryoneLLM series of models is to be a replacement or an alternative to Mixtral-8x7b that is more suitable for general and specific use, as well as easier to fine tune. Since Mistralai is being secretive about the "secret sause" that makes Mixtral-Instruct such an effective fine tune of the Mixtral-base model, I've decided its time for the community to directly compete with Mistralai on our own.
Not Mixtral MoE but Merge-kit MoE
- What makes a perfect MoE: The secret formula
- Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
- Why the community working together to improve as a whole is the only way we will get Mixtral right