An open-source, theoretical implementation of the Claude Mythos model architecture. The project implements a Recurrent-Depth Transformer (RDT) consisting of three stages: a Prelude, a looped Recurrent Block, and a final Coda. It utilizes switchable attention between Multi-Latent Attention (MLA) and Grouped Query Attention (GQA), alongside a sparse Mixture of Experts (MoE) design to facilitate compute-adaptive reasoning in continuous latent space.
Key technical features include:
* Recurrent-Depth Transformer architecture for implicit chain-of-thought reasoning.
* LTI-stable injection parameters to prevent residual explosion during training.
* Support for multiple model scales ranging from 1B to 1T parameters.
* Integration of Adaptive Computation Time (ACT) or similar halting mechanisms to manage overthinking.
* Use of fine-grained MoE with shared experts to balance breadth and depth.
Alibaba's Qwen team has open-sourced Qwen3.6-35B-A3B, a sparse mixture-of-experts (MoE) model designed for high performance with low computational costs. While the model possesses 35 billion total parameters, it only activates 3 billion during operation, allowing it to outperform larger dense models in logical reasoning and programming tasks.
Key highlights:
- Uses MoE architecture to achieve high intelligence with minimal activated parameters.
- Demonstrates exceptional multimodal capabilities, particularly in spatial intelligence and visual perception.
- Competes closely with large-scale models like Gemma4-31B and Claude Sonnet 4.5 in specific metrics.
- Integrated into Qwen Studio and available via Alibaba Cloud BaiLian as qwen3.6-flash.
- Supports advanced features like thinking chain retention and seamless integration with AI programming assistants.
The article details the release of Qwen3-Coder-Next, a new 80-billion-parameter open-source large language model (LLM) from Alibaba’s Qwen team. This model is designed for coding tasks and utilizes an ultra-sparse Mixture-of-Experts (MoE) architecture, activating only 3 billion parameters at a time for increased efficiency. It boasts a massive 262,144 token context window and innovative techniques like Gated DeltaNet and Best-Fit Packing to overcome traditional LLM limitations. Qwen3-Coder-Next was trained using an "agentic training" pipeline, learning from real-world coding scenarios and feedback. It supports 370 programming languages and demonstrates competitive performance against leading models like OpenAI’s Codex and Anthropic’s Claude, while also exhibiting strong security features. The release is positioned as a significant advancement in open-weight AI and a challenge to proprietary coding models.
Zhipu AI has released GLM-4.7-Flash, a 30B-A3B MoE model designed for efficient local coding and agent applications. It offers strong coding and reasoning performance with a 128k token context length and supports English and Chinese.
An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.
1. **DeepSeek V3/R1**:
- Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
- MLA compresses key and value tensors to reduce KV cache memory usage.
- MoE activates only a subset of experts per token, improving inference efficiency.
2. **OLMo 2**:
- Focuses on transparency in training data and code.
- Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
- Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.
3. **Gemma 3**:
- Employs sliding window attention to reduce memory requirements in the KV cache.
- Uses a 5:1 ratio of sliding window attention to global attention layers.
- Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.
4. **Mistral Small 3.1**:
- Outperforms Gemma 3 27B on several benchmarks while being faster.
- Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.
5. **Llama 4**:
- Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
- Alternates MoE and dense modules in every other transformer block.
6. **Qwen3**:
- Comes in both dense and MoE variants.
- Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.
7. **SmolLM3**:
- Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
- NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.
8. **Kimi K2 and Kimi K2 Thinking**:
- Uses a variant of the Muon optimizer over AdamW.
- Kimi K2 Thinking extends the context size to 256k tokens.
9. **GPT-OSS**:
- OpenAI's first open-weight models since GPT-2.
- Uses sliding window attention and a width-versus-depth trade-off.
10. **Grok 2.5**:
- Uses a small number of large experts and a shared expert module.
- Reflects an older trend in MoE architectures.
11. **GLM-4.5**:
- Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
- Uses a shared expert and starts with several dense layers before introducing MoE blocks.
12. **Qwen3-Next**:
- Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
- Uses Multi-Token Prediction (MTP) for efficiency.
13. **MiniMax-M2**:
- Uses per-layer QK-Norm and partial RoPE.
- More "sparse" than Qwen3, with fewer active experts per token.
14. **Kimi Linear**:
- Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
- Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).
15. **Olmo 3 Thinking**:
- Uses sliding window attention and YaRN for context extension.
- Comes in base, instruct, and reasoning variants.
16. **DeepSeek V3.2**:
- Adds a sparse attention mechanism to improve efficiency.
- On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.
17. **Mistral 3**:
- First MoE model since Mixtral in 2023.
- Partnered with NVIDIA for optimization on Blackwell chips.
18. **Nemotron 3**:
- A Transformer-Mamba hybrid architecture.
- Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.
19. **Xiaomi MiMo-V2-Flash**:
- Uses sliding window attention in a 5:1 ratio with global attention.
- Employs multi-token prediction (MTP) for efficiency.
20. **Arcee AI Trinity Large**:
- Uses alternating local:global attention layers, NoPE, and gated attention.
- Introduces depth-scaled sandwich norm for training stability.
Not Mixtral MoE but Merge-kit MoE
EveryoneLLM series of models are a new Mixtral type model created using experts that were finetuned by the community, for the community. This is the first model to release in the series and it is a coding specific model. EveryoneLLM, which will be a more generalized model, will be released in the near future after more work is done to fine tune the process of merging Mistral models into a larger Mixtral models with greater success.
The goal of the EveryoneLLM series of models is to be a replacement or an alternative to Mixtral-8x7b that is more suitable for general and specific use, as well as easier to fine tune. Since Mistralai is being secretive about the "secret sause" that makes Mixtral-Instruct such an effective fine tune of the Mixtral-base model, I've decided its time for the community to directly compete with Mistralai on our own.
Not Mixtral MoE but Merge-kit MoE
- What makes a perfect MoE: The secret formula
- Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
- Why the community working together to improve as a whole is the only way we will get Mixtral right