A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.
1. **DeepSeek V3/R1**:
- Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
- MLA compresses key and value tensors to reduce KV cache memory usage.
- MoE activates only a subset of experts per token, improving inference efficiency.
2. **OLMo 2**:
- Focuses on transparency in training data and code.
- Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
- Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.
3. **Gemma 3**:
- Employs sliding window attention to reduce memory requirements in the KV cache.
- Uses a 5:1 ratio of sliding window attention to global attention layers.
- Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.
4. **Mistral Small 3.1**:
- Outperforms Gemma 3 27B on several benchmarks while being faster.
- Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.
5. **Llama 4**:
- Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
- Alternates MoE and dense modules in every other transformer block.
6. **Qwen3**:
- Comes in both dense and MoE variants.
- Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.
7. **SmolLM3**:
- Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
- NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.
8. **Kimi K2 and Kimi K2 Thinking**:
- Uses a variant of the Muon optimizer over AdamW.
- Kimi K2 Thinking extends the context size to 256k tokens.
9. **GPT-OSS**:
- OpenAI's first open-weight models since GPT-2.
- Uses sliding window attention and a width-versus-depth trade-off.
10. **Grok 2.5**:
- Uses a small number of large experts and a shared expert module.
- Reflects an older trend in MoE architectures.
11. **GLM-4.5**:
- Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
- Uses a shared expert and starts with several dense layers before introducing MoE blocks.
12. **Qwen3-Next**:
- Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
- Uses Multi-Token Prediction (MTP) for efficiency.
13. **MiniMax-M2**:
- Uses per-layer QK-Norm and partial RoPE.
- More "sparse" than Qwen3, with fewer active experts per token.
14. **Kimi Linear**:
- Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
- Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).
15. **Olmo 3 Thinking**:
- Uses sliding window attention and YaRN for context extension.
- Comes in base, instruct, and reasoning variants.
16. **DeepSeek V3.2**:
- Adds a sparse attention mechanism to improve efficiency.
- On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.
17. **Mistral 3**:
- First MoE model since Mixtral in 2023.
- Partnered with NVIDIA for optimization on Blackwell chips.
18. **Nemotron 3**:
- A Transformer-Mamba hybrid architecture.
- Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.
19. **Xiaomi MiMo-V2-Flash**:
- Uses sliding window attention in a 5:1 ratio with global attention.
- Employs multi-token prediction (MTP) for efficiency.
20. **Arcee AI Trinity Large**:
- Uses alternating local:global attention layers, NoPE, and gated attention.
- Introduces depth-scaled sandwich norm for training stability.
This article demonstrates how to use the attention mechanism in a time series classification framework, specifically for classifying normal sine waves versus 'modified' (flattened) sine waves. It details the data generation, model implementation (using a bidirectional LSTM with attention), and results, achieving high accuracy.
This article provides a beginner-friendly explanation of attention mechanisms and transformer models, covering sequence-to-sequence modeling, the limitations of RNNs, the concept of attention, and how transformers address these limitations with self-attention and parallelization.
The attention mechanism in Large Language Models (LLMs) helps derive the meaning of a word from its context. This involves encoding words as multi-dimensional vectors, calculating query and key vectors, and using attention weights to adjust the embedding based on contextual relevance.
The article delves into how large language models (LLMs) store facts, focusing on the role of multi-layer perceptrons (MLPs) in this process. It explains the mechanics of MLPs, including matrix multiplication, bias addition, and the Rectified Linear Unit (ReLU) function, using the example of encoding the fact that Michael Jordan plays basketball. The article also discusses the concept of superposition, which allows models to store a vast number of features by utilizing nearly perpendicular directions in high-dimensional spaces.
The article explores the architectural changes that enable DeepSeek's models to perform well with fewer resources, focusing on Multi-Head Latent Attention (MLA). It discusses the evolution of attention mechanisms, from Bahdanau to Transformer's Multi-Head Attention (MHA), and introduces Grouped-Query Attention (GQA) as a solution to MHA's memory inefficiencies. The article highlights DeepSeek's competitive performance despite lower reported training costs.
This article is part of a series titled ‘LLMs from Scratch’, a complete guide to understanding and building Large Language Models (LLMs). In this article, we discuss the self-attention mechanism and how it is used by transformers to create rich and context-aware transformer embeddings.
The Self-Attention mechanism is used to add context to learned embeddings, which are vectors representing each word in the input sequence. The process involves the following steps:
1. Learned Embeddings: These are the initial vector representations of words, learned during the training phase. The weights matrix, storing the learned embeddings, is stored in the first linear layer of the Transformer architecture.
2. Positional Encoding: This step adds positional information to the learned embeddings. Positional information helps the model understand the order of the words in the input sequence, as transformers process all words in parallel, and without this information, they would lose the order of the words.
3. Self-Attention: The core of the Self-Attention mechanism is to update the learned embeddings with context from the surrounding words in the input sequence. This mechanism determines which words provide context to other words, and this contextual information is used to produce the final contextualized embeddings.
Combined with the growing trend of multimodality, or models that combine language, image, and other types of capabilities, we may see a trend of AI models operating more like a committee of different components rather than a monolithic block. This approach actually has many conceptual similarities to a set of interesting ideas described by Marvin Minsky and Seymour Paypert from the early days of AI.