Tags: kimi* + llama*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.

    1. **DeepSeek V3/R1**:
    - Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
    - MLA compresses key and value tensors to reduce KV cache memory usage.
    - MoE activates only a subset of experts per token, improving inference efficiency.

    2. **OLMo 2**:
    - Focuses on transparency in training data and code.
    - Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
    - Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.

    3. **Gemma 3**:
    - Employs sliding window attention to reduce memory requirements in the KV cache.
    - Uses a 5:1 ratio of sliding window attention to global attention layers.
    - Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.

    4. **Mistral Small 3.1**:
    - Outperforms Gemma 3 27B on several benchmarks while being faster.
    - Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.

    5. **Llama 4**:
    - Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
    - Alternates MoE and dense modules in every other transformer block.

    6. **Qwen3**:
    - Comes in both dense and MoE variants.
    - Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.

    7. **SmolLM3**:
    - Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
    - NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.

    8. **Kimi K2 and Kimi K2 Thinking**:
    - Uses a variant of the Muon optimizer over AdamW.
    - Kimi K2 Thinking extends the context size to 256k tokens.

    9. **GPT-OSS**:
    - OpenAI's first open-weight models since GPT-2.
    - Uses sliding window attention and a width-versus-depth trade-off.

    10. **Grok 2.5**:
    - Uses a small number of large experts and a shared expert module.
    - Reflects an older trend in MoE architectures.

    11. **GLM-4.5**:
    - Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
    - Uses a shared expert and starts with several dense layers before introducing MoE blocks.

    12. **Qwen3-Next**:
    - Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
    - Uses Multi-Token Prediction (MTP) for efficiency.

    13. **MiniMax-M2**:
    - Uses per-layer QK-Norm and partial RoPE.
    - More "sparse" than Qwen3, with fewer active experts per token.

    14. **Kimi Linear**:
    - Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
    - Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).

    15. **Olmo 3 Thinking**:
    - Uses sliding window attention and YaRN for context extension.
    - Comes in base, instruct, and reasoning variants.

    16. **DeepSeek V3.2**:
    - Adds a sparse attention mechanism to improve efficiency.
    - On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.

    17. **Mistral 3**:
    - First MoE model since Mixtral in 2023.
    - Partnered with NVIDIA for optimization on Blackwell chips.

    18. **Nemotron 3**:
    - A Transformer-Mamba hybrid architecture.
    - Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.

    19. **Xiaomi MiMo-V2-Flash**:
    - Uses sliding window attention in a 5:1 ratio with global attention.
    - Employs multi-token prediction (MTP) for efficiency.

    20. **Arcee AI Trinity Large**:
    - Uses alternating local:global attention layers, NoPE, and gated attention.
    - Introduces depth-scaled sandwich norm for training stability.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "kimi+llama"

About - Propulsed by SemanticScuttle