Tags: moe*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Zhipu AI has released GLM-4.7-Flash, a 30B-A3B MoE model designed for efficient local coding and agent applications. It offers strong coding and reasoning performance with a 128k token context length and supports English and Chinese.
  2. An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
  3. A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
  4. A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.

    1. **DeepSeek V3/R1**:
    - Uses Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE) for efficiency.
    - MLA compresses key and value tensors to reduce KV cache memory usage.
    - MoE activates only a subset of experts per token, improving inference efficiency.

    2. **OLMo 2**:
    - Focuses on transparency in training data and code.
    - Uses RMSNorm layers placed after attention and feed-forward modules (Post-Norm).
    - Introduces QK-Norm, an additional RMSNorm layer applied to queries and keys inside the attention mechanism.

    3. **Gemma 3**:
    - Employs sliding window attention to reduce memory requirements in the KV cache.
    - Uses a 5:1 ratio of sliding window attention to global attention layers.
    - Combines Pre-Norm and Post-Norm RMSNorm layers around the attention module.

    4. **Mistral Small 3.1**:
    - Outperforms Gemma 3 27B on several benchmarks while being faster.
    - Uses a standard architecture with a custom tokenizer and reduced KV cache and layer count.

    5. **Llama 4**:
    - Adopts an MoE approach similar to DeepSeek V3 but with fewer, larger experts.
    - Alternates MoE and dense modules in every other transformer block.

    6. **Qwen3**:
    - Comes in both dense and MoE variants.
    - Dense models are easier to fine-tune and deploy, while MoE models are optimized for scaling inference.

    7. **SmolLM3**:
    - Uses No Positional Embeddings (NoPE), omitting explicit positional information injection.
    - NoPE improves length generalization, meaning performance deteriorates less with increased sequence length.

    8. **Kimi K2 and Kimi K2 Thinking**:
    - Uses a variant of the Muon optimizer over AdamW.
    - Kimi K2 Thinking extends the context size to 256k tokens.

    9. **GPT-OSS**:
    - OpenAI's first open-weight models since GPT-2.
    - Uses sliding window attention and a width-versus-depth trade-off.

    10. **Grok 2.5**:
    - Uses a small number of large experts and a shared expert module.
    - Reflects an older trend in MoE architectures.

    11. **GLM-4.5**:
    - Comes in two variants: a 355-billion-parameter model and a more compact 106-billion-parameter version.
    - Uses a shared expert and starts with several dense layers before introducing MoE blocks.

    12. **Qwen3-Next**:
    - Introduces a Gated DeltaNet + Gated Attention hybrid mechanism.
    - Uses Multi-Token Prediction (MTP) for efficiency.

    13. **MiniMax-M2**:
    - Uses per-layer QK-Norm and partial RoPE.
    - More "sparse" than Qwen3, with fewer active experts per token.

    14. **Kimi Linear**:
    - Modifies the linear attention mechanism with Kimi Delta Attention (KDA).
    - Combines Gated DeltaNet with Multi-Head Latent Attention (MLA).

    15. **Olmo 3 Thinking**:
    - Uses sliding window attention and YaRN for context extension.
    - Comes in base, instruct, and reasoning variants.

    16. **DeepSeek V3.2**:
    - Adds a sparse attention mechanism to improve efficiency.
    - On par with GPT-5.1 and Gemini 3.0 Pro on certain benchmarks.

    17. **Mistral 3**:
    - First MoE model since Mixtral in 2023.
    - Partnered with NVIDIA for optimization on Blackwell chips.

    18. **Nemotron 3**:
    - A Transformer-Mamba hybrid architecture.
    - Interleaves Mamba-2 sequence-modeling blocks with sparse MoE feed-forward layers.

    19. **Xiaomi MiMo-V2-Flash**:
    - Uses sliding window attention in a 5:1 ratio with global attention.
    - Employs multi-token prediction (MTP) for efficiency.

    20. **Arcee AI Trinity Large**:
    - Uses alternating local:global attention layers, NoPE, and gated attention.
    - Introduces depth-scaled sandwich norm for training stability.
  5. Not Mixtral MoE but Merge-kit MoE

    EveryoneLLM series of models are a new Mixtral type model created using experts that were finetuned by the community, for the community. This is the first model to release in the series and it is a coding specific model. EveryoneLLM, which will be a more generalized model, will be released in the near future after more work is done to fine tune the process of merging Mistral models into a larger Mixtral models with greater success.

    The goal of the EveryoneLLM series of models is to be a replacement or an alternative to Mixtral-8x7b that is more suitable for general and specific use, as well as easier to fine tune. Since Mistralai is being secretive about the "secret sause" that makes Mixtral-Instruct such an effective fine tune of the Mixtral-base model, I've decided its time for the community to directly compete with Mistralai on our own.
  6. Not Mixtral MoE but Merge-kit MoE

    - What makes a perfect MoE: The secret formula
    - Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
    - Why the community working together to improve as a whole is the only way we will get Mixtral right

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "moe"

About - Propulsed by SemanticScuttle