An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
A user demonstrates how to run a 120B model efficiently on hardware with only 8GB VRAM by offloading MOE layers to CPU and keeping only attention layers on GPU, achieving high performance with minimal VRAM usage.
A detailed comparison of the architectures of recent large language models (LLMs) including DeepSeek-V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, and Kimi 2, focusing on key design choices and their impact on performance and efficiency.
Not Mixtral MoE but Merge-kit MoE
EveryoneLLM series of models are a new Mixtral type model created using experts that were finetuned by the community, for the community. This is the first model to release in the series and it is a coding specific model. EveryoneLLM, which will be a more generalized model, will be released in the near future after more work is done to fine tune the process of merging Mistral models into a larger Mixtral models with greater success.
The goal of the EveryoneLLM series of models is to be a replacement or an alternative to Mixtral-8x7b that is more suitable for general and specific use, as well as easier to fine tune. Since Mistralai is being secretive about the "secret sause" that makes Mixtral-Instruct such an effective fine tune of the Mixtral-base model, I've decided its time for the community to directly compete with Mistralai on our own.
Not Mixtral MoE but Merge-kit MoE
- What makes a perfect MoE: The secret formula
- Why is a proper merge considered a base model, and how do we distinguish them from a FrankenMoE?
- Why the community working together to improve as a whole is the only way we will get Mixtral right