An in-depth look at the architecture of OpenAI's GPT-OSS models, detailing tokenization, embeddings, transformer blocks, Mixture of Experts, attention mechanisms (GQA and RoPE), and quantization techniques.
The article explores the architectural changes that enable DeepSeek's models to perform well with fewer resources, focusing on Multi-Head Latent Attention (MLA). It discusses the evolution of attention mechanisms, from Bahdanau to Transformer's Multi-Head Attention (MHA), and introduces Grouped-Query Attention (GQA) as a solution to MHA's memory inefficiencies. The article highlights DeepSeek's competitive performance despite lower reported training costs.
novel concepts that Mistral AI added to traditional Transformer architectures and we perform a comparison of inference time between Mistral 7B and Llama 2 7B and a comparison of memory, inference time and response quality between Mixtral 8x7B and LLama 2 70B. RAG systems and a public Amazon dataset with customer reviews.