An open-source, theoretical implementation of the Claude Mythos model architecture. The project implements a Recurrent-Depth Transformer (RDT) consisting of three stages: a Prelude, a looped Recurrent Block, and a final Coda. It utilizes switchable attention between Multi-Latent Attention (MLA) and Grouped Query Attention (GQA), alongside a sparse Mixture of Experts (MoE) design to facilitate compute-adaptive reasoning in continuous latent space.
Key technical features include:
* Recurrent-Depth Transformer architecture for implicit chain-of-thought reasoning.
* LTI-stable injection parameters to prevent residual explosion during training.
* Support for multiple model scales ranging from 1B to 1T parameters.
* Integration of Adaptive Computation Time (ACT) or similar halting mechanisms to manage overthinking.
* Use of fine-grained MoE with shared experts to balance breadth and depth.