This blog post details how to implement high-performance matrix multiplication using NVIDIA cuTile, focusing on Tile loading, computation, storage, and block-level parallel programming. It also covers best practices for Tile programming and performance optimization strategies.
This blog post explains the causes of nondeterminism in LLM inference, arguing that it's not simply due to floating-point non-associativity and concurrency, but rather a lack of batch invariance in kernels. It details how to achieve batch invariance in RMSNorm, matrix multiplication, and attention, and presents experimental results demonstrating deterministic completions and the benefits for on-policy RL.