An Apple study shows that large language models (LLMs) can improve performance by using a checklist-based reinforcement learning scheme, similar to a simple productivity trick of checking one's work.
DeepMind researchers propose a new 'streams' approach to AI development, focusing on experiential learning and autonomous interaction with the world, moving beyond the limitations of current large language models and potentially surpassing human intelligence.
ByteDance Research has released DAPO (Dynamic Sampling Policy Optimization), an open-source reinforcement learning system for LLMs, aiming to improve reasoning abilities and address reproducibility issues. DAPO includes innovations like Clip-Higher, Dynamic Sampling, Token-level Policy Gradient Loss, and Overlong Reward Shaping, achieving a score of 50 on the AIME 2024 benchmark with the Qwen2.5-32B model.