0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag
The article introduces a new approach to language modeling called test-time scaling, which enhances performance by utilizing additional compute resources during testing. The authors present a method involving a curated dataset and a technique called budget forcing to control compute usage, allowing models to double-check answers and improve reasoning. The approach is demonstrated with the Qwen2.5-32B-Instruct language model, showing significant improvements on competition math questions.
The article explores the DeepSeek-R1 models, focusing on how reinforcement learning (RL) is used to develop advanced reasoning capabilities in AI. It discusses the DeepSeek-R1-Zero model, which learns reasoning without supervised fine-tuning, and the DeepSeek-R1 model, which combines RL with a small amount of supervised data for improved performance. The article highlights the use of distillation to transfer reasoning patterns to smaller models and addresses challenges and future directions in RL for AI.
DeepSeek-R1 is a groundbreaking AI model that uses reinforcement learning to teach large language models to reason, outperforming models like GPT4-o1 at a fraction of the computational cost.
TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. It is built upon veRL and allows the 3B base LM to develop self-verification and search abilities through reinforcement learning.
Hugging Face's initiative to replicate DeepSeek-R1, focusing on developing datasets and sharing training pipelines for reasoning models.
The article introduces Hugging Face's Open-R1 project, a community-driven initiative to reconstruct and expand upon DeepSeek-R1, a cutting-edge reasoning language model. DeepSeek-R1, which emerged as a significant breakthrough, utilizes pure reinforcement learning to enhance a base model's reasoning capabilities without human supervision. However, DeepSeek did not release the datasets, training code, or detailed hyperparameters used to create the model, leaving key aspects of its development opaque.
The Open-R1 project aims to address these gaps by systematically replicating and improving upon DeepSeek-R1's methodology. The initiative involves three main steps:
This article discusses the process of training a large language model (LLM) using reinforcement learning from human feedback (RLHF) and a new alternative method called Direct Preference Optimization (DPO). The article explains how these methods help align the LLM with human expectations and make it more efficient.
This article discusses the latest open LLM (large language model) releases, including Mixtral 8x22B, Meta AI's Llama 3, and Microsoft's Phi-3, and compares their performance on the MMLU benchmark. It also talks about Apple's OpenELM and its efficient language model family with an open-source training and inference framework. The article also explores the use of PPO and DPO algorithms for instruction finetuning and alignment in LLMs.
First / Previous / Next / Last
/ Page 1 of 0