0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag
Researchers from various institutions have created an AI benchmark using NPR's Sunday Puzzle questions to test AI reasoning capabilities. They found that reasoning models like OpenAI’s o1 and DeepSeek’s R1 can struggle with complex puzzles, sometimes even acknowledging when they are wrong. This benchmark aims to assess AI models based on general human knowledge rather than specialized skills.
First / Previous / Next / Last
/ Page 1 of 0