Task-Specific Performance
In the 'NYT Connections' puzzle benchmark, O3-Mini scores 72.4, outperforming DeepSeek-R1’s 54.4 by a significant margin of 18 points, showcasing its superior problem-solving ability. On the LiveBench global average, O3-Mini also leads with 73.94 compared to DeepSeek-R1’s 71.38. However, in mathematics tasks, DeepSeek-R1 demonstrates stronger numerical reasoning, scoring 79.54 versus O3-Mini’s 65.65.