Post
994
Hey all โ our ResearchClawBench leaderboard just updated ๐ฅ
We let AI do real science: 40 tasks across 10 disciplines, compared to human papers. Hard example? ๐๏ธ Glacier mass change โ AI must integrate 233 datasets from 35 teams, 4 methods, reproduce 6542ยฑ387 Gt ice loss vs IPCC. No toy problems.
Latest leaderboard (2026-06-09) ๐:
Agents: ๐ฅ Claude Code 21.5 (50 = match human), $5.3; ๐ฅ EvoScientist 18.8, $4.1; ๐ฅ Codex CLI 18.4, just $2.0
LLMs+Harness: ๐ฅ Claude-Opus-4.8 21.1, $4.0; ๐ฅ Claude-Opus-4.7 20.7; ๐ฅ MiniMax-M3 19.8, only $0.45; Qwen3.7-Max 18.7, $0.42, 11min ๐ฅ
Claude still king, but MiniMax/Qwen/DeepSeek are crazy cheap and competitive. Expensive isn't always better.
๐ Code & star: https://github.com/InternScience/ResearchClawBench
๐ Website: https://internscience.github.io/ResearchClawBench-Home/
๐ค Upvote paper: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (2606.07591)
We let AI do real science: 40 tasks across 10 disciplines, compared to human papers. Hard example? ๐๏ธ Glacier mass change โ AI must integrate 233 datasets from 35 teams, 4 methods, reproduce 6542ยฑ387 Gt ice loss vs IPCC. No toy problems.
Latest leaderboard (2026-06-09) ๐:
Agents: ๐ฅ Claude Code 21.5 (50 = match human), $5.3; ๐ฅ EvoScientist 18.8, $4.1; ๐ฅ Codex CLI 18.4, just $2.0
LLMs+Harness: ๐ฅ Claude-Opus-4.8 21.1, $4.0; ๐ฅ Claude-Opus-4.7 20.7; ๐ฅ MiniMax-M3 19.8, only $0.45; Qwen3.7-Max 18.7, $0.42, 11min ๐ฅ
Claude still king, but MiniMax/Qwen/DeepSeek are crazy cheap and competitive. Expensive isn't always better.
๐ Code & star: https://github.com/InternScience/ResearchClawBench
๐ Website: https://internscience.github.io/ResearchClawBench-Home/
๐ค Upvote paper: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (2606.07591)