Process Reward Models that Think -- https://arxiv.org/abs/2504.16828
AI & ML interests
Factuality, reasoning, alignment, LLM applications
Recent Activity
spaces 7
Running
LudoBench
🎲
Multimodal Game Reasoning Benchmark [ICLR 2026]
Sleeping
Answer Convergence Early Stopping
🛑
Demo for EMNLP Paper "Answer Convergence as a Signal..."
Runtime error
FactRBench
🏆
View and analyze long-form factuality leaderboard
Running
3
ExpertLongBench
🚀
Leaderboard for ExpertLongBench
Sleeping
1
ManyICLBench
🚀
Leaderboard for ManyICLBench
Running
MLRC-BENCH
📊
Display model performance rankings
datasets 13
launch/LudoBench
Viewer • Updated • 638 • 15
launch/ExpertLongBench
Preview • Updated • 630 • 10
launch/thinkprm-1K-verification-cots
Viewer • Updated • 1k • 34 • 6
launch/ManyICLBench
Viewer • Updated • 66 • 581 • 1
launch/CMV
Viewer • Updated • 133 • 31
launch/FactRBench
Viewer • Updated • 1.06k • 84 • 1
launch/FactBench
Viewer • Updated • 1k • 74 • 3
launch/CLASH
Viewer • Updated • 345 • 47 • 4
launch/gov_report
Viewer • Updated • 58.4k • 315 • 11
launch/gov_report_qs
Viewer • Updated • 7.87k • 85 • 4