Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
Abstract
A framework for Scientific General Intelligence (SGI) is presented, evaluated using SGI-Bench, and improved with Test-Time Reinforcement Learning, highlighting gaps in existing models' scientific capabilities.
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
Community
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, dry/wet experiments, and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.

Grounded in the Practical Inquiry Model (PIM), SGI‑Bench views science as an iterative cycle that links deliberation, conception, action and perception at a high level. Under this lens, Scientific General Intelligence (SGI) denotes an AI’s capacity to traverse that cycle coherently and autonomously—integrating knowledge retrieval, idea formation, action execution, and interpretation into a unified loop of inquiry.

SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ Master's and PhD holders with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.

An agent-based evaluation stack coordinating specialized agents and tools to assess models end-to-end with task-specific and customizable metrics. By formalizing question selection, metric construction, scoring, and reporting into traceable stages, it strengthens reproducibility, mitigates evaluator–model coupling bias, and offers scientist‑aligned, actionable insights for model selection and iteration.

TTRL converts open-ended scientific exploration into a measurable test-time optimization process and can be extended to multi-objective scientist-aligned rewards (rigor, feasibility, safety, cost). In practice, it improves idea novelty without labels by coupling strict output structure with retrieval-grounded rewards, and generalizes to multi‑objective optimization that balances creativity with rigor and feasibility—making scientific ideation auditable and adaptable at inference time.
Models citing this paper 0
No model linking this paper
