arxiv:2603.11076

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Published on Mar 10

· Submitted by

Ellie Chen on Mar 13

Fudan University

Upvote

Authors:

Aili Chen ,

Abstract

Training Qwen3-8B on DIVE data improves performance across out-of-distribution benchmarks, with diversity scaling outperforming quantity scaling even with less data.

AI-generated summary

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

sheep33333

Paper author Paper submitter about 18 hours ago

We introduce DIVE, an iterative evidence-driven recipe that scales diversity in agentic task synthesis for generalizable tool use. Unlike prior approaches that synthesize tasks first, DIVE inverts
the order — executing diverse, real-world tools first and reverse-deriving tasks from the resulting traces. We build a pool of 373 real-world tools across 5 domains, ensuring every synthesized task
is grounded in verifiable tool executions.

Key results:

Training Qwen3-8B on DIVE data yields +22 avg points across 9 OOD benchmarks, outperforming the strongest 8B baseline by +68%
Without any task-specific training, DIVE surpasses specialist agents on their home benchmarks (e.g., GAIA 61.2 vs 50.0, text-only validation set), while specialists suffer negative transfer on unseen domains
Scaling analysis reveals: diversity scaling yields stronger OOD gains than quantity scaling even with 4× less data; RL further amplifies the diversity advantage; scaling homogeneous tasks alone
hits a ceiling regardless of data volume
Despite being 8B, DIVE is competitive with 10–100× larger frontier models (e.g., outperforming Claude-4-Sonnet on BrowseComp and FinSearchComp; on the zero-shot Toolathlon benchmark, improving
from near-zero (0.9) to 8.3, approaching GPT-OSS-120B (9.8) and Gemini-2.5-Pro (10.5))

We release code, model, and data: SFT data (20K subset), RL data (3.2K), evaluation benchmark, and trained model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.11076 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.11076 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.11076 in a Space README.md to link it from this page.