From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
Abstract
CAPO, a curriculum advantage policy optimization, enhances reinforcement learning for large language models by strategically introducing positive and negative advantage signals, improving reasoning capabilities and generalization.
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
Community
š [New Paper] CAPO: From Imitation to Discrimination ā Rethinking Advantage in RL
Early RL training often suffers from instability due to "mixed signals" (simultaneous positive & negative feedback). Inspired by child cognitive development, we propose CAPO (Curriculum Advantage Policy Optimization).
⨠The Core Intuition:
Instead of a static curriculum, we leverage Advantage values to create a dynamic, two-phase process:
1ļøā£ Imitation Phase: Train on Positive Advantage only. This reduces variance and establishes a stable behavioral foundation (Imitate to learn).
2ļøā£ Discrimination Phase: Introduce Negative Signals later. [cite_start]This restores unbiased estimation and refines decision boundaries (Discriminate to generalize).
š Highlights:
- Plug-and-Play: Delivers consistent gains when compatible with GRPO, PPO, RLOO, & Reinforce++.
- Cross-Domain: Not just for Math! CAPO demonstrates impressive generalization on Multimodal GUI Agent tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning (2025)
- Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models (2025)
- Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models (2025)
- Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)
- MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards (2025)
- Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch (2025)
- HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper