Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge 葛钰峣, Yilong Xu, Jiafeng Guo, Xueqi Cheng

Nov 15, 2025

Abstract:

Reinforcement learning (RL) has shown great promise in enhancing LLM reasoning, but current approaches mainly focus on single domains with verifiable rewards. We propose RGR-GRPO, a rubric-driven RL framework for multi-domain reasoning that uses rubrics to provide fine-grained reward signals and offline guidance. This approach helps LLMs receive dense and informative rewards while exploring a larger solution space during GRPO training. Experiments across 14 benchmarks show RGR-GRPO outperforms alternatives, achieving average gains of +7.0%, +5.4%, +8.4%, and +6.6% on math, physics, chemistry, and general reasoning tasks respectively. The method also maintains stable entropy fluctuations during off-policy training and demonstrates stronger pass@k performance, indicating sustained exploration capabilities.

Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning

Abstract:

Yuyao Ge 葛钰峣

Ph.D Student

Related