The first systematic study of reward design for RL-based tool-integrated reasoning β unlocking emergent agentic behaviors through principled fine-grained reward signals.
University of Illinois Urbana-Champaign
LLMs increasingly need to use external tools: search engines, calculators, code interpreters, APIs. While supervised fine-tuning (SFT) teaches models what tool calls look like, it fails to teach agents how to reason about using tools β leading to overfitting, overthinking, and poor generalization.
ToolRL presents the first systematic study of reward design for tool-integrated reasoning (TIR) under the RL paradigm. We explore reward type, scale, granularity, and temporal dynamics β and propose a principled decomposed reward that drives stable, generalizing, emergent tool-use behavior through GRPO training.
Unlike binary rewards, ToolRL's decomposed correctness signal separately evaluates what to call, how to parameterize it, and what values to pass β providing richer feedback for complex multi-tool interactions.
Mathematical reasoning rewards are simple: is the final answer correct? Tool-integrated reasoning is fundamentally different. A single turn may invoke multiple tools with complex parameters. Coarse binary rewards fail to distinguish correct tool, wrong parameter from wrong tool entirely. ToolRL addresses this with a four-dimensional reward analysis:
What aspect to reward: format compliance, tool selection correctness, or parameter accuracy.
How strongly to reward. Dynamic scaling helps models transition from simple to complex behavior.
How detailed the signal. Fine-grained decomposition leads to more stable, effective learning.
How rewards evolve over training. Curriculum-style dynamics prevent plateau and collapse.
ToolRL applies Group Relative Policy Optimization (GRPO), which normalizes advantages within groups of responses to the same query, reducing variance without a separate value function. We train from a cold start β no SFT pre-initialization β and find this leads to better generalization than SFT-initialized RL, because SFT memorization actively hurts exploration.
We evaluate on BFCL V3 (Berkeley Function Calling Leaderboard), which covers non-live AST evaluation, live API testing, multi-turn conversations, and relevance/irrelevance detection. GRPO Cold Start consistently achieves the highest overall accuracy across all four model families.
| Model / Method | Overall Acc | Non-Live AST | Live Acc | Multi-Turn |
|---|---|---|---|---|
| Qwen2.5-1.5B (Raw) | 19.41% | 16.00% | 35.58% | 0.00% |
| Qwen2.5-1.5B (SFT 4k) | 40.67% | 59.94% | 59.31% | 1.00% |
| Qwen2.5-1.5B GRPO Cold Start | 46.20% | 77.96% | 60.73% | 2.25% |
| Qwen2.5-7B (Raw) | 41.97% | 66.02% | 53.51% | 4.25% |
| Qwen2.5-7B (SFT 4k) | 36.53% | 45.15% | 57.13% | 0.75% |
| Qwen2.5-7B GRPO Cold Start | 58.38% | 86.17% | 74.90% | 18.12% |
| Llama-3.2-3B (Raw) | 22.09% | 17.44% | 43.85% | 0.00% |
| Llama-3.2-3B (SFT 4k) | 44.16% | 65.42% | 63.04% | 1.38% |
| Llama-3.2-3B GRPO Cold Start | 44.10% | 74.38% | 56.86% | 1.37% |
We also evaluate on API-Bank (a three-level API composition benchmark) and Bamboogle (open-domain QA requiring search tools). GRPO Cold Start achieves the highest scores in both benchmarks, demonstrating that the reward design generalizes beyond benchmark format.
| Model / Method | API-Bank Overall | Level 1 | Level 2 | Level 3 | Bamboogle |
|---|---|---|---|---|---|
| Qwen2.5-1.5B (Raw) | 30.65% | 28.32% | 35.82% | 35.11% | 20.80% |
| Qwen2.5-1.5B (SFT 4k) | 47.07% | 52.88% | 52.24% | 26.72% | 23.20% |
| Qwen2.5-1.5B GRPO Cold Start | 63.15% | 70.68% | 61.19% | 41.22% | 44.00% |
| Qwen2.5-3B (Raw) | 51.59% | 59.65% | 32.84% | 36.64% | 52.00% |
| Qwen2.5-3B GRPO Cold Start | 67.00% | 73.43% | 67.16% | 47.33% | 60.00% |
| Qwen2.5-7B (Raw) | 62.48% | 70.68% | 49.25% | 44.27% | 69.60% |
| Qwen2.5-7B GRPO Cold Start | 64.66% | 76.69% | 64.18% | 38.93% | 72.00% |
Longer reasoning traces are not inherently better. Adding length rewards degrades performance. In tool-integrated reasoning, verbosity from reward hacking leads to unfocused reasoning chains. The model should be taught to think precisely, not at length.
Dynamic reward scaling helps transition from simple to complex behavior. Static reward scales either undershoot (too easy tasks dominate) or overshoot (reward collapse on hard tasks). Gradually increasing scale encourages the model to first master format and simple calls before tackling multi-parameter, multi-tool scenarios.
Fine-grained reward decomposition is the most important single design choice. Splitting the correctness signal into name matching, parameter key matching, and value matching β rather than using a single binary reward β leads to significantly more stable training and better final accuracy. The model can now learn partially correct calls rather than treating everything as pass/fail.
GRPO cold start beats SFT-initialized GRPO. Counterintuitively, starting RL from the raw instruct model outperforms starting from an SFT checkpoint. SFT-initialized models achieve higher training rewards (due to distributional alignment) but generalize worse β a signature of memorization. Cold start forces the model to genuinely explore and learn.
Emergent behaviors arise: proactiveness and metacognitive reasoning. Models trained with ToolRL begin to exhibit spontaneous behaviors not seen in SFT baselines β voluntarily calling additional verification tools, rejecting inappropriate tool invocations, and reasoning out loud about tool selection trade-offs. These are properties of genuine reasoning, not memorized patterns.
GRPO is consistently more stable than PPO for cold-start tool learning. PPO with a cold start shows inconsistency across model families and benefits more from SFT initialization. GRPO's group-wise advantage normalization removes the need for a value function, making it particularly well-suited for the diverse reward distributions of tool-calling tasks.
If you find ToolRL useful in your research, please consider citing our work.
@article{qian2025toolrl,
title = {ToolRL: Reward is All Tool Learning Needs},
author = {Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Ji, Heng},
journal = {arXiv preprint arXiv:2504.13958},
year = {2025}
}