ToolRL: Reward is All Tool Learning Needs

Abstract

TL;DR

LLMs increasingly need to use external tools: search engines, calculators, code interpreters, APIs. While supervised fine-tuning (SFT) teaches models what tool calls look like, it fails to teach agents how to reason about using tools — leading to overfitting, overthinking, and poor generalization.

ToolRL presents the first systematic study of reward design for tool-integrated reasoning (TIR) under the RL paradigm. We explore reward type, scale, granularity, and temporal dynamics — and propose a principled decomposed reward that drives stable, generalizing, emergent tool-use behavior through GRPO training.

+17%

Over base
models

+15%

Over SFT
baselines

GRPO

Cold start, no
SFT pre-training

Model families
evaluated

Core Contribution

Fine-Grained Reward Decomposition

🏆 R_final ∈ [−3, 4]

📋 R_format ∈ {0, 1}

✅ R_correct ∈ [−3, 3]

🏷️ Tool Name
Jaccard overlap of predicted
vs. ground-truth tool names

🔑 Param Names
Parameter key overlap across
all ground-truth calls

💎 Param Values
Exact value match per key
in matched tool calls

Unlike binary rewards, ToolRL's decomposed correctness signal separately evaluates what to call, how to parameterize it, and what values to pass — providing richer feedback for complex multi-tool interactions.

Approach

Why Reward Design Matters for Tool Use

Mathematical reasoning rewards are simple: is the final answer correct? Tool-integrated reasoning is fundamentally different. A single turn may invoke multiple tools with complex parameters. Coarse binary rewards fail to distinguish correct tool, wrong parameter from wrong tool entirely. ToolRL addresses this with a four-dimensional reward analysis:

DIMENSION 1

Reward Type

What aspect to reward: format compliance, tool selection correctness, or parameter accuracy.

DIMENSION 2

Reward Scale

How strongly to reward. Dynamic scaling helps models transition from simple to complex behavior.

DIMENSION 3

Reward Granularity

How detailed the signal. Fine-grained decomposition leads to more stable, effective learning.

DIMENSION 4

Reward Dynamics

How rewards evolve over training. Curriculum-style dynamics prevent plateau and collapse.

Training with GRPO

ToolRL applies Group Relative Policy Optimization (GRPO), which normalizes advantages within groups of responses to the same query, reducing variance without a separate value function. We train from a cold start — no SFT pre-initialization — and find this leads to better generalization than SFT-initialized RL, because SFT memorization actively hurts exploration.

Results

BFCL Benchmark Results

We evaluate on BFCL V3 (Berkeley Function Calling Leaderboard), which covers non-live AST evaluation, live API testing, multi-turn conversations, and relevance/irrelevance detection. GRPO Cold Start consistently achieves the highest overall accuracy across all four model families.

Model / Method	Overall Acc	Non-Live AST	Live Acc	Multi-Turn
Qwen2.5-1.5B (Raw)	19.41%	16.00%	35.58%	0.00%
Qwen2.5-1.5B (SFT 4k)	40.67%	59.94%	59.31%	1.00%
Qwen2.5-1.5B GRPO Cold Start	46.20%	77.96%	60.73%	2.25%
Qwen2.5-7B (Raw)	41.97%	66.02%	53.51%	4.25%
Qwen2.5-7B (SFT 4k)	36.53%	45.15%	57.13%	0.75%
Qwen2.5-7B GRPO Cold Start	58.38%	86.17%	74.90%	18.12%
Llama-3.2-3B (Raw)	22.09%	17.44%	43.85%	0.00%
Llama-3.2-3B (SFT 4k)	44.16%	65.42%	63.04%	1.38%
Llama-3.2-3B GRPO Cold Start	44.10%	74.38%	56.86%	1.37%

API-Bank & Bamboogle Results

We also evaluate on API-Bank (a three-level API composition benchmark) and Bamboogle (open-domain QA requiring search tools). GRPO Cold Start achieves the highest scores in both benchmarks, demonstrating that the reward design generalizes beyond benchmark format.

Model / Method	API-Bank Overall	Level 1	Level 2	Level 3	Bamboogle
Qwen2.5-1.5B (Raw)	30.65%	28.32%	35.82%	35.11%	20.80%
Qwen2.5-1.5B (SFT 4k)	47.07%	52.88%	52.24%	26.72%	23.20%
Qwen2.5-1.5B GRPO Cold Start	63.15%	70.68%	61.19%	41.22%	44.00%
Qwen2.5-3B (Raw)	51.59%	59.65%	32.84%	36.64%	52.00%
Qwen2.5-3B GRPO Cold Start	67.00%	73.43%	67.16%	47.33%	60.00%
Qwen2.5-7B (Raw)	62.48%	70.68%	49.25%	44.27%	69.60%
Qwen2.5-7B GRPO Cold Start	64.66%	76.69%	64.18%	38.93%	72.00%

Insights

Key Findings

Finding 1

Longer reasoning traces are not inherently better. Adding length rewards degrades performance. In tool-integrated reasoning, verbosity from reward hacking leads to unfocused reasoning chains. The model should be taught to think precisely, not at length.

Finding 2

Dynamic reward scaling helps transition from simple to complex behavior. Static reward scales either undershoot (too easy tasks dominate) or overshoot (reward collapse on hard tasks). Gradually increasing scale encourages the model to first master format and simple calls before tackling multi-parameter, multi-tool scenarios.

Finding 3

Fine-grained reward decomposition is the most important single design choice. Splitting the correctness signal into name matching, parameter key matching, and value matching — rather than using a single binary reward — leads to significantly more stable training and better final accuracy. The model can now learn partially correct calls rather than treating everything as pass/fail.

Finding 4

GRPO cold start beats SFT-initialized GRPO. Counterintuitively, starting RL from the raw instruct model outperforms starting from an SFT checkpoint. SFT-initialized models achieve higher training rewards (due to distributional alignment) but generalize worse — a signature of memorization. Cold start forces the model to genuinely explore and learn.

Finding 5

Emergent behaviors arise: proactiveness and metacognitive reasoning. Models trained with ToolRL begin to exhibit spontaneous behaviors not seen in SFT baselines — voluntarily calling additional verification tools, rejecting inappropriate tool invocations, and reasoning out loud about tool selection trade-offs. These are properties of genuine reasoning, not memorized patterns.

Finding 6

GRPO is consistently more stable than PPO for cold-start tool learning. PPO with a cold start shows inconsistency across model families and benefits more from SFT initialization. GRPO's group-wise advantage normalization removes the need for a value function, making it particularly well-suited for the diverse reward distributions of tool-calling tasks.

Citation

If you find ToolRL useful in your research, please consider citing our work.

@article{qian2025toolrl,
  title   = {ToolRL: Reward is All Tool Learning Needs},
  author  = {Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Ji, Heng},
  journal = {arXiv preprint arXiv:2504.13958},
  year    = {2025}
}