Teaching LLM agents when not to use tools β strategic metacognitive reasoning that reduces tool calls while improving performance.
1University of Illinois Urbana-Champaign 2IBM Research AI *Equal contribution
SMART teaches agents this same balance β invoking tools only when parametric knowledge falls short.
Current LLM agents are strong reasoners and capable tool-users β but they fail to balance these two modes. Agents routinely reach for external tools on questions their own parametric knowledge could answer just fine, wasting compute and often degrading accuracy. We call this Tool Overuse.
Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that equips agents with calibrated self-awareness. We build SMART-ER, a multi-domain dataset with explicit reasoning chains that justify each decision to use (or skip) a tool. Fine-tuning on SMART-ER yields SMARTAgent: a family of models that reduce tool usage by 24% while boosting task performance by over 37%.
XAgent and AgentGPT, both powered by GPT, still overuse tools on GSM8K tasks that require no tools β running ~10Γ slower than direct reasoning.
SMART draws from metacognitive theory in cognitive psychology: just as humans implicitly recognize when to recall a fact from memory versus when to Google something, SMART-trained agents develop explicit awareness of their own knowledge boundaries.
We construct SMART-ER (SMART-Enhanced Reasoning) β a 3,000+ question dataset spanning three domains. Each question is compositional: it mixes steps the model handles well (parametric reasoning) with steps that genuinely require tools. Crucially, every step includes an explicit metacognitive justification explaining the choice.
Adapted from MATH benchmark. Each query blends challenging computations (tool: Code executor) with simple arithmetic the model knows β forcing calibrated decisions about when precision tools are truly needed.
Adapted from FreshQA. Queries mix fast-changing facts (tool: Search) requiring real-time knowledge with slow-changing facts the model can answer from its training data without any search.
Adapted from Intention-in-Interaction (IN3). Queries contain implicit user preferences that the model cannot guess β requiring targeted AskUser tool calls β while other parts are answerable from general knowledge.
Each query is decomposed into subgoals and annotated with a binary flag: needs tool or parametric knowledge suffices. GPT-4o executes the reasoning chain, calling tools where needed and recording the results. Every step is then enriched with an explicit justification β the metacognitive rationale β that the student model learns to mimic during supervised fine-tuning.
We evaluate SMARTAgent against base models with standard reasoning prompts and base models equipped with tool prompts, across three in-domain task categories and two out-of-distribution benchmarks. The pattern is consistent: fewer tool calls, better answers.
| Model / Method | Math Acc β | Math Tools β | Time Acc β | Time Tools β | Intent Coverage β |
|---|---|---|---|---|---|
| Mistral-7B + Tool Prompt | 13.25 | 3.90 | 49.00 | 1.67 | 63.04 |
| SMARTAgent (Mistral-7B) | 22.75 | 0.60 | 64.00 | 1.00 | 81.76 |
| Llama-3.1-8B + Tool Prompt | 51.00 | 1.93 | 56.00 | 2.05 | 70.20 |
| SMARTAgent (Llama-3.1-8B) | 54.75 | 0.88 | 67.00 | 1.05 | 78.28 |
| Llama-3.1-70B + Tool Prompt | 67.50 | 3.53 | 63.00 | 2.08 | 61.68 |
| Macro-average across all models: Tool use β 24% | Performance β 37% | |||||
A striking result: 7B-scale SMARTAgent outperforms its 70B counterparts while making far fewer tool calls. Strategic self-awareness bridges the gap between small and large models in a way that scale alone cannot.
We test SMARTAgent on benchmarks it has never seen: GSM8K (grade school math) and MINTQA (multi-hop QA requiring up-to-date knowledge). The model generalizes cleanly β maintaining accuracy with dramatically fewer tool invocations.
| Method | GSM8K Acc β | GSM8K Tools β | MINTQA Acc β | MINTQA Tools β |
|---|---|---|---|---|
| Llama-3.1-8B (Normal) | 80.29 | 0.00 | 21.65 | 0.00 |
| Llama-3.1-8B + Tool Prompt | 83.17 | 2.53 | 16.49 | 4.03 |
| SMARTAgent (Llama-3.1-8B) | 83.40 | 0.76 | 29.90 | 1.06 |
| Mistral-7B + Tool Prompt | 55.34 | 3.56 | 10.31 | 6.46 |
| SMARTAgent (Mistral-7B) | 58.98 | 0.45 | 25.77 | 0.99 |
In MINTQA, a benchmark where arbitrary tool prompting actually hurts performance (baseline drops from 21.65% to 16.49%), SMARTAgent achieves the best result at 29.90% β with only one-fifth the tool calls. The model has learned to be genuinely strategic, not just permissive.
Tool overuse is not benign β it actively degrades performance. In MINTQA and math tasks, arbitrary tool use reduces accuracy compared to plain chain-of-thought. Unnecessary tool calls introduce compounding errors through multi-round interactions, ultimately leading to wrong answers even when the right tool is available.
7B models with strategic tool use match or exceed 70B baselines. SMARTAgent (7B / 8B) outperforms Llama-3.1-70B and GPT-4o in Time and Intention domains while using significantly fewer tools. Strategic calibration is a more effective lever than raw model scale for tool-integrated tasks.
Metacognitive justifications are the key training signal. Explicitly labeling why each step uses or skips a tool β not just whether to call it β is what enables the model to generalize to unseen task distributions. The model learns a decision principle, not just an input-output mapping.
Larger models can overuse tools too. GPT-4o uses tools less frequently in the Intention domain, causing a larger performance drop than the 7B SMARTAgent. Overconfidence β the flip side of over-reliance β is also a failure mode that strategic training must address.
Confidence analysis validates calibration. Inspecting token logits at decision points (reasoning vs. tool call) shows that SMARTAgent produces significantly higher confidence at correct decision steps. The model isn't just making better decisions β it knows when it is right.
If you find SMART useful in your research, please consider citing our work.
@article{qian2025smart,
title = {SMART: Self-Aware Agent for Tool Overuse Mitigation},
author = {Qian, Cheng and Acikgoz, Emre Can and Wang, Hongru and Chen, Xiusi and Sil, Avirup and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Ji, Heng},
journal = {arXiv preprint arXiv:2502.11435},
year = {2025}
}