Part of Project Lifelong Agents

SMART: Self-Aware Agent for Tool Overuse Mitigation

Teaching LLM agents when not to use tools β€” strategic metacognitive reasoning that reduces tool calls while improving performance.

Cheng Qian1*, Emre Can Acikgoz1*, Hongru Wang1†, Xiusi Chen1, Avirup Sil2, Dilek Hakkani-TΓΌr1, Gokhan Tur1, Heng Ji1†

1University of Illinois Urbana-Champaign   2IBM Research AI   *Equal contribution

Strategic.
Self-Aware.
Efficient.

LLMs unnecessarily invoke tools over 30% of the time β€” even for tasks their own knowledge handles perfectly. SMART instills metacognitive awareness so agents know when tools help and when they don't.

🧠
Human Metacognition
Recall from memory vs. search online
πŸ’‘
Parametric
Knowledge
⟷
πŸ”§
External
Tools

SMART teaches agents this same balance β€” invoking tools only when parametric knowledge falls short.

Abstract

TL;DR

Current LLM agents are strong reasoners and capable tool-users β€” but they fail to balance these two modes. Agents routinely reach for external tools on questions their own parametric knowledge could answer just fine, wasting compute and often degrading accuracy. We call this Tool Overuse.

Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that equips agents with calibrated self-awareness. We build SMART-ER, a multi-domain dataset with explicit reasoning chains that justify each decision to use (or skip) a tool. Fine-tuning on SMART-ER yields SMARTAgent: a family of models that reduce tool usage by 24% while boosting task performance by over 37%.

βˆ’24%
Reduction in
tool calls
+37%
Performance
improvement
7B β‰ˆ 70B
Small models match
large-model baselines
β…• calls
OOD generalization
with 5Γ— fewer tools
The Problem

Tool Overuse is Pervasive

πŸ” Query arrives Agent decides how to solve
β†’
πŸ”§ Tool invoked Even when unnecessary
β†’
❌ Overhead + errors Slower, often less accurate
⚠️ Both LLMs (>30%) and agent systems show systematic tool overuse

XAgent and AgentGPT, both powered by GPT, still overuse tools on GSM8K tasks that require no tools β€” running ~10Γ— slower than direct reasoning.

Approach

SMART: Metacognition for Agents

SMART draws from metacognitive theory in cognitive psychology: just as humans implicitly recognize when to recall a fact from memory versus when to Google something, SMART-trained agents develop explicit awareness of their own knowledge boundaries.

The SMART-ER Dataset

We construct SMART-ER (SMART-Enhanced Reasoning) β€” a 3,000+ question dataset spanning three domains. Each question is compositional: it mixes steps the model handles well (parametric reasoning) with steps that genuinely require tools. Crucially, every step includes an explicit metacognitive justification explaining the choice.

πŸ“

Math Domain

Adapted from MATH benchmark. Each query blends challenging computations (tool: Code executor) with simple arithmetic the model knows β€” forcing calibrated decisions about when precision tools are truly needed.

πŸ“…

Time Domain

Adapted from FreshQA. Queries mix fast-changing facts (tool: Search) requiring real-time knowledge with slow-changing facts the model can answer from its training data without any search.

🎯

Intention Domain

Adapted from Intention-in-Interaction (IN3). Queries contain implicit user preferences that the model cannot guess β€” requiring targeted AskUser tool calls β€” while other parts are answerable from general knowledge.

How SMART-ER Is Built

Each query is decomposed into subgoals and annotated with a binary flag: needs tool or parametric knowledge suffices. GPT-4o executes the reasoning chain, calling tools where needed and recording the results. Every step is then enriched with an explicit justification β€” the metacognitive rationale β€” that the student model learns to mimic during supervised fine-tuning.

Step 1 (parametric): "Tim Cook has been Apple's CEO since 2011." β†’ [Reasoning: This is a stable fact within my training data.]
Step 2 (tool-dependent): SEARCH("latest Apple chip announcement") β†’ [Reasoning: Chip releases happen frequently; I cannot rely on stale knowledge.]
Results

Main Results

We evaluate SMARTAgent against base models with standard reasoning prompts and base models equipped with tool prompts, across three in-domain task categories and two out-of-distribution benchmarks. The pattern is consistent: fewer tool calls, better answers.

Model / Method Math Acc ↑ Math Tools ↓ Time Acc ↑ Time Tools ↓ Intent Coverage ↑
Mistral-7B + Tool Prompt 13.253.9049.001.6763.04
SMARTAgent (Mistral-7B) 22.750.6064.001.0081.76
Llama-3.1-8B + Tool Prompt 51.001.9356.002.0570.20
SMARTAgent (Llama-3.1-8B) 54.750.8867.001.0578.28
Llama-3.1-70B + Tool Prompt 67.503.5363.002.0861.68
Macro-average across all models: Tool use ↓ 24%   |   Performance ↑ 37%

A striking result: 7B-scale SMARTAgent outperforms its 70B counterparts while making far fewer tool calls. Strategic self-awareness bridges the gap between small and large models in a way that scale alone cannot.

Generalization

Out-of-Distribution Performance

We test SMARTAgent on benchmarks it has never seen: GSM8K (grade school math) and MINTQA (multi-hop QA requiring up-to-date knowledge). The model generalizes cleanly β€” maintaining accuracy with dramatically fewer tool invocations.

Method GSM8K Acc ↑ GSM8K Tools ↓ MINTQA Acc ↑ MINTQA Tools ↓
Llama-3.1-8B (Normal) 80.290.0021.650.00
Llama-3.1-8B + Tool Prompt 83.172.5316.494.03
SMARTAgent (Llama-3.1-8B) 83.400.7629.901.06
Mistral-7B + Tool Prompt 55.343.5610.316.46
SMARTAgent (Mistral-7B) 58.980.4525.770.99

In MINTQA, a benchmark where arbitrary tool prompting actually hurts performance (baseline drops from 21.65% to 16.49%), SMARTAgent achieves the best result at 29.90% β€” with only one-fifth the tool calls. The model has learned to be genuinely strategic, not just permissive.

Insights

Key Findings

Finding 1

Tool overuse is not benign β€” it actively degrades performance. In MINTQA and math tasks, arbitrary tool use reduces accuracy compared to plain chain-of-thought. Unnecessary tool calls introduce compounding errors through multi-round interactions, ultimately leading to wrong answers even when the right tool is available.

Finding 2

7B models with strategic tool use match or exceed 70B baselines. SMARTAgent (7B / 8B) outperforms Llama-3.1-70B and GPT-4o in Time and Intention domains while using significantly fewer tools. Strategic calibration is a more effective lever than raw model scale for tool-integrated tasks.

Finding 3

Metacognitive justifications are the key training signal. Explicitly labeling why each step uses or skips a tool β€” not just whether to call it β€” is what enables the model to generalize to unseen task distributions. The model learns a decision principle, not just an input-output mapping.

Finding 4

Larger models can overuse tools too. GPT-4o uses tools less frequently in the Intention domain, causing a larger performance drop than the 7B SMARTAgent. Overconfidence β€” the flip side of over-reliance β€” is also a failure mode that strategic training must address.

Finding 5

Confidence analysis validates calibration. Inspecting token logits at decision points (reasoning vs. tool call) shows that SMARTAgent produces significantly higher confidence at correct decision steps. The model isn't just making better decisions β€” it knows when it is right.

Citation

If you find SMART useful in your research, please consider citing our work.

@article{qian2025smart,
  title   = {SMART: Self-Aware Agent for Tool Overuse Mitigation},
  author  = {Qian, Cheng and Acikgoz, Emre Can and Wang, Hongru and Chen, Xiusi and Sil, Avirup and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Ji, Heng},
  journal = {arXiv preprint arXiv:2502.11435},
  year    = {2025}
}