Part of Project Lifelong Agents

Current Agents Fail to Leverage World Models as Tool for Foresight

An empirical analysis of why LLM agents hesitate to simulate the future โ€” and how systematic failures in invocation, interpretation, and integration block anticipatory cognition.

Cheng Qian1, Emre Can Acikgoz1, Bingxuan Li1, Xiusi Chen1, Yuji Zhang1, Bingxiang He2, Qinyu Luo3,
Dilek Hakkani-Tรผr1, Gokhan Tur1, Yunzhu Li4, Heng Ji1

1University of Illinois Urbana-Champaign   2Tsinghua University   3Johns Hopkins University   4Columbia University

Foresight
Available.
Foresight Unused.

We give agents access to state-of-the-art generative world models. They almost never use them. And when they do, the results are often worse than acting blindly. This paper dissects why โ€” and charts a path forward.

World Model as Tool โ€” Agent Decision Loop
๐Ÿค–
Agent (VLM)
Perceives state, forms plan
๐Ÿ”ฎ
Simulate
World Model call
Rarely used <1%
or
โšก
Act
Real environment
Default behavior
โš ๏ธ When forced to simulate: performance degrades up to 5% vs. no simulation
Abstract

TL;DR

Modern AI agents face long-horizon tasks with irreversible consequences โ€” exactly the conditions where foresight should matter most. Generative world models (Sora, WAN, environment simulators) can now produce coherent future-state predictions. The natural question: can agents use these world models as tools to look before they leap?

The answer, empirically, is no. Across diverse agentic control tasks and visual QA benchmarks, we find that current agents rarely invoke world-model simulation, frequently misuse predicted rollouts when they do, and often exhibit degraded performance when simulation is available or enforced. This paper provides a rigorous empirical taxonomy of why โ€” and what it will take to fix it.

<1%
World model
invocation rate (VQA)
~15%
Misuse rate when
world model is called
โˆ’5%
Worst-case performance
drop when forced to use WM
9
VLM families evaluated
across open and closed source
Framework

World Model as Tool: The Evaluation Design

๐Ÿค– VLM Agent Perceives environment state
โ†’
๐Ÿค” Decide Simulate or act directly?
โ†’
๐Ÿ”ฎ World Model Predicts future state
โ†’
โšก Real Action Grounded in foresight
Normal Mode
WM optional โ€” agent decides
WM Invisible
No WM โ€” baseline condition
WM Force
Mandatory WM use

Agentic tasks use cloned-environment simulators for ground-truth rollouts. VQA tasks use WAN 2.1 (video generation) to produce imagined visual futures based on agent-specified queries.

Root Causes

Three Governance Breakdowns

Attribution analysis reveals that failures cluster into three recurring patterns that together explain the majority of observed regressions when world-model access is provided.

๐ŸŽฏ

Misguided Input Formulation

Agents query the world model with underspecified or ambiguous instructions. The simulation produces a plausible-looking rollout that does not actually answer the agent's implicit question โ€” leading to misinformed decisions downstream.

๐ŸŒซ๏ธ

Ambiguous Outcome Interpretation

Even when the world model produces a useful rollout, agents struggle to extract decision-relevant information from it. They either over-trust a single deterministic prediction or dismiss visually ambiguous outputs altogether.

โšก

Unstable Action Integration

Agents fail to coherently integrate simulated evidence with internal reasoning. Rollouts are sometimes overridden by overconfident internal beliefs, and sometimes over-weighted, causing the agent to rigidly follow a simulated path even when real-world observations contradict it.

Results

Agent Task Performance: With vs. Without World Model

Across four agentic control tasks (FrozenLake grid navigation, 3D Navigation, robotic PrimitiveSkill manipulation, and Sokoban box-pushing), access to a world model rarely helps and often hurts. Only the GPT-5 family shows consistent marginal improvement.

Model FrozenLake Navigate Prim.Skill Sokoban Avg (w/o WM) Avg (w/ WM) ฮ”
GPT-4o-mini 0.36โ†’0.390.26โ†’0.240.32โ†’0.220.00โ†’0.00 0.270.22โ†“0.05
GPT-4o 0.58โ†’0.580.35โ†’0.310.51โ†’0.460.02โ†’0.02 0.400.36โ†“0.04
GPT-5-mini 0.86โ†’0.890.66โ†’0.630.11โ†’0.190.03โ†’0.00 0.410.43โ†‘0.02
GPT-5 0.77โ†’0.890.74โ†’0.710.19โ†’0.200.06โ†’0.14 0.470.48โ†‘0.01
Llama-4-Maverick 0.70โ†’0.660.31โ†’0.200.40โ†’0.320.00โ†’0.03 0.350.27โ†“0.08
Llama-4-Scout 0.59โ†’0.550.55โ†’0.540.32โ†’0.240.02โ†’0.03 0.420.38โ†“0.04
Qwen2.5-VL-7B 0.36โ†’0.450.26โ†’0.260.13โ†’0.120.02โ†’0.00 0.200.20โ€”
Qwen2.5-VL-72B 0.61โ†’0.450.38โ†’0.370.41โ†’0.320.00โ†’0.00 0.370.33โ†“0.04

The pattern is stark: most models perform strictly worse when the world model is available. Only GPT-5-class models eke out marginal gains, suggesting that effective foresight integration demands frontier-level reasoning capacity that most current models lack.

Results

VQA Task Accuracy: World Model Access Has Near-Zero Impact

On four spatial Visual Question Answering benchmarks (3DSRBench, MMSI, SAT, and Spatial-Avg), providing access to a generative world model (WAN 2.1) produces almost no measurable change.

Model 3DSRBench MMSI SAT Spatial Avg (w/o WM) Avg (w/ WM) ฮ”
GPT-4o-mini 0.58โ†’0.590.28โ†’0.270.52โ†’0.570.65โ†’0.66 0.560.56โ€”
GPT-4o 0.66โ†’0.660.31โ†’0.300.71โ†’0.730.72โ†’0.72 0.630.63โ€”
GPT-5-mini 0.67โ†’0.680.35โ†’0.360.85โ†’0.830.78โ†’0.79 0.660.67โ†‘0.01
GPT-5 0.69โ†’0.700.38โ†’0.370.86โ†’0.850.80โ†’0.79 0.680.68โ€”
Llama-4-Maverick 0.61โ†’0.620.27โ†’0.280.52โ†’0.470.74โ†’0.75 0.600.60โ€”
Llama-4-Scout 0.59โ†’0.590.27โ†’0.280.41โ†’0.360.74โ†’0.73 0.580.58โ€”
Qwen2.5-VL-7B 0.53โ†’0.540.24โ†’0.240.59โ†’0.660.62โ†’0.63 0.520.52โ€”
Qwen2.5-VL-32B 0.59โ†’0.580.30โ†’0.280.47โ†’0.470.67โ†’0.67 0.570.56โ†“0.01
Qwen2.5-VL-72B 0.61โ†’0.610.29โ†’0.290.47โ†’0.480.71โ†’0.73 0.590.59โ€”

Gains and losses are distributed nearly symmetrically โ€” even individually significant benchmark swings cancel each other out at the average level, leaving most models' aggregate score unchanged. This contrasts with agentic tasks where regressions are more consistent, and underscores that the bottleneck in VQA is not information availability but agents' inability to decide when and how to invoke simulation.

World Model Invocation Rates

Models are not failing because they misuse world-model rollouts โ€” they are mostly not calling the world model at all. Usage rates remain below 10% for VQA tasks across all but the Llama family, and even Llama models that invoke simulation more often show no measurable benefit from doing so.

GPT Series
<5% usage
Larger GPT models use it even less โ€” high self-confidence overrides simulation willingness
Llama Series
Most proactive
Higher willingness to call WM, but still shows little measurable benefit โ€” willing but not effective
Qwen Series
Near-zero (VQA)
Qwen-7B โ€” the smallest, least capable model tested โ€” is paradoxically the most reluctant to call the WM
Insights

Key Findings

Finding 1

World models do not reliably improve agent performance. Across both agentic control and visual QA tasks, adding world-model access fails to materialize the expected foresight advantage. For most model families in agent tasks, world-model signals introduce noise rather than guidance. In VQA, gains and losses occur at nearly equal rates โ€” a net-neutral impact. The assumption that more information always helps is demonstrably false here.

Finding 2

Models rarely choose to invoke the world model. Usage rates stay below 1% for VQA across most model families. Models appear to lack a clear internal strategy for when world-model rollouts would improve predictions. This is not a capability gap โ€” it is a decision-making gap. Models simply do not recognize simulation as a useful tool even when explicit demonstrations are provided in-context.

Finding 3

Invocation habits are family-specific and shaped by model self-confidence. GPT and Qwen models frequently bypass simulation entirely; larger variants in these families do so even more, suggesting that higher capability correlates with overconfidence that crowds out external simulation. Llama variants are more willing to query, but their queries do not produce better outcomes โ€” willingness and effectiveness are decoupled.

Finding 4

Forced simulation makes things worse. When we mandate world-model use in Force Mode, performance declines for every model and every task. Optional access already introduces systematic failure modes; making it mandatory amplifies those weaknesses. This conclusively rules out the idea that reluctance is the only problem โ€” even when agents do simulate, they cannot reliably extract value from it.

Finding 5

Agents use world models for confirmation, not exploration. Attribution analysis reveals that when world models do help, the pattern is: agent forms a hypothesis, queries world model to verify it, and the simulation confirms it. This brittle "confirmation bias" pattern means that when the initial hypothesis is wrong, the world model locks in the error rather than correcting it. A healthier pattern โ€” propose multiple hypotheses, simulate each, select the best โ€” is rarely observed.

Finding 6

The bottleneck is not simulation quality โ€” it is agent behavior. Our agentic tasks use ground-truth cloned-environment simulators (perfect fidelity). Even with zero simulation error, most agents fail to leverage foresight effectively. The problem is not "the world model is inaccurate" โ€” the problem is that agents cannot decide when to simulate, how to interpret rollouts, and how to integrate foresight into action. These are three distinct competencies that future training must address separately.

Future Work

Paths Toward Reliable Anticipatory Cognition

Our findings define three concrete challenges โ€” and suggest corresponding remedies โ€” for building agents that can genuinely leverage world models.

๐Ÿงฉ
Dedicated Governance Modules.

A three-part interaction loop: a Decider that boldly proposes candidate actions to test, a Reflector that evaluates simulated vs. real outcomes, and a Memory that maintains long-horizon task objectives across simulated branches. This makes foresight usage explicit and structured.

๐ŸŽฒ
Multi-Hypothesis Simulation.

Rather than confirming a single guess, agents should generate several competing hypotheses, simulate each, and select based on predicted-outcome quality. This turns simulation into structured hypothesis testing โ€” correcting errors rather than reinforcing them.

๐Ÿ‹๏ธ
RL Training with World-Model-as-Tool Rewards.

Beyond prompting, intrinsic training is required. Online multi-turn RL with rewards that incentivize appropriate invocation frequency, diverse query formulation, and information gain (hypothesis entropy reduction) can directly target the identified failure modes.

Citation

If you find this work useful in your research, please consider citing our paper.

@article{qian2026worldagent,
  title   = {Current Agents Fail to Leverage World Model as Tool for Foresight},
  author  = {Qian, Cheng and Acikgoz, Emre Can and Li, Bingxuan and Chen, Xiusi and Zhang, Yuji and He, Bingxiang and Luo, Qinyu and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Li, Yunzhu and Ji, Heng},
  journal = {arXiv preprint arXiv:2601.03905},
  year    = {2026}
}