World-Agent: Agents Fail to Leverage World Models for Foresight

Abstract

TL;DR

Modern AI agents face long-horizon tasks with irreversible consequences — exactly the conditions where foresight should matter most. Generative world models (Sora, WAN, environment simulators) can now produce coherent future-state predictions. The natural question: can agents use these world models as tools to look before they leap?

The answer, empirically, is no. Across diverse agentic control tasks and visual QA benchmarks, we find that current agents rarely invoke world-model simulation, frequently misuse predicted rollouts when they do, and often exhibit degraded performance when simulation is available or enforced. This paper provides a rigorous empirical taxonomy of why — and what it will take to fix it.

<1%

World model
invocation rate (VQA)

~15%

Misuse rate when
world model is called

−5%

Worst-case performance
drop when forced to use WM

VLM families evaluated
across open and closed source

Root Causes

Three Governance Breakdowns

Attribution analysis reveals that failures cluster into three recurring patterns that together explain the majority of observed regressions when world-model access is provided.

🎯

Misguided Input Formulation

Agents query the world model with underspecified or ambiguous instructions. The simulation produces a plausible-looking rollout that does not actually answer the agent's implicit question — leading to misinformed decisions downstream.

🌫️

Ambiguous Outcome Interpretation

Even when the world model produces a useful rollout, agents struggle to extract decision-relevant information from it. They either over-trust a single deterministic prediction or dismiss visually ambiguous outputs altogether.

⚡

Unstable Action Integration

Agents fail to coherently integrate simulated evidence with internal reasoning. Rollouts are sometimes overridden by overconfident internal beliefs, and sometimes over-weighted, causing the agent to rigidly follow a simulated path even when real-world observations contradict it.

Results

Agent Task Performance: With vs. Without World Model

Across four agentic control tasks (FrozenLake grid navigation, 3D Navigation, robotic PrimitiveSkill manipulation, and Sokoban box-pushing), access to a world model rarely helps and often hurts. Only the GPT-5 family shows consistent marginal improvement.

Model	FrozenLake	Navigate	Prim.Skill	Sokoban	Avg (w/o WM)	Avg (w/ WM)	Δ
GPT-4o-mini	0.36→0.39	0.26→0.24	0.32→0.22	0.00→0.00	0.27	0.22	↓0.05
GPT-4o	0.58→0.58	0.35→0.31	0.51→0.46	0.02→0.02	0.40	0.36	↓0.04
GPT-5-mini	0.86→0.89	0.66→0.63	0.11→0.19	0.03→0.00	0.41	0.43	↑0.02
GPT-5	0.77→0.89	0.74→0.71	0.19→0.20	0.06→0.14	0.47	0.48	↑0.01
Llama-4-Maverick	0.70→0.66	0.31→0.20	0.40→0.32	0.00→0.03	0.35	0.27	↓0.08
Llama-4-Scout	0.59→0.55	0.55→0.54	0.32→0.24	0.02→0.03	0.42	0.38	↓0.04
Qwen2.5-VL-7B	0.36→0.45	0.26→0.26	0.13→0.12	0.02→0.00	0.20	0.20	—
Qwen2.5-VL-72B	0.61→0.45	0.38→0.37	0.41→0.32	0.00→0.00	0.37	0.33	↓0.04

The pattern is stark: most models perform strictly worse when the world model is available. Only GPT-5-class models eke out marginal gains, suggesting that effective foresight integration demands frontier-level reasoning capacity that most current models lack.

Results

VQA Task Accuracy: World Model Access Has Near-Zero Impact

On four spatial Visual Question Answering benchmarks (3DSRBench, MMSI, SAT, and Spatial-Avg), providing access to a generative world model (WAN 2.1) produces almost no measurable change.

Model	3DSRBench	MMSI	SAT	Spatial	Avg (w/o WM)	Avg (w/ WM)	Δ
GPT-4o-mini	0.58→0.59	0.28→0.27	0.52→0.57	0.65→0.66	0.56	0.56	—
GPT-4o	0.66→0.66	0.31→0.30	0.71→0.73	0.72→0.72	0.63	0.63	—
GPT-5-mini	0.67→0.68	0.35→0.36	0.85→0.83	0.78→0.79	0.66	0.67	↑0.01
GPT-5	0.69→0.70	0.38→0.37	0.86→0.85	0.80→0.79	0.68	0.68	—
Llama-4-Maverick	0.61→0.62	0.27→0.28	0.52→0.47	0.74→0.75	0.60	0.60	—
Llama-4-Scout	0.59→0.59	0.27→0.28	0.41→0.36	0.74→0.73	0.58	0.58	—
Qwen2.5-VL-7B	0.53→0.54	0.24→0.24	0.59→0.66	0.62→0.63	0.52	0.52	—
Qwen2.5-VL-32B	0.59→0.58	0.30→0.28	0.47→0.47	0.67→0.67	0.57	0.56	↓0.01
Qwen2.5-VL-72B	0.61→0.61	0.29→0.29	0.47→0.48	0.71→0.73	0.59	0.59	—

Gains and losses are distributed nearly symmetrically — even individually significant benchmark swings cancel each other out at the average level, leaving most models' aggregate score unchanged. This contrasts with agentic tasks where regressions are more consistent, and underscores that the bottleneck in VQA is not information availability but agents' inability to decide when and how to invoke simulation.

World Model Invocation Rates

Models are not failing because they misuse world-model rollouts — they are mostly not calling the world model at all. Usage rates remain below 10% for VQA tasks across all but the Llama family, and even Llama models that invoke simulation more often show no measurable benefit from doing so.

GPT Series

<5% usage

Larger GPT models use it even less — high self-confidence overrides simulation willingness

Llama Series

Most proactive

Higher willingness to call WM, but still shows little measurable benefit — willing but not effective

Qwen Series

Near-zero (VQA)

Qwen-7B — the smallest, least capable model tested — is paradoxically the most reluctant to call the WM

Insights

Key Findings

Finding 1

World models do not reliably improve agent performance. Across both agentic control and visual QA tasks, adding world-model access fails to materialize the expected foresight advantage. For most model families in agent tasks, world-model signals introduce noise rather than guidance. In VQA, gains and losses occur at nearly equal rates — a net-neutral impact. The assumption that more information always helps is demonstrably false here.

Finding 2

Models rarely choose to invoke the world model. Usage rates stay below 1% for VQA across most model families. Models appear to lack a clear internal strategy for when world-model rollouts would improve predictions. This is not a capability gap — it is a decision-making gap. Models simply do not recognize simulation as a useful tool even when explicit demonstrations are provided in-context.

Finding 3

Invocation habits are family-specific and shaped by model self-confidence. GPT and Qwen models frequently bypass simulation entirely; larger variants in these families do so even more, suggesting that higher capability correlates with overconfidence that crowds out external simulation. Llama variants are more willing to query, but their queries do not produce better outcomes — willingness and effectiveness are decoupled.

Finding 4

Forced simulation makes things worse. When we mandate world-model use in Force Mode, performance declines for every model and every task. Optional access already introduces systematic failure modes; making it mandatory amplifies those weaknesses. This conclusively rules out the idea that reluctance is the only problem — even when agents do simulate, they cannot reliably extract value from it.

Finding 5

Agents use world models for confirmation, not exploration. Attribution analysis reveals that when world models do help, the pattern is: agent forms a hypothesis, queries world model to verify it, and the simulation confirms it. This brittle "confirmation bias" pattern means that when the initial hypothesis is wrong, the world model locks in the error rather than correcting it. A healthier pattern — propose multiple hypotheses, simulate each, select the best — is rarely observed.

Finding 6

The bottleneck is not simulation quality — it is agent behavior. Our agentic tasks use ground-truth cloned-environment simulators (perfect fidelity). Even with zero simulation error, most agents fail to leverage foresight effectively. The problem is not "the world model is inaccurate" — the problem is that agents cannot decide when to simulate, how to interpret rollouts, and how to integrate foresight into action. These are three distinct competencies that future training must address separately.

Future Work

Paths Toward Reliable Anticipatory Cognition

Our findings define three concrete challenges — and suggest corresponding remedies — for building agents that can genuinely leverage world models.

🧩

Dedicated Governance Modules.

A three-part interaction loop: a Decider that boldly proposes candidate actions to test, a Reflector that evaluates simulated vs. real outcomes, and a Memory that maintains long-horizon task objectives across simulated branches. This makes foresight usage explicit and structured.

🎲

Multi-Hypothesis Simulation.

Rather than confirming a single guess, agents should generate several competing hypotheses, simulate each, and select based on predicted-outcome quality. This turns simulation into structured hypothesis testing — correcting errors rather than reinforcing them.

🏋️

RL Training with World-Model-as-Tool Rewards.

Beyond prompting, intrinsic training is required. Online multi-turn RL with rewards that incentivize appropriate invocation frequency, diverse query formulation, and information gain (hypothesis entropy reduction) can directly target the identified failure modes.

Citation

If you find this work useful in your research, please consider citing our paper.

@article{qian2026worldagent,
  title   = {Current Agents Fail to Leverage World Model as Tool for Foresight},
  author  = {Qian, Cheng and Acikgoz, Emre Can and Li, Bingxuan and Chen, Xiusi and Zhang, Yuji and He, Bingxiang and Luo, Qinyu and Hakkani-T{\"u}r, Dilek and Tur, Gokhan and Li, Yunzhu and Ji, Heng},
  journal = {arXiv preprint arXiv:2601.03905},
  year    = {2026}
}

Current Agents Fail to Leverage World Models as Tool for Foresight