Part of Project Lifelong Agents

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

A unified framework for training and evaluating user-centric LLM agents through standardized gym environments, simulated users, and carefully designed RL rewards.

Cheng Qian1, Zuxin Liu2, Akshara Prabhakar2, Jielin Qiu2, Zhiwei Liu2, Haolin Chen2, Shirley Kokane2, Heng Ji1, Weiran Yao2, Shelby Heinecke2, Silvio Savarese2, Caiming Xiong2, Huan Wang2

1University of Illinois Urbana-Champaign  Β·  2Salesforce AI Research

User-First.
Multi-Turn.
RL-Trained.

Most RL agents are trained to complete tasks β€” not to satisfy users. UserRL bridges this gap by pairing standardized gym environments with simulated users, training agents that can handle diverse, dynamic, multi-turn interactions via GRPO β€” unlocking SFT cold-start critical capabilities and deliberate trajectory scoring.

UserRL β€” Multi-Turn Interaction Loop
πŸ‘€
Simulated User
Sends requests, evaluates responses
multi-turn dialogue
πŸ€–
LLM Agent
Responds, clarifies, takes action
🎯
Turn Reward
Per-step signal
πŸ“ˆ
Traj. Score
Sequence-level
βš™οΈ
GRPO
RL optimizer
Abstract

TL;DR

Reinforcement learning has shown promise in training agentic models for dynamic, multi-turn interactions β€” but most existing frameworks optimize for task completion, not user satisfaction. The diversity and dynamics of real user interactions pose fundamentally different challenges.

UserRL proposes a unified framework for training and evaluating user-centric agents through standardized gym environments paired with simulated users. By systematically varying turn-level reward assignment and trajectory-level score calculation, UserRL reveals how different reward formulations affect learning under GRPO β€” establishing a practical pathway for building robust, user-serving agentic models.

0.565
Best avg. score
(Qwen3-8B trained)
+81%
Over raw open-source
Qwen3 models
+20%
Over best closed-source
(Gemini-2.5-Pro)
8 Gyms
Diverse user-centric
evaluation environments
Core Framework

UserRL: Gym Environments Γ— Simulated Users Γ— RL

πŸ‹οΈ
Gym Environments

Standardized multi-turn interaction environments built on UserBench β€” covering diverse user-assistance tasks with reproducible evaluation protocols.

🎭
Simulated Users

LLM-based user proxies (GPT-4o or Qwen3-32B) that generate realistic requests, respond to agent clarifications, and provide feedback signals for reward computation.

🎯
Reward Design

Systematic variation of turn-level reward assignment and trajectory-level score calculation β€” analyzed for impact on GRPO learning stability and final performance.

UserRL builds upon UserBench β€” a comprehensive benchmark for evaluating user-centric agent capabilities across 15 task types and multiple interaction modes.

Approach

Reward Design for Multi-Turn User Interaction

User-centric tasks differ fundamentally from task-completion benchmarks: success is not a binary flag at the end of a trajectory β€” it accumulates across turns, depends on user-side signals, and is sensitive to how the agent communicates, not just what it ultimately produces. UserRL systematically studies two reward design dimensions:

DIMENSION 1
Turn-Level Reward Assignment

How to assign rewards within a trajectory β€” uniform distribution, last-turn only, or feedback-weighted. Controls how the agent attributes value to individual interaction steps.

DIMENSION 2
Trajectory-Level Score Calculation

How to aggregate interaction quality into a scalar training signal β€” average, minimum, or deliberate scoring. Deliberate scoring emphasizes the most informative conversational turns.

Training with GRPO on Simulated Users

UserRL applies Group Relative Policy Optimization (GRPO) on trajectories generated by rolling out the agent against simulated users. The simulated user serves as both an interaction partner and a reward signal generator β€” evaluating agent responses turn by turn. Two simulator options are studied: GPT-4o (strong, closed-source) and Qwen3-32B (open-source, cost-effective). Both produce transferable training signal, but Qwen3-32B offers a practical pathway to cost-effective and reproducible training.

SFT Cold Start: The Essential Precursor

A critical finding of UserRL is that SFT pre-training is not optional for multi-turn interaction RL. Without an SFT cold start, models lack the basic dialogue competence necessary to generate meaningful multi-turn trajectories β€” making GRPO's group-relative advantage computation degenerate. SFT provides the initial interaction ability that RL then refines and extends far beyond what SFT alone achieves.

Results

Gym Evaluation: Trained Agents Dominate Closed-Source Baselines

We evaluate all models across 8 UserBench Gym environments β€” 4 held-in (TravelGym, TurtleGym, FunctionGym, TauGym) and 4 held-out (PersuadeGym, IntentionGym, TelepathyGym, SearchGym) β€” to test both in-distribution performance and generalization. The notation (TurnReward / TrajScore) denotes the reward formulation: turn-level assignment strategy (Equalized, EM, R2G) paired with trajectory-level scoring (R2G or Sum).

Model TravelGym TurtleGym FunctionGym TauGym PersuadeGym IntentionGym TelepathyGym SearchGym Avg.
Open-Source β€” Trained (Qwen3-8B)
Qwen3-8B Equalized/R2G 0.57300.18540.42310.1818 0.53171.81750.56100.8880 0.5652
Qwen3-8B (EM/R2G) 0.50250.19170.41030.2000 0.53971.90250.53660.8640 0.5343
Qwen3-8B (R2G/R2G) 0.57240.16150.42310.1394 0.52381.85250.58540.8480 0.5539
Qwen3-8B (Equalized/Sum) 0.50540.13230.26920.2121 0.50401.62750.53660.8320 0.5076
Open-Source β€” Trained (Qwen3-4B)
Qwen3-4B Equalized/R2G 0.50860.18440.33330.2000 0.46431.80750.60980.8640 0.5269
Qwen3-4B (EM/R2G) 0.50760.14170.33330.1576 0.45631.73750.63410.8640 0.5154
Qwen3-4B (R2G/R2G) 0.46290.16870.39740.1333 0.57941.59750.46340.8640 0.4895
Qwen3-4B (Equalized/Sum) 0.44560.16150.23080.1333 0.45241.71500.48780.8400 0.4656
Open-Source β€” Raw (No Training)
Qwen3-32B (Raw) 0.17240.15100.15380.0000 0.48411.83000.56100.7920 0.3128
Qwen3-14B (Raw) 0.19240.14170.16670.1030 0.53171.70000.58540.5120 0.3027
Qwen3-4B (Raw) 0.14050.08540.07690.0364 0.40481.74000.48780.8560 0.2929
Closed-Source Baselines
Gemini-2.5-Pro 0.34680.27400.41030.1939 0.42461.59000.90240.9280 0.4702
Gemini-2.5-Flash 0.25530.19580.32050.1212 0.40871.68500.63410.9280 0.3973
GPT-4o 0.36430.29170.28210.0303 0.37701.89750.85370.8800 0.4449
GPT-4o-mini 0.09760.09060.15380.2061 0.53170.25000.04880.3520 0.1729

The results reveal a striking pattern: a trained Qwen3-8B with Equalized/R2G reward (avg. 0.5652) outperforms every closed-source model, including Gemini-2.5-Pro (0.4702) and GPT-4o (0.4449) β€” by a margin of +20% and +27% respectively. Even the smaller trained Qwen3-4B (0.5269) surpasses all closed-source baselines. Raw models without training remain far behind their trained counterparts (0.31 vs 0.56), confirming that user-centric interaction ability does not emerge naturally at scale β€” it must be trained.

Insights

Key Findings

Finding 1

SFT cold start is critical for unlocking interaction ability and enabling sustained RL improvements. Without an SFT warm-up, pure RL training stagnates β€” models cannot form coherent multi-turn dialogues, so GRPO receives degenerate training signal. SFT provides the necessary dialogue foundation that RL then exploits to push far beyond the SFT performance ceiling. Concretely, the best trained Qwen3-8B (avg 0.5652) vastly outperforms its raw counterpart Qwen3-32B (avg 0.3128) despite being a 4Γ— smaller model β€” a gap attributable solely to user-centric RL training. This is a sharper dependency than in single-turn tool-use tasks, reflecting the complexity of sustained user interaction.

Finding 2

Deliberate trajectory scoring (R2G) yields the most efficient and effective multi-turn interactions. R2G trajectory scoring consistently outperforms Sum across both Qwen3-8B and Qwen3-4B model sizes. The Equalized/R2G configuration achieves the best averages of 0.5652 (8B) and 0.5269 (4B), while switching to Sum scoring drops performance by up to βˆ’0.06 avg. R2G correctly focuses learning on turns that carry meaningful user-satisfaction signal, while equalized turn reward prevents any single step from dominating. Among turn-level strategies, Equalized slightly outperforms EM for both model sizes, suggesting that distributing reward equally across turns provides more stable gradient signal than exact-match-based attribution.

Finding 3

Open-source user simulators (Qwen3-32B) are a cost-effective and transferable alternative to GPT-4o. While stronger simulated users (GPT-4o) do facilitate more nuanced training signal, Qwen3-32B achieves competitive training outcomes at a fraction of the cost. Crucially, agents trained with open-source simulators transfer well to GPT-4o-evaluated benchmarks β€” the best trained model achieves +20% avg over Gemini-2.5-Pro (0.5652 vs 0.4702) and +27% avg over GPT-4o (0.5652 vs 0.4449), suggesting that simulator strength matters less than simulator consistency. Open-source infrastructure is viable for reproducing and scaling UserRL training, removing the proprietary dependency that has blocked community adoption of similar frameworks.

Citation

If you find UserRL useful in your research, please consider citing our work.

@article{qian2025userrl,
  title   = {UserRL: Training Interactive User-Centric Agent via Reinforcement Learning},
  author  = {Qian, Cheng and Liu, Zuxin and Prabhakar, Akshara and Qiu, Jielin and Liu, Zhiwei and Chen, Haolin and Kokane, Shirley and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
  journal = {arXiv preprint arXiv:2509.19736},
  year    = {2025}
}