A unified framework for training and evaluating user-centric LLM agents through standardized gym environments, simulated users, and carefully designed RL rewards.
1University of Illinois Urbana-Champaign Β· 2Salesforce AI Research
Reinforcement learning has shown promise in training agentic models for dynamic, multi-turn interactions β but most existing frameworks optimize for task completion, not user satisfaction. The diversity and dynamics of real user interactions pose fundamentally different challenges.
UserRL proposes a unified framework for training and evaluating user-centric agents through standardized gym environments paired with simulated users. By systematically varying turn-level reward assignment and trajectory-level score calculation, UserRL reveals how different reward formulations affect learning under GRPO β establishing a practical pathway for building robust, user-serving agentic models.
Standardized multi-turn interaction environments built on UserBench β covering diverse user-assistance tasks with reproducible evaluation protocols.
LLM-based user proxies (GPT-4o or Qwen3-32B) that generate realistic requests, respond to agent clarifications, and provide feedback signals for reward computation.
Systematic variation of turn-level reward assignment and trajectory-level score calculation β analyzed for impact on GRPO learning stability and final performance.
UserRL builds upon UserBench β a comprehensive benchmark for evaluating user-centric agent capabilities across 15 task types and multiple interaction modes.
User-centric tasks differ fundamentally from task-completion benchmarks: success is not a binary flag at the end of a trajectory β it accumulates across turns, depends on user-side signals, and is sensitive to how the agent communicates, not just what it ultimately produces. UserRL systematically studies two reward design dimensions:
How to assign rewards within a trajectory β uniform distribution, last-turn only, or feedback-weighted. Controls how the agent attributes value to individual interaction steps.
How to aggregate interaction quality into a scalar training signal β average, minimum, or deliberate scoring. Deliberate scoring emphasizes the most informative conversational turns.
UserRL applies Group Relative Policy Optimization (GRPO) on trajectories generated by rolling out the agent against simulated users. The simulated user serves as both an interaction partner and a reward signal generator β evaluating agent responses turn by turn. Two simulator options are studied: GPT-4o (strong, closed-source) and Qwen3-32B (open-source, cost-effective). Both produce transferable training signal, but Qwen3-32B offers a practical pathway to cost-effective and reproducible training.
A critical finding of UserRL is that SFT pre-training is not optional for multi-turn interaction RL. Without an SFT cold start, models lack the basic dialogue competence necessary to generate meaningful multi-turn trajectories β making GRPO's group-relative advantage computation degenerate. SFT provides the initial interaction ability that RL then refines and extends far beyond what SFT alone achieves.
We evaluate all models across 8 UserBench Gym environments β 4 held-in (TravelGym, TurtleGym, FunctionGym, TauGym) and 4 held-out (PersuadeGym, IntentionGym, TelepathyGym, SearchGym) β to test both in-distribution performance and generalization. The notation (TurnReward / TrajScore) denotes the reward formulation: turn-level assignment strategy (Equalized, EM, R2G) paired with trajectory-level scoring (R2G or Sum).
| Model | TravelGym | TurtleGym | FunctionGym | TauGym | PersuadeGym | IntentionGym | TelepathyGym | SearchGym | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Open-Source β Trained (Qwen3-8B) | |||||||||
| Qwen3-8B Equalized/R2G | 0.5730 | 0.1854 | 0.4231 | 0.1818 | 0.5317 | 1.8175 | 0.5610 | 0.8880 | 0.5652 |
| Qwen3-8B (EM/R2G) | 0.5025 | 0.1917 | 0.4103 | 0.2000 | 0.5397 | 1.9025 | 0.5366 | 0.8640 | 0.5343 |
| Qwen3-8B (R2G/R2G) | 0.5724 | 0.1615 | 0.4231 | 0.1394 | 0.5238 | 1.8525 | 0.5854 | 0.8480 | 0.5539 |
| Qwen3-8B (Equalized/Sum) | 0.5054 | 0.1323 | 0.2692 | 0.2121 | 0.5040 | 1.6275 | 0.5366 | 0.8320 | 0.5076 |
| Open-Source β Trained (Qwen3-4B) | |||||||||
| Qwen3-4B Equalized/R2G | 0.5086 | 0.1844 | 0.3333 | 0.2000 | 0.4643 | 1.8075 | 0.6098 | 0.8640 | 0.5269 |
| Qwen3-4B (EM/R2G) | 0.5076 | 0.1417 | 0.3333 | 0.1576 | 0.4563 | 1.7375 | 0.6341 | 0.8640 | 0.5154 |
| Qwen3-4B (R2G/R2G) | 0.4629 | 0.1687 | 0.3974 | 0.1333 | 0.5794 | 1.5975 | 0.4634 | 0.8640 | 0.4895 |
| Qwen3-4B (Equalized/Sum) | 0.4456 | 0.1615 | 0.2308 | 0.1333 | 0.4524 | 1.7150 | 0.4878 | 0.8400 | 0.4656 |
| Open-Source β Raw (No Training) | |||||||||
| Qwen3-32B (Raw) | 0.1724 | 0.1510 | 0.1538 | 0.0000 | 0.4841 | 1.8300 | 0.5610 | 0.7920 | 0.3128 |
| Qwen3-14B (Raw) | 0.1924 | 0.1417 | 0.1667 | 0.1030 | 0.5317 | 1.7000 | 0.5854 | 0.5120 | 0.3027 |
| Qwen3-4B (Raw) | 0.1405 | 0.0854 | 0.0769 | 0.0364 | 0.4048 | 1.7400 | 0.4878 | 0.8560 | 0.2929 |
| Closed-Source Baselines | |||||||||
| Gemini-2.5-Pro | 0.3468 | 0.2740 | 0.4103 | 0.1939 | 0.4246 | 1.5900 | 0.9024 | 0.9280 | 0.4702 |
| Gemini-2.5-Flash | 0.2553 | 0.1958 | 0.3205 | 0.1212 | 0.4087 | 1.6850 | 0.6341 | 0.9280 | 0.3973 |
| GPT-4o | 0.3643 | 0.2917 | 0.2821 | 0.0303 | 0.3770 | 1.8975 | 0.8537 | 0.8800 | 0.4449 |
| GPT-4o-mini | 0.0976 | 0.0906 | 0.1538 | 0.2061 | 0.5317 | 0.2500 | 0.0488 | 0.3520 | 0.1729 |
The results reveal a striking pattern: a trained Qwen3-8B with Equalized/R2G reward (avg. 0.5652) outperforms every closed-source model, including Gemini-2.5-Pro (0.4702) and GPT-4o (0.4449) β by a margin of +20% and +27% respectively. Even the smaller trained Qwen3-4B (0.5269) surpasses all closed-source baselines. Raw models without training remain far behind their trained counterparts (0.31 vs 0.56), confirming that user-centric interaction ability does not emerge naturally at scale β it must be trained.
SFT cold start is critical for unlocking interaction ability and enabling sustained RL improvements. Without an SFT warm-up, pure RL training stagnates β models cannot form coherent multi-turn dialogues, so GRPO receives degenerate training signal. SFT provides the necessary dialogue foundation that RL then exploits to push far beyond the SFT performance ceiling. Concretely, the best trained Qwen3-8B (avg 0.5652) vastly outperforms its raw counterpart Qwen3-32B (avg 0.3128) despite being a 4Γ smaller model β a gap attributable solely to user-centric RL training. This is a sharper dependency than in single-turn tool-use tasks, reflecting the complexity of sustained user interaction.
Deliberate trajectory scoring (R2G) yields the most efficient and effective multi-turn interactions. R2G trajectory scoring consistently outperforms Sum across both Qwen3-8B and Qwen3-4B model sizes. The Equalized/R2G configuration achieves the best averages of 0.5652 (8B) and 0.5269 (4B), while switching to Sum scoring drops performance by up to β0.06 avg. R2G correctly focuses learning on turns that carry meaningful user-satisfaction signal, while equalized turn reward prevents any single step from dominating. Among turn-level strategies, Equalized slightly outperforms EM for both model sizes, suggesting that distributing reward equally across turns provides more stable gradient signal than exact-match-based attribution.
Open-source user simulators (Qwen3-32B) are a cost-effective and transferable alternative to GPT-4o. While stronger simulated users (GPT-4o) do facilitate more nuanced training signal, Qwen3-32B achieves competitive training outcomes at a fraction of the cost. Crucially, agents trained with open-source simulators transfer well to GPT-4o-evaluated benchmarks β the best trained model achieves +20% avg over Gemini-2.5-Pro (0.5652 vs 0.4702) and +27% avg over GPT-4o (0.5652 vs 0.4449), suggesting that simulator strength matters less than simulator consistency. Open-source infrastructure is viable for reproducing and scaling UserRL training, removing the proprietary dependency that has blocked community adoption of similar frameworks.
If you find UserRL useful in your research, please consider citing our work.
@article{qian2025userrl,
title = {UserRL: Training Interactive User-Centric Agent via Reinforcement Learning},
author = {Qian, Cheng and Liu, Zuxin and Prabhakar, Akshara and Qiu, Jielin and Liu, Zhiwei and Chen, Haolin and Kokane, Shirley and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
journal = {arXiv preprint arXiv:2509.19736},
year = {2025}
}