Yao Zhang

I am a final-year Ph.D. candidate in Computer Science at LMU Munich, advised by Prof. Volker Tresp. Previously, I received my M.Sc. (2021) and B.Sc. (2019) from LMU Munich.

My research is focused on agent reliability, reducing compounding failures in long-horizon tasks through structured search and process supervision. As agents take on increasingly autonomous roles, reliable multi-step execution remains one of the central challenges for real-world deployment. I develop methods that address this, from single-agent execution to multi-agent system design. Earlier in my PhD, I also worked on multimodal learning, including parameter-efficient fine-tuning in federated and continual settings. Looking ahead, I am interested in enabling agents to take on increasingly autonomous roles, where they can be trusted to operate over extended periods with minimal human oversight. Feel free to reach out if you are interested in my work or would like to connect.

I am on the job market for Research Scientist / Applied Scientist positions in agentic systems.

Research Interests: Agent Reliability Reward Modeling Agentic Systems Multimodal Learning

News

Jan 2026	WebArbiter accepted at ICLR 2026.
Nov 2025	AUVIC accepted at AAAI 2026.
Oct 2025	GroundedPRM presented at LAW@NeurIPS 2025.
Aug 2025	SwarmAgentic accepted at EMNLP 2025 (Main).
Mar 2025	FedBiP accepted at CVPR 2025.
Dec 2024	WebPilot accepted at AAAI 2025.
Oct 2024	CL-CrossVQA accepted at WACV 2025.
Dec 2023	FedDAT accepted at AAAI 2024.

Selected Publications

Full List →

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

Yao Zhang , Shijie Tang , Zeyu Li , and 2 more authors

ICLR 2026

WebArbiter is a reasoning-first, principle-inducing WebPRM that formulates process reward modeling as text generation, producing structured justifications that conclude with a preference verdict to identify the action most conducive to task completion. Trained via reasoning distillation and reinforcement learning, it achieves SOTA on WebPRMBench and delivers substantial gains in reward-guided trajectory search on WebArena-Lite.

Abs arXiv Bib Homepage Code

Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 6.4 points, underscoring its robustness and practical value in complex web tasks.
@misc{zhang2026webarbiterprincipleguidedreasoningprocess, title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, author={Zhang, Yao and Tang, Shijie and Li, Zeyu and Han, Zhen and Tresp, Volker}, year={2026}, eprint={2601.21872}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.21872}, }
SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

Yao Zhang , Chenyang Lin , Shijie Tang , and 4 more authors

EMNLP 2025 (Main)

SwarmAgentic is a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. It maintains a population of candidate systems and evolves them via feedback-guided updates inspired by Particle Swarm Optimization, enabling efficient exploration of the agentic system design space.

Abs arXiv Bib Homepage Code

The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation.
@misc{zhang2025swarmagenticfullyautomatedagentic, title={SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence}, author={Zhang, Yao and Lin, Chenyang and Tang, Shijie and Chen, Haokun and Zhou, Shijie and Ma, Yunpu and Tresp, Volker}, year={2025}, eprint={2506.15672}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.15672}, }
GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Yao Zhang , Yu Wu , Haowei Zhang , and 6 more authors

LAW@NeurIPS 2025

GroundedPRM is a tree-guided and fidelity-aware framework for automatic process reward modeling that combines MCTS-guided path construction with tool-based step verification. It achieves SOTA performance with only 10% of the training data compared to existing auto-labeled methods, demonstrating exceptional sample efficiency and superior reasoning quality.

Abs arXiv Bib Homepage Code

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
@misc{zhang2025groundedprmtreeguidedfidelityawareprocess, title={GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning}, author={Zhang, Yao and Wu, Yu and Zhang, Haowei and Li, Weiguo and Chen, Haokun and Wu, Jingpei and Li, Guohao and Han, Zhen and Tresp, Volker}, year={2025}, eprint={2510.14942}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.14942}, }
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

Yao Zhang , Zijian Ma , Yunpu Ma , and 3 more authors

AAAI 2025

WebPilot is a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. It uses Global Optimization for high-level planning and Local Optimization for executing subtasks, achieving SOTA performance on WebArena with a 93% relative increase in success rate.

Abs arXiv Bib Homepage

LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, which lack the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks and continuously refining this plan, thereby focusing the search process and mitigating the challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot marks a significant advancement in general autonomous agent capabilities, paving the way for more advanced and reliable decision-making in practical environments.
@misc{zhang2024webpilotversatileautonomousmultiagent, title={WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration}, author={Zhang, Yao and Ma, Zijian and Ma, Yunpu and Han, Zhen and Wu, Yu and Tresp, Volker}, year={2025}, eprint={2408.15978}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2408.15978}, }

Academic Service

Conference Reviewer: ARR, NeurIPS, CVPR, AAAI, BMVC
Teaching Assistant: Bachelor Seminar on Generative AI; Master Seminar on Knowledge Graphs, LMU Munich