WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

ICLR 2026

1LMU Munich  2Technical University of Munich  3Munich Center for Machine Learning (MCML)
Code: Apache 2.0 Model: Apache 2.0 Data: CC BY 4.0

A reasoning-first, principle-inducing WebPRM with structured justifications for step-level web agent supervision.

WebArbiter two-stage training pipeline: Stage 1 reasoning distillation from teacher LLM, Stage 2 reinforcement learning with verifiable rewards, and principle-guided inference
Overview of WebArbiter. Stage 1 distills principle-guided reasoning from a teacher LLM. Stage 2 applies RL with verifiable rewards. At inference, the model induces principles, applies them to candidates, and outputs a preference verdict.

Why Process Reward Models for Web Agents?

Web interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited:

Scalar WebPRM

Collapses progress into coarse scores with little interpretability or weak grounding.

Checklist-based WebPRM

Relies on checklists that are brittle under dynamic layouts and state-dependent action semantics, often mislabeling superficially correct actions as successful.

LLM-as-Judge

High cost, limited scalability, susceptible to hallucination, often rewarding fluent but incorrect actions.

WebArbiter  : a reasoning-first, principle-inducing WebPRM that produces structured justifications concluding with a preference verdict identifying the action most conducive to task completion.

Abstract

Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WEBPRMBENCH, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WEBPRMBENCH, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 6.4 points, underscoring its robustness and practical value in complex web tasks.

Key Contributions

🧠 Reasoning-First WebPRM

We propose WebArbiter, a reasoning-first, principle-inducing PRM trained with reasoning distillation and RL, providing auditable reasoning chains and correctness-aligned signals.

📊 WebPRMBench

We release WebPRMBench, the first comprehensive evaluation benchmark to provide systematic WebPRM evaluation across 4 web environments, using Pairwise and Best-of-N (BoN) Accuracy as standard metrics.

🏆 State-of-the-Art Performance

WebArbiter achieves SOTA on WebPRMBench, surpassing both proprietary LLMs and prior WebPRMs, and delivers up to 6.4-point gains in reward-guided trajectory search on WebArena-Lite.

🔍 Training Design Insights

We analyze the effects of different training components through systematic ablations, showing that cold-start RL alone is unstable across environments, whereas reasoning distillation and explicit principles are essential for stable and transferable progress-aware judgments, with RL primarily acting as an amplifier.

Method Overview

1

Principle Induction

Dynamically derive evaluation criteria from user intent and current state

2

Structured Reasoning

Ground each candidate action against principles with auditable reasoning chains

3

Preference Verdict

Conclude with a verdict identifying the action most conducive to task completion

Two-Stage Training

1 Reasoning Distillation

Distill principle-guided reasoning from a stronger teacher, promoting judgments grounded in explicit principles rather than surface heuristics. Trained via negative log-likelihood on teacher-generated justifications.

2 Reinforcement Learning

Maximize expected reward with KL regularization via GRPO under binary verifiable rewards R ∈ {−1, +1}, correcting teacher biases and enabling cross-environment generalization.

Reward-Guided Trajectory Search

At each decision step, the policy samples 5 candidate actions. WebArbiter runs a knockout tournament—pairwise comparisons grounded in dynamically induced principles—to select the action most conducive to task completion.

1 Sample Candidates
2 Knockout Tournament
3 Principle-Guided Reasoning
4 Execute Winner
Set my gitlab status as Enjoying life. Step 1 / 5 Success
Agent navigating web interface
CANDIDATES (5)
KNOCKOUT BRACKET
INDUCED PRINCIPLES
WINNER click [106]

5 example trajectories across 5 WebArena-Lite websites · All 72 successful trajectories on HuggingFace COMING SOON

Main Results

On WebPRMBench, WebArbiter-7B achieves the highest Avg. BoN Accuracy, outperforming GPT-5 by 9.1 points and surpassing WebShepherd-8B by an absolute gain of 31 points.

Performance comparison on WebPRMBench. Left: Average Best-of-N Acc vs. model size. Right: Domain-wise Avg BoN Acc across all environments.
Left: Average Best-of-N Acc vs. model size, showing superior efficiency despite smaller scale. Right: Domain-wise Avg BoN Acc, where WebArbiter achieves the best results across all environments.
Model Mind2Web WebArena AssistantBench WorkArena Average
Pair.BoN Pair.BoN Pair.BoN Pair.BoN Pair.BoN
LLM-as-Judge, Proprietary
GPT-4o-mini81.7450.9278.2356.7289.1773.3381.4346.7082.6456.92
GPT-4o79.9952.6284.5866.6785.8366.6784.3355.1983.6860.29
GPT-580.8662.3984.8371.6481.6763.3381.1464.6282.1365.50
Claude-3.7-Sonnet80.2057.9082.8064.1081.5061.3082.1060.6081.6560.98
Gemini-2.5-Flash81.3057.0182.7162.1980.0063.3383.3056.1381.8359.67
DeepSeek-R181.6257.3782.0460.2178.4956.1884.1263.8981.5759.41
LLM-as-Judge, Open-Source
Qwen2.5-3B-Instruct76.4636.9360.3215.4275.8333.3364.4519.3469.2726.76
Qwen2.5-7B-Instruct77.7939.1874.8842.7984.1753.3377.5835.8577.6142.78
Llama-3-70B-Instruct80.5549.3677.3650.7585.8370.0079.0840.0980.7152.55
WebPRMs (3B)
WebShepherd-3B87.5065.2168.1641.2966.6746.6750.0021.2368.0843.60
WebArbiter-3B93.3278.4281.9756.2278.3346.6781.0154.8183.6559.06
WebPRMs (7B+)
WebShepherd-8B86.6673.6968.3343.8855.9230.0054.5625.5364.3443.28
WebArbiter-7B97.0789.5388.4368.6689.1770.0082.0970.1989.1974.60
Table 1. Results on WebPRMBench with Pairwise and BoN Accuracy. Bold: best; underline: second best.

We evaluate WebArbiter in reward-guided trajectory search on WebArena-Lite, using Best-of-N sampling with a Knockout Tournament mechanism. WebArbiter surpasses WebShepherd by up to 6.4 points, further demonstrating robustness in realistic interaction settings.

Policy WebPRM Shopping CMS Reddit GitLab MAP Avg. Δ
GPT-4o-mini as Policy
GPT-4o-miniw/o Trajectory Search*21.7422.8619.0534.3819.3523.48
GPT-4o-miniGPT-4o-mini24.4422.8626.3233.3315.3824.47+0.99
GPT-4o-miniWebShepherd-8B*26.0945.7123.8140.6235.4834.34+10.86
GPT-4o-miniWebArbiter-7B37.7842.8636.8446.6738.4640.52+17.04
GPT-4o as Policy
GPT-4ow/o Trajectory Search*23.9131.4328.5756.2519.3531.90
GPT-4oGPT-4o-mini26.6737.1442.1140.0019.2333.03+1.13
GPT-4oWebShepherd-8B*30.4342.8647.6246.8835.4840.65+8.75
GPT-4oWebArbiter-7B44.4442.8652.6356.6738.4647.01+15.11
Table 2. Success rates (%) of trajectory search with GPT-4o-mini and GPT-4o as policy on WebArena-Lite. * Results reported from WebShepherd. Δ is relative to the w/o Trajectory Search baseline. WebArbiter consistently achieves the highest gains across both policy models.

Training Design Insights

We compare four training variants to disentangle the effects of RL, principle guidance, and justification style (Table 3).

Cold-Start RL

Instruct + Cold Start RL performs well on in-domain Mind2Web but collapses on out-of-domain benchmarks. Reward optimization without reasoning distillation struggles in noisy and complex environments.

RL + Principle Prompting

Instruct + Cold Start RL + Principles improves both average Pairwise and BoN Acc, especially on AssistantBench and WorkArena, where tasks need context- and state-dependent judgments beyond surface layout cues. Principle-guided reasoning provides transferable criteria for true task progress.

Reasoning w/o Principles + RL

Instruct + SFTw/o Principles + RL uses narrative-style justifications only; fluency improves but performance consistently lags principle-aware settings. Without explicit principles, the model tends to rationalize actions post hoc from surface plausibility and spurious cues.

Method Mind2Web WebArena AssistantBench WorkArena Average
Pair.BoN Pair.BoN Pair.BoN Pair.BoN Pair.BoN
Instruct (Original)77.7939.1874.8842.7984.1753.3377.5835.8577.6142.78
Instruct + Cold Start RL96.1886.0071.1035.8072.4033.6074.9037.9078.1548.33
Instruct + Cold Start RL + Principles96.1888.0077.8046.3080.1048.9082.4051.8084.1258.75
Instruct + SFTw/o Principles + RL98.4894.3474.6041.5077.2040.2079.1044.6082.3555.16
WebArbiter-7B97.0789.5388.4368.6689.1770.0082.0970.1989.1974.60
Table 3. Ablation results on WebPRMBench (Qwen2.5-7B-Instruct backbone). WebArbiter, combining principle-guided reasoning distillation with RL, achieves the highest overall performance.

Reasoning Supervision Analysis

We analyze the role of reasoning supervision by comparing answer-only SFT, distilled reasoning, and RL under both full-data and limited-data (10K) settings (Table 4).

Method Mind2Web WebArena AssistantBench WorkArena Average
Pair.BoN Pair.BoN Pair.BoN Pair.BoN Pair.BoN
Train on Full Data
Instruct + SFT85.1460.9180.8552.7382.5056.6779.5752.8882.0255.80
Instruct + Distilled + SFT87.4261.1881.5952.7383.3363.3381.1356.7383.3758.49
WebArbiter-7B97.0789.5388.4368.6689.1770.0082.0970.1989.1974.60
Train on 10K (Stage-1 Reasoning Distillation) Data
Instruct + SFT84.5360.8282.2158.7182.5056.6780.5839.6282.4653.96
Instruct + Distilled85.2063.4083.1061.8083.0060.2081.4055.6083.1860.25
Table 4. Results under full-data and limited-data (10K) training regimes. Reasoning distillation improves over answer-only SFT, while WebArbiter (reasoning distillation + RL) achieves the best overall performance.

Distillation + RL as Amplifier

Reasoning supervision yields more reliable judgments especially under BoN Acc (multi-candidate settings). With full data, answer-only SFT after distillation gives environment-dependent gains because final-answer optimization can reintroduce shortcuts; distillation still grounds judgments in true task progress. RL then enlarges the margin between progress-making and spurious trajectories.

Especially Effective Under Limited Data

Under the 10K (stage-1 distillation) setting, Instruct + Distilled beats Instruct + SFT on both Pairwise and BoN Acc in every environment (e.g. +6.29 Avg. BoN Acc). Identical data budgets imply gains come from biasing the model toward progress-aware reward judgments, not from scale.

Inference-Time Scaling

As the number of sampled evaluations K increases, both Pairwise and BoN Accuracy improve consistently. Gains are more pronounced under the stricter BoN Acc, highlighting the advantage of additional inference-time compute in multi-distractor ranking.

Inference-time scaling of WebArbiter: Pairwise Accuracy and BoN Accuracy as K increases
Figure 5. Inference-time scaling of WebArbiter. Left: Pairwise and Right: BoN Acc as the number of sampled reward evaluations K increases.

Case Study

We compare WebArbiter with WebShepherd on real trajectory search examples from WebArena-Lite. WebArbiter's principle-guided reasoning correctly identifies the preferred action, while checklist-based methods are misled by surface-level cues.

Case study: WebArbiter principle-guided reasoning vs WebShepherd checklist approach

Hover to zoom & pan · Scroll to navigate vertically · Click for full PDF

Failure Cases

We examine two recurring failure patterns on GitLab, revealing open challenges for text-based WebPRMs that rely on accessibility-tree observations.

Failure case: Safe-action bias in reward-guided trajectory search on GitLab

Hover to zoom & pan · Scroll to navigate vertically · Click for full PDF

WebPRMBench

We introduce WebPRMBench, the first comprehensive evaluation benchmark for evaluating WebPRMs. It provides 1,150 step-level preference instances, each consisting of one correct action and four rejected alternatives, collected across 4 web environments.

707
Mind2Web
61.5%
201
WebArena
17.5%
30
AssistantBench
2.6%
212
WorkArena
18.4%
0
Preference Instances
0
Web Environments
1 + 4
Correct + Rejected
0
Metrics (Pair. & BoN)

BibTeX

@misc{zhang2026webarbiterprincipleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872},
}