Model architecture | Model size | Language

WebJudge

image

WebJudge preserves critical intermediate screenshots while mitigating the token overload issue, resulting in more accurate and reliable evaluations. Please check our paper for more details.

Results

Comparison against Existing Evaluation Methods on Online-Mind2Web

Model Auto-Eval SeeAct Agent-E Browser Use Claude 3.5 Claude 3.7 Operator Avg AR
GPT-4o Autonomous Eval 84.7 85.0 76.0 83.7 75.5 71.7 79.4
AgentTrek Eval 73.0 64.3 63.3 -- -- -- 66.9
WebVoyager -- 75.3 71.3 74.0 72.0 76.7 73.9
WebJudge 86.7 86.0 81.4 86.3 79.1 81.8 83.6
o4-mini Autonomous Eval 79.7 85.7 86.0 84.3 68.0 73.3 79.5
WebVoyager -- 80.3 79.0 81.7 74.3 78.3 78.7
WebJudge 85.3 86.3 89.3 87.0 82.3 83.7 85.7
WebJudge-7B 86.0 87.3 88.3 89.7 84.3 86.3 87.0
WebJudge powered by GPT-4o and o4-mini consistently achieves the highest agreement, with averages of 83.6% and 85.7%, respectively. Meanwhile, WebJudge-7B even outperforms o4-mini, reaching a high agreement with human judgment of 87%.

Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)

Methods AB VWA WA Work Wk++ Overall
Rule-based* 25.0 85.2 79.0 100.0 83.3 83.8
Autonomous Eval* 83.3 61.2 67.6 96.4 59.3 67.6
GPT-4o (A11y Tree)* 77.8 63.0 70.2 94.6 63.0 69.8
WebJudge (GPT-4o) 66.7 69.8 72.6 92.3 75.0 73.7
WebJudge-7B 80.0 66.7 77.5 100.0 70.0 75.7
WebJudge (o4-mini) 100.0 74.5 81.2 100.0 90.0 82.0

WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.

The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.

Inference

vLLM server

vllm serve osunlp/WebJudge-7B --port PORT --api-key API_KEY

or

LLaMA-Factory API

API_PORT=PORT llamafactory-cli api examples/inference/qwen2_vl.yaml

Prompt

Please check our Repository and Paper for more details about prompt.

text = """**Task**: {task}

**Key Points for Task Completion**: {key_points}

The snapshot of the web page is shown in the image."""

messages = [
                {"role": "system", "content": system_msg},
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": text},
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{jpg_base64_image}", "detail": "high"},
                        },
                    ],
                }
            ]
completion = client.chat.completions.create(
    model=model_path,
    messages=messages,
    temperature=0
)

Citation Information

Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.

@article{xue2025illusionprogressassessingcurrent,
      title={An Illusion of Progress? Assessing the Current State of Web Agents}, 
      author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
      year={2025},
      eprint={2504.01382},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.01382}, 
}

@inproceedings{deng2023mind2web,
 author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {28091--28114},
 publisher = {Curran Associates, Inc.},
 title = {Mind2Web: Towards a Generalist Agent for the Web},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}
Downloads last month
19
Safetensors
Model size
8.29B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for osunlp/WebJudge-7B

Finetuned
(224)
this model
Quantizations
2 models