WebJudge-7B / README.md

Update README.md

bb550f8 verified 3 months ago

6.47 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	library_name: transformers
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	---

	<style>
	img {
	display: inline;
	}
	</style>

	[![Model architecture](https://img.shields.io/badge/Qwen2.5-VL-blue#model-badge)](#model-architecture)
	\| [![Model size](https://img.shields.io/badge/Params-7B-green#model-badge)](#model-architecture)
	\| [![Language](https://img.shields.io/badge/Language-en-orange#model-badge)](#datasets)

	# WebJudge

	![image](https://raw.githubusercontent.com/OSU-NLP-Group/Online-Mind2Web/refs/heads/main/images/WebJudge.jpg)

	WebJudge preserves critical intermediate screenshots while mitigating the token overload issue, resulting in more accurate and reliable evaluations. Please check our [paper](https://arxiv.org/abs/2504.01382) for more details.

	- [Repository](https://github.com/OSU-NLP-Group/Online-Mind2Web)
	- 📃 [Paper](https://arxiv.org/abs/2504.01382)
	- 🏆 [Leaderboard](https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard)
	- 🤗 [Data](https://huggingface.co/datasets/osunlp/Online-Mind2Web)
	- [Model](https://huggingface.co/osunlp/WebJudge-7B)


	## Results

	### Comparison against Existing Evaluation Methods on Online-Mind2Web
	<table>
	<tr>
	<th>Model</th>
	<th>Auto-Eval</th>
	<td>SeeAct</td>
	<td>Agent-E</td>
	<td>Browser Use</td>
	<td>Claude 3.5 </td>
	<td>Claude 3.7</td>
	<td>Operator</td>
	<th>Avg AR</th>
	</tr>
	<tr>
	<th rowspan="4">GPT-4o</th>
	<td>Autonomous Eval</td>
	<td>84.7</td>
	<td>85.0</td>
	<td>76.0</td>
	<td>83.7</td>
	<td>75.5</td>
	<td>71.7</td>
	<td>79.4</td>
	</tr>
	<tr>
	<td>AgentTrek Eval</td>
	<td>73.0</td>
	<td>64.3</td>
	<td>63.3</td>
	<td>--</td>
	<td>--</td>
	<td>--</td>
	<td>66.9</td>
	</tr>
	<tr>
	<td>WebVoyager</td>
	<td>--</td>
	<td>75.3</td>
	<td>71.3</td>
	<td>74.0</td>
	<td>72.0</td>
	<td>76.7</td>
	<td>73.9</td>
	</tr>
	<tr>
	<td>WebJudge</td>
	<td>86.7</td>
	<td>86.0</td>
	<td>81.4</td>
	<td>86.3</td>
	<td>79.1</td>
	<td>81.8</td>
	<td><b>83.6</b></td>
	</tr>

	<tr>
	<th rowspan="3">o4-mini</th>
	<td>Autonomous Eval</td>
	<td>79.7</td>
	<td>85.7</td>
	<td>86.0</td>
	<td>84.3</td>
	<td>68.0</td>
	<td>73.3</td>
	<td>79.5</td>
	</tr>
	<tr>
	<td>WebVoyager</td>
	<td>--</td>
	<td>80.3</td>
	<td>79.0</td>
	<td>81.7</td>
	<td>74.3</td>
	<td>78.3</td>
	<td>78.7</td>
	</tr>
	<tr>
	<td>WebJudge</td>
	<td>85.3</td>
	<td>86.3</td>
	<td>89.3</td>
	<td>87.0</td>
	<td>82.3</td>
	<td>83.7</td>
	<td><b>85.7</b></td>
	</tr>

	<tr>
	<th></th>
	<td>WebJudge-7B</td>
	<td>86.0</td>
	<td>87.3</td>
	<td>88.3</td>
	<td>89.7</td>
	<td>84.3</td>
	<td>86.3</td>
	<td><b>87.0</b></td>
	</tr>
	</table>
	WebJudge powered by GPT-4o and o4-mini consistently achieves the highest agreement, with averages of 83.6% and 85.7%, respectively. Meanwhile, WebJudge-7B even outperforms o4-mini, reaching a high agreement with human judgment of 87%.


	### Excellent generalization capabilities on [AgentRewardBench](https://agent-reward-bench.github.io/) (5 OOD benchmarks)
	\| Methods \| AB \| VWA \| WA \| Work \| Wk++ \| Overall \|
	\|--------------\|--------\|--------\|--------\|----------\|----------\|--------------\|
	\| Rule-based* \| 25.0 \| 85.2 \| 79.0 \| 100.0 \| 83.3 \| 83.8 \|
	\| Autonomous Eval* \| 83.3 \| 61.2 \| 67.6 \| 96.4 \| 59.3 \| 67.6 \|
	\| GPT-4o (A11y Tree)* \| 77.8 \| 63.0 \| 70.2 \| 94.6 \| 63.0 \| 69.8 \|
	\| WebJudge (GPT-4o) \| 66.7 \| 69.8 \| 72.6 \| 92.3 \| 75.0 \| 73.7 \|
	\| WebJudge-7B \| 80.0 \| 66.7 \| 77.5 \| 100.0 \| 70.0 \| 75.7 \|
	\| WebJudge (o4-mini) \| 100.0 \| 74.5 \| 81.2 \| 100.0 \| 90.0 \| 82.0 \|

	WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.

	The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.

	## Inference

	### vLLM server

	```bash
	vllm serve osunlp/WebJudge-7B --port PORT --api-key API_KEY
	```

	or

	### LLaMA-Factory API

	```
	API_PORT=PORT llamafactory-cli api examples/inference/qwen2_vl.yaml
	```

	### Prompt
	Please check our [Repository](https://github.com/OSU-NLP-Group/Online-Mind2Web) and [Paper](https://arxiv.org/abs/2504.01382) for more details about prompt.

	```python
	text = """Task: {task}

	Key Points for Task Completion: {key_points}

	The snapshot of the web page is shown in the image."""

	messages = [
	{"role": "system", "content": system_msg},
	{
	"role": "user",
	"content": [
	{"type": "text", "text": text},
	{
	"type": "image_url",
	"image_url": {"url": f"data:image/jpeg;base64,{jpg_base64_image}", "detail": "high"},
	},
	],
	}
	]
	completion = client.chat.completions.create(
	model=model_path,
	messages=messages,
	temperature=0
	)
	```

	## Citation Information

	Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.

	```
	@article{xue2025illusionprogressassessingcurrent,
	title={An Illusion of Progress? Assessing the Current State of Web Agents},
	author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
	year={2025},
	eprint={2504.01382},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2504.01382},
	}

	@inproceedings{deng2023mind2web,
	author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
	booktitle = {Advances in Neural Information Processing Systems},
	editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
	pages = {28091--28114},
	publisher = {Curran Associates, Inc.},
	title = {Mind2Web: Towards a Generalist Agent for the Web},
	url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
	volume = {36},
	year = {2023}
	}
	```