WebJudge-7B / README.md
xuetianci99's picture
Update README.md
bb550f8 verified
---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
library_name: transformers
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
<style>
img {
display: inline;
}
</style>
[![Model architecture](https://img.shields.io/badge/Qwen2.5-VL-blue#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-7B-green#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-en-orange#model-badge)](#datasets)
# WebJudge
![image](https://raw.githubusercontent.com/OSU-NLP-Group/Online-Mind2Web/refs/heads/main/images/WebJudge.jpg)
WebJudge preserves critical intermediate screenshots while mitigating the token overload issue, resulting in more accurate and reliable evaluations. Please check our [paper](https://arxiv.org/abs/2504.01382) for more details.
- **[Repository](https://github.com/OSU-NLP-Group/Online-Mind2Web)**
- **📃 [Paper](https://arxiv.org/abs/2504.01382)**
- **🏆 [Leaderboard](https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard)**
- **🤗 [Data](https://huggingface.co/datasets/osunlp/Online-Mind2Web)**
- **[Model](https://huggingface.co/osunlp/WebJudge-7B)**
## Results
### Comparison against Existing Evaluation Methods on Online-Mind2Web
<table>
<tr>
<th>Model</th>
<th>Auto-Eval</th>
<td>SeeAct</td>
<td>Agent-E</td>
<td>Browser Use</td>
<td>Claude 3.5 </td>
<td>Claude 3.7</td>
<td>Operator</td>
<th>Avg AR</th>
</tr>
<tr>
<th rowspan="4">GPT-4o</th>
<td>Autonomous Eval</td>
<td>84.7</td>
<td>85.0</td>
<td>76.0</td>
<td>83.7</td>
<td>75.5</td>
<td>71.7</td>
<td>79.4</td>
</tr>
<tr>
<td>AgentTrek Eval</td>
<td>73.0</td>
<td>64.3</td>
<td>63.3</td>
<td>--</td>
<td>--</td>
<td>--</td>
<td>66.9</td>
</tr>
<tr>
<td>WebVoyager</td>
<td>--</td>
<td>75.3</td>
<td>71.3</td>
<td>74.0</td>
<td>72.0</td>
<td>76.7</td>
<td>73.9</td>
</tr>
<tr>
<td>WebJudge</td>
<td>86.7</td>
<td>86.0</td>
<td>81.4</td>
<td>86.3</td>
<td>79.1</td>
<td>81.8</td>
<td><b>83.6</b></td>
</tr>
<tr>
<th rowspan="3">o4-mini</th>
<td>Autonomous Eval</td>
<td>79.7</td>
<td>85.7</td>
<td>86.0</td>
<td>84.3</td>
<td>68.0</td>
<td>73.3</td>
<td>79.5</td>
</tr>
<tr>
<td>WebVoyager</td>
<td>--</td>
<td>80.3</td>
<td>79.0</td>
<td>81.7</td>
<td>74.3</td>
<td>78.3</td>
<td>78.7</td>
</tr>
<tr>
<td>WebJudge</td>
<td>85.3</td>
<td>86.3</td>
<td>89.3</td>
<td>87.0</td>
<td>82.3</td>
<td>83.7</td>
<td><b>85.7</b></td>
</tr>
<tr>
<th></th>
<td>WebJudge-7B</td>
<td>86.0</td>
<td>87.3</td>
<td>88.3</td>
<td>89.7</td>
<td>84.3</td>
<td>86.3</td>
<td><b>87.0</b></td>
</tr>
</table>
WebJudge powered by GPT-4o and o4-mini consistently achieves the highest agreement, with averages of 83.6% and 85.7%, respectively. Meanwhile, WebJudge-7B even outperforms o4-mini, reaching a high agreement with human judgment of 87%.
### Excellent generalization capabilities on [AgentRewardBench](https://agent-reward-bench.github.io/) (5 OOD benchmarks)
| **Methods** | **AB** | **VWA** | **WA** | **Work** | **Wk++** | **Overall** |
|--------------|--------|--------|--------|----------|----------|--------------|
| *Rule-based** | 25.0 | **85.2** | 79.0 | 100.0 | 83.3 | 83.8 |
| Autonomous Eval* | 83.3 | 61.2 | 67.6 | 96.4 | 59.3 | 67.6 |
| GPT-4o (A11y Tree)* | 77.8 | 63.0 | 70.2 | 94.6 | 63.0 | 69.8 |
| WebJudge (GPT-4o) | 66.7 | 69.8 | 72.6 | 92.3 | 75.0 | 73.7 |
| WebJudge-7B | 80.0 | 66.7 | 77.5 | 100.0 | 70.0 | 75.7 |
| WebJudge (o4-mini) | **100.0** | 74.5 | **81.2** | **100.0** | **90.0** | **82.0** |
WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.
The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.
## Inference
### vLLM server
```bash
vllm serve osunlp/WebJudge-7B --port PORT --api-key API_KEY
```
or
### LLaMA-Factory API
```
API_PORT=PORT llamafactory-cli api examples/inference/qwen2_vl.yaml
```
### Prompt
Please check our [Repository](https://github.com/OSU-NLP-Group/Online-Mind2Web) and [Paper](https://arxiv.org/abs/2504.01382) for more details about prompt.
```python
text = """**Task**: {task}
**Key Points for Task Completion**: {key_points}
The snapshot of the web page is shown in the image."""
messages = [
{"role": "system", "content": system_msg},
{
"role": "user",
"content": [
{"type": "text", "text": text},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{jpg_base64_image}", "detail": "high"},
},
],
}
]
completion = client.chat.completions.create(
model=model_path,
messages=messages,
temperature=0
)
```
## Citation Information
Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.
```
@article{xue2025illusionprogressassessingcurrent,
title={An Illusion of Progress? Assessing the Current State of Web Agents},
author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
year={2025},
eprint={2504.01382},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.01382},
}
@inproceedings{deng2023mind2web,
author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {28091--28114},
publisher = {Curran Associates, Inc.},
title = {Mind2Web: Towards a Generalist Agent for the Web},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
volume = {36},
year = {2023}
}
```