Instructions to use osunlp/WebJudge-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use osunlp/WebJudge-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="osunlp/WebJudge-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("osunlp/WebJudge-7B")
model = AutoModelForMultimodalLM.from_pretrained("osunlp/WebJudge-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use osunlp/WebJudge-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "osunlp/WebJudge-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "osunlp/WebJudge-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/osunlp/WebJudge-7B

SGLang

How to use osunlp/WebJudge-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "osunlp/WebJudge-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "osunlp/WebJudge-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "osunlp/WebJudge-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "osunlp/WebJudge-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use osunlp/WebJudge-7B with Docker Model Runner:
```
docker model run hf.co/osunlp/WebJudge-7B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

| |

WebJudge

WebJudge preserves critical intermediate screenshots while mitigating the token overload issue, resulting in more accurate and reliable evaluations. Please check our paper for more details.

Repository
📃 Paper
🏆 Leaderboard
🤗 Data
Model

Results

Comparison against Existing Evaluation Methods on Online-Mind2Web

Model	Auto-Eval	SeeAct	Agent-E	Browser Use	Claude 3.5	Claude 3.7	Operator	Avg AR
GPT-4o	Autonomous Eval	84.7	85.0	76.0	83.7	75.5	71.7	79.4
	AgentTrek Eval	73.0	64.3	63.3	--	--	--	66.9
	WebVoyager	--	75.3	71.3	74.0	72.0	76.7	73.9
	WebJudge	86.7	86.0	81.4	86.3	79.1	81.8	83.6
o4-mini	Autonomous Eval	79.7	85.7	86.0	84.3	68.0	73.3	79.5
	WebVoyager	--	80.3	79.0	81.7	74.3	78.3	78.7
	WebJudge	85.3	86.3	89.3	87.0	82.3	83.7	85.7
	WebJudge-7B	86.0	87.3	88.3	89.7	84.3	86.3	87.0

WebJudge powered by GPT-4o and o4-mini consistently achieves the highest agreement, with averages of 83.6% and 85.7%, respectively. Meanwhile, WebJudge-7B even outperforms o4-mini, reaching a high agreement with human judgment of 87%.

Excellent generalization capabilities on AgentRewardBench (5 OOD benchmarks)

Methods	AB	VWA	WA	Work	Wk++	Overall
Rule-based*	25.0	85.2	79.0	100.0	83.3	83.8
Autonomous Eval*	83.3	61.2	67.6	96.4	59.3	67.6
GPT-4o (A11y Tree)*	77.8	63.0	70.2	94.6	63.0	69.8
WebJudge (GPT-4o)	66.7	69.8	72.6	92.3	75.0	73.7
WebJudge-7B	80.0	66.7	77.5	100.0	70.0	75.7
WebJudge (o4-mini)	100.0	74.5	81.2	100.0	90.0	82.0

WebJudge significantly outperforms existing methods, achieving impressive overall precision of 73.7% 75.7% and 82.0% on WebArena (WA), VisualWebArena (VWA), AssistantBench (AB), WorkArena (Work) and WorkArena++ (Wk++) across 1302 trajectories.

The high precision suggests that WebJudge holds potential as a robust and scalable reward model for downstream applications such as Rejection Sampling Fine-Tuning, Reflection, and Reinforcement Learning.

Inference

vLLM server

vllm serve osunlp/WebJudge-7B --port PORT --api-key API_KEY

LLaMA-Factory API

API_PORT=PORT llamafactory-cli api examples/inference/qwen2_vl.yaml

Prompt

Please check our Repository and Paper for more details about prompt.

text = """**Task**: {task}

**Key Points for Task Completion**: {key_points}

The snapshot of the web page is shown in the image."""

messages = [
                {"role": "system", "content": system_msg},
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": text},
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{jpg_base64_image}", "detail": "high"},
                        },
                    ],
                }
            ]
completion = client.chat.completions.create(
    model=model_path,
    messages=messages,
    temperature=0
)

Citation Information

Note: Online-Mind2Web is derived from the original Mind2Web dataset. We kindly ask that you cite both the original and this work when using or referencing the data.

@article{xue2025illusionprogressassessingcurrent,
      title={An Illusion of Progress? Assessing the Current State of Web Agents}, 
      author={Tianci Xue and Weijian Qi and Tianneng Shi and Chan Hee Song and Boyu Gou and Dawn Song and Huan Sun and Yu Su},
      year={2025},
      eprint={2504.01382},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.01382}, 
}

@inproceedings{deng2023mind2web,
 author = {Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {28091--28114},
 publisher = {Curran Associates, Inc.},
 title = {Mind2Web: Towards a Generalist Agent for the Web},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}