qianhuiwu commited on
Commit
34e15c4
·
verified ·
1 Parent(s): febc622

Update model card.

Browse files
Files changed (1) hide show
  1. README.md +99 -4
README.md CHANGED
@@ -6,14 +6,109 @@ base_model:
6
 
7
  # GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
8
 
9
- [GUI-Actor-7B-Qwen2-VL]() | [GUI-Actor-2B-Qwen-2-VL]() | [GUI-Actor-7B-Qwen2.5-VL]() | [GUI-Actor-3B-Qwen2.5-VL]()
 
 
 
 
10
 
11
- This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents** (Wu et al, 2025)](https://arxiv.org/abs/2403.12968).
12
- It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here]() (will update later).
13
 
14
  For more details on model design and evaluation, please check the project page at [GUI-Actor](https://aka.ms/GUI-Actor).
15
 
16
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ## Citation
19
  ```
 
6
 
7
  # GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
8
 
9
+ - [GUI-Actor-7B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)
10
+ - [GUI-Actor-2B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL)
11
+ - [GUI-Actor-7B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL)
12
+ - [GUI-Actor-3B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)
13
+ - [GUI-Actor-Verifier-2B](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)
14
 
15
+ This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents** (Wu et al, 2025)](https://github.com/microsoft/GUI-Actor).
16
+ It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
17
 
18
  For more details on model design and evaluation, please check the project page at [GUI-Actor](https://aka.ms/GUI-Actor).
19
 
20
+ ## 📊 Performance Comparison on GUI Grounding Benchmarks
21
+ Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
22
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
23
+ |------------------|--------------|----------------|------------|----------------|
24
+ | **_72B models:_**
25
+ | AGUVIS-72B | Qwen2-VL | - | 89.2 | - |
26
+ | UGround-V1-72B | Qwen2-VL | 34.5 | **89.4** | - |
27
+ | UI-TARS-72B | Qwen2-VL | **38.1** | 88.4 | **90.3** |
28
+ | **_7B models:_**
29
+ | OS-Atlas-7B | Qwen2-VL | 18.9 | 82.5 | 84.1 |
30
+ | AGUVIS-7B | Qwen2-VL | 22.9 | 84.4 | 86.0† |
31
+ | UGround-V1-7B | Qwen2-VL | 31.1 | 86.3 | 87.6† |
32
+ | UI-TARS-7B | Qwen2-VL | 35.7 | **89.5** | **91.6** |
33
+ | GUI-Actor-7B | Qwen2-VL | **40.7** | 88.3 | 89.5 |
34
+ | GUI-Actor-7B + Verifier | Qwen2-VL | 44.2 | 89.7 | 90.9 |
35
+ | **_2B models:_**
36
+ | UGround-V1-2B | Qwen2-VL | 26.6 | 77.1 | - |
37
+ | UI-TARS-2B | Qwen2-VL | 27.7 | 82.3 | 84.7 |
38
+ | GUI-Actor-2B | Qwen2-VL | **36.7** | **86.5** | **88.6** |
39
+ | GUI-Actor-2B + Verifier | Qwen2-VL | 41.8 | 86.9 | 89.3 |
40
+
41
+ Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
42
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
43
+ |----------------|---------------|----------------|----------------|
44
+ | **_7B models:_**
45
+ | Qwen2.5-VL-7B | Qwen2.5-VL | 27.6 | 88.8 |
46
+ | Jedi-7B | Qwen2.5-VL | 39.5 | 91.7 |
47
+ | GUI-Actor-7B | Qwen2.5-VL | **44.6** | **92.1** |
48
+ | GUI-Actor-7B + Verifier | Qwen2.5-VL | 47.7 | 92.5 |
49
+ | **_3B models:_**
50
+ | Qwen2.5-VL-3B | Qwen2.5-VL | 25.9 | 80.9 |
51
+ | Jedi-3B | Qwen2.5-VL | 36.1 | 88.6 |
52
+ | GUI-Actor-3B | Qwen2.5-VL | **42.2** | **91.0** |
53
+ | GUI-Actor-3B + Verifier | Qwen2.5-VL | 45.9 | 92.4 |
54
+
55
+ ## 🚀 Usage
56
+ ```python
57
+ import torch
58
+
59
+ from qwen_vl_utils import process_vision_info
60
+ from datasets import load_dataset
61
+ from transformers import Qwen2VLProcessor
62
+ from gui_actor.constants import chat_template
63
+ from gui_actor.modeling import Qwen2VLForConditionalGenerationWithActionHead
64
+ from gui_actor.inference import inference
65
+
66
+
67
+ # load model
68
+ model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
69
+ data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
70
+ tokenizer = data_processor.tokenizer
71
+ model = Qwen2VLForConditionalGenerationWithActionHead.from_pretrained(
72
+ model_name_or_path,
73
+ torch_dtype=torch.bfloat16,
74
+ device_map="cuda:0",
75
+ attn_implementation="flash_attention_2"
76
+ ).eval()
77
+
78
+ # prepare example
79
+ dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
80
+ example = dataset[0]
81
+ conversation = [
82
+ {
83
+ "role": "system",
84
+ "content": [
85
+ {
86
+ "type": "text",
87
+ "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
88
+ }
89
+ ]
90
+ },
91
+ {
92
+ "role": "user",
93
+ "content": [
94
+ {
95
+ "type": "image",
96
+ "image": example["image"], # PIL.Image.Image or str to path
97
+ # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
98
+ },
99
+ {
100
+ "type": "text",
101
+ "text": example["instruction"]
102
+ },
103
+ ],
104
+ },
105
+ ]
106
+
107
+ # inference
108
+ pred = inference(conversation, model, tokenizer, data_processor, logits_processor=logits_processor_actor, use_placeholder=True, topk=3)
109
+ px, py = pred["topk_points"][0]
110
+ print(f"Click point: [{px}, {py}]")
111
+ ```
112
 
113
  ## Citation
114
  ```