qianhuiwu commited on
Commit
d8a8183
Β·
verified Β·
1 Parent(s): 42bcf36

Add model card.

Browse files
Files changed (1) hide show
  1. README.md +132 -3
README.md CHANGED
@@ -1,3 +1,132 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - Qwen/Qwen2-VL-2B-Instruct
5
+ ---
6
+
7
+ # GUI-Actor-2B with Qwen2-VL-2B as backbone VLM
8
+
9
+ - [GUI-Actor-7B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)
10
+ - [GUI-Actor-2B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL)
11
+ - [GUI-Actor-7B-Qwen2.5-VL (coming soon)](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL)
12
+ - [GUI-Actor-3B-Qwen2.5-VL (coming soon)](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)
13
+ - [GUI-Actor-Verifier-2B](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)
14
+
15
+ This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://aka.ms/GUI-Actor).
16
+ It is developed based on [Qwen2-VL-2B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
17
+
18
+ For more details on model design and evaluation, please check: [🏠 Project Page](https://aka.ms/GUI-Actor) | [πŸ’» Github Repo](https://github.com/microsoft/GUI-Actor) | [πŸ“‘ Paper]().
19
+
20
+ ## πŸ“Š Performance Comparison on GUI Grounding Benchmarks
21
+ Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
22
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
23
+ |------------------|--------------|----------------|------------|----------------|
24
+ | **_72B models:_**
25
+ | AGUVIS-72B | Qwen2-VL | - | 89.2 | - |
26
+ | UGround-V1-72B | Qwen2-VL | 34.5 | **89.4** | - |
27
+ | UI-TARS-72B | Qwen2-VL | **38.1** | 88.4 | **90.3** |
28
+ | **_7B models:_**
29
+ | OS-Atlas-7B | Qwen2-VL | 18.9 | 82.5 | 84.1 |
30
+ | AGUVIS-7B | Qwen2-VL | 22.9 | 84.4 | 86.0† |
31
+ | UGround-V1-7B | Qwen2-VL | 31.1 | 86.3 | 87.6† |
32
+ | UI-TARS-7B | Qwen2-VL | 35.7 | **89.5** | **91.6** |
33
+ | GUI-Actor-7B | Qwen2-VL | **40.7** | 88.3 | 89.5 |
34
+ | GUI-Actor-7B + Verifier | Qwen2-VL | 44.2 | 89.7 | 90.9 |
35
+ | **_2B models:_**
36
+ | UGround-V1-2B | Qwen2-VL | 26.6 | 77.1 | - |
37
+ | UI-TARS-2B | Qwen2-VL | 27.7 | 82.3 | 84.7 |
38
+ | GUI-Actor-2B | Qwen2-VL | **36.7** | **86.5** | **88.6** |
39
+ | GUI-Actor-2B + Verifier | Qwen2-VL | 41.8 | 86.9 | 89.3 |
40
+
41
+ Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
42
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
43
+ |----------------|---------------|----------------|----------------|
44
+ | **_7B models:_**
45
+ | Qwen2.5-VL-7B | Qwen2.5-VL | 27.6 | 88.8 |
46
+ | Jedi-7B | Qwen2.5-VL | 39.5 | 91.7 |
47
+ | GUI-Actor-7B | Qwen2.5-VL | **44.6** | **92.1** |
48
+ | GUI-Actor-7B + Verifier | Qwen2.5-VL | 47.7 | 92.5 |
49
+ | **_3B models:_**
50
+ | Qwen2.5-VL-3B | Qwen2.5-VL | 25.9 | 80.9 |
51
+ | Jedi-3B | Qwen2.5-VL | 36.1 | 88.6 |
52
+ | GUI-Actor-3B | Qwen2.5-VL | **42.2** | **91.0** |
53
+ | GUI-Actor-3B + Verifier | Qwen2.5-VL | 45.9 | 92.4 |
54
+
55
+ ## πŸš€ Usage
56
+ ```python
57
+ import torch
58
+
59
+ from qwen_vl_utils import process_vision_info
60
+ from datasets import load_dataset
61
+ from transformers import Qwen2VLProcessor
62
+ from gui_actor.constants import chat_template
63
+ from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
64
+ from gui_actor.inference import inference
65
+
66
+
67
+ # load model
68
+ model_name_or_path = "microsoft/GUI-Actor-2B-Qwen2-VL"
69
+ data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
70
+ tokenizer = data_processor.tokenizer
71
+ model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
72
+ model_name_or_path,
73
+ torch_dtype=torch.bfloat16,
74
+ device_map="cuda:0",
75
+ attn_implementation="flash_attention_2"
76
+ ).eval()
77
+
78
+ # prepare example
79
+ dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
80
+ example = dataset[0]
81
+ print(f"Intruction: {example['instruction']}")
82
+ print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
83
+
84
+ conversation = [
85
+ {
86
+ "role": "system",
87
+ "content": [
88
+ {
89
+ "type": "text",
90
+ "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
91
+ }
92
+ ]
93
+ },
94
+ {
95
+ "role": "user",
96
+ "content": [
97
+ {
98
+ "type": "image",
99
+ "image": example["image"], # PIL.Image.Image or str to path
100
+ # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
101
+ },
102
+ {
103
+ "type": "text",
104
+ "text": example["instruction"]
105
+ },
106
+ ],
107
+ },
108
+ ]
109
+
110
+ # inference
111
+ pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
112
+ px, py = pred["topk_points"][0]
113
+ print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")
114
+
115
+ # >> Model Response
116
+ # Intruction: close this window
117
+ # ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
118
+ # Predicted click point: [0.9709, 0.1548]
119
+ ```
120
+
121
+ ## πŸ“ Citation
122
+ ```
123
+ @article{wu2025guiactor,
124
+ title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
125
+ author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
126
+ year={2025},
127
+ eprint={},
128
+ archivePrefix={arXiv},
129
+ primaryClass={cs.CV},
130
+ url={},
131
+ }
132
+ ```