Update README.md
Browse files
README.md
CHANGED
@@ -11,4 +11,278 @@ tags:
|
|
11 |
base_model:
|
12 |
- openbmb/MiniCPM-V-2_6
|
13 |
pipeline_tag: image-text-to-text
|
14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
base_model:
|
12 |
- openbmb/MiniCPM-V-2_6
|
13 |
pipeline_tag: image-text-to-text
|
14 |
+
---
|
15 |
+
|
16 |
+
# AgentCPM-GUI
|
17 |
+
|
18 |
+
[GitHub](https://github.com/OpenBMB/AgentCPM-GUI) | Technical Blog
|
19 |
+
|
20 |
+
## News
|
21 |
+
|
22 |
+
* [2025-05-13] 🚀🚀🚀 We have open-sourced **AgentCPM-GUI**, an on-device GUI agent capable of operating Chinese & English apps and equipped with RFT-enhanced reasoning abilities.
|
23 |
+
|
24 |
+
## Overview
|
25 |
+
|
26 |
+
**AgentCPM-GUI** is an open-source on-device LLM agent model jointly developed by [THUNLP](https://nlp.csai.tsinghua.edu.cn) and [ModelBest](https://modelbest.cn/en). Built on [MiniCPM-V](https://github.com/OpenBMB/MiniCPM-V) with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.
|
27 |
+
|
28 |
+
Key features include:
|
29 |
+
|
30 |
+
- **High-quality GUI grounding** — Pre-training on a large-scale bilingual Android dataset significantly boosts localization and comprehension of common GUI widgets (buttons, input boxes, labels, icons, etc.).
|
31 |
+
- **Chinese-app operation** — The first open-source GUI agent finely tuned for Chinese apps, covering 30 + popular titles such as Amap, Dianping, bilibili and Xiaohongshu.
|
32 |
+
- **Enhanced planning & reasoning** — Reinforcement fine-tuning (RFT) lets the model “think” before outputting an action, greatly improving success on complex tasks.
|
33 |
+
- **Compact action-space design** — An optimized action space and concise JSON format reduce the average action length to 9.7 tokens, boosting on-device inference efficiency.
|
34 |
+
|
35 |
+
Demo Case (1x speed):
|
36 |
+
|
37 |
+
https://github.com/user-attachments/assets/5472a659-cd71-4bce-a181-0981129c6a81
|
38 |
+
|
39 |
+
## Quick Start
|
40 |
+
|
41 |
+
### Install dependencies
|
42 |
+
|
43 |
+
```bash
|
44 |
+
git clone https://github.com/OpenBMB/AgentCPM-GUI
|
45 |
+
cd MiniCPM-Agent
|
46 |
+
conda create -n gui_agent python=3.11
|
47 |
+
conda activate gui_agent
|
48 |
+
pip install -r requirements.txt
|
49 |
+
```
|
50 |
+
|
51 |
+
### Download the model
|
52 |
+
|
53 |
+
Download [AgentCPM-GUI](https://huggingface.co/openbmb/AgentCPM-GUI) from Hugging Face and place it in `model/AgentCPM-GUI`.
|
54 |
+
|
55 |
+
#### Huggingface Inference
|
56 |
+
|
57 |
+
```python
|
58 |
+
import torch
|
59 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
60 |
+
from PIL import Image
|
61 |
+
import json
|
62 |
+
|
63 |
+
# 1. Load the model and tokenizer
|
64 |
+
model_path = "model/AgentCPM-GUI" # model path
|
65 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
66 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
|
67 |
+
model = model.to("cuda:0")
|
68 |
+
|
69 |
+
# 2. Build the input
|
70 |
+
instruction = "请点击屏幕上的‘会员’按钮"
|
71 |
+
image_path = "assets/test.jpeg"
|
72 |
+
image = Image.open(image_path).convert("RGB")
|
73 |
+
|
74 |
+
# 3. Resize the longer side to 1120 px to save compute & memory
|
75 |
+
def __resize__(origin_img):
|
76 |
+
resolution = origin_img.size
|
77 |
+
w,h = resolution
|
78 |
+
max_line_res = 1120
|
79 |
+
if max_line_res is not None:
|
80 |
+
max_line = max_line_res
|
81 |
+
if h > max_line:
|
82 |
+
w = int(w * max_line / h)
|
83 |
+
h = max_line
|
84 |
+
if w > max_line:
|
85 |
+
h = int(h * max_line / w)
|
86 |
+
w = max_line
|
87 |
+
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
|
88 |
+
return img
|
89 |
+
image = __resize__(image)
|
90 |
+
|
91 |
+
# 4. Build the message format
|
92 |
+
messages = [{
|
93 |
+
"role": "user",
|
94 |
+
"content": [
|
95 |
+
f"<Question>{instruction}</Question>\n当前屏幕截图:",
|
96 |
+
image
|
97 |
+
]
|
98 |
+
}]
|
99 |
+
|
100 |
+
# 5. Inference
|
101 |
+
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
|
102 |
+
items = list(ACTION_SCHEMA.items())
|
103 |
+
insert_index = 3
|
104 |
+
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
|
105 |
+
ACTION_SCHEMA = dict(items)
|
106 |
+
SYSTEM_PROMPT = f'''# Role
|
107 |
+
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
|
108 |
+
|
109 |
+
# Task
|
110 |
+
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
|
111 |
+
|
112 |
+
# Rule
|
113 |
+
- 以紧凑JSON格式输出
|
114 |
+
- 输出操作必须遵循Schema约束
|
115 |
+
|
116 |
+
# Schema
|
117 |
+
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
|
118 |
+
|
119 |
+
outputs = model.chat(
|
120 |
+
image=None,
|
121 |
+
msgs=messages,
|
122 |
+
system_prompt=SYSTEM_PROMPT,
|
123 |
+
tokenizer=tokenizer,
|
124 |
+
temperature=0.1,
|
125 |
+
top_p=0.3,
|
126 |
+
n=1,
|
127 |
+
)
|
128 |
+
|
129 |
+
# 6. Output
|
130 |
+
print(outputs)
|
131 |
+
```
|
132 |
+
|
133 |
+
Expected output:
|
134 |
+
|
135 |
+
```JSON
|
136 |
+
{"thought":"任务目标是点击屏幕上的‘会员’按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击‘会员’按钮可以访问应用的会员相关内容。","POINT":[729,69]}
|
137 |
+
```
|
138 |
+
|
139 |
+
#### vLLM Inference
|
140 |
+
|
141 |
+
```bash
|
142 |
+
# Launch the vLLM server
|
143 |
+
vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code
|
144 |
+
```
|
145 |
+
|
146 |
+
```python
|
147 |
+
import base64
|
148 |
+
import io
|
149 |
+
import json
|
150 |
+
import requests
|
151 |
+
from PIL import Image
|
152 |
+
|
153 |
+
END_POINT = "http://localhost:8000/v1/chat/completions" # Replace with actual endpoint
|
154 |
+
|
155 |
+
# system prompt
|
156 |
+
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
|
157 |
+
items = list(ACTION_SCHEMA.items())
|
158 |
+
insert_index = 3
|
159 |
+
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
|
160 |
+
ACTION_SCHEMA = dict(items)
|
161 |
+
SYSTEM_PROMPT = f'''# Role
|
162 |
+
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
|
163 |
+
|
164 |
+
# Task
|
165 |
+
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
|
166 |
+
|
167 |
+
# Rule
|
168 |
+
- 以紧凑JSON格式输出
|
169 |
+
- 输出操作必须遵循Schema约束
|
170 |
+
|
171 |
+
# Schema
|
172 |
+
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
|
173 |
+
|
174 |
+
def encode_image(image: Image.Image) -> str:
|
175 |
+
"""Convert PIL Image to base64-encoded string."""
|
176 |
+
with io.BytesIO() as in_mem_file:
|
177 |
+
image.save(in_mem_file, format="JPEG")
|
178 |
+
in_mem_file.seek(0)
|
179 |
+
return base64.b64encode(in_mem_file.read()).decode("utf-8")
|
180 |
+
|
181 |
+
def __resize__(origin_img):
|
182 |
+
resolution = origin_img.size
|
183 |
+
w,h = resolution
|
184 |
+
max_line_res = 1120
|
185 |
+
if max_line_res is not None:
|
186 |
+
max_line = max_line_res
|
187 |
+
if h > max_line:
|
188 |
+
w = int(w * max_line / h)
|
189 |
+
h = max_line
|
190 |
+
if w > max_line:
|
191 |
+
h = int(h * max_line / w)
|
192 |
+
w = max_line
|
193 |
+
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
|
194 |
+
return img
|
195 |
+
|
196 |
+
def predict(text_prompt: str, image: Image.Image):
|
197 |
+
messages = [
|
198 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
199 |
+
{"role": "user", "content": [
|
200 |
+
{"type": "text", "text": f"<Question>{text_prompt}</Question>\n当前屏幕截图:"},
|
201 |
+
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
|
202 |
+
]}
|
203 |
+
]
|
204 |
+
|
205 |
+
payload = {
|
206 |
+
"model": "AgentCPM-GUI", # Your model name
|
207 |
+
"temperature": 0.1,
|
208 |
+
"messages": messages,
|
209 |
+
"max_tokens": 2048,
|
210 |
+
}
|
211 |
+
|
212 |
+
headers = {
|
213 |
+
"Content-Type": "application/json",
|
214 |
+
}
|
215 |
+
|
216 |
+
response = requests.post(END_POINT, headers=headers, json=payload)
|
217 |
+
assistant_msg = response.json()["choices"][0]["message"]["content"]
|
218 |
+
return assistant_msg
|
219 |
+
|
220 |
+
image = __resize__(Image.open("assets/test.jpeg"))
|
221 |
+
instruction = "请点击屏幕上的‘会员’按钮"
|
222 |
+
response = predict(instruction, image)
|
223 |
+
print(response)
|
224 |
+
```
|
225 |
+
|
226 |
+
## Fine-tuning
|
227 |
+
|
228 |
+
Source code for SFT and RFT training is provided — see [SFT](sft/readme.md) and [RFT](rft/readme.md).
|
229 |
+
|
230 |
+
## Performance Evaluation
|
231 |
+
|
232 |
+
### Grounding Benchmark
|
233 |
+
|
234 |
+
| Model | fun2point | text2point | bbox2text | average |
|
235 |
+
| ------------------------- | -------------- | -------------- | -------------- | -------------- |
|
236 |
+
| **AgentCPM-GUI-8B** | **79.1** | **76.5** | **58.2** | **71.3** |
|
237 |
+
| Qwen2.5-VL-7B | 36.8 | 52.0 | 44.1 | 44.3 |
|
238 |
+
| Intern2.5-VL-8B | 17.2 | 24.2 | 45.9 | 29.1 |
|
239 |
+
| Intern2.5-VL-26B | 14.8 | 16.6 | 36.3 | 22.6 |
|
240 |
+
| OS-Genesis-7B | 8.3 | 5.8 | 4.0 | 6.0 |
|
241 |
+
| UI-TARS-7B | 56.8 | 66.7 | 1.4 | 41.6 |
|
242 |
+
| OS-Altas-7B | 53.6 | 60.7 | 0.4 | 38.2 |
|
243 |
+
| Aguvis-7B | 60.8 | **76.5** | 0.2 | 45.8 |
|
244 |
+
| GPT-4o | 22.1 | 19.9 | 14.3 | 18.8 |
|
245 |
+
| GPT-4o with Grounding | 44.3 | 44.0 | 14.3 | 44.2 |
|
246 |
+
|
247 |
+
### Agent Benchmark
|
248 |
+
|
249 |
+
| Dataset | Android Control-Low TM | Android Control-Low EM | Android Control-High TM | Android Control-High EM | GUI-Odyssey TM | GUI-Odyssey EM | AITZ TM | AITZ EM | Chinese APP TM | Chinese APP EM |
|
250 |
+
| ------------------------- | ---------------------- | ---------------------- | ----------------------- | ----------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
|
251 |
+
| **AgentCPM-GUI-8B** | **94.39** | **90.20** | **77.70** | **69.17** | **90.85** | **74.96** | **85.71** | **76.38** | **96.86** | **91.28** |
|
252 |
+
| Qwen2.5-VL-7B | 92.11 | 82.12 | 69.65 | 57.36 | 55.33 | 40.90 | 73.16 | 57.58 | 68.53 | 48.80 |
|
253 |
+
| UI-TARS-7B | 93.52 | 88.89 | 68.53 | 60.81 | 78.79 | 57.33 | 71.74 | 55.31 | 71.01 | 53.92 |
|
254 |
+
| OS-Genesis-7B | 90.74 | 74.22 | 65.92 | 44.43 | 11.67 | 3.63 | 19.98 | 8.45 | 38.10 | 14.50 |
|
255 |
+
| OS-Atlas-7B | 73.03 | 67.25 | 70.36 | 56.53 | 91.83* | 76.76* | 74.13 | 58.45 | 81.53 | 55.89 |
|
256 |
+
| Aguvis-7B | 93.85 | 89.40 | 65.56 | 54.18 | 26.71 | 13.54 | 35.71 | 18.99 | 67.43 | 38.20 |
|
257 |
+
| OdysseyAgent-7B | 65.10 | 39.16 | 58.80 | 32.74 | 90.83 | 73.67 | 59.17 | 31.60 | 67.56 | 25.44 |
|
258 |
+
| GPT-4o | - | 19.49 | - | 20.80 | - | 20.39 | 70.00 | 35.30 | 3.67 | 3.67 |
|
259 |
+
| Gemini 2.0 | - | 28.50 | - | 60.20 | - | 3.27 | - | - | - | - |
|
260 |
+
| Claude | - | 19.40 | - | 12.50 | 60.90 | - | - | - | - | - |
|
261 |
+
|
262 |
+
> \*Different train/test splits
|
263 |
+
|
264 |
+
All evaluation data and code are open-sourced — see [here](eval) for details.
|
265 |
+
|
266 |
+
## Evaluation Data
|
267 |
+
|
268 |
+
We provide **CAGUI**, an evaluation benchmark for Chinese apps covering **grounding** and **agent** tasks.
|
269 |
+
See the dataset on [Hugging Face](https://huggingface.co/datasets/openbmb/CAGUI).
|
270 |
+
|
271 |
+
## License
|
272 |
+
|
273 |
+
* Code in this repository is released under the [Apache-2.0](./LICENSE) license.
|
274 |
+
|
275 |
+
## Citation
|
276 |
+
|
277 |
+
If **AgentCPM-GUI** is useful for your research, please cite:
|
278 |
+
|
279 |
+
```bibtex
|
280 |
+
@misc{2025,
|
281 |
+
author = {THUNLP},
|
282 |
+
title = {AgentCPM-GUI},
|
283 |
+
year = {2025},
|
284 |
+
publisher = {GitHub},
|
285 |
+
journal = {GitHub repository},
|
286 |
+
howpublished = {\url{https://github.com/OpenBMB/AgentCPM-GUI}}
|
287 |
+
}
|
288 |
+
```
|