zhong-zhang commited on
Commit
648f01f
·
verified ·
1 Parent(s): 136d1c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +275 -1
README.md CHANGED
@@ -11,4 +11,278 @@ tags:
11
  base_model:
12
  - openbmb/MiniCPM-V-2_6
13
  pipeline_tag: image-text-to-text
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  base_model:
12
  - openbmb/MiniCPM-V-2_6
13
  pipeline_tag: image-text-to-text
14
+ ---
15
+
16
+ # AgentCPM-GUI
17
+
18
+ [GitHub](https://github.com/OpenBMB/AgentCPM-GUI) | Technical Blog
19
+
20
+ ## News
21
+
22
+ * [2025-05-13] 🚀🚀🚀 We have open-sourced **AgentCPM-GUI**, an on-device GUI agent capable of operating Chinese & English apps and equipped with RFT-enhanced reasoning abilities.
23
+
24
+ ## Overview
25
+
26
+ **AgentCPM-GUI** is an open-source on-device LLM agent model jointly developed by [THUNLP](https://nlp.csai.tsinghua.edu.cn) and [ModelBest](https://modelbest.cn/en). Built on [MiniCPM-V](https://github.com/OpenBMB/MiniCPM-V) with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.
27
+
28
+ Key features include:
29
+
30
+ - **High-quality GUI grounding** — Pre-training on a large-scale bilingual Android dataset significantly boosts localization and comprehension of common GUI widgets (buttons, input boxes, labels, icons, etc.).
31
+ - **Chinese-app operation** — The first open-source GUI agent finely tuned for Chinese apps, covering 30 + popular titles such as Amap, Dianping, bilibili and Xiaohongshu.
32
+ - **Enhanced planning & reasoning** — Reinforcement fine-tuning (RFT) lets the model “think” before outputting an action, greatly improving success on complex tasks.
33
+ - **Compact action-space design** — An optimized action space and concise JSON format reduce the average action length to 9.7 tokens, boosting on-device inference efficiency.
34
+
35
+ Demo Case (1x speed):
36
+
37
+ https://github.com/user-attachments/assets/5472a659-cd71-4bce-a181-0981129c6a81
38
+
39
+ ## Quick Start
40
+
41
+ ### Install dependencies
42
+
43
+ ```bash
44
+ git clone https://github.com/OpenBMB/AgentCPM-GUI
45
+ cd MiniCPM-Agent
46
+ conda create -n gui_agent python=3.11
47
+ conda activate gui_agent
48
+ pip install -r requirements.txt
49
+ ```
50
+
51
+ ### Download the model
52
+
53
+ Download [AgentCPM-GUI](https://huggingface.co/openbmb/AgentCPM-GUI) from Hugging Face and place it in `model/AgentCPM-GUI`.
54
+
55
+ #### Huggingface Inference
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import AutoTokenizer, AutoModelForCausalLM
60
+ from PIL import Image
61
+ import json
62
+
63
+ # 1. Load the model and tokenizer
64
+ model_path = "model/AgentCPM-GUI" # model path
65
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
66
+ model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
67
+ model = model.to("cuda:0")
68
+
69
+ # 2. Build the input
70
+ instruction = "请点击屏幕上的‘会员’按钮"
71
+ image_path = "assets/test.jpeg"
72
+ image = Image.open(image_path).convert("RGB")
73
+
74
+ # 3. Resize the longer side to 1120 px to save compute & memory
75
+ def __resize__(origin_img):
76
+ resolution = origin_img.size
77
+ w,h = resolution
78
+ max_line_res = 1120
79
+ if max_line_res is not None:
80
+ max_line = max_line_res
81
+ if h > max_line:
82
+ w = int(w * max_line / h)
83
+ h = max_line
84
+ if w > max_line:
85
+ h = int(h * max_line / w)
86
+ w = max_line
87
+ img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
88
+ return img
89
+ image = __resize__(image)
90
+
91
+ # 4. Build the message format
92
+ messages = [{
93
+ "role": "user",
94
+ "content": [
95
+ f"<Question>{instruction}</Question>\n当前屏幕截图:",
96
+ image
97
+ ]
98
+ }]
99
+
100
+ # 5. Inference
101
+ ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
102
+ items = list(ACTION_SCHEMA.items())
103
+ insert_index = 3
104
+ items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
105
+ ACTION_SCHEMA = dict(items)
106
+ SYSTEM_PROMPT = f'''# Role
107
+ 你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
108
+
109
+ # Task
110
+ 针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
111
+
112
+ # Rule
113
+ - 以紧凑JSON格式输出
114
+ - 输出操作必须遵循Schema约束
115
+
116
+ # Schema
117
+ {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
118
+
119
+ outputs = model.chat(
120
+ image=None,
121
+ msgs=messages,
122
+ system_prompt=SYSTEM_PROMPT,
123
+ tokenizer=tokenizer,
124
+ temperature=0.1,
125
+ top_p=0.3,
126
+ n=1,
127
+ )
128
+
129
+ # 6. Output
130
+ print(outputs)
131
+ ```
132
+
133
+ Expected output:
134
+
135
+ ```JSON
136
+ {"thought":"任务目标是点击屏幕上的‘会员’按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击‘会员’按钮可以访问应用的会员相关内容。","POINT":[729,69]}
137
+ ```
138
+
139
+ #### vLLM Inference
140
+
141
+ ```bash
142
+ # Launch the vLLM server
143
+ vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code
144
+ ```
145
+
146
+ ```python
147
+ import base64
148
+ import io
149
+ import json
150
+ import requests
151
+ from PIL import Image
152
+
153
+ END_POINT = "http://localhost:8000/v1/chat/completions" # Replace with actual endpoint
154
+
155
+ # system prompt
156
+ ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
157
+ items = list(ACTION_SCHEMA.items())
158
+ insert_index = 3
159
+ items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
160
+ ACTION_SCHEMA = dict(items)
161
+ SYSTEM_PROMPT = f'''# Role
162
+ 你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
163
+
164
+ # Task
165
+ 针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
166
+
167
+ # Rule
168
+ - 以紧凑JSON格式输出
169
+ - 输出操作必须遵循Schema约束
170
+
171
+ # Schema
172
+ {json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
173
+
174
+ def encode_image(image: Image.Image) -> str:
175
+ """Convert PIL Image to base64-encoded string."""
176
+ with io.BytesIO() as in_mem_file:
177
+ image.save(in_mem_file, format="JPEG")
178
+ in_mem_file.seek(0)
179
+ return base64.b64encode(in_mem_file.read()).decode("utf-8")
180
+
181
+ def __resize__(origin_img):
182
+ resolution = origin_img.size
183
+ w,h = resolution
184
+ max_line_res = 1120
185
+ if max_line_res is not None:
186
+ max_line = max_line_res
187
+ if h > max_line:
188
+ w = int(w * max_line / h)
189
+ h = max_line
190
+ if w > max_line:
191
+ h = int(h * max_line / w)
192
+ w = max_line
193
+ img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
194
+ return img
195
+
196
+ def predict(text_prompt: str, image: Image.Image):
197
+ messages = [
198
+ {"role": "system", "content": SYSTEM_PROMPT},
199
+ {"role": "user", "content": [
200
+ {"type": "text", "text": f"<Question>{text_prompt}</Question>\n当前屏幕截图:"},
201
+ {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
202
+ ]}
203
+ ]
204
+
205
+ payload = {
206
+ "model": "AgentCPM-GUI", # Your model name
207
+ "temperature": 0.1,
208
+ "messages": messages,
209
+ "max_tokens": 2048,
210
+ }
211
+
212
+ headers = {
213
+ "Content-Type": "application/json",
214
+ }
215
+
216
+ response = requests.post(END_POINT, headers=headers, json=payload)
217
+ assistant_msg = response.json()["choices"][0]["message"]["content"]
218
+ return assistant_msg
219
+
220
+ image = __resize__(Image.open("assets/test.jpeg"))
221
+ instruction = "请点击屏幕上的‘会员’按钮"
222
+ response = predict(instruction, image)
223
+ print(response)
224
+ ```
225
+
226
+ ## Fine-tuning
227
+
228
+ Source code for SFT and RFT training is provided — see [SFT](sft/readme.md) and [RFT](rft/readme.md).
229
+
230
+ ## Performance Evaluation
231
+
232
+ ### Grounding Benchmark
233
+
234
+ | Model | fun2point | text2point | bbox2text | average |
235
+ | ------------------------- | -------------- | -------------- | -------------- | -------------- |
236
+ | **AgentCPM-GUI-8B** | **79.1** | **76.5** | **58.2** | **71.3** |
237
+ | Qwen2.5-VL-7B | 36.8 | 52.0 | 44.1 | 44.3 |
238
+ | Intern2.5-VL-8B | 17.2 | 24.2 | 45.9 | 29.1 |
239
+ | Intern2.5-VL-26B | 14.8 | 16.6 | 36.3 | 22.6 |
240
+ | OS-Genesis-7B | 8.3 | 5.8 | 4.0 | 6.0 |
241
+ | UI-TARS-7B | 56.8 | 66.7 | 1.4 | 41.6 |
242
+ | OS-Altas-7B | 53.6 | 60.7 | 0.4 | 38.2 |
243
+ | Aguvis-7B | 60.8 | **76.5** | 0.2 | 45.8 |
244
+ | GPT-4o | 22.1 | 19.9 | 14.3 | 18.8 |
245
+ | GPT-4o with Grounding | 44.3 | 44.0 | 14.3 | 44.2 |
246
+
247
+ ### Agent Benchmark
248
+
249
+ | Dataset | Android Control-Low TM | Android Control-Low EM | Android Control-High TM | Android Control-High EM | GUI-Odyssey TM | GUI-Odyssey EM | AITZ TM | AITZ EM | Chinese APP TM | Chinese APP EM |
250
+ | ------------------------- | ---------------------- | ---------------------- | ----------------------- | ----------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
251
+ | **AgentCPM-GUI-8B** | **94.39** | **90.20** | **77.70** | **69.17** | **90.85** | **74.96** | **85.71** | **76.38** | **96.86** | **91.28** |
252
+ | Qwen2.5-VL-7B | 92.11 | 82.12 | 69.65 | 57.36 | 55.33 | 40.90 | 73.16 | 57.58 | 68.53 | 48.80 |
253
+ | UI-TARS-7B | 93.52 | 88.89 | 68.53 | 60.81 | 78.79 | 57.33 | 71.74 | 55.31 | 71.01 | 53.92 |
254
+ | OS-Genesis-7B | 90.74 | 74.22 | 65.92 | 44.43 | 11.67 | 3.63 | 19.98 | 8.45 | 38.10 | 14.50 |
255
+ | OS-Atlas-7B | 73.03 | 67.25 | 70.36 | 56.53 | 91.83* | 76.76* | 74.13 | 58.45 | 81.53 | 55.89 |
256
+ | Aguvis-7B | 93.85 | 89.40 | 65.56 | 54.18 | 26.71 | 13.54 | 35.71 | 18.99 | 67.43 | 38.20 |
257
+ | OdysseyAgent-7B | 65.10 | 39.16 | 58.80 | 32.74 | 90.83 | 73.67 | 59.17 | 31.60 | 67.56 | 25.44 |
258
+ | GPT-4o | - | 19.49 | - | 20.80 | - | 20.39 | 70.00 | 35.30 | 3.67 | 3.67 |
259
+ | Gemini 2.0 | - | 28.50 | - | 60.20 | - | 3.27 | - | - | - | - |
260
+ | Claude | - | 19.40 | - | 12.50 | 60.90 | - | - | - | - | - |
261
+
262
+ > \*Different train/test splits
263
+
264
+ All evaluation data and code are open-sourced — see [here](eval) for details.
265
+
266
+ ## Evaluation Data
267
+
268
+ We provide **CAGUI**, an evaluation benchmark for Chinese apps covering **grounding** and **agent** tasks.
269
+ See the dataset on [Hugging Face](https://huggingface.co/datasets/openbmb/CAGUI).
270
+
271
+ ## License
272
+
273
+ * Code in this repository is released under the [Apache-2.0](./LICENSE) license.
274
+
275
+ ## Citation
276
+
277
+ If **AgentCPM-GUI** is useful for your research, please cite:
278
+
279
+ ```bibtex
280
+ @misc{2025,
281
+ author = {THUNLP},
282
+ title = {AgentCPM-GUI},
283
+ year = {2025},
284
+ publisher = {GitHub},
285
+ journal = {GitHub repository},
286
+ howpublished = {\url{https://github.com/OpenBMB/AgentCPM-GUI}}
287
+ }
288
+ ```