AgentCPM-GUI

GitHub | Technical Blog

News

  • [2025-05-13] ๐Ÿš€๐Ÿš€๐Ÿš€ We have open-sourced AgentCPM-GUI, an on-device GUI agent capable of operating Chinese & English apps and equipped with RFT-enhanced reasoning abilities.

Overview

AgentCPM-GUI is an open-source on-device LLM agent model jointly developed by THUNLP and ModelBest. Built on MiniCPM-V with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.

Key features include:

  • High-quality GUI grounding โ€” Pre-training on a large-scale bilingual Android dataset significantly boosts localization and comprehension of common GUI widgets (buttons, input boxes, labels, icons, etc.).
  • Chinese-app operation โ€” The first open-source GUI agent finely tuned for Chinese apps, covering 30 + popular titles such as Amap, Dianping, bilibili and Xiaohongshu.
  • Enhanced planning & reasoning โ€” Reinforcement fine-tuning (RFT) lets the model โ€œthinkโ€ before outputting an action, greatly improving success on complex tasks.
  • Compact action-space design โ€” An optimized action space and concise JSON format reduce the average action length to 9.7 tokens, boosting on-device inference efficiency.

Demo Case (1x speed):

https://github.com/user-attachments/assets/5472a659-cd71-4bce-a181-0981129c6a81

Quick Start

Install dependencies

git clone https://github.com/OpenBMB/AgentCPM-GUI
cd MiniCPM-Agent
conda create -n gui_agent python=3.11
conda activate gui_agent
pip install -r requirements.txt

Download the model

Download AgentCPM-GUI from Hugging Face and place it in model/AgentCPM-GUI.

Huggingface Inference

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import json

# 1. Load the model and tokenizer
model_path = "model/AgentCPM-GUI"  # model path
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to("cuda:0") 

# 2. Build the input
instruction = "่ฏท็‚นๅ‡ปๅฑๅน•ไธŠ็š„โ€˜ไผšๅ‘˜โ€™ๆŒ‰้’ฎ"
image_path = "assets/test.jpeg"
image = Image.open(image_path).convert("RGB")

# 3. Resize the longer side to 1120 px to save compute & memory
def __resize__(origin_img):
    resolution = origin_img.size
    w,h = resolution
    max_line_res = 1120
    if max_line_res is not None:
        max_line = max_line_res
        if h > max_line:
            w = int(w * max_line / h)
            h = max_line
        if w > max_line:
            h = int(h * max_line / w)
            w = max_line
    img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
    return img
image = __resize__(image)

# 4. Build the message format
messages = [{
    "role": "user",
    "content": [
        f"<Question>{instruction}</Question>\nๅฝ“ๅ‰ๅฑๅน•ๆˆชๅ›พ๏ผš",
        image
    ]
}]

# 5. Inference
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# Role
ไฝ ๆ˜ฏไธ€ๅ็†Ÿๆ‚‰ๅฎ‰ๅ“็ณป็ปŸ่งฆๅฑGUIๆ“ไฝœ็š„ๆ™บ่ƒฝไฝ“๏ผŒๅฐ†ๆ นๆฎ็”จๆˆท็š„้—ฎ้ข˜๏ผŒๅˆ†ๆžๅฝ“ๅ‰็•Œ้ข็š„GUIๅ…ƒ็ด ๅ’Œๅธƒๅฑ€๏ผŒ็”Ÿๆˆ็›ธๅบ”็š„ๆ“ไฝœใ€‚

# Task
้’ˆๅฏน็”จๆˆท้—ฎ้ข˜๏ผŒๆ นๆฎ่พ“ๅ…ฅ็š„ๅฝ“ๅ‰ๅฑๅน•ๆˆชๅ›พ๏ผŒ่พ“ๅ‡บไธ‹ไธ€ๆญฅ็š„ๆ“ไฝœใ€‚

# Rule
- ไปฅ็ดงๅ‡‘JSONๆ ผๅผ่พ“ๅ‡บ
- ่พ“ๅ‡บๆ“ไฝœๅฟ…้กป้ตๅพชSchema็บฆๆŸ

# Schema
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''

outputs = model.chat(
    image=None,
    msgs=messages,
    system_prompt=SYSTEM_PROMPT,
    tokenizer=tokenizer,
    temperature=0.1,
    top_p=0.3,
    n=1,
)

# 6. Output
print(outputs)

Expected output:

{"thought":"ไปปๅŠก็›ฎๆ ‡ๆ˜ฏ็‚นๅ‡ปๅฑๅน•ไธŠ็š„โ€˜ไผšๅ‘˜โ€™ๆŒ‰้’ฎใ€‚ๅฝ“ๅ‰็•Œ้ขๆ˜พ็คบไบ†ๅบ”็”จ็š„ๆŽจ่้กต้ข๏ผŒ้กถ้ƒจๆœ‰ไธ€ไธชๅฏผ่ˆชๆ ใ€‚็‚นๅ‡ปโ€˜ไผšๅ‘˜โ€™ๆŒ‰้’ฎๅฏไปฅ่ฎฟ้—ฎๅบ”็”จ็š„ไผšๅ‘˜็›ธๅ…ณๅ†…ๅฎนใ€‚","POINT":[729,69]}

vLLM Inference

# Launch the vLLM server
vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code
import base64
import io
import json
import requests
from PIL import Image

END_POINT = "http://localhost:8000/v1/chat/completions"  # Replace with actual endpoint

# system prompt
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
items = list(ACTION_SCHEMA.items())
insert_index = 3
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
ACTION_SCHEMA = dict(items)
SYSTEM_PROMPT = f'''# Role
ไฝ ๆ˜ฏไธ€ๅ็†Ÿๆ‚‰ๅฎ‰ๅ“็ณป็ปŸ่งฆๅฑGUIๆ“ไฝœ็š„ๆ™บ่ƒฝไฝ“๏ผŒๅฐ†ๆ นๆฎ็”จๆˆท็š„้—ฎ้ข˜๏ผŒๅˆ†ๆžๅฝ“ๅ‰็•Œ้ข็š„GUIๅ…ƒ็ด ๅ’Œๅธƒๅฑ€๏ผŒ็”Ÿๆˆ็›ธๅบ”็š„ๆ“ไฝœใ€‚

# Task
้’ˆๅฏน็”จๆˆท้—ฎ้ข˜๏ผŒๆ นๆฎ่พ“ๅ…ฅ็š„ๅฝ“ๅ‰ๅฑๅน•ๆˆชๅ›พ๏ผŒ่พ“ๅ‡บไธ‹ไธ€ๆญฅ็š„ๆ“ไฝœใ€‚

# Rule
- ไปฅ็ดงๅ‡‘JSONๆ ผๅผ่พ“ๅ‡บ
- ่พ“ๅ‡บๆ“ไฝœๅฟ…้กป้ตๅพชSchema็บฆๆŸ

# Schema
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''

def encode_image(image: Image.Image) -> str:
    """Convert PIL Image to base64-encoded string."""
    with io.BytesIO() as in_mem_file:
        image.save(in_mem_file, format="JPEG")
        in_mem_file.seek(0)
        return base64.b64encode(in_mem_file.read()).decode("utf-8")

def __resize__(origin_img):
    resolution = origin_img.size
    w,h = resolution
    max_line_res = 1120
    if max_line_res is not None:
        max_line = max_line_res
        if h > max_line:
            w = int(w * max_line / h)
            h = max_line
        if w > max_line:
            h = int(h * max_line / w)
            w = max_line
    img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
    return img

def predict(text_prompt: str, image: Image.Image):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": [
            {"type": "text", "text": f"<Question>{text_prompt}</Question>\nๅฝ“ๅ‰ๅฑๅน•ๆˆชๅ›พ๏ผš"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
        ]}
    ]

    payload = {
        "model": "AgentCPM-GUI",  # Your model name
        "temperature": 0.1,
        "messages": messages,
        "max_tokens": 2048,
    }

    headers = {
        "Content-Type": "application/json",
    }

    response = requests.post(END_POINT, headers=headers, json=payload)
    assistant_msg = response.json()["choices"][0]["message"]["content"]
    return assistant_msg

image = __resize__(Image.open("assets/test.jpeg"))
instruction = "่ฏท็‚นๅ‡ปๅฑๅน•ไธŠ็š„โ€˜ไผšๅ‘˜โ€™ๆŒ‰้’ฎ"
response = predict(instruction, image)
print(response)

Fine-tuning

Source code for SFT and RFT training is provided โ€” see SFT and RFT.

Performance Evaluation

Grounding Benchmark

Model fun2point text2point bbox2text average
AgentCPM-GUI-8B 79.1 76.5 58.2 71.3
Qwen2.5-VL-7B 36.8 52.0 44.1 44.3
Intern2.5-VL-8B 17.2 24.2 45.9 29.1
Intern2.5-VL-26B 14.8 16.6 36.3 22.6
OS-Genesis-7B 8.3 5.8 4.0 6.0
UI-TARS-7B 56.8 66.7 1.4 41.6
OS-Altas-7B 53.6 60.7 0.4 38.2
Aguvis-7B 60.8 76.5 0.2 45.8
GPT-4o 22.1 19.9 14.3 18.8
GPT-4o with Grounding 44.3 44.0 14.3 44.2

Agent Benchmark

Dataset Android Control-Low TM Android Control-Low EM Android Control-High TM Android Control-High EM GUI-Odyssey TM GUI-Odyssey EM AITZ TM AITZ EM Chinese APP TM Chinese APP EM
AgentCPM-GUI-8B 94.39 90.20 77.70 69.17 90.85 74.96 85.71 76.38 96.86 91.28
Qwen2.5-VL-7B 92.11 82.12 69.65 57.36 55.33 40.90 73.16 57.58 68.53 48.80
UI-TARS-7B 93.52 88.89 68.53 60.81 78.79 57.33 71.74 55.31 71.01 53.92
OS-Genesis-7B 90.74 74.22 65.92 44.43 11.67 3.63 19.98 8.45 38.10 14.50
OS-Atlas-7B 73.03 67.25 70.36 56.53 91.83* 76.76* 74.13 58.45 81.53 55.89
Aguvis-7B 93.85 89.40 65.56 54.18 26.71 13.54 35.71 18.99 67.43 38.20
OdysseyAgent-7B 65.10 39.16 58.80 32.74 90.83 73.67 59.17 31.60 67.56 25.44
GPT-4o - 19.49 - 20.80 - 20.39 70.00 35.30 3.67 3.67
Gemini 2.0 - 28.50 - 60.20 - 3.27 - - - -
Claude - 19.40 - 12.50 60.90 - - - - -

*Different train/test splits

All evaluation data and code are open-sourced โ€” see here for details.

Evaluation Data

We provide CAGUI, an evaluation benchmark for Chinese apps covering grounding and agent tasks. See the dataset on Hugging Face.

License

  • Code in this repository is released under the Apache-2.0 license.

Citation

If AgentCPM-GUI is useful for your research, please cite:

@misc{2025,
  author       = {THUNLP},
  title        = {AgentCPM-GUI},
  year         = {2025},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/OpenBMB/AgentCPM-GUI}}
}
Downloads last month
13
Safetensors
Model size
8.1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for openbmb/AgentCPM-GUI

Finetuned
(8)
this model