BAAI
/

Safetensors
English
llava_onevision
yuheng2000 commited on
Commit
78dcdd5
·
verified ·
1 Parent(s): 28e6cb5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +241 -85
README.md CHANGED
@@ -7,116 +7,200 @@ language:
7
  - en
8
  ---
9
 
10
- # RoboBrain
11
 
12
- <!-- Provide a quick summary of what the model is/does. -->
13
- [CVPR 2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete.
 
 
 
 
14
 
15
  <p align="center">
16
- </a>&nbsp&nbsp⭐️ <a href="https://superrobobrain.github.io/">Project</a></a>&nbsp&nbsp | &nbsp&nbsp🧠 <a href="https://github.com/FlagOpen/RoboBrain/">Github</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://www.modelscope.cn/models/BAAI/RoboBrain/files/">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp🌎 <a href="https://github.com/FlagOpen/ShareRobot">Dataset</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="http://arxiv.org/abs/2502.21257">Paper</a>&nbsp&nbsp
17
  </p>
18
  <p align="center">
19
  </a>&nbsp&nbsp🎯 <a href="">RoboOS (Coming Soon)</a>: An Efficient Open-Source Multi-Robot Coordination System for RoboBrain.
20
  </p>
21
  <p align="center">
22
- </a>&nbsp&nbsp🎯 <a href="https://tanhuajie.github.io/ReasonRFT/">ReasonRFT</a>: Exploring a New RFT Paradigm to Enhance RoboBrain's Visual Reasoning Capabilities.
23
  </p>
24
 
25
- ## 🤗 Checkpoints
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  | Models | Checkpoint | Description |
27
  |----------------------|----------------------------------------------------------------|------------------------------------------------------------|
28
  | Planning Model | [🤗 Planning CKPTs](https://huggingface.co/BAAI/RoboBrain/) | Used for Planning prediction in our paper |
29
- | Affordance (A-LoRA) | [🤗 Affordance CKPTs](https://huggingface.co/BAAI/RoboBrain-LoRA-Affordance) | Used for Affordance prediction in our paper |
30
- | Trajectory (T-LoRA) | [🤗 Trajectory CKPTs](https://huggingface.co/BAAI/RoboBrain/) | Used for Trajectory prediction in our paper *(Coming Soon)* |
31
 
32
- ## 🔥 Introduction
33
 
34
- In recent years, the rapid development of multimodal large language models (MLLMs) has significantly advanced the research progress of artificial general intelligence (AGI).
35
- By utilizing vast multimodal data from the internet and combining it with self-supervised learning techniques,
36
- MLLMs have demonstrated exceptional capabilities in visual perception and understanding human language instructions.
37
- However, despite their impressive performance in general tasks, MLLMs still face substantial challenges in embodied scenarios,
38
- particularly in long-horizon manipulation tasks.
39
 
40
- In robotics, long-horizon manipulation tasks are one of the core capabilities for robots executing complex tasks.
41
- These tasks typically involve multiple steps and long-term interactions, such as "preparing a cup of tea in the kitchen" or "completing item sorting in a warehouse."
42
- Such tasks require robots not only to understand abstract instructions but also to convert those instructions into concrete actions.
43
- Specifically, the successful execution of long-horizon manipulation tasks relies on three core capabilities:
 
 
 
 
 
 
44
 
45
- - **Planning**: Robots need to decompose complex abstract instructions into executable subtasks. For example, "lifting the teapot and pouring water into the cup" must be broken down into steps such as "approaching the teapot and lifting it," "moving the teapot to align the spout over the cup," and "tilting the teapot to pour water."
46
-
47
- - **Affordance Perception**: Robots must accurately identify the actionable regions of objects, such as the handle or spout of a teapot, to ensure the precision of their actions.
48
-
49
- - **Trajectory Prediction**: Robots need to predict the complete path from the starting point to the target position based on the task instructions, such as the movement trajectory from the current position to the handle of the teapot.
50
 
51
- However, existing MLLMs exhibit significant shortcomings in these areas. For instance, when faced with the task of "lifting the teapot and pouring water into the cup," MLLMs may struggle to accurately decompose the task steps, identify the graspable regions of the teapot, or even predict the complete trajectory from the starting point to the target position. These limitations primarily stem from the current lack of large-scale, fine-grained datasets specifically designed for MLLMs and long-horizon manipulation tasks in robotics.
52
 
53
- To address this gap, we propose <a target="_blank" href="https://huggingface.co/datasets/BAAI/ShareRobot">ShareRobot</a>—a high-quality heterogeneous dataset specifically designed for robotic manipulation tasks.
54
- ShareRobot annotates multidimensional information, including task planning, object affordance regions, and end-effector trajectories,
55
- providing a solid foundation for enhancing robotic capabilities.
56
- Based on ShareRobot, we developed **RoboBrain**, a unified embodied multimodal brain model that aims to enhance robots' capabilities in long-horizon manipulation tasks.
57
- Through carefully designed data ratios, multi-stage training strategies, and inputs of long videos and high-resolution images, RoboBrain achieves a cognitive leap from abstract task instructions to concrete action expressions, demonstrating its potential for practical applications in robotics.
58
 
 
 
59
 
 
 
60
 
61
- RoboBrain consists of three key robotic capabilities for long-horizon manipulation tasks: **planning**, **affordance perception**, and **trajectory prediction**.
62
- Based on the ShareRobot dataset we have constructed, RoboBrain has achieved state-of-the-art performance in multiple robotic benchmarks through a well-designed multi-stage training process,
63
- realizing a cognitive leap from abstract instruction understanding to concrete action expression.
64
- <p align="center">
65
- <img src="https://superrobobrain.github.io/images/RoboBrain_teaser.png" />
66
- <p>
67
 
 
 
68
 
69
- ## 🤖 Inference
 
70
 
71
- ### Option 1: HF inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- #### Run python script as example:
74
  ```python
75
- import torch
76
- from transformers import AutoProcessor, AutoModelForPreTraining
77
 
78
  model_id = "BAAI/RoboBrain"
 
79
 
80
- print("Loading Checkpoint ...")
81
- model = AutoModelForPreTraining.from_pretrained(
82
- model_id,
83
- torch_dtype=torch.float16,
84
- low_cpu_mem_usage=True,
85
- ).to("cuda:0")
86
-
87
- processor = AutoProcessor.from_pretrained(model_id)
88
-
89
- # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
90
- # Each value in "content" has to be a list of dicts with types ("text", "image")
91
- messages = [
92
- {
93
- "role": "user",
94
- "content": [
95
- {"type": "text", "text": "What is shown in this image?"},
96
- {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
97
- ],
98
- },
99
- ]
100
-
101
- print("Processing input...")
102
- inputs = processor.apply_chat_template(
103
- messages,
104
- add_generation_prompt=True,
105
- tokenize=True,
106
- return_dict=True,
107
- return_tensors="pt"
108
- )
109
 
110
- inputs = {k: v.to("cuda:0") for k, v in inputs.items()}
 
111
 
112
- print("Generating output...")
113
- output = model.generate(**inputs, max_new_tokens=250)
114
- print(processor.decode(output[0][2:], skip_special_tokens=True))
 
 
 
 
115
 
116
  ```
117
 
118
- ### Option 2: VLLM inference
119
- #### Install and launch VLLM
120
  ```bash
121
  # Install vllm package
122
  pip install vllm==0.6.6.post1
@@ -125,7 +209,7 @@ pip install vllm==0.6.6.post1
125
  python -m vllm.entrypoints.openai.api_server --model BAAI/RoboBrain --served-model-name robobrain --max_model_len 16384 --limit_mm_per_prompt image=8
126
  ```
127
 
128
- #### Run python script as example:
129
  ```python
130
  from openai import OpenAI
131
  import base64
@@ -138,19 +222,23 @@ client = OpenAI(
138
  base_url=openai_api_base,
139
  )
140
 
 
 
 
 
 
 
 
 
 
141
  response = client.chat.completions.create(
142
  model="robobrain",
143
  messages=[
144
  {
145
  "role": "user",
146
  "content": [
147
- {
148
- "type": "image_url",
149
- "image_url": {
150
- "url": "http://images.cocodataset.org/val2017/000000039769.jpg"
151
- },
152
- },
153
- {"type": "text", "text": "What is shown in this image?"},
154
  ],
155
  },
156
  ]
@@ -158,15 +246,83 @@ response = client.chat.completions.create(
158
 
159
  content = response.choices[0].message.content
160
  print(content)
 
 
 
 
 
 
 
 
161
  ```
162
 
163
- ## 📑 Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  @article{ji2025robobrain,
167
  title={RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete},
168
  author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
169
  journal={arXiv preprint arXiv:2502.21257},
170
  year={2025}
171
  }
172
- ```
 
7
  - en
8
  ---
9
 
 
10
 
11
+ <div align="center">
12
+ <img src="https://github.com/FlagOpen/RoboBrain/raw/main/assets/logo.jpg" width="400"/>
13
+ </div>
14
+
15
+ # [CVPR 25] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete.
16
+
17
 
18
  <p align="center">
19
+ </a>&nbsp&nbsp⭐️ <a href="https://superrobobrain.github.io/">Project</a></a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/BAAI/RoboBrain/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://www.modelscope.cn/models/BAAI/RoboBrain/files/">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp🌎 <a href="https://github.com/FlagOpen/ShareRobot">Dataset</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="http://arxiv.org/abs/2502.21257">Paper</a>&nbsp&nbsp | &nbsp&nbsp💬 <a href="./assets/wechat.png">WeChat</a>
20
  </p>
21
  <p align="center">
22
  </a>&nbsp&nbsp🎯 <a href="">RoboOS (Coming Soon)</a>: An Efficient Open-Source Multi-Robot Coordination System for RoboBrain.
23
  </p>
24
  <p align="center">
25
+ </a>&nbsp&nbsp🎯 <a href="https://tanhuajie.github.io/ReasonRFT/">Reason-RFT</a>: Exploring a New RFT Paradigm to Enhance RoboBrain's Visual Reasoning Capabilities.
26
  </p>
27
 
28
+ ## 🔥 Overview
29
+ Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: **(1) Planning Capability**, which involves decomposing complex manipulation instructions into manageable sub-tasks; **(2) Affordance Perception**, the ability to recognize and interpret the affordances of interactive objects; and **(3) Trajectory Prediction**, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.
30
+
31
+ <div align="center">
32
+ <img src="https://github.com/FlagOpen/RoboBrain/blob/main/assets/overview.png" />
33
+ </div>
34
+
35
+ ## 🚀 Features
36
+ This repository supports:
37
+ - **`Data Preparation`**: Please refer to [Dataset Preparation](https://github.com/FlagOpen/ShareRobot) for how to prepare the dataset.
38
+ - **`Training for RoboBrain`**: Please refer to [Training Section](#Training) for the usage of training scripts.
39
+ - **`Support HF/VLLM Inference`**: Please see [Inference Section](#Inference), now we support inference with [VLLM](https://github.com/vllm-project/vllm).
40
+ - **`Evaluation for RoboBrain`**: Please refer to [Evaluation Section](#Evaluation) for how to prepare the benchmarks.
41
+ - **`ShareRobot Generation`**: Please refer to [ShareRobot](https://github.com/FlagOpen/ShareRobot) for details.
42
+
43
+
44
+ ## 🗞️ News
45
+
46
+ - **`2025-03-29`**: 🤗 We have released [Affordance Checkpoint](https://huggingface.co/BAAI/RoboBrain-LoRA-Affordance/) in Huggingface.
47
+ - **`2025-03-27`**: 🤗 We have released [Planning Checkpoint](https://huggingface.co/BAAI/RoboBrain/) in Huggingface.
48
+ - **`2025-03-26`**: 🔥 We have released the [RoboBrain](https://github.com/FlagOpen/RoboBrain/) repository.
49
+ - **`2025-02-27`**: 🌍 Our [RoboBrain](http://arxiv.org/abs/2502.21257/) was accepted to CVPR2025.
50
+
51
+
52
+ ## 📆 Todo
53
+ - [x] Release scripts for model training and inference.
54
+ - [x] Release Planning checkpoint.
55
+ - [x] Release Affordance checkpoint.
56
+ - [ ] Release ShareRobot dataset. *(Uploading ...)*
57
+ - [ ] Release Trajectory checkpoint.
58
+ - [ ] Release evaluation scripts for Benchmarks.
59
+ - [ ] Training more powerful **Robobrain-v2**.
60
+
61
+
62
+ ## 🤗 Models
63
+
64
+ - **[`Base Planning Model`](https://huggingface.co/BAAI/RoboBrain/)**: The model was trained on general datasets in Stages 1–2 and on the Robotic Planning dataset in Stage 3, which is designed for Planning prediction.
65
+ - **[`A-LoRA for Affordance`](https://huggingface.co/BAAI/RoboBrain-LoRA-Affordance/)**: Based on the Base Planning Model, Stage 4 involves LoRA-based training with our Affordance dataset to predict affordance.
66
+ - **[`T-LoRA for Trajectory`](https://huggingface.co/BAAI/RoboBrain/)**: Based on the Base Planning Model, Stage 4 involves LoRA-based training with our Trajectory dataset to predict trajectory. *(Coming Soon)*
67
+
68
+ <div align="center">
69
+ <img src="https://github.com/FlagOpen/RoboBrain/blob/main/assets/training.png" />
70
+ </div>
71
+
72
  | Models | Checkpoint | Description |
73
  |----------------------|----------------------------------------------------------------|------------------------------------------------------------|
74
  | Planning Model | [🤗 Planning CKPTs](https://huggingface.co/BAAI/RoboBrain/) | Used for Planning prediction in our paper |
75
+ | Affordance (A-LoRA) | [🤗 Affordance CKPTs](https://huggingface.co/BAAI/RoboBrain-LoRA-Affordance/) | Used for Affordance prediction in our paper |
76
+ | Trajectory (T-LoRA) | [🤗 Trajectory CKPTs](https://huggingface.co/BAAI/RoboBrain-LoRA-Trajectory/) | Used for Trajectory prediction in our paper *(Coming Soon)* |
77
 
 
78
 
79
+ ## 🛠️ Setup
 
 
 
 
80
 
81
+ ```bash
82
+ # clone repo.
83
+ git clone https://github.com/FlagOpen/RoboBrain.git
84
+ cd RoboBrain
85
+
86
+ # build conda env.
87
+ conda create -n robobrain python=3.10
88
+ conda activate robobrain
89
+ pip install -r requirements.txt
90
+ ```
91
 
92
+ ## <a id="Training"> 🤖 Training</a>
 
 
 
 
93
 
94
+ ### 1. Data Preparation
95
 
96
+ ```bash
97
+ # Modify datasets for Stage 1, please refer to:
98
+ - yaml_path: scripts/train/yaml/stage_1_0.yaml
 
 
99
 
100
+ # Modify datasets for Stage 1.5, please refer to:
101
+ - yaml_path: scripts/train/yaml/stage_1_5.yaml
102
 
103
+ # Modify datasets for Stage 2_si, please refer to:
104
+ - yaml_path: scripts/train/yaml/stage_2_si.yaml
105
 
106
+ # Modify datasets for Stage 2_ov, please refer to:
107
+ - yaml_path: scripts/train/yaml/stage_2_ov.yaml
 
 
 
 
108
 
109
+ # Modify datasets for Stage 3_plan, please refer to:
110
+ - yaml_path: scripts/train/yaml/stage_3_planning.yaml
111
 
112
+ # Modify datasets for Stage 4_aff, please refer to:
113
+ - yaml_path: scripts/train/yaml/stage_4_affordance.yaml
114
 
115
+ # Modify datasets for Stage 4_traj, please refer to:
116
+ - yaml_path: scripts/train/yaml/stage_4_trajectory.yaml
117
+ ```
118
+ **Note:** The sample format in each json file should be like:
119
+ ```json
120
+ {
121
+ "id": "xxxx",
122
+ "image": [
123
+ "image1.png",
124
+ "image2.png",
125
+ ],
126
+ "conversations": [
127
+ {
128
+ "from": "human",
129
+ "value": "<image>\n<image>\nAre there numerous dials near the bottom left of the tv?"
130
+ },
131
+ {
132
+ "from": "gpt",
133
+ "value": "Yes. The sun casts shadows ... a serene, clear sky."
134
+ }
135
+ ]
136
+ },
137
+ ```
138
+
139
+ ### 2. Training
140
+ ```bash
141
+ # Training on Stage 1:
142
+ bash scripts/train/stage_1_0_pretrain.sh
143
+
144
+ # Training on Stage 1.5:
145
+ bash scripts/train/stage_1_5_direct_finetune.sh
146
+
147
+ # Training on Stage 2_si:
148
+ bash scripts/train/stage_2_0_resume_finetune_si.sh
149
+
150
+ # Training on Stage 2_ov:
151
+ bash scripts/train/stage_2_0_resume_finetune_ov.sh
152
+
153
+ # Training on Stage 3_plan:
154
+ bash scripts/train/stage_3_0_resume_finetune_robo.sh
155
+
156
+ # Training on Stage 4_aff:
157
+ bash scripts/train/stage_4_0_resume_finetune_lora_a.sh
158
+
159
+ # Training on Stage 4_traj:
160
+ bash scripts/train/stage_4_0_resume_finetune_lora_t.sh
161
+ ```
162
+ **Note:** Please change the environment variables (e.g. *DATA_PATH*, *IMAGE_FOLDER*, *PREV_STAGE_CHECKPOINT*) in the script to your own.
163
+
164
+ ### 3. Convert original weights to HF weights
165
+ ```bash
166
+ # Planning Model
167
+ python model/llava_utils/convert_robobrain_to_hf.py --model_dir /path/to/original/checkpoint/ --dump_path /path/to/output/
168
+
169
+ # A-LoRA & T-RoRA
170
+ python model/llava_utils/convert_lora_weights_to_hf.py --model_dir /path/to/original/checkpoint/ --dump_path /path/to/output/
171
+ ```
172
+
173
+ ## <a id="Inference">⭐️ Inference</a>
174
+
175
+ ### 1. Usage for Planning Prediction
176
+
177
+ #### Option 1: HF inference
178
 
 
179
  ```python
180
+ from inference import SimpleInference
 
181
 
182
  model_id = "BAAI/RoboBrain"
183
+ model = SimpleInference(model_id)
184
 
185
+ prompt = "Given the obiects in the image, if you are required to complete the task \"Put the apple in the basket\", what is your detailed plan? Write your plan and explain it in detail, using the following format: Step_1: xxx\nStep_2: xxx\n ...\nStep_n: xxx\n"
186
+
187
+ image = "./assets/demo/planning.png"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
 
189
+ pred = model.inference(prompt, image, do_sample=True)
190
+ print(f"Prediction: {pred}")
191
 
192
+ '''
193
+ Prediction: (as an example)
194
+ Step_1: Move to the apple. Move towards the apple on the table.
195
+ Step_2: Pick up the apple. Grab the apple and lift it off the table.
196
+ Step_3: Move towards the basket. Move the apple towards the basket without dropping it.
197
+ Step_4: Put the apple in the basket. Place the apple inside the basket, ensuring it is in a stable position.
198
+ '''
199
 
200
  ```
201
 
202
+ #### Option 2: VLLM inference
203
+ Install and launch VLLM
204
  ```bash
205
  # Install vllm package
206
  pip install vllm==0.6.6.post1
 
209
  python -m vllm.entrypoints.openai.api_server --model BAAI/RoboBrain --served-model-name robobrain --max_model_len 16384 --limit_mm_per_prompt image=8
210
  ```
211
 
212
+ Run python script as example:
213
  ```python
214
  from openai import OpenAI
215
  import base64
 
222
  base_url=openai_api_base,
223
  )
224
 
225
+ prompt = "Given the obiects in the image, if you are required to complete the task \"Put the apple in the basket\", what is your detailed plan? Write your plan and explain it in detail, using the following format: Step_1: xxx\nStep_2: xxx\n ...\nStep_n: xxx\n"
226
+
227
+ image = "./assets/demo/planning.png"
228
+
229
+ with open(image, "rb") as f:
230
+ encoded_image = base64.b64encode(f.read())
231
+ encoded_image = encoded_image.decode("utf-8")
232
+ base64_img = f"data:image;base64,{encoded_image}"
233
+
234
  response = client.chat.completions.create(
235
  model="robobrain",
236
  messages=[
237
  {
238
  "role": "user",
239
  "content": [
240
+ {"type": "image_url", "image_url": {"url": base64_img}},
241
+ {"type": "text", "text": prompt},
 
 
 
 
 
242
  ],
243
  },
244
  ]
 
246
 
247
  content = response.choices[0].message.content
248
  print(content)
249
+
250
+ '''
251
+ Prediction: (as an example)
252
+ Step_1: Move to the apple. Move towards the apple on the table.
253
+ Step_2: Pick up the apple. Grab the apple and lift it off the table.
254
+ Step_3: Move towards the basket. Move the apple towards the basket without dropping it.
255
+ Step_4: Put the apple in the basket. Place the apple inside the basket, ensuring it is in a stable position.
256
+ '''
257
  ```
258
 
259
+ ### 2. Usage for Affordance Prediction
260
+ ```python
261
+ from inference import SimpleInference
262
+
263
+ model_id = "BAAI/RoboBrain"
264
+ lora_id = "BAAI/RoboBrain-LoRA-Affordance"
265
+ model = SimpleInference(model_id, lora_id)
266
+
267
+ # Example 1:
268
+ prompt = "You are a robot using the joint control. The task is \"pick_up the suitcase\". Please predict a possible affordance area of the end effector?"
269
+
270
+ image = "./assets/demo/affordance_1.jpg"
271
+
272
+ pred = model.inference(prompt, image, do_sample=False)
273
+ print(f"Prediction: {pred}")
274
+
275
+ '''
276
+ Prediction: [0.733, 0.158, 0.845, 0.263]
277
+ '''
278
+
279
+ # Example 2:
280
+ prompt = "You are a robot using the joint control. The task is \"push the bicycle\". Please predict a possible affordance area of the end effector?"
281
+
282
+ image = "./assets/demo/affordance_2.jpg"
283
+
284
+ pred = model.inference(prompt, image, do_sample=False)
285
+ print(f"Prediction: {pred}")
286
+
287
+ '''
288
+ Prediction: [0.600, 0.127, 0.692, 0.227]
289
+ '''
290
 
291
  ```
292
+
293
+ <div align="center">
294
+ <img src="https://github.com/FlagOpen/RoboBrain/blob/main/assets/demo/examples.png" />
295
+ </div>
296
+
297
+ ### 3. Usage for Trajectory Prediction
298
+ *Coming Soon ...*
299
+
300
+
301
+ ## <a id="Evaluation">🤖 Evaluation</a>
302
+ *Coming Soon ...*
303
+
304
+ <div align="center">
305
+ <img src="https://github.com/FlagOpen/RoboBrain/blob/main/assets/result.png" />
306
+ </div>
307
+
308
+ ## 😊 Acknowledgement
309
+
310
+ We would like to express our sincere gratitude to the developers and contributors of the following projects:
311
+ 1. [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT): The comprehensive codebase for training Vision-Language Models (VLMs).
312
+ 2. [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval): A powerful evaluation tool for Vision-Language Models (VLMs).
313
+ 3. [vllm](https://github.com/vllm-project/vllm): A high-throughput and memory-efficient LLMs/VLMs inference engine.
314
+ 4. [OpenEQA](https://github.com/facebookresearch/open-eqa): A wonderful benchmark for Embodied Question Answering.
315
+ 5. [RoboVQA](https://github.com/google-deepmind/robovqa): Provide high-level reasoning models and datasets for robotics applications.
316
+
317
+ Their outstanding contributions have played a pivotal role in advancing our research and development initiatives.
318
+
319
+ ## 📑 Citation
320
+ If you find this project useful, welcome to cite us.
321
+ ```bib
322
  @article{ji2025robobrain,
323
  title={RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete},
324
  author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
325
  journal={arXiv preprint arXiv:2502.21257},
326
  year={2025}
327
  }
328
+ ```