hamedrahimi commited on
Commit
705e21d
·
verified ·
1 Parent(s): 1c6ecac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -12,6 +12,7 @@ base_model:
12
  pipeline_tag: image-text-to-text
13
  ---
14
  # User-VLM 360°
 
15
 
16
  ## Overview
17
  **User-VLM 360°** is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces **User-aware tuning**, addressing the **semantic gap** that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, **User-VLM 360°** enables **real-time, robust adaptation** in dynamic robotic environments by inherently aligning cross-modal user representations.
@@ -21,6 +22,8 @@ This model allows for **customization of open-weight VLMs** to produce **persona
21
  ## Training Details
22
  **Base Model:** User-VLM 360° is built on **PaliGemma 2**, which consists of a **SigLIP vision encoder** and **Gemma 2 as the language model**.
23
 
 
 
24
  ### Fine-tuning Process:
25
  1. **Base Model Tuning:**
26
  - Tuned the **MLP layer** to provide **user and scene descriptions** over **1 epoch**.
 
12
  pipeline_tag: image-text-to-text
13
  ---
14
  # User-VLM 360°
15
+ ![Architecture](result-final.pdf)
16
 
17
  ## Overview
18
  **User-VLM 360°** is a series of personalized Vision-Language Models (VLMs) designed for social human-robot interactions. The model introduces **User-aware tuning**, addressing the **semantic gap** that arises from the misalignment between user queries and the observed scene as captured by a robot's camera. Unlike traditional instruction tuning, which introduces latency and reduces performance, **User-VLM 360°** enables **real-time, robust adaptation** in dynamic robotic environments by inherently aligning cross-modal user representations.
 
22
  ## Training Details
23
  **Base Model:** User-VLM 360° is built on **PaliGemma 2**, which consists of a **SigLIP vision encoder** and **Gemma 2 as the language model**.
24
 
25
+ ![Deployment on Pepper](pepper2.pdf)
26
+
27
  ### Fine-tuning Process:
28
  1. **Base Model Tuning:**
29
  - Tuned the **MLP layer** to provide **user and scene descriptions** over **1 epoch**.