File size: 4,333 Bytes
226c7c9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# Robot Data Augmentation with Cosmos-Transfer1

This pipeline provides a two-step process to augment robotic videos using **Cosmos-Transfer1-7B**. It leverages **spatial-temporal control** to modify backgrounds while preserving the shape and/or appearance of the robot foreground.

## Overview of Settings

We propose two augmentation settings:

### Setting 1 (fg_vis_edge_bg_seg): Preserve Shape and Appearance of the Robot (foreground)
- **Foreground Controls**: `Edge`, `Vis`
- **Background Controls**: `Segmentation`
- **Weights**:
  - `w_edge(FG) = 1`
  - `w_vis(FG) = 1`
  - `w_seg(BG) = 1`
  - All other weights = 0

### Setting 2 (fg_edge_bg_seg): Preserve Only Shape of the Robot (foreground)
- **Foreground Controls**: `Edge`
- **Background Controls**: `Segmentation`
- **Weights**:
  - `w_edge(FG) = 1`
  - `w_seg(BG) = 1`
  - All other weights = 0

## Step-by-Step Instructions

### Step 1: Generate Spatial-Temporal Weights

This script extracts foreground (robot) and background information from semantic segmentation data. It processes per-frame segmentation masks and color-to-class mappings to generate spatial-temporal weight matrices for each control modality based on the selected setting.

#### Input Requirements:
- A `segmentation` folder containing per-frame segmentation masks in PNG format
- A `segmentation_label` folder containing color-to-class mapping JSON files for each frame, for example:
  ```json
  {
      "(29, 0, 0, 255)": {
          "class": "gripper0_right_r_palm_vis"
      },
      "(31, 0, 0, 255)": {
          "class": "gripper0_right_R_thumb_proximal_base_link_vis"
      },
      "(33, 0, 0, 255)": {
          "class": "gripper0_right_R_thumb_proximal_link_vis"
      }
  }
  ```
- An input video file

Here is an example input format:
[Example input directory](https://github.com/google-deepmind/cosmos/tree/main/assets/robot_augmentation_example/example1)

#### Usage

```bash
PYTHONPATH=$(pwd) python cosmos_transfer1/auxiliary/robot_augmentation/spatial_temporal_weight.py \
    --setting setting1 \
    --robot-keywords world_robot gripper robot \
    --input-dir assets/robot_augmentation_example \
    --output-dir outputs/robot_augmentation_example
```

#### Parameters:

* `--setting`: Weight setting to use (choices: 'setting1', 'setting2', default: 'setting1')
  * setting1: Emphasizes robot in visual and edge features (vis: 1.0 foreground, edge: 1.0 foreground, seg: 1.0 background)
  * setting2: Emphasizes robot only in edge features (edge: 1.0 foreground, seg: 1.0 background)

* `--input-dir`: Input directory containing example folders
  * Default: 'assets/robot_augmentation_example'

* `--output-dir`: Output directory for weight matrices
  * Default: 'outputs/robot_augmentation_example'

* `--robot-keywords`: Keywords used to identify robot classes
  * Default: ["world_robot", "gripper", "robot"]
  * Any semantic class containing these keywords will be treated as robot foreground

### Step 2: Run Cosmos-Transfer1 Inference

Use the generated spatial-temporal weight matrices to perform video augmentation with the proper controls.

```bash
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=1}"

PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 \
cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/robot_example_spatial_temporal_setting1 \
    --controlnet_specs assets/robot_augmentation_example/example1/inference_cosmos_transfer1_robot_spatiotemporal_weights.json \
    --offload_text_encoder_model \
    --offload_guardrail_models \
    --num_gpus $NUM_GPU
```

- Augmented videos are saved in `outputs/robot_example_spatial_temporal_setting1/`

## Input Outputs Example

Input video:

<video src="https://github.com/user-attachments/assets/9c2df99d-7d0c-4dcf-af87-4ec9f65328ed">
  Your browser does not support the video tag.
</video>

You can run multiple times with different prompts (e.g., `assets/robot_augmentation_example/example1/example1_prompts.json`), and you can get different augmentation results:

<video src="https://github.com/user-attachments/assets/6dee15f5-9d8b-469a-a92a-3419cb466d44">
  Your browser does not support the video tag.
</video>