harry900000's picture
add cosmos-tranfer1/ into repo
226c7c9

A newer version of the Gradio SDK is available: 5.42.0

Upgrade

Robot Data Augmentation with Cosmos-Transfer1

This pipeline provides a two-step process to augment robotic videos using Cosmos-Transfer1-7B. It leverages spatial-temporal control to modify backgrounds while preserving the shape and/or appearance of the robot foreground.

Overview of Settings

We propose two augmentation settings:

Setting 1 (fg_vis_edge_bg_seg): Preserve Shape and Appearance of the Robot (foreground)

  • Foreground Controls: Edge, Vis
  • Background Controls: Segmentation
  • Weights:
    • w_edge(FG) = 1
    • w_vis(FG) = 1
    • w_seg(BG) = 1
    • All other weights = 0

Setting 2 (fg_edge_bg_seg): Preserve Only Shape of the Robot (foreground)

  • Foreground Controls: Edge
  • Background Controls: Segmentation
  • Weights:
    • w_edge(FG) = 1
    • w_seg(BG) = 1
    • All other weights = 0

Step-by-Step Instructions

Step 1: Generate Spatial-Temporal Weights

This script extracts foreground (robot) and background information from semantic segmentation data. It processes per-frame segmentation masks and color-to-class mappings to generate spatial-temporal weight matrices for each control modality based on the selected setting.

Input Requirements:

  • A segmentation folder containing per-frame segmentation masks in PNG format
  • A segmentation_label folder containing color-to-class mapping JSON files for each frame, for example:
    {
        "(29, 0, 0, 255)": {
            "class": "gripper0_right_r_palm_vis"
        },
        "(31, 0, 0, 255)": {
            "class": "gripper0_right_R_thumb_proximal_base_link_vis"
        },
        "(33, 0, 0, 255)": {
            "class": "gripper0_right_R_thumb_proximal_link_vis"
        }
    }
    
  • An input video file

Here is an example input format: Example input directory

Usage

PYTHONPATH=$(pwd) python cosmos_transfer1/auxiliary/robot_augmentation/spatial_temporal_weight.py \
    --setting setting1 \
    --robot-keywords world_robot gripper robot \
    --input-dir assets/robot_augmentation_example \
    --output-dir outputs/robot_augmentation_example

Parameters:

  • --setting: Weight setting to use (choices: 'setting1', 'setting2', default: 'setting1')

    • setting1: Emphasizes robot in visual and edge features (vis: 1.0 foreground, edge: 1.0 foreground, seg: 1.0 background)
    • setting2: Emphasizes robot only in edge features (edge: 1.0 foreground, seg: 1.0 background)
  • --input-dir: Input directory containing example folders

    • Default: 'assets/robot_augmentation_example'
  • --output-dir: Output directory for weight matrices

    • Default: 'outputs/robot_augmentation_example'
  • --robot-keywords: Keywords used to identify robot classes

    • Default: ["world_robot", "gripper", "robot"]
    • Any semantic class containing these keywords will be treated as robot foreground

Step 2: Run Cosmos-Transfer1 Inference

Use the generated spatial-temporal weight matrices to perform video augmentation with the proper controls.

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=1}"

PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 \
cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/robot_example_spatial_temporal_setting1 \
    --controlnet_specs assets/robot_augmentation_example/example1/inference_cosmos_transfer1_robot_spatiotemporal_weights.json \
    --offload_text_encoder_model \
    --offload_guardrail_models \
    --num_gpus $NUM_GPU
  • Augmented videos are saved in outputs/robot_example_spatial_temporal_setting1/

Input Outputs Example

Input video:

You can run multiple times with different prompts (e.g., assets/robot_augmentation_example/example1/example1_prompts.json), and you can get different augmentation results: