File size: 8,328 Bytes
d4a8a38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---

title: TransPixelerTest
app_file: app.py
sdk: gradio
sdk_version: 5.35.0
---

## TransPixeler: Advancing Text-to-Video Generation with Transparency (CVPR2025)
<br>
<a href="https://arxiv.org/abs/2501.03006"><img src='https://img.shields.io/badge/arXiv-2501.03006-b31b1b.svg'></a>
<a href='https://wileewang.github.io/TransPixeler'><img src='https://img.shields.io/badge/Project_Page-TransPixeler-blue'></a>
<a href='https://huggingface.co/spaces/wileewang/TransPixar'><img src='https://img.shields.io/badge/HuggingFace-TransPixeler-yellow'></a>
<a href="https://discord.gg/7Xds3Qjr"><img src="https://img.shields.io/badge/Discord-join-blueviolet?logo=discord&amp"></a>
<a href="https://github.com/wileewang/TransPixar/blob/main/wechat_group.jpg"><img src="https://img.shields.io/badge/Wechat-Join-green?logo=wechat&amp"></a>
<a href='https://openbayes.com/console/public/tutorials/tKhPalKrDb9'><img src='https://img.shields.io/badge/Demo-OpenBayes่ดๅผ่ฎก็ฎ—-blue'></a>
<br>

[Luozhou Wang*](https://wileewang.github.io/), 
[Yijun Li**](https://yijunmaverick.github.io/), 
[Zhifei Chen](), 
[Jui-Hsien Wang](http://juiwang.com/), 
[Zhifei Zhang](https://zzutk.github.io/), 
[He Zhang](https://sites.google.com/site/hezhangsprinter), 
[Zhe Lin](https://sites.google.com/site/zhelin625/home), 
[Ying-Cong Chenโ€ ](https://www.yingcong.me)

HKUST(GZ), HKUST, Adobe Research.

\* Internship Project  
\** Project Lead  
โ€  Corresponding Author

Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.  
We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixeler preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.  
Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

<!-- insert a teaser gif -->
<!-- <img src="assets/mi.gif"  width="640" /> -->



## ๐Ÿ“ฐ News

- **[2025.04.28]** We have introduced a new development branch [`wan`](https://github.com/wileewang/TransPixar/tree/wan) that integrates the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model to support **joint generation** tasks. This branch includes training code tailored for generating both RGB and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.

- **[2025.02.26]** **TransPixeler** is accepted by CVPR 2025! See you in Nashville!

- **[2025.01.19]** We've renamed our project from **TransPixar** to **TransPixeler**!!

- **[2025.01.17]** Weโ€™ve created a [Discord group](https://discord.gg/7Xds3Qjr) and a [WeChat group](https://github.com/wileewang/TransPixar/blob/main/wechat_group.jpg)! Everyone is welcome to join for discussions and collaborations.

- **[2025.01.14]** Added new tasks to the repository's roadmap, including support for Hunyuan and LTX video models, and ComfyUI integration.

- **[2025.01.07]** Released project page, arXiv paper, inference code, and Hugging Face demo.




## ๐Ÿ”ฅ New Branch for Joint Generation with Wan2.1

We have introduced a new development branch [`wan`](https://github.com/wileewang/TransPixar/tree/wan) that integrates the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model to support **joint generation** tasks.

In the `wan` branch, we have developed and released training code tailored for joint generation scenarios, enabling the simultaneous generation of RGB videos and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.

**Key features of the `wan` branch:**
- **Integration of Wan2.1**: Leverages the capabilities of the Wan2.1 video generation model for enhanced performance.
- **Joint Generation Support**: Facilitates the concurrent generation of RGB and paired modality videos.
- **Dataset Structure**: Expects each sample to include:
  - A primary video file (`001.mp4`) representing the RGB content.
  - A paired secondary video file (`001_seg.mp4`) with a fixed `_seg` suffix, representing the associated modality.
  - A caption text file (`001.txt`) with the same base name as the primary video.
- **Periodic Evaluation**: Supports periodic video sampling during training by setting `eval_every_step` or `eval_every_epoch` in the configuration.
- **Customized Pipelines**: Offers tailored training and inference pipelines designed specifically for joint generation tasks.

๐Ÿ‘‰ To utilize the joint generation features, please checkout the [`wan`](https://github.com/wileewang/TransPixar/tree/wan) branch.




## Contents

* [Installation](#installation)
* [TransPixar LoRA Weights](#transpixar-lora-hub) 
* [Training](#training)
* [Inference](#inference)
* [Acknowledgement](#acknowledgement)
* [Citation](#citation)



## Installation

```bash

# For the main branch

conda create -n TransPixeler python=3.10

conda activate TransPixeler

pip install -r requirements.txt

```

**Note:**  
If you want to use the **Wan2.1 model**, please first checkout the `wan` branch:

```bash

git checkout wan

```

## TransPixeler LoRA Weights

Our pipeline is designed to support various video tasks, including Text-to-RGBA Video, Image-to-RGBA Video.

We provide the following pre-trained LoRA weights:

| Task          | Base Model                                                    | Frames | LoRA weights                                                       | Inference VRAM |
|---------------|---------------------------------------------------------------|--------|--------------------------------------------------------------------|----------------|
| T2V + RGBA    | [THUDM/CogVideoX-5B](https://huggingface.co/THUDM/CogVideoX-5b)       | 49     | [link](https://huggingface.co/wileewang/TransPixar/blob/main/cogvideox_rgba_lora.safetensors) | ~24GB          |


## Training - RGB + Alpha Joint Generation
We have open-sourced the training code for **Mochi** on RGBA joint generation. Please refer to the [Mochi README](Mochi/README.md) for details.


## Inference - Gradio Demo
In addition to the [Hugging Face online demo](https://huggingface.co/spaces/wileewang/TransPixar), users can also launch a local inference demo based on CogVideoX-5B by running the following command:

```bash

python app.py

```

## Inference - Command Line Interface (CLI)
To generate RGBA videos, navigate to the corresponding directory for the video model and execute the following command:
```bash

python cli.py \

    --lora_path /path/to/lora \

    --prompt "..."

```

---

## Acknowledgement

* [finetrainers](https://github.com/a-r-r-o-w/finetrainers): We followed their implementation of Mochi training and inference.
* [CogVideoX](https://github.com/THUDM/CogVideo): We followed their implementation of CogVideoX training and inference.

We are grateful for their exceptional work and generous contribution to the open-source community.

## Citation

```bibtex

@misc{wang2025transpixeler,

      title={TransPixeler: Advancing Text-to-Video Generation with Transparency}, 

      author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Ying-Cong Chen},

      year={2025},

      eprint={2501.03006},

      archivePrefix={arXiv},

      primaryClass={cs.CV},

      url={https://arxiv.org/abs/2501.03006}, 

}

```

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=wileewang/TransPixeler&type=Date)](https://star-history.com/#wileewang/TransPixeler&Date)