--- license: mit pipeline_tag: image-to-video library_name: diffusers ---
Haoyu Zhen*, Qiao Sun*, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan
We propose TesserAct, the 4D Embodied World Model, which takes input images and text instruction to generate RGB, depth, and normal videos, reconstructing a 4D scene and predicting actions.