File size: 1,142 Bytes

4e2e7ec
 
 
 
 
 
bb84b2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e2e7ec
 
bb84b2f

---
license: mit
pipeline_tag: image-to-video
library_name: diffusers
---

<p align="center">
  <h1 align="center">TesserAct: Learning 4D Embodied World Models</h1>
  <p align="center">
    <a href="https://haoyuzhen.com">Haoyu Zhen*</a>,
    <a href="https://qiaosun22.github.io/">Qiao Sun*</a>,
    <a href="https://icefoxzhx.github.io/">Hongxin Zhang</a>,
    <a href="https://senfu.github.io/">Junyan Li</a>,
    <a href="https://rainbow979.github.io/">Siyuan Zhou</a>,
    <a href="https://yilundu.github.io/">Yilun Du</a>,
    <a href="https://people.csail.mit.edu/ganchuang">Chuang Gan</a>
  </p>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2504.20995">Paper PDF</a>
  &nbsp;|&nbsp;
  <a href="https://tesseractworld.github.io">Project Page</a>
  &nbsp;|&nbsp;
  <a href="https://huggingface.co/anyeZHY/tesseract">Model on Hugging Face</a>
  &nbsp;|&nbsp;
  <a href="https://github.com/UMass-Embodied-AGI/TesserAct">Code</a>
</p>


We propose TesserAct, the 4D Embodied World Model, which takes input images and text instruction to generate RGB, depth,
and normal videos, reconstructing a 4D scene and predicting actions.