Instructions to use robotics-diffusion-transformer/rdt-1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use robotics-diffusion-transformer/rdt-1b with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("robotics-diffusion-transformer/rdt-1b", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| # RDT-1B | |
| RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given a language instruction and 3-view RGB image observations, RDT can predict the next | |
| 64 robot actions. RDT is inherently compatible with almost all kinds of modern mobile manipulators, from single-arm to dual-arm, joint to EEF, pos. to vel., and even with a mobile chassis. | |
| All the [code]() and pretrained model weights are licensed under MIT license. | |
| Please refer to our [project page](https://rdt-robotics.github.io/rdt-robotics/) and [paper]() for more information. | |
| ## Model Details | |
| - **Developed by** RDT team from Tsinghua University | |
| - **License:** MIT | |
| - **Language(s) (NLP):** en | |
| - **Model Architecture:** Diffusion Transformer | |
| - **Pretrain dataset:** Curated pretrain dataset collected from 46 datasets. Please see [here]() for detail | |
| - **Repository:** [repo_url] | |
| - **Paper :** [paper_url] | |
| - **Project Page:** https://rdt-robotics.github.io/rdt-robotics/ | |
| ## Uses | |
| RDT takes language instruction, image observations and proprioception as input, and predicts the next 64 robot actions in the form of unified action space vector. | |
| The unified action space vector includes all the main physical quantities of robots (e.g. the end-effector and joint, position and velocity, base movement, etc.) and can be applied to a wide range of robotic embodiments. | |
| The pre-trained RDT model can be fine-tuned for specific robotic embodiment and deployed on real-world robots. | |
| Here's an example of how to use the RDT-1B model for inference on a Mobile-ALOHA robot: | |
| ```python | |
| # Clone the repository and install dependencies | |
| from scripts.agilex_model import create_model | |
| # Names of cameras used for visual input | |
| CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist'] | |
| config = { | |
| 'episode_len': 1000, # Max length of one episode | |
| 'state_dim': 14, # Dimension of the robot's state | |
| 'chunk_size': 64, # Number of actions to predict in one step | |
| 'camera_names': CAMERA_NAMES, | |
| } | |
| pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" | |
| # Create the model with specified configuration | |
| model = create_model( | |
| args=config, | |
| dtype=torch.bfloat16, | |
| pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path, | |
| control_frequency=25, | |
| ) | |
| # Start inference process | |
| # Load pre-computed language embeddings | |
| lang_embeddings_path = 'your/language/embedding/path' | |
| text_embedding = torch.load(lang_embeddings_path)['embeddings'] | |
| images: List(PIL.Image) = ... # The images from last 2 frame | |
| proprio = ... # The current robot state | |
| # Perform inference to predict the next chunk_size actions | |
| actions = policy.step( | |
| proprio=proprio, | |
| images=images, | |
| text_embeds=text_embedding | |
| ) | |
| ``` | |
| RDT-1B supports finetuning on custom dataset, deploying and inferencing on real-robots, as well as pretraining the model. | |
| Please refer to [our repository](https://github.com/GeneralEmbodiedSystem/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md) for all the above guides. | |
| ## Citation | |
| <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> | |
| **BibTeX:** | |
| [More Information Needed] |