{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "n3ryhkSfIEfl" }, "source": [ "# Video Tokenization Using [NVIDIA Cosmos Tokenizer](https://github.com/nvidia-cosmos/cosmos-predict1/blob/main/cosmos_predict1/models/tokenizer) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nvidia-cosmos/cosmos-predict1/blob/main/cosmos_predict1/models/tokenizer/notebook/Video_Tokenization.ipynb)\n", "\n", "The Jupyter Notebook example utilizes the **Cosmos-Tokenizer** pretrained models, which include Continuous Video (CV) tokenizers that transform videos into continuous spatio-temporal latents and Discrete Video (DI) tokenizers that transform videos into discrete tokens. Both CV and DV tokenizers are available with compression rates of (`TxHxW` format) 4x8x8 and 8x8x8, and 8x16x16. For instance, **CV4x8x8** effectively downsizes the number of frames by a factor of 4 and both height and width by a factor of 8.\n", "\n", "Within the notebook, the `VideoTokenizer` class from the `cosmos_tokenizer.video_lib` module is employed to manage the encoder and decoder components of this model. The encoder compresses the input video into a condensed latent representation or discrete integers, while the decoder reconstructs the video from this latent representation or discrete integers.\n", "\n", "This instance of the Cosmos Tokenizer demonstrates its autoencoding capability: compressing a video into a smaller latent space and subsequently reconstructing it to its original form. This showcases the efficiency of video tokenization for tasks involving significant spatial compression during video reconstruction, a highly desirable feature for generative modeling.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5BkjyLTPLM6e" }, "source": [ "This tutorial follows a simple, step-by-step approach, making it easy to understand and adapt.\n", "\n", "## Step 1: Clone the Cosmos Tokenizer Repository" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TEV88M9YG973" }, "outputs": [], "source": [ "!git clone https://github.com/NVIDIA-Cosmos/cosmos-predict1.git" ] }, { "cell_type": "markdown", "metadata": { "id": "AxOMEJpFL9QL" }, "source": [ "## Step 2: Install **Cosmos-Tokenizer**\n", "Before proceeding, ensure you have the **Cosmos Tokenizer** installed. If you cloned the repository in Step 1, use the following command to install it in editable mode:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XuwUR6HrIxD8" }, "outputs": [], "source": [ "# Step 2: # Install Cosmos-Tokenizer and its Python dependencies.\n", "import os\n", "if os.path.exists(\"cosmos-predict1\"):\n", " os.chdir(\"cosmos-predict1\")\n", " !apt-get update\n", " !apt-get install -y git-lfs\n", " !git lfs pull\n", " %pip install -r requirements.txt\n", "else:\n", " print('cosmos-predict1 is already installed.')" ] }, { "cell_type": "markdown", "metadata": { "id": "id29RPiyMOtB" }, "source": [ "## Step 3: Set Up Hugging Face API Token and Download Pretrained Models\n", "\n", "In this step, you'll configure the Hugging Face API token and download the pretrained model weights required for the **Cosmos Tokenizer**.\n", "\n", "1. **Ensure You Have a Hugging Face Account** \n", " If you do not already have a Hugging Face account, follow these steps to create one and generate an API token:\n", " - Go to the [Hugging Face website](https://huggingface.co/) and sign up for a free account.\n", " - After logging in, navigate to your [Settings → Access Tokens](https://huggingface.co/settings/tokens).\n", " - Click on \"New Token\" to generate an API token with the required permissions.\n", "\n", "2. **Set the Hugging Face Token** \n", " Check if the Hugging Face token is already set in the environment variables. If not, you will be prompted to enter it manually. The token is essential to authenticate and access the Hugging Face models.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "joxcyOlnM7HQ" }, "outputs": [], "source": [ "# Check if the token is already set\n", "if \"HUGGINGFACE_TOKEN\" not in os.environ:\n", " os.environ[\"HUGGINGFACE_TOKEN\"] = input(\"Please enter your Hugging Face API token: \")\n", "!git config --global credential.helper store" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Lq7MAQ9pGPH9" }, "outputs": [], "source": [ "from huggingface_hub import login, snapshot_download\n", "import os\n", "HUGGINGFACE_TOKEN = os.environ.get(\"HUGGINGFACE_TOKEN\")\n", "login(token=HUGGINGFACE_TOKEN, add_to_git_credential=True)\n", "model_names = [\n", " \"Cosmos-0.1-Tokenizer-CV4x8x8\",\n", " \"Cosmos-0.1-Tokenizer-CV8x8x8\",\n", " \"Cosmos-0.1-Tokenizer-CV8x16x16\",\n", " \"Cosmos-0.1-Tokenizer-DV4x8x8\",\n", " \"Cosmos-0.1-Tokenizer-DV8x8x8\",\n", " \"Cosmos-0.1-Tokenizer-DV8x16x16\",\n", " \"Cosmos-Tokenize1-CV8x8x8-720p\",\n", " \"Cosmos-Tokenize1-DV8x16x16-720p\",\n", "]\n", "for model_name in model_names:\n", " hf_repo = \"nvidia/\" + model_name\n", " local_dir = \"checkpoints/\" + model_name\n", " os.makedirs(local_dir, exist_ok=True)\n", " print(f\"downloading {model_name}...\")\n", " snapshot_download(repo_id=hf_repo, local_dir=local_dir)" ] }, { "cell_type": "markdown", "metadata": { "id": "ltZ-v2vzNv74" }, "source": [ "## Step 4: Use Cosmos Tokenizer for Video Reconstruction\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/", "height": 594 }, "id": "gZFPrGCBGwtC", "outputId": "ad18dc16-c1f2-410c-937b-787c677ec27e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:19<00:00, 6.45s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Input video read from:\t /home/freda/Cosmos/cosmos1/models/tokenizer/test_data/video.mp4\n", "Reconstruction saved:\t /home/freda/Cosmos/cosmos1/models/tokenizer/test_data/video_CV8x8x8.mp4\n" ] }, { "data": { "text/html": [ "
\n", "
\n", "
Input Video
\n", "
\n", "
Reconstructed Video
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# @title In this step, load the required checkpoints, and perform video reconstruction. {\"run\":\"auto\"}\n", "import cv2\n", "import numpy as np\n", "import torch\n", "\n", "import importlib\n", "from cosmos_predict1.tokenizer.inference.video_lib import CausalVideoTokenizer\n", "import mediapy as media\n", "\n", "\n", "# 1) Specify the model name, and the paths to the encoder/decoder checkpoints.\n", "model_name = 'Cosmos-Tokenize1-CV8x8x8-720p' # @param [\"Cosmos-0.1-Tokenizer-CV4x8x8\", \"Cosmos-0.1-Tokenizer-CV8x8x8\", \"Cosmos-0.1-Tokenizer-CV8x16x16\", \"Cosmos-0.1-Tokenizer-DV4x8x8\", \"Cosmos-0.1-Tokenizer-DV8x8x8\", \"Cosmos-0.1-Tokenizer-DV8x16x16\", \"Cosmos-Tokenize1-CV8x8x8-720p\", \"Cosmos-Tokenize1-DV8x16x16-720p\"]\n", "temporal_window = 49 # @param {type:\"slider\", min:1, max:121, step:8}\n", "\n", "encoder_ckpt = f\"checkpoints/{model_name}/encoder.jit\"\n", "decoder_ckpt = f\"checkpoints/{model_name}/decoder.jit\"\n", "\n", "# 2) Load or provide the video filename you want to tokenize & reconstruct.\n", "input_filepath = \"cosmos_predict1/tokenizer/test_data/video.mp4\"\n", "\n", "# 3) Read the video from disk (shape = T x H x W x 3 in BGR).\n", "input_video = media.read_video(input_filepath)[..., :3]\n", "assert input_video.ndim == 4 and input_video.shape[-1] == 3, \"Frames must have shape T x H x W x 3\"\n", "\n", "# 4) Expand dimensions to B x Tx H x W x C, since the CausalVideoTokenizer expects a batch dimension\n", "# in the input. (Batch size = 1 in this example.)\n", "batched_input_video = np.expand_dims(input_video, axis=0)\n", "\n", "# 5) Create the CausalVideoTokenizer instance with the encoder & decoder.\n", "# - device=\"cuda\" uses the GPU\n", "# - dtype=\"bfloat16\" expects Ampere or newer GPU (A100, RTX 30xx, etc.)\n", "tokenizer = CausalVideoTokenizer(\n", " checkpoint_enc=encoder_ckpt,\n", " checkpoint_dec=decoder_ckpt,\n", " device=\"cuda\",\n", " dtype=\"bfloat16\",\n", ")\n", "\n", "# 6) Use the tokenizer to autoencode (encode & decode) the video.\n", "# The output is a NumPy array with shape = B x T x H x W x C, range [0..255].\n", "batched_output_video = tokenizer(batched_input_video,\n", " temporal_window=temporal_window)\n", "\n", "# 7) Extract the single video from the batch (index 0).\n", "output_video = batched_output_video[0]\n", "\n", "# 9) Save the reconstructed video to disk.\n", "input_dir, input_filename = os.path.split(input_filepath)\n", "filename, ext = os.path.splitext(input_filename)\n", "output_filepath = f\"{input_dir}/{filename}_{model_name.split('-')[-1]}{ext}\"\n", "media.write_video(output_filepath, output_video)\n", "print(\"Input video read from:\\t\", f\"{os.getcwd()}/{input_filepath}\")\n", "print(\"Reconstruction saved:\\t\", f\"{os.getcwd()}/{output_filepath}\")\n", "\n", "# 10) Visualization of the input video (left) and the reconstruction (right).\n", "media.show_videos([input_video, output_video], [\"Input Video\", \"Reconstructed Video\"], height=480)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }