{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "n3ryhkSfIEfl" }, "source": [ "# Video Tokenization Using [NVIDIA Cosmos Tokenizer](https://github.com/nvidia-cosmos/cosmos-predict1/blob/main/cosmos_predict1/models/tokenizer) | [](https://colab.research.google.com/github/nvidia-cosmos/cosmos-predict1/blob/main/cosmos_predict1/models/tokenizer/notebook/Video_Tokenization.ipynb)\n", "\n", "The Jupyter Notebook example utilizes the **Cosmos-Tokenizer** pretrained models, which include Continuous Video (CV) tokenizers that transform videos into continuous spatio-temporal latents and Discrete Video (DI) tokenizers that transform videos into discrete tokens. Both CV and DV tokenizers are available with compression rates of (`TxHxW` format) 4x8x8 and 8x8x8, and 8x16x16. For instance, **CV4x8x8** effectively downsizes the number of frames by a factor of 4 and both height and width by a factor of 8.\n", "\n", "Within the notebook, the `VideoTokenizer` class from the `cosmos_tokenizer.video_lib` module is employed to manage the encoder and decoder components of this model. The encoder compresses the input video into a condensed latent representation or discrete integers, while the decoder reconstructs the video from this latent representation or discrete integers.\n", "\n", "This instance of the Cosmos Tokenizer demonstrates its autoencoding capability: compressing a video into a smaller latent space and subsequently reconstructing it to its original form. This showcases the efficiency of video tokenization for tasks involving significant spatial compression during video reconstruction, a highly desirable feature for generative modeling.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5BkjyLTPLM6e" }, "source": [ "This tutorial follows a simple, step-by-step approach, making it easy to understand and adapt.\n", "\n", "## Step 1: Clone the Cosmos Tokenizer Repository" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TEV88M9YG973" }, "outputs": [], "source": [ "!git clone https://github.com/NVIDIA-Cosmos/cosmos-predict1.git" ] }, { "cell_type": "markdown", "metadata": { "id": "AxOMEJpFL9QL" }, "source": [ "## Step 2: Install **Cosmos-Tokenizer**\n", "Before proceeding, ensure you have the **Cosmos Tokenizer** installed. If you cloned the repository in Step 1, use the following command to install it in editable mode:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XuwUR6HrIxD8" }, "outputs": [], "source": [ "# Step 2: # Install Cosmos-Tokenizer and its Python dependencies.\n", "import os\n", "if os.path.exists(\"cosmos-predict1\"):\n", " os.chdir(\"cosmos-predict1\")\n", " !apt-get update\n", " !apt-get install -y git-lfs\n", " !git lfs pull\n", " %pip install -r requirements.txt\n", "else:\n", " print('cosmos-predict1 is already installed.')" ] }, { "cell_type": "markdown", "metadata": { "id": "id29RPiyMOtB" }, "source": [ "## Step 3: Set Up Hugging Face API Token and Download Pretrained Models\n", "\n", "In this step, you'll configure the Hugging Face API token and download the pretrained model weights required for the **Cosmos Tokenizer**.\n", "\n", "1. **Ensure You Have a Hugging Face Account** \n", " If you do not already have a Hugging Face account, follow these steps to create one and generate an API token:\n", " - Go to the [Hugging Face website](https://huggingface.co/) and sign up for a free account.\n", " - After logging in, navigate to your [Settings → Access Tokens](https://huggingface.co/settings/tokens).\n", " - Click on \"New Token\" to generate an API token with the required permissions.\n", "\n", "2. **Set the Hugging Face Token** \n", " Check if the Hugging Face token is already set in the environment variables. If not, you will be prompted to enter it manually. The token is essential to authenticate and access the Hugging Face models.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "joxcyOlnM7HQ" }, "outputs": [], "source": [ "# Check if the token is already set\n", "if \"HUGGINGFACE_TOKEN\" not in os.environ:\n", " os.environ[\"HUGGINGFACE_TOKEN\"] = input(\"Please enter your Hugging Face API token: \")\n", "!git config --global credential.helper store" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Lq7MAQ9pGPH9" }, "outputs": [], "source": [ "from huggingface_hub import login, snapshot_download\n", "import os\n", "HUGGINGFACE_TOKEN = os.environ.get(\"HUGGINGFACE_TOKEN\")\n", "login(token=HUGGINGFACE_TOKEN, add_to_git_credential=True)\n", "model_names = [\n", " \"Cosmos-0.1-Tokenizer-CV4x8x8\",\n", " \"Cosmos-0.1-Tokenizer-CV8x8x8\",\n", " \"Cosmos-0.1-Tokenizer-CV8x16x16\",\n", " \"Cosmos-0.1-Tokenizer-DV4x8x8\",\n", " \"Cosmos-0.1-Tokenizer-DV8x8x8\",\n", " \"Cosmos-0.1-Tokenizer-DV8x16x16\",\n", " \"Cosmos-Tokenize1-CV8x8x8-720p\",\n", " \"Cosmos-Tokenize1-DV8x16x16-720p\",\n", "]\n", "for model_name in model_names:\n", " hf_repo = \"nvidia/\" + model_name\n", " local_dir = \"checkpoints/\" + model_name\n", " os.makedirs(local_dir, exist_ok=True)\n", " print(f\"downloading {model_name}...\")\n", " snapshot_download(repo_id=hf_repo, local_dir=local_dir)" ] }, { "cell_type": "markdown", "metadata": { "id": "ltZ-v2vzNv74" }, "source": [ "## Step 4: Use Cosmos Tokenizer for Video Reconstruction\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/", "height": 594 }, "id": "gZFPrGCBGwtC", "outputId": "ad18dc16-c1f2-410c-937b-787c677ec27e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:19<00:00, 6.45s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Input video read from:\t /home/freda/Cosmos/cosmos1/models/tokenizer/test_data/video.mp4\n", "Reconstruction saved:\t /home/freda/Cosmos/cosmos1/models/tokenizer/test_data/video_CV8x8x8.mp4\n" ] }, { "data": { "text/html": [ "
\n",
" \n",
" Input Video | \n",
" \n",
" Reconstructed Video |