{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "n3ryhkSfIEfl" }, "source": [ "# Image Tokenization Using [NVIDIA Cosmos Tokenizer](https://github.com/NVIDIA-Cosmos/cosmos-predict1/blob/main/cosmos1/models/tokenizer) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nvidia-cosmos/cosmos-predict1/blob/main/cosmos_predict1/models/tokenizer/notebook/Image_Tokenization.ipynb)\n", "\n", "The Jupyter Notebook example utilizes the **Cosmos-Tokenizer** pretrained models, which include Continuous Image (CI) tokenizers that transform images into continuous latents and Discrete Image (DI) tokenizers that transform images into discrete tokens. Both CI and DI tokenizers are available with compression rates of 8x8 and 16x16. For instance, **CI16x16** effectively downsizes both height and width by a factor of 16.\n", "\n", "Within the notebook, the `ImageTokenizer` class from the `cosmos_tokenizer.image_lib` module is employed to manage the encoder and decoder components of this model. The encoder compresses the input image into a condensed latent representation or discrete integers, while the decoder reconstructs the image from this latent representation or discrete integers.\n", "\n", "This instance of the Cosmos Tokenizer demonstrates its autoencoding capability: compressing an image into a smaller latent space and subsequently reconstructing it to its original form. This showcases the efficiency of image tokenization for tasks involving significant spatial compression during image reconstruction, a highly desirable feature for generative modeling.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "5BkjyLTPLM6e" }, "source": [ "This tutorial follows a simple, step-by-step approach, making it easy to understand and adapt.\n", "\n", "## Step 1: Clone the Cosmos Tokenizer Repository" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TEV88M9YG973" }, "outputs": [], "source": [ "!git clone https://github.com/NVIDIA-Cosmos/cosmos-predict1.git" ] }, { "cell_type": "markdown", "metadata": { "id": "AxOMEJpFL9QL" }, "source": [ "## Step 2: Install **Cosmos-Tokenizer**\n", "Before proceeding, ensure you have the **Cosmos Tokenizer** installed. If you cloned the repository in Step 1, use the following command to install it in editable mode:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XuwUR6HrIxD8" }, "outputs": [], "source": [ "# Step 2: # Install Cosmos and its Python dependencies.\n", "import os\n", "if os.path.exists(\"cosmos-predict1\"):\n", " os.chdir(\"cosmos-predict1\")\n", " %pip install -r requirements.txt\n", "else:\n", " print('cosmos-predict1 is already installed.')" ] }, { "cell_type": "markdown", "metadata": { "id": "id29RPiyMOtB" }, "source": [ "## Step 3: Set Up Hugging Face API Token and Download Pretrained Models\n", "\n", "In this step, you'll configure the Hugging Face API token and download the pretrained model weights required for the **Cosmos Tokenizer**.\n", "\n", "1. **Ensure You Have a Hugging Face Account** \n", " If you do not already have a Hugging Face account, follow these steps to create one and generate an API token:\n", " - Go to the [Hugging Face website](https://huggingface.co/) and sign up for a free account.\n", " - After logging in, navigate to your [Settings → Access Tokens](https://huggingface.co/settings/tokens).\n", " - Click on \"New Token\" to generate an API token with the required permissions.\n", "\n", "2. **Set the Hugging Face Token** \n", " Check if the Hugging Face token is already set in the environment variables. If not, you will be prompted to enter it manually. The token is essential to authenticate and access the Hugging Face models.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "joxcyOlnM7HQ" }, "outputs": [], "source": [ "# Check if the token is already set\n", "if \"HUGGINGFACE_TOKEN\" not in os.environ:\n", " os.environ[\"HUGGINGFACE_TOKEN\"] = input(\"Please enter your Hugging Face API token: \")\n", "!git config --global credential.helper store" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Lq7MAQ9pGPH9" }, "outputs": [], "source": [ "from huggingface_hub import login, snapshot_download\n", "import os\n", "HUGGINGFACE_TOKEN = os.environ.get(\"HUGGINGFACE_TOKEN\")\n", "login(token=HUGGINGFACE_TOKEN, add_to_git_credential=True)\n", "model_names = [\n", " \"Cosmos-0.1-Tokenizer-CI8x8\",\n", " \"Cosmos-0.1-Tokenizer-CI16x16\",\n", " \"Cosmos-0.1-Tokenizer-DI8x8\",\n", " \"Cosmos-0.1-Tokenizer-DI16x16\",\n", "]\n", "for model_name in model_names:\n", " hf_repo = \"nvidia/\" + model_name\n", " local_dir = \"checkpoints/\" + model_name\n", " os.makedirs(local_dir, exist_ok=True)\n", " print(f\"downloading {model_name}...\")\n", " snapshot_download(repo_id=hf_repo, local_dir=local_dir)" ] }, { "cell_type": "markdown", "metadata": { "id": "ltZ-v2vzNv74" }, "source": [ "## Step 4: Use Cosmos Tokenizer for Image Reconstruction\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "colab": { "base_uri": "https://localhost:8080/", "height": 839 }, "id": "gZFPrGCBGwtC", "outputId": "0df7efc4-7a40-4011-81a6-3c541ba1601f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input image read from:\t /content/Cosmos-Tokenizer/test_data/image.png\n", "Reconstruction saved:\t /content/Cosmos-Tokenizer/test_data/image_CI8x8.png\n" ] }, { "data": { "text/html": [ "
\n", "
\n", "
Input Image
\n", "
\n", "
Reconstructed Image
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# @title In this step, load the required checkpoints, and perform image reconstruction. {\"run\":\"auto\"}\n", "import cv2\n", "import numpy as np\n", "import torch\n", "\n", "import importlib\n", "from cosmos_predict1.tokenizer.inference.image_lib import ImageTokenizer\n", "import mediapy as media\n", "\n", "\n", "# 1) Specify the model name, and the paths to the encoder/decoder checkpoints.\n", "model_name = 'Cosmos-0.1-Tokenizer-CI8x8' # @param [\"Cosmos-0.1-Tokenizer-CI16x16\", \"Cosmos-0.1-Tokenizer-CI8x8\", \"Cosmos-0.1-Tokenizer-DI8x8\", \"Cosmos-0.1-Tokenizer-DI16x16\"]\n", "\n", "encoder_ckpt = f\"checkpoints/{model_name}/encoder.jit\"\n", "decoder_ckpt = f\"checkpoints/{model_name}/decoder.jit\"\n", "\n", "# 2) Load or provide the image filename you want to tokenize & reconstruct.\n", "input_filepath = \"cosmos_predict1/tokenizer/test_data/image.png\"\n", "\n", "# 3) Read the image from disk (shape = H x W x 3 in BGR). Then convert to RGB.\n", "input_image = media.read_image(input_filepath)[..., :3]\n", "assert input_image.ndim == 3 and input_image.shape[2] == 3, \"Image must have shape H x W x 3\"\n", "\n", "# 4) Expand dimensions to B x H x W x C, since the ImageTokenizer expects a batch dimension\n", "# in the input. (Batch size = 1 in this example.)\n", "batched_input_image = np.expand_dims(input_image, axis=0)\n", "\n", "# 5) Create the ImageTokenizer instance with the encoder & decoder.\n", "# - device=\"cuda\" uses the GPU\n", "# - dtype=\"bfloat16\" expects Ampere or newer GPU (A100, RTX 30xx, etc.)\n", "tokenizer = ImageTokenizer(\n", " checkpoint_enc=encoder_ckpt,\n", " checkpoint_dec=decoder_ckpt,\n", " device=\"cuda\",\n", " dtype=\"bfloat16\",\n", ")\n", "\n", "# 6) Use the tokenizer to autoencode (encode & decode) the image.\n", "# The output is a NumPy array with shape = B x H x W x C, range [0..255].\n", "batched_output_image = tokenizer(batched_input_image)\n", "\n", "# 7) Extract the single image from the batch (index 0), convert to uint8.\n", "output_image = batched_output_image[0]\n", "\n", "# 9) Save the reconstructed image to disk.\n", "input_dir, input_filename = os.path.split(input_filepath)\n", "filename, ext = os.path.splitext(input_filename)\n", "output_filepath = f\"{input_dir}/{filename}_{model_name.split('-')[-1]}{ext}\"\n", "media.write_image(output_filepath, output_image)\n", "print(\"Input image read from:\\t\", f\"{os.getcwd()}/{input_filepath}\")\n", "print(\"Reconstruction saved:\\t\", f\"{os.getcwd()}/{output_filepath}\")\n", "\n", "# 10) Visualization of the input image (left) and the reconstruction (right).\n", "media.show_images([input_image, output_image], [\"Input Image\", \"Reconstructed Image\"])" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }