Spaces:

nvidia
/

addit

Running on Zero

App Files Files Community

YoadTew commited on 1 day ago

Commit

504c7e8

1 Parent(s): 47481b9

Add application file

Browse files

Files changed (19) hide show

.gitattributes +1 -0
LICENSE +35 -0
README.md +168 -10
addit_attention_processors.py +297 -0
addit_attention_store.py +316 -0
addit_blending_utils.py +232 -0
addit_flux_pipeline.py +1389 -0
addit_flux_transformer.py +521 -0
addit_methods.py +186 -0
addit_scheduler.py +101 -0
app.py +416 -0
images/bed_dark_room.jpg +3 -0
images/flower.jpg +3 -0
requirements.txt +20 -0
run_CLI_addit_generated.py +102 -0
run_CLI_addit_real.py +121 -0
run_addit_generated.ipynb +80 -0
run_addit_real.ipynb +85 -0
visualization_utils.py +235 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,35 @@

+NVIDIA License
+1. Definitions
+“Licensor” means any person or entity that distributes its Work.
+“Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works  thereof  that are made available under this license.
+The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
+Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
+2. License Grant
+2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
+3. Limitations
+3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
+3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
+3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for research or evaluation purposes only.
+3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
+3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
+3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
+4. Disclaimer of Warranty.
+THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.
+5. Limitation of Liability.
+EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

README.md CHANGED Viewed

@@ -1,13 +1,171 @@
 ---
-title: Addit
-emoji: ⚡
-colorFrom: pink
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.36.2
-app_file: app.py
-pinned: false
-license: other
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 🎨 Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
+<div align="center">
+[![arXiv](https://img.shields.io/badge/arXiv-2411.07232-b31b1b.svg)](https://arxiv.org/abs/2411.07232)
+[![Project Website](https://img.shields.io/badge/🌐-Project%20Website-blue)](https://research.nvidia.com/labs/par/addit/)
+</div>
+## 👥 Authors
+**Yoad Tewel**<sup>1,2</sup>, **Rinon Gal**<sup>1,2</sup>, **Dvir Samuel**<sup>3</sup>, **Yuval Atzmon**<sup>1</sup>, **Lior Wolf**<sup>2</sup>, **Gal Chechik**<sup>1</sup>
+<sup>1</sup>NVIDIA • <sup>2</sup>Tel Aviv University • <sup>3</sup>Bar-Ilan University
+<div align="center">
+<img src="https://research.nvidia.com/labs/par/addit/static/images/Teaser.png" alt="Add-it Teaser" width="800"/>
+</div>
+## 📄 Abstract
+Adding objects into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes.
+We introduce **Add-it**, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement.
+Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.
+---
+## 📋 Description
+This repository contains the official implementation of the Add-it paper, providing tools for seamless object insertion into images using pretrained diffusion models.
+---
+## 🛠️ Setup
+```bash
+conda env create -f environment.yml
+conda activate addit
+```
 ---
+## 🚀 Usage
+### 💻 Command Line Interface (CLI)
+Add-it provides two CLI scripts for different use cases:
+#### 1. 🎭 Adding Objects to Generated Images
+Use `run_CLI_addit_generated.py` to add objects to AI-generated images:
+```bash
+python run_CLI_addit_generated.py \
+    --prompt_source "A photo of a cat sitting on the couch" \
+    --prompt_target "A photo of a cat wearing a red hat sitting on the couch" \
+    --subject_token "hat"
+```
+##### ⚙️ Options for Generated Images
+**🔴 Required Arguments:**
+- `--prompt_source`: Source prompt for generating the base image
+- `--prompt_target`: Target prompt describing the desired edited image
+- `--subject_token`: Single token representing the subject to add (must appear in prompt_target)
+**🔵 Optional Arguments:**
+- `--output_dir`: Directory to save output images (default: "outputs")
+- `--seed_src`: Seed for source generation (default: 6311)
+- `--seed_obj`: Seed for edited image generation (default: 1)
+- `--extended_scale`: Extended attention scale (default: 1.05)
+- `--structure_transfer_step`: Structure transfer step (default: 2)
+- `--blend_steps`: Blend steps (default: [15]). To allow for changes in the input image pass `--blend_steps` with empty value.
+- `--localization_model`: Localization model (default: "attention_points_sam")
+  - **Options:** `attention_points_sam`, `attention`, `attention_box_sam`, `attention_mask_sam`, `grounding_sam`
+- `--show_attention`: Show attention maps using pyplot (flag), will be saved to `attn_vis.png`.
+#### 2. 📸 Adding Objects to Real Images
+Use `run_CLI_addit_real.py` to add objects to existing images:
+```bash
+python run_CLI_addit_real.py \
+    --source_image "images/bed_dark_room.jpg" \
+    --prompt_source "A photo of a bed in a dark room" \
+    --prompt_target "A photo of a dog lying on a bed in a dark room" \
+    --subject_token "dog"
+```
+##### ⚙️ Options for Real Images
+**🔴 Required Arguments:**
+- `--source_image`: Path to the source image (default: "images/bed_dark_room.jpg")
+- `--prompt_source`: Source prompt describing the original image
+- `--prompt_target`: Target prompt describing the desired edited image
+- `--subject_token`: Subject token to add to the image (must appear in prompt_target)
+**🔵 Optional Arguments:**
+- `--output_dir`: Directory to save output images (default: "outputs")
+- `--seed_src`: Seed for source generation (default: 6311)
+- `--seed_obj`: Seed for edited image generation (default: 1)
+- `--extended_scale`: Extended attention scale (default: 1.1)
+- `--structure_transfer_step`: Structure transfer step (default: 4)
+- `--blend_steps`: Blend steps (default: [18]). To allow for changes in the input image pass `--blend_steps` with empty value.
+- `--localization_model`: Localization model (default: "attention")
+  - **Options:** `attention_points_sam`, `attention`, `attention_box_sam`, `attention_mask_sam`, `grounding_sam`
+- `--use_offset`: Use offset in processing (flag)
+- `--show_attention`: Show attention maps using pyplot (flag), will be saved to `attn_vis.png`.
+- `--disable_inversion`: Disable source image inversion (flag)
+---
+### 📓 Jupyter Notebooks
+You can run Add-it in two interactive modes:
+| Mode | Notebook | Description |
+|------|----------|-------------|
+| 🎭 **Generated Images** | `run_addit_generated.ipynb` | Adding objects to AI-generated images |
+| 📸 **Real Images** | `run_addit_real.ipynb` | Adding objects to existing real images |
+The notebooks contain examples of different prompts and parameters that can be adjusted to control the object insertion process.
+---
+## 💡 Tips for Better Results
+- **Prompt Design**: The `--prompt_target` should be similar to the `--prompt_source`, but include a description of the new object to insert
+- **Seed Variation**: Try different values for `--seed_obj` - some prompts may require a few attempts to get satisfying results
+- **Localization Models**: The most effective `--localization_model` options are `attention_points_sam` and `attention`. Use the `--show_attention` flag to visualize localization performance
+- **Object Placement Issues**: If the object is not added to the image:
+  - Try **decreasing** `--structure_transfer_step`
+  - Try **increasing** `--extended_scale`
+- **Flexibility**: To allow more flexibility in modifying the source image, set `--blend_steps` to an empty value to send an empty list: `[]`
+---
+## 📰 News
+- **🎉 2025 JUL**: Official Add-it implementation is released!
+---
+## 📝 TODO
+- [x] Release code
+---
+## 📚 Citation
+If you make use of our work, please cite our paper:
+```bibtex
+@misc{tewel2024addit,
+    title={Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models},
+    author={Yoad Tewel and Rinon Gal and Dvir Samuel and Yuval Atzmon and Lior Wolf and Gal Chechik},
+    year={2024},
+    eprint={2411.07232},
+    archivePrefix={arXiv},
+    primaryClass={cs.CV}
+}
+```
 ---
+<div align="center">
+<strong>🌟 Star this repo if you find it useful! 🌟</strong>
+</div>

addit_attention_processors.py ADDED Viewed

	@@ -0,0 +1,297 @@

+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+from collections import defaultdict
+from diffusers.models.attention_processor import Attention, apply_rope
+from typing import Callable, List, Optional, Tuple, Union
+from addit_attention_store import AttentionStore
+from visualization_utils import show_tensors
+import torch
+import torch.nn.functional as F
+import numpy as np
+from scipy.optimize import brentq
+def apply_standard_attention(query, key, value, attn, attention_probs=None):
+    batch_size, attn_heads, _, head_dim = query.shape
+    # Do normal attention, to cache the attention scores
+    query = query.reshape(batch_size*attn_heads, -1, head_dim)
+    key = key.reshape(batch_size*attn_heads, -1, head_dim)
+    value = value.reshape(batch_size*attn_heads, -1, head_dim)
+    if attention_probs is None:
+        attention_probs = attn.get_attention_scores(query, key)
+    hidden_states = torch.bmm(attention_probs, value)
+    hidden_states = hidden_states.view(batch_size, attn_heads, -1, head_dim)
+    return hidden_states, attention_probs
+def apply_extended_attention(query, key, value, attention_store, attn, layer_name, step_index, extend_type="pixels",
+                             extended_scale=1., record_attention=False):
+    batch_size = query.size(0)
+    extend_query = query[1:]
+    if extend_type == "full":
+        added_key = key[0] * extended_scale
+        added_value = value[0]
+    elif extend_type == "text":
+        added_key = key[0, :, :512] * extended_scale
+        added_value = value[0, :, :512]
+    elif extend_type == "pixels":
+        added_key =  key[0, :, 512:]
+        added_value =  value[0, :, 512:]
+        key[1]  = key[1] * extended_scale
+    extend_key = torch.cat([added_key, key[1]], dim=1).unsqueeze(0)
+    extend_value = torch.cat([added_value, value[1]], dim=1).unsqueeze(0)
+    hidden_states_0 = F.scaled_dot_product_attention(query[:1], key[:1], value[:1], dropout_p=0.0, is_causal=False)
+    if record_attention or attention_store.is_cache_attn_ratio(step_index):
+        hidden_states_1, attention_probs_1 = apply_standard_attention(extend_query, extend_key, extend_value, attn)
+    else:
+        hidden_states_1 = F.scaled_dot_product_attention(extend_query, extend_key, extend_value, dropout_p=0.0, is_causal=False)
+    if record_attention:
+        # Store Attention
+        seq_len = attention_probs_1.size(2) - attention_probs_1.size(1)
+        self_attention_probs_1 = attention_probs_1[:,:,seq_len:]
+        attention_store.store_attention(self_attention_probs_1, layer_name, 1, attn.heads)
+    if attention_store.is_cache_attn_ratio(step_index):
+        attention_store.store_attention_ratios(attention_probs_1, step_index, layer_name)
+    hidden_states = torch.cat([hidden_states_0, hidden_states_1], dim=0)
+    return hidden_states
+def apply_attention(query, key, value, attention_store, attn, layer_name, step_index,
+                    record_attention, extended_attention, extended_scale):
+    if extended_attention:
+        hidden_states = apply_extended_attention(query, key, value, attention_store, attn, layer_name, step_index,
+                                                     extended_scale=extended_scale,
+                                                     record_attention=record_attention)
+    else:
+        if record_attention:
+            hidden_states_0 = F.scaled_dot_product_attention(query[:1], key[:1], value[:1], dropout_p=0.0, is_causal=False)
+            hidden_states_1, attention_probs_1 = apply_standard_attention(query[1:], key[1:], value[1:], attn)
+            attention_store.store_attention(attention_probs_1, layer_name, 1, attn.heads)
+            hidden_states = torch.cat([hidden_states_0, hidden_states_1], dim=0)
+        else:
+            hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
+    return hidden_states
+class AdditFluxAttnProcessor2_0:
+    """Attention processor used typically in processing the SD3-like self-attention projections."""
+    def __init__(self, layer_name: str, attention_store: AttentionStore,
+                 extended_steps: Tuple[int, int] = (0, 30), **kwargs):
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError("FluxAttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
+        self.layer_name = layer_name
+        self.layer_idx = int(layer_name.split(".")[-1])
+        self.attention_store = attention_store
+        self.extended_steps = (0, extended_steps) if isinstance(extended_steps, int) else extended_steps
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.FloatTensor,
+        encoder_hidden_states: torch.FloatTensor = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        image_rotary_emb: Optional[torch.Tensor] = None,
+        step_index: Optional[int] = None,
+        extended_scale: Optional[float] = 1.0,
+    ) -> torch.FloatTensor:
+        input_ndim = hidden_states.ndim
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+        context_input_ndim = encoder_hidden_states.ndim
+        if context_input_ndim == 4:
+            batch_size, channel, height, width = encoder_hidden_states.shape
+            encoder_hidden_states = encoder_hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+        batch_size = encoder_hidden_states.shape[0]
+        # `sample` projections.
+        query = attn.to_q(hidden_states)
+        key = attn.to_k(hidden_states)
+        value = attn.to_v(hidden_states)
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        if attn.norm_q is not None:
+            query = attn.norm_q(query)
+        if attn.norm_k is not None:
+            key = attn.norm_k(key)
+        # `context` projections.
+        encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states)
+        encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states)
+        encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states)
+        encoder_hidden_states_query_proj = encoder_hidden_states_query_proj.view(
+            batch_size, -1, attn.heads, head_dim
+        ).transpose(1, 2)
+        encoder_hidden_states_key_proj = encoder_hidden_states_key_proj.view(
+            batch_size, -1, attn.heads, head_dim
+        ).transpose(1, 2)
+        encoder_hidden_states_value_proj = encoder_hidden_states_value_proj.view(
+            batch_size, -1, attn.heads, head_dim
+        ).transpose(1, 2)
+        if attn.norm_added_q is not None:
+            encoder_hidden_states_query_proj = attn.norm_added_q(encoder_hidden_states_query_proj)
+        if attn.norm_added_k is not None:
+            encoder_hidden_states_key_proj = attn.norm_added_k(encoder_hidden_states_key_proj)
+        # attention
+        query = torch.cat([encoder_hidden_states_query_proj, query], dim=2)
+        key = torch.cat([encoder_hidden_states_key_proj, key], dim=2)
+        value = torch.cat([encoder_hidden_states_value_proj, value], dim=2)
+        if image_rotary_emb is not None:
+            # YiYi to-do: update uising apply_rotary_emb
+            # from ..embeddings import apply_rotary_emb
+            # query = apply_rotary_emb(query, image_rotary_emb)
+            # key = apply_rotary_emb(key, image_rotary_emb)
+            query, key = apply_rope(query, key, image_rotary_emb)
+        record_attention = self.attention_store.is_record_attention(self.layer_name, step_index)
+        extend_start, extend_end = self.extended_steps
+        extended_attention = extend_start <= step_index <= extend_end
+        hidden_states = apply_attention(query, key, value, self.attention_store, attn, self.layer_name, step_index,
+                        record_attention, extended_attention, extended_scale)
+        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        hidden_states = hidden_states.to(query.dtype)
+        encoder_hidden_states, hidden_states = (
+            hidden_states[:, : encoder_hidden_states.shape[1]],
+            hidden_states[:, encoder_hidden_states.shape[1] :],
+        )
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+        encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+        if context_input_ndim == 4:
+            encoder_hidden_states = encoder_hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+        return hidden_states, encoder_hidden_states
+class AdditFluxSingleAttnProcessor2_0:
+    r"""
+    Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0).
+    """
+    def __init__(self, layer_name: str, attention_store: AttentionStore,
+                 extended_steps: Tuple[int, int] = (0, 30), **kwargs):
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError("AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
+        self.layer_name = layer_name
+        self.layer_idx = int(layer_name.split(".")[-1])
+        self.attention_store = attention_store
+        self.extended_steps = (0, extended_steps) if isinstance(extended_steps, int) else extended_steps
+    def __call__(
+        self,
+        attn: Attention,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        image_rotary_emb: Optional[torch.Tensor] = None,
+        step_index: Optional[int] = None,
+        extended_scale: Optional[float] = 1.0,
+    ) -> torch.Tensor:
+        input_ndim = hidden_states.ndim
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
+        batch_size, _, _ = hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
+        query = attn.to_q(hidden_states)
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        if attn.norm_q is not None:
+            query = attn.norm_q(query)
+        if attn.norm_k is not None:
+            key = attn.norm_k(key)
+        # Apply RoPE if needed
+        if image_rotary_emb is not None:
+            # YiYi to-do: update uising apply_rotary_emb
+            # from ..embeddings import apply_rotary_emb
+            # query = apply_rotary_emb(query, image_rotary_emb)
+            # key = apply_rotary_emb(key, image_rotary_emb)
+            query, key = apply_rope(query, key, image_rotary_emb)
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        # TODO: add support for attn.scale when we move to Torch 2.1
+        record_attention = self.attention_store.is_record_attention(self.layer_name, step_index)
+        extend_start, extend_end = self.extended_steps
+        extended_attention = extend_start <= step_index <= extend_end
+        hidden_states = apply_attention(query, key, value, self.attention_store, attn, self.layer_name, step_index,
+                        record_attention, extended_attention, extended_scale)
+        hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        hidden_states = hidden_states.to(query.dtype)
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
+        return hidden_states

addit_attention_store.py ADDED Viewed

	@@ -0,0 +1,316 @@

+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import torch
+from skimage import filters
+import cv2
+import torch.nn.functional as F
+from skimage.filters import threshold_li, threshold_yen, threshold_multiotsu
+import numpy as np
+from visualization_utils import show_tensors
+import matplotlib.pyplot as plt
+def text_to_tokens(text, tokenizer):
+    return [tokenizer.decode(x) for x in tokenizer(text, padding="longest", return_tensors="pt").input_ids[0]]
+def flatten_list(l):
+    return [item for sublist in l for item in sublist]
+def gaussian_blur(heatmap, kernel_size=7, sigma=0):
+    # Shape of heatmap: (H, W)
+    heatmap = heatmap.cpu().numpy()
+    heatmap = cv2.GaussianBlur(heatmap, (kernel_size, kernel_size), sigma)
+    heatmap = torch.tensor(heatmap)
+    return heatmap
+def min_max_norm(x):
+    return (x - x.min()) / (x.max() - x.min())
+class AttentionStore:
+    def __init__(self, prompts, tokenizer,
+                 subject_token=None, record_attention_steps=[],
+                 is_cache_attn_ratio=False, attn_ratios_steps=[5]):
+        self.text2image_store = {}
+        self.image2text_store = {}
+        self.count_per_layer = {}
+        self.record_attention_steps = record_attention_steps
+        self.record_attention_layers = ["transformer_blocks.13","transformer_blocks.14", "transformer_blocks.18", "single_transformer_blocks.23", "single_transformer_blocks.33"]
+        self.attention_ratios = {}
+        self._is_cache_attn_ratio = is_cache_attn_ratio
+        self.attn_ratios_steps = attn_ratios_steps
+        self.ratio_source = 'text'
+        self.max_tokens_to_record = 10
+        if isinstance(prompts, str):
+            prompts = [prompts]
+            batch_size = 1
+        else:
+            batch_size = len(prompts)
+        tokens_per_prompt = []
+        for prompt in prompts:
+            tokens = text_to_tokens(prompt, tokenizer)
+            tokens_per_prompt.append(tokens)
+        self.tokens_to_record = []
+        self.token_idxs_to_record = []
+        if len(record_attention_steps) > 0:
+            self.subject_tokens = flatten_list([text_to_tokens(x, tokenizer)[:-1] for x in [subject_token]])
+            self.subject_tokens_idx = [tokens_per_prompt[1].index(x) for x in self.subject_tokens]
+            self.add_token_idx = self.subject_tokens_idx[-1]
+    def is_record_attention(self, layer_name, step_index):
+        is_correct_layer = (self.record_attention_layers is None) or (layer_name in self.record_attention_layers)
+        record_attention =  (step_index in self.record_attention_steps) and (is_correct_layer)
+        return record_attention
+    def store_attention(self, attention_probs, layer_name, batch_size, num_heads):
+        text_len = 512
+        timesteps = len(self.record_attention_steps)
+        # Split batch and heads
+        attention_probs = attention_probs.view(batch_size, num_heads, *attention_probs.shape[1:])
+        # Mean over the heads
+        attention_probs = attention_probs.mean(dim=1)
+        # Attention: text -> image
+        attention_probs_text2image = attention_probs[:, :text_len, text_len:]
+        attention_probs_text2image = [attention_probs_text2image[0, self.subject_tokens_idx, :]]
+        # Attention: image -> text
+        attention_probs_image2text = attention_probs[:, text_len:, :text_len].transpose(1,2)
+        attention_probs_image2text = [attention_probs_image2text[0, self.subject_tokens_idx, :]]
+        if layer_name not in self.text2image_store:
+            self.text2image_store[layer_name] = [x for x in attention_probs_text2image]
+            self.image2text_store[layer_name] = [x for x in attention_probs_image2text]
+        else:
+            self.text2image_store[layer_name] = [self.text2image_store[layer_name][i] + x for i, x in enumerate(attention_probs_text2image)]
+            self.image2text_store[layer_name] = [self.text2image_store[layer_name][i] + x for i, x in enumerate(attention_probs_image2text)]
+    def is_cache_attn_ratio(self, step_index):
+        return (self._is_cache_attn_ratio) and (step_index in self.attn_ratios_steps)
+    def store_attention_ratios(self, attention_probs, step_index, layer_name):
+        layer_prefix = layer_name.split(".")[0]
+        if self.ratio_source == 'pixels':
+            extended_attention_probs = attention_probs.mean(dim=0)[512:, :]
+            extended_attention_probs_source = extended_attention_probs[:,:4096].sum(dim=1).view(64,64).float().cpu()
+            extended_attention_probs_text = extended_attention_probs[:,4096:4096+512].sum(dim=1).view(64,64).float().cpu()
+            extended_attention_probs_target = extended_attention_probs[:,4096+512:].sum(dim=1).view(64,64).float().cpu()
+            token_attention = extended_attention_probs[:,4096+self.add_token_idx].view(64,64).float().cpu()
+            stacked_attention_ratios = torch.cat([extended_attention_probs_source, extended_attention_probs_text, extended_attention_probs_target, token_attention], dim=1)
+        elif self.ratio_source == 'text':
+            extended_attention_probs = attention_probs.mean(dim=0)[:512, :]
+            extended_attention_probs_source = extended_attention_probs[:,:4096].sum(dim=0).view(64,64).float().cpu()
+            extended_attention_probs_target = extended_attention_probs[:,4096+512:].sum(dim=0).view(64,64).float().cpu()
+            stacked_attention_ratios = torch.cat([extended_attention_probs_source, extended_attention_probs_target], dim=1)
+        if step_index not in self.attention_ratios:
+            self.attention_ratios[step_index] = {}
+        if layer_prefix not in self.attention_ratios[step_index]:
+            self.attention_ratios[step_index][layer_prefix] = []
+        self.attention_ratios[step_index][layer_prefix].append(stacked_attention_ratios)
+    def get_attention_ratios(self, step_indices=None, display_imgs=False):
+        ratios = []
+        if step_indices is None:
+            step_indices = list(self.attention_ratios.keys())
+        if len(step_indices) == 1:
+            steps = f"Step: {step_indices[0]}"
+        else:
+            steps = f"Steps: [{step_indices[0]}-{step_indices[-1]}]"
+        layer_prefixes = list(self.attention_ratios[step_indices[0]].keys())
+        scores_per_layer = {}
+        for layer_prefix in layer_prefixes:
+            ratios = []
+            for step_index in step_indices:
+                if layer_prefix in self.attention_ratios[step_index]:
+                    step_ratios = self.attention_ratios[step_index][layer_prefix]
+                    step_ratios = torch.stack(step_ratios).mean(dim=0)
+                    ratios.append(step_ratios)
+            # Mean over the steps
+            ratios = torch.stack(ratios).mean(dim=0)
+            if self.ratio_source == 'pixels':
+                source, text, target, token = torch.split(ratios, 64, dim=1)
+                title = f"{steps}: Source={source.sum().item():.2f}, Text={text.sum().item():.2f}, Target={target.sum().item():.2f}, Token={token.sum().item():.2f}"
+                ratios = min_max_norm(torch.cat([source, text, target], dim=1))
+                token = min_max_norm(token)
+                ratios = torch.cat([ratios, token], dim=1)
+            elif self.ratio_source == 'text':
+                source, target = torch.split(ratios, 64, dim=1)
+                source_sum = source.sum().item()
+                target_sum = target.sum().item()
+                text_sum = 512 - (source_sum + target_sum)
+                title = f"{steps}: Source={source_sum:.2f}, Target={target_sum:.2f}"
+                ratios = min_max_norm(torch.cat([source, target], dim=1))
+            if display_imgs:
+                print(f"Layer: {layer_prefix}")
+                show_tensors([ratios], [title])
+            scores_per_layer[layer_prefix] = (source_sum, text_sum, target_sum)
+        return scores_per_layer
+    def plot_attention_ratios(self, step_indices=None):
+        steps = list(self.attention_ratios.keys())
+        score_per_layer = {
+            'transformer_blocks': {},
+            'single_transformer_blocks': {}
+        }
+        for i in steps:
+            scores_per_layer = self.get_attention_ratios(step_indices=[i], display_imgs=False)
+            for layer in self.attention_ratios[i]:
+                source, text, target = scores_per_layer[layer]
+                score_per_layer[layer][i] = (source, text, target)
+        for layer_type in score_per_layer:
+            x = list(score_per_layer[layer_type].keys())
+            source_sums = [x[0] for x in score_per_layer[layer_type].values()]
+            text_sums = [x[1] for x in score_per_layer[layer_type].values()]
+            target_sums = [x[2] for x in score_per_layer[layer_type].values()]
+            # Calculate the total sums for each stack (source + text + target)
+            total_sums = [source_sums[j] + text_sums[j] + target_sums[j] for j in range(len(source_sums))]
+            # Create stacked bar plots
+            fig, ax = plt.subplots(figsize=(10, 6))
+            indices = np.arange(len(x))
+            # Plot source at the bottom
+            ax.bar(indices, source_sums, label='Source', color='#6A2C70')
+            # Plot text stacked on source
+            ax.bar(indices, text_sums, label='Text', color='#B83B5E', bottom=source_sums)
+            # Plot target stacked on text + source
+            target_bottom = [source_sums[j] + text_sums[j] for j in range(len(source_sums))]
+            ax.bar(indices, target_sums, label='Target', color='#F08A5D', bottom=target_bottom)
+            # Annotate bars with percentage values
+            for j, index in enumerate(indices):
+                font_size = 12
+                # Source percentage
+                source_percentage = 100 * source_sums[j] / total_sums[j]
+                ax.text(index, source_sums[j] / 2, f'{source_percentage:.1f}%',
+                        ha='center', va='center', rotation=90, color='white',
+                        fontsize=font_size, fontweight='bold')
+                # Text percentage
+                text_percentage = 100 * text_sums[j] / total_sums[j]
+                ax.text(index, source_sums[j] + (text_sums[j] / 2), f'{text_percentage:.1f}%',
+                        ha='center', va='center', rotation=90, color='white',
+                        fontsize=font_size, fontweight='bold')
+                # Target percentage
+                target_percentage = 100 * target_sums[j] / total_sums[j]
+                ax.text(index, source_sums[j] + text_sums[j] + (target_sums[j] / 2), f'{target_percentage:.1f}%',
+                        ha='center', va='center', rotation=90, color='white',
+                        fontsize=font_size, fontweight='bold')
+            ax.set_xlabel('Step Index')
+            ax.set_ylabel('Attention Ratio')
+            ax.set_title(f'Attention Ratios for {layer_type}')
+            ax.set_xticks(indices)
+            ax.set_xticklabels(x)
+            plt.legend()
+            plt.show()
+    def aggregate_attention(self, store, target_layers=None, resolution=None,
+                            gaussian_kernel=3, thr_type='otsu', thr_number=0.5):
+        if target_layers is None:
+            store_vals = list(store.values())
+        elif isinstance(target_layers, list):
+            store_vals = [store[x] for x in target_layers]
+        else:
+            raise ValueError("target_layers must be a list of layer names or None.")
+        # store vals = List[layers] of Tensor[batch_size, text_tokens, image_tokens]
+        batch_size = len(store_vals[0])
+        attention_maps = []
+        attention_masks = []
+        for i in range(batch_size):
+            # Average over the layers
+            agg_vals = torch.stack([x[i] for x in store_vals]).mean(dim=0)
+            if resolution is None:
+                size = int(agg_vals.shape[-1] ** 0.5)
+                resolution = (size, size)
+            agg_vals = agg_vals.view(agg_vals.shape[0], *resolution)
+            if gaussian_kernel > 0:
+                agg_vals = torch.stack([gaussian_blur(x.float(), kernel_size=gaussian_kernel) for x in agg_vals]).to(agg_vals.dtype)
+            mask_vals = agg_vals.clone()
+            for j in range(mask_vals.shape[0]):
+                mask_vals[j] = (mask_vals[j] - mask_vals[j].min()) / (mask_vals[j].max() - mask_vals[j].min())
+                np_vals = mask_vals[j].float().cpu().numpy()
+                otsu_thr = filters.threshold_otsu(np_vals)
+                li_thr = threshold_li(np_vals, initial_guess=otsu_thr)
+                yen_thr = threshold_yen(np_vals)
+                if thr_type == 'otsu':
+                    thr = otsu_thr
+                elif thr_type == 'yen':
+                    thr = yen_thr
+                elif thr_type == 'li':
+                    thr = li_thr
+                elif thr_type == 'number':
+                    thr = thr_number
+                elif thr_type == 'multiotsu':
+                    thrs = threshold_multiotsu(np_vals, classes=3)
+                    if thrs[1] > thrs[0] * 3.5:
+                        thr = thrs[1]
+                    else:
+                        thr = thrs[0]
+                    # Take the closest threshold to otsu_thr
+                    # thr = thrs[np.argmin(np.abs(thrs - otsu_thr))]
+                # alpha = 0.8
+                # thr  = (alpha * thr + (1-alpha) * mask_vals[j].max())
+                mask_vals[j] = (mask_vals[j] > thr).to(mask_vals[j].dtype)
+            attention_maps.append(agg_vals)
+            attention_masks.append(mask_vals)
+        return attention_maps, attention_masks, self.tokens_to_record

addit_blending_utils.py ADDED Viewed

	@@ -0,0 +1,232 @@

+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import torch
+import numpy as np
+import torch.nn.functional as F
+from skimage import filters
+import matplotlib.pyplot as plt
+from scipy.ndimage import maximum_filter, label, find_objects
+def dilate_mask(latents_mask, k, latents_dtype):
+    # Reshape the mask to 2D (64x64)
+    mask_2d = latents_mask.view(64, 64)
+    # Create a square kernel for dilation
+    kernel = torch.ones(2*k+1, 2*k+1, device=mask_2d.device, dtype=mask_2d.dtype)
+    # Add two dimensions to make it compatible with conv2d
+    mask_4d = mask_2d.unsqueeze(0).unsqueeze(0)
+    # Perform dilation using conv2d
+    dilated_mask = F.conv2d(mask_4d, kernel.unsqueeze(0).unsqueeze(0), padding=k)
+    # Threshold the result to get a binary mask
+    dilated_mask = (dilated_mask > 0).to(mask_2d.dtype)
+    # Reshape back to the original shape and convert to the desired dtype
+    dilated_mask = dilated_mask.view(4096, 1).to(latents_dtype)
+    return dilated_mask
+def clipseg_predict(model, processor, image, text, device):
+    inputs = processor(text=text, images=image, return_tensors="pt")
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model(**inputs)
+        preds = outputs.logits.unsqueeze(1)
+        preds = torch.sigmoid(preds)
+        otsu_thr = filters.threshold_otsu(preds.cpu().numpy())
+        subject_mask = (preds > otsu_thr).float()
+    return subject_mask
+def grounding_sam_predict(model, processor, sam_predictor, image, text, device):
+    inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    results = processor.post_process_grounded_object_detection(
+        outputs,
+        inputs.input_ids,
+        box_threshold=0.4,
+        text_threshold=0.3,
+        target_sizes=[image.size[::-1]]
+    )
+    input_boxes = results[0]["boxes"].cpu().numpy()
+    if input_boxes.shape[0] == 0:
+        return torch.ones((64, 64), device=device)
+    with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+        sam_predictor.set_image(image)
+        masks, scores, logits = sam_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            box=input_boxes,
+            multimask_output=False,
+        )
+    subject_mask = torch.tensor(masks[0], device=device)
+    return subject_mask
+def mask_to_box_sam_predict(mask, sam_predictor, image, text, device):
+    H, W = image.size
+    # Resize clipseg mask to image size
+    mask = F.interpolate(mask.view(1, 1, mask.shape[-2], mask.shape[-1]), size=(H, W), mode='bilinear').view(H, W)
+    mask_indices = torch.nonzero(mask)
+    top_left = mask_indices.min(dim=0)[0]
+    bottom_right = mask_indices.max(dim=0)[0]
+    # numpy shape [1,4]
+    input_boxes = np.array([[top_left[1].item(), top_left[0].item(), bottom_right[1].item(), bottom_right[0].item()]])
+    with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+        sam_predictor.set_image(image)
+        masks, scores, logits = sam_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            box=input_boxes,
+            multimask_output=True,
+        )
+    # subject_mask = torch.tensor(masks[0], device=device)
+    subject_mask = torch.tensor(np.max(masks, axis=0), device=device)
+    return subject_mask, input_boxes[0]
+def mask_to_mask_sam_predict(mask, sam_predictor, image, text, device):
+    H, W = (256, 256)
+    # Resize clipseg mask to image size
+    mask = F.interpolate(mask.view(1, 1, mask.shape[-2], mask.shape[-1]), size=(H, W), mode='bilinear').view(1, H, W)
+    mask_input = mask.float().cpu().numpy()
+    with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+        sam_predictor.set_image(image)
+        masks, scores, logits = sam_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            mask_input=mask_input,
+            multimask_output=False,
+        )
+    subject_mask = torch.tensor(masks[0], device=device)
+    return subject_mask
+def mask_to_points_sam_predict(mask, sam_predictor, image, text, device):
+    H, W = image.size
+    # Resize clipseg mask to image size
+    mask = F.interpolate(mask.view(1, 1, mask.shape[-2], mask.shape[-1]), size=(H, W), mode='bilinear').view(H, W)
+    mask_indices = torch.nonzero(mask)
+    # Randomly sample 10 points from the mask
+    n_points = 2
+    point_coords = mask_indices[torch.randperm(mask_indices.shape[0])[:n_points]].float().cpu().numpy()
+    point_labels = torch.ones((n_points,)).float().cpu().numpy()
+    with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+        sam_predictor.set_image(image)
+        masks, scores, logits = sam_predictor.predict(
+            point_coords=point_coords,
+            point_labels=point_labels,
+            multimask_output=False,
+        )
+    subject_mask = torch.tensor(masks[0], device=device)
+    return subject_mask
+def attention_to_points_sam_predict(subject_attention, subject_mask, sam_predictor, image, text, device):
+    H, W = image.size
+    # Resize clipseg mask to image size
+    subject_attention = F.interpolate(subject_attention.view(1, 1, subject_attention.shape[-2], subject_attention.shape[-1]), size=(H, W), mode='bilinear').view(H, W)
+    subject_mask = F.interpolate(subject_mask.view(1, 1, subject_mask.shape[-2], subject_mask.shape[-1]), size=(H, W), mode='bilinear').view(H, W)
+    # Get mask_bbox
+    subject_mask_indices = torch.nonzero(subject_mask)
+    top_left = subject_mask_indices.min(dim=0)[0]
+    bottom_right = subject_mask_indices.max(dim=0)[0]
+    box_width = bottom_right[1] - top_left[1]
+    box_height = bottom_right[0] - top_left[0]
+    # Define the number of points and minimum distance between points
+    n_points = 3
+    max_thr = 0.35
+    max_attention = torch.max(subject_attention)
+    min_distance = max(box_width, box_height) // (n_points + 1)  # Adjust this value to control spread
+    # min_distance = max(min_distance, 75)
+    # Initialize list to store selected points
+    selected_points = []
+    # Create a copy of the attention map
+    remaining_attention = subject_attention.clone()
+    for _ in range(n_points):
+        if remaining_attention.max() < max_thr * max_attention:
+            break
+        # Find the highest attention point
+        point = torch.argmax(remaining_attention)
+        y, x = torch.unravel_index(point, remaining_attention.shape)
+        y, x = y.item(), x.item()
+        # Add the point to our list
+        selected_points.append((x, y))
+        # Zero out the area around the selected point
+        y_min = max(0, y - min_distance)
+        y_max = min(H, y + min_distance + 1)
+        x_min = max(0, x - min_distance)
+        x_max = min(W, x + min_distance + 1)
+        remaining_attention[y_min:y_max, x_min:x_max] = 0
+    # Convert selected points to numpy array
+    point_coords = np.array(selected_points)
+    point_labels = np.ones(point_coords.shape[0], dtype=int)
+    with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+        sam_predictor.set_image(image)
+        masks, scores, logits = sam_predictor.predict(
+            point_coords=point_coords,
+            point_labels=point_labels,
+            multimask_output=False,
+        )
+    subject_mask = torch.tensor(masks[0], device=device)
+    return subject_mask, point_coords
+def sam_refine_step(mask, sam_predictor, image, device):
+    mask_indices = torch.nonzero(mask)
+    top_left = mask_indices.min(dim=0)[0]
+    bottom_right = mask_indices.max(dim=0)[0]
+    # numpy shape [1,4]
+    input_boxes = np.array([[top_left[1].item(), top_left[0].item(), bottom_right[1].item(), bottom_right[0].item()]])
+    with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+        sam_predictor.set_image(image)
+        masks, scores, logits = sam_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            box=input_boxes,
+            multimask_output=True,
+        )
+    # subject_mask = torch.tensor(masks[0], device=device)
+    subject_mask = torch.tensor(np.max(masks, axis=0), device=device)
+    return subject_mask, input_boxes[0]

addit_flux_pipeline.py ADDED Viewed

	@@ -0,0 +1,1389 @@

+# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+from typing import Any, Callable, Dict, List, Optional, Union
+import torch
+import numpy as np
+from PIL import Image
+from diffusers.pipelines.flux.pipeline_flux import FluxPipeline, calculate_shift, retrieve_timesteps
+from diffusers.pipelines.flux.pipeline_output import FluxPipelineOutput
+from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
+from diffusers.utils.torch_utils import randn_tensor
+import matplotlib.pyplot as plt
+import torch.fft
+import torch.nn.functional as F
+from diffusers.models.attention_processor import FluxAttnProcessor2_0, FluxSingleAttnProcessor2_0
+from addit_attention_processors import AdditFluxAttnProcessor2_0, AdditFluxSingleAttnProcessor2_0
+from addit_attention_store import AttentionStore
+from transformers import CLIPSegProcessor, CLIPSegForImageSegmentation
+from skimage import filters
+from visualization_utils import show_image_and_heatmap, show_images, draw_points_on_pil_image, draw_bboxes_on_image
+from addit_blending_utils import clipseg_predict, grounding_sam_predict, mask_to_box_sam_predict, \
+            mask_to_mask_sam_predict, attention_to_points_sam_predict
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from scipy.optimize import brentq
+from scipy.optimize import root_scalar
+def register_my_attention_processors(transformer, attention_store, extended_steps_multi, extended_steps_single):
+    attn_procs = {}
+    for i, (name, processor) in enumerate(transformer.attn_processors.items()):
+        layer_name = ".".join(name.split(".")[:2])
+        if layer_name.startswith("transformer_blocks"):
+            attn_procs[name] = AdditFluxAttnProcessor2_0(layer_name=layer_name,
+                                                      attention_store=attention_store,
+                                                      extended_steps=extended_steps_multi)
+        elif layer_name.startswith("single_transformer_blocks"):
+            attn_procs[name] = AdditFluxSingleAttnProcessor2_0(layer_name=layer_name,
+                                                            attention_store=attention_store,
+                                                            extended_steps=extended_steps_single)
+    transformer.set_attn_processor(attn_procs)
+def register_regular_attention_processors(transformer):
+    attn_procs = {}
+    for i, (name, processor) in enumerate(transformer.attn_processors.items()):
+        layer_name = ".".join(name.split(".")[:2])
+        if layer_name.startswith("transformer_blocks"):
+            attn_procs[name] = FluxAttnProcessor2_0()
+        elif layer_name.startswith("single_transformer_blocks"):
+            attn_procs[name] = FluxSingleAttnProcessor2_0()
+    transformer.set_attn_processor(attn_procs)
+def img2img_retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+class AdditFluxPipeline(FluxPipeline):
+    def prepare_latents(
+        self,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        height = 2 * (int(height) // self.vae_scale_factor)
+        width = 2 * (int(width) // self.vae_scale_factor)
+        shape = (batch_size, num_channels_latents, height, width)
+        if latents is not None:
+            latent_image_ids = self._prepare_latent_image_ids(batch_size, height, width, device, dtype)
+            return latents.to(device=device, dtype=dtype), latent_image_ids
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        if isinstance(generator, list):
+            latents = torch.empty(shape, device=device, dtype=dtype)
+            latents_list = [randn_tensor(shape, generator=g, device=device, dtype=dtype) for g in generator]
+            for i, l_i in enumerate(latents_list):
+                latents[i] = l_i[i]
+        else:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+        latent_image_ids = self._prepare_latent_image_ids(batch_size, height, width, device, dtype)
+        return latents, latent_image_ids
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 28,
+        timesteps: List[int] = None,
+        guidance_scale: Union[float, List[float]] = 7.0,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+        seed: Optional[Union[int, List[int]]] = None,
+        same_latent_for_all_prompts: bool = False,
+        # Extended Attention
+        extended_steps_multi: Optional[int] = -1,
+        extended_steps_single: Optional[int] = -1,
+        extended_scale: Optional[Union[float, str]] = 1.0,
+        # Structure Transfer
+        source_latents: Optional[torch.FloatTensor] = None,
+        structure_transfer_step: int = 5,
+        # Latent Blending
+        subject_token: Optional[str] = None,
+        localization_model: Optional[str] = "attention_points_sam",
+        blend_steps: List[int] = [],
+        show_attention: bool = False,
+        # Real Image Source
+        is_img_src: bool = False,
+        use_offset: bool = False,
+        img_src_latents: Optional[List[torch.FloatTensor]] = None,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
+                will be used instead
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image. This is set to 1024 by default for the best results.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image. This is set to 1024 by default for the best results.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            timesteps (`List[int]`, *optional*):
+                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
+                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
+                passed will be used. Must be in descending order.
+            guidance_scale (`float`, *optional*, defaults to 7.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
+                If not provided, pooled text embeddings will be generated from `prompt` input argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.flux.FluxPipelineOutput`] instead of a plain tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
+        Examples:
+        Returns:
+            [`~pipelines.flux.FluxPipelineOutput`] or `tuple`: [`~pipelines.flux.FluxPipelineOutput`] if `return_dict`
+            is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated
+            images.
+        """
+        device = self._execution_device
+        # Blend Steps
+        blend_models = {}
+        if len(blend_steps) > 0:
+            if localization_model == "clipseg":
+                blend_models["clipseg_processor"] = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+                blend_models["clipseg_model"] = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined").to(device)
+            elif localization_model == "grounding_sam":
+                grounding_dino_model_id = "IDEA-Research/grounding-dino-base"
+                blend_models["grounding_processor"] = AutoProcessor.from_pretrained(grounding_dino_model_id)
+                blend_models["grounding_model"] = AutoModelForZeroShotObjectDetection.from_pretrained(grounding_dino_model_id).to(device)
+                blend_models["sam_predictor"] = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
+            elif localization_model == "clipseg_sam":
+                blend_models["clipseg_processor"] = CLIPSegProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
+                blend_models["clipseg_model"] = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined").to(device)
+                blend_models["sam_predictor"] = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
+            elif localization_model == "attention":
+                pass
+            elif localization_model in ["attention_box_sam", "attention_mask_sam", "attention_points_sam"]:
+                blend_models["sam_predictor"] = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
+        height = height or self.default_sample_size * self.vae_scale_factor
+        width = width or self.default_sample_size * self.vae_scale_factor
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            height,
+            width,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
+        )
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._interrupt = False
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        lora_scale = (
+            self.joint_attention_kwargs.get("scale", None) if self.joint_attention_kwargs is not None else None
+        )
+        (
+            prompt_embeds,
+            pooled_prompt_embeds,
+            text_ids,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+            lora_scale=lora_scale,
+        )
+        # 4. Prepare latent variables
+        if (generator is None) and seed is not None:
+            if isinstance(seed, int):
+                generator = torch.Generator(device=device).manual_seed(seed)
+            else:
+                assert len(seed) == batch_size, "The number of seeds must match the batch size"
+                generator = [torch.Generator(device=device).manual_seed(s) for s in seed]
+        num_channels_latents = self.transformer.config.in_channels // 4
+        latents, latent_image_ids = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        if same_latent_for_all_prompts:
+            latents = latents[:1].repeat(batch_size * num_images_per_prompt, 1, 1)
+        noise = latents.clone()
+        attention_store_kwargs = {}
+        if extended_scale == "auto":
+            is_auto_extend_scale = True
+            extended_scale = 1.05
+            attention_store_kwargs["is_cache_attn_ratio"] = True
+            auto_extended_step = 5
+            target_auto_ratio = 1.05
+        else:
+            is_auto_extend_scale = False
+        if len(blend_steps) > 0:
+            attn_steps = range(blend_steps[0] - 2, blend_steps[0] + 1)
+            attention_store_kwargs["record_attention_steps"] = attn_steps
+        self.attention_store = AttentionStore(prompts=prompt, tokenizer=self.tokenizer_2, subject_token=subject_token, **attention_store_kwargs)
+        register_my_attention_processors(self.transformer, self.attention_store, extended_steps_multi, extended_steps_single)
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps)
+        image_seq_len = latents.shape[1]
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.base_image_seq_len,
+            self.scheduler.config.max_image_seq_len,
+            self.scheduler.config.base_shift,
+            self.scheduler.config.max_shift,
+        )
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            timesteps,
+            sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+        # handle guidance
+        if self.transformer.config.guidance_embeds:
+            if isinstance(guidance_scale, float):
+                guidance = torch.full([1], guidance_scale, device=device, dtype=torch.float32)
+                guidance = guidance.expand(latents.shape[0])
+            elif isinstance(guidance_scale, list):
+                assert len(guidance_scale) == latents.shape[0], "The number of guidance scales must match the batch size"
+                guidance = torch.tensor(guidance_scale, device=device, dtype=torch.float32)
+        else:
+            guidance = None
+        if is_img_src and img_src_latents is None:
+            assert source_latents is not None, "source_latents must be provided when is_img_src is True"
+            rand_noise = noise[0].clone()
+            img_src_latents = []
+            for i in range(timesteps.shape[0]):
+                sigma = self.scheduler.sigmas[i]
+                img_src_latents.append((1.0 - sigma) * source_latents[0] + sigma * rand_noise)
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0]).to(latents.dtype)
+                # For denoising from source image
+                if is_img_src:
+                    latents[0] = img_src_latents[i]
+                # For Structure Transfer
+                if (source_latents is not None) and i == structure_transfer_step:
+                    sigma = self.scheduler.sigmas[i]
+                    latents[1] = (1.0 - sigma) * source_latents[0] + sigma * noise[1]
+                if is_auto_extend_scale and i == auto_extended_step:
+                    def f(gamma):
+                        self.attention_store.attention_ratios[i] = {}
+                        noise_pred = self.transformer(
+                            hidden_states=latents,
+                            timestep=timestep / 1000,
+                            guidance=guidance,
+                            pooled_projections=pooled_prompt_embeds,
+                            encoder_hidden_states=prompt_embeds,
+                            txt_ids=text_ids,
+                            img_ids=latent_image_ids,
+                            joint_attention_kwargs=self.joint_attention_kwargs,
+                            return_dict=False,
+                            proccesor_kwargs={"step_index": i, "extended_scale": gamma},
+                        )[0]
+                        scores_per_layer = self.attention_store.get_attention_ratios(step_indices=[i], display_imgs=False)
+                        source_sum, text_sum, target_sum = scores_per_layer['transformer_blocks']
+                        # We want to find the gamma that makes the ratio equal to K
+                        ratio = (target_sum / source_sum)
+                        return (ratio - target_auto_ratio)
+                    gamma_sol = brentq(f, 1.0, 1.2, xtol=0.01)
+                    print('Chosen gamma:', gamma_sol)
+                    extended_scale = gamma_sol
+                else:
+                    noise_pred = self.transformer(
+                        hidden_states=latents,
+                        timestep=timestep / 1000,
+                        guidance=guidance,
+                        pooled_projections=pooled_prompt_embeds,
+                        encoder_hidden_states=prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        joint_attention_kwargs=self.joint_attention_kwargs,
+                        return_dict=False,
+                        proccesor_kwargs={"step_index": i, "extended_scale": extended_scale},
+                    )[0]
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents, x0 = self.scheduler.step(noise_pred, t, latents, return_dict=False, step_index=i)
+                if use_offset and is_img_src and (i+1 < len(img_src_latents)):
+                    next_latent = img_src_latents[i+1]
+                    offset = (next_latent - latents[0])
+                    latents[1] = latents[1] + offset
+                # blend latents
+                if i in blend_steps and (subject_token is not None) and (localization_model is not None):
+                    x0 = self._unpack_latents(x0, height, width, self.vae_scale_factor)
+                    x0 = (x0 / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+                    images = self.vae.decode(x0, return_dict=False)[0]
+                    images = self.image_processor.postprocess(images, output_type="pil")
+                    self.do_step_blend(images, latents, subject_token, localization_model, show_attention, i, blend_models)
+                if latents.dtype != latents_dtype:
+                    if torch.backends.mps.is_available():
+                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                        latents = latents.to(latents_dtype)
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                # if XLA_AVAILABLE:
+                #     xm.mark_step()
+        if output_type == "latent":
+            image = latents
+        elif output_type == "both":
+            return_latents = latents
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type="pil")
+            return (image, return_latents)
+        else:
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image,)
+        return FluxPipelineOutput(images=image)
+    def do_step_blend(self, images, latents, subject_token, localization_model,
+                      show_attention, i, blend_models):
+        device = latents.device
+        latents_dtype = latents.dtype
+        clipseg_processor = blend_models.get("clipseg_processor", None)
+        clipseg_model = blend_models.get("clipseg_model", None)
+        grounding_processor = blend_models.get("grounding_processor", None)
+        grounding_model = blend_models.get("grounding_model", None)
+        sam_predictor = blend_models.get("sam_predictor", None)
+        image_to_display = []
+        titles_to_display = []
+        if show_attention:
+            image_to_display += [images[0], images[1]]
+            titles_to_display += ["Source X0", "Target X0"]
+        if localization_model == "clipseg":
+            subject_mask = clipseg_predict(clipseg_model, clipseg_processor, [images[-1]], f"A photo of {subject_token}", device)
+        elif localization_model == "grounding_sam":
+            subject_mask = grounding_sam_predict(grounding_model, grounding_processor, sam_predictor, images[-1], f"A {subject_token}.", device)
+        elif localization_model == "clipseg_sam":
+            subject_mask = clipseg_predict(clipseg_model, clipseg_processor, [images[-1]], f"A photo of {subject_token}", device)
+            subject_mask = mask_to_box_sam_predict(subject_mask, sam_predictor, images[-1], None, device)
+        elif localization_model == "attention":
+            store = self.attention_store.image2text_store
+            attention_maps, attention_masks, tokens = self.attention_store.aggregate_attention(store, target_layers=None, gaussian_kernel=3)
+            subject_mask = attention_masks[0][-1].to(device)
+            subject_attention = attention_maps[0][-1].to(device)
+            if show_attention:
+                attentioned_image = show_image_and_heatmap(subject_attention.float(), images[1], relevnace_res=512)
+                attention_masked_image = show_image_and_heatmap(subject_mask.float(), images[1], relevnace_res=512)
+                image_to_display += [attentioned_image, attention_masked_image]
+                titles_to_display += ["Attention", "Attention Mask"]
+        elif localization_model == "attention_box_sam":
+            store = self.attention_store.image2text_store
+            attention_maps, attention_masks, tokens = self.attention_store.aggregate_attention(store, target_layers=None, gaussian_kernel=3)
+            attention_mask = attention_masks[0][-1].to(device)
+            subject_attention = attention_maps[0][-1].to(device)
+            subject_mask, bbox = mask_to_box_sam_predict(attention_mask, sam_predictor, images[-1], None, device)
+            if show_attention:
+                attentioned_image = show_image_and_heatmap(subject_attention.float(), images[1], relevnace_res=512)
+                attention_masked_image = show_image_and_heatmap(attention_mask.float(), images[1], relevnace_res=512)
+                sam_masked_image = show_image_and_heatmap(subject_mask.float(), images[1], relevnace_res=1024)
+                sam_masked_image = draw_bboxes_on_image(sam_masked_image, [bbox.tolist()], color="green", thickness=5)
+                image_to_display += [attentioned_image, attention_masked_image, sam_masked_image]
+                titles_to_display += ["Attention", "Attention Mask", "SAM Mask"]
+        elif localization_model == "attention_mask_sam":
+            store = self.attention_store.image2text_store
+            attention_maps, attention_masks, tokens = self.attention_store.aggregate_attention(store, target_layers=None, gaussian_kernel=3)
+            attention_mask = attention_masks[0][-1].to(device)
+            subject_attention = attention_maps[0][-1].to(device)
+            subject_mask = mask_to_mask_sam_predict(attention_mask, sam_predictor, images[-1], None, device)
+            if show_attention:
+                print('Attention:')
+                attentioned_image = show_image_and_heatmap(subject_attention.float(), images[1], relevnace_res=512)
+                attention_masked_image = show_image_and_heatmap(attention_mask.float(), images[1], relevnace_res=512)
+                sam_masked_image = show_image_and_heatmap(subject_mask.float(), images[1], relevnace_res=1024)
+                image_to_display += [attentioned_image, attention_masked_image, sam_masked_image]
+                titles_to_display += ["Attention", "Attention Mask", "SAM Mask"]
+        elif localization_model == "attention_points_sam":
+            store = self.attention_store.image2text_store
+            attention_maps, attention_masks, tokens = self.attention_store.aggregate_attention(store, target_layers=None, gaussian_kernel=3)
+            attention_mask = attention_masks[0][-1].to(device)
+            subject_attention = attention_maps[0][-1].to(device)
+            subject_mask, point_coords = attention_to_points_sam_predict(subject_attention, attention_mask, sam_predictor, images[1], None, device)
+            if show_attention:
+                print('Attention:')
+                attentioned_image = show_image_and_heatmap(subject_attention.float(), images[1], relevnace_res=512)
+                attention_masked_image = show_image_and_heatmap(attention_mask.float(), images[1], relevnace_res=512)
+                sam_masked_image = show_image_and_heatmap(subject_mask.float(), images[1], relevnace_res=1024)
+                sam_masked_image = draw_points_on_pil_image(sam_masked_image, point_coords, point_color="green", radius=10)
+                image_to_display += [attentioned_image, attention_masked_image, sam_masked_image]
+                titles_to_display += ["Attention", "Attention Mask", "SAM Mask"]
+        if show_attention:
+            show_images(image_to_display, titles_to_display, size=512, save_path="attn_vis.png")
+        # Resize the mask to latents size
+        latents_mask = torch.nn.functional.interpolate(subject_mask.view(1,1,subject_mask.shape[-2],subject_mask.shape[-1]), size=64, mode='bilinear').view(4096, 1).to(latents_dtype)
+        latents_mask[latents_mask > 0.01] = 1
+        latents[1] = latents[1] * latents_mask + latents[0] * (1 - latents_mask)
+    ############# Image to Image Methods #############
+    def img2img_encode_vae_image(self, image: torch.Tensor, generator: torch.Generator):
+        if isinstance(generator, list):
+            image_latents = [
+                img2img_retrieve_latents(self.vae.encode(image[i : i + 1]), generator=generator[i])
+                for i in range(image.shape[0])
+            ]
+            image_latents = torch.cat(image_latents, dim=0)
+        else:
+            image_latents = img2img_retrieve_latents(self.vae.encode(image), generator=generator)
+        image_latents = (image_latents - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+        return image_latents
+    def img2img_prepare_latents(
+        self,
+        image,
+        timestep,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+    ):
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        height = 2 * (int(height) // self.vae_scale_factor)
+        width = 2 * (int(width) // self.vae_scale_factor)
+        shape = (batch_size, num_channels_latents, height, width)
+        latent_image_ids = self.img2img_prepare_latent_image_ids(batch_size, height, width, device, dtype)
+        if latents is not None:
+            return latents.to(device=device, dtype=dtype), latent_image_ids
+        image = image.to(device=device, dtype=dtype)
+        image_latents = self.img2img_encode_vae_image(image=image, generator=generator)
+        if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
+            # expand init_latents for batch_size
+            additional_image_per_prompt = batch_size // image_latents.shape[0]
+            image_latents = torch.cat([image_latents] * additional_image_per_prompt, dim=0)
+        elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
+            raise ValueError(
+                f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
+            )
+        else:
+            image_latents = torch.cat([image_latents], dim=0)
+        noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        latents = self.scheduler.scale_noise(image_latents, timestep, noise)
+        latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+        return latents, latent_image_ids
+    def img2img_check_inputs(
+        self,
+        prompt,
+        prompt_2,
+        strength,
+        height,
+        width,
+        prompt_embeds=None,
+        pooled_prompt_embeds=None,
+        callback_on_step_end_tensor_inputs=None,
+        max_sequence_length=None,
+    ):
+        if strength < 0 or strength > 1:
+            raise ValueError(f"The value of strength should in [0.0, 1.0] but is {strength}")
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+        if callback_on_step_end_tensor_inputs is not None and not all(
+            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
+        ):
+            raise ValueError(
+                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
+            )
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt_2 is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+        elif prompt_2 is not None and (not isinstance(prompt_2, str) and not isinstance(prompt_2, list)):
+            raise ValueError(f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}")
+        if prompt_embeds is not None and pooled_prompt_embeds is None:
+            raise ValueError(
+                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
+            )
+        if max_sequence_length is not None and max_sequence_length > 512:
+            raise ValueError(f"`max_sequence_length` cannot be greater than 512 but is {max_sequence_length}")
+    # Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3_img2img.StableDiffusion3Img2ImgPipeline.get_timesteps
+    def img2img_get_timesteps(self, num_inference_steps, strength, device):
+        # get the original timestep using init_timestep
+        init_timestep = min(num_inference_steps * strength, num_inference_steps)
+        t_start = int(max(num_inference_steps - init_timestep, 0))
+        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
+        if hasattr(self.scheduler, "set_begin_index"):
+            self.scheduler.set_begin_index(t_start * self.scheduler.order)
+        return timesteps, num_inference_steps - t_start
+    @staticmethod
+    # Copied from diffusers.pipelines.flux.pipeline_flux.FluxPipeline._prepare_latent_image_ids
+    def img2img_prepare_latent_image_ids(batch_size, height, width, device, dtype):
+        latent_image_ids = torch.zeros(height // 2, width // 2, 3)
+        latent_image_ids[..., 1] = latent_image_ids[..., 1] + torch.arange(height // 2)[:, None]
+        latent_image_ids[..., 2] = latent_image_ids[..., 2] + torch.arange(width // 2)[None, :]
+        latent_image_id_height, latent_image_id_width, latent_image_id_channels = latent_image_ids.shape
+        latent_image_ids = latent_image_ids.reshape(
+            latent_image_id_height * latent_image_id_width, latent_image_id_channels
+        )
+        return latent_image_ids.to(device=device, dtype=dtype)
+    @torch.no_grad()
+    def call_img2img(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        image: PipelineImageInput = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        strength: float = 0.6,
+        num_inference_steps: int = 28,
+        timesteps: List[int] = None,
+        guidance_scale: float = 7.0,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
+                will be used instead
+            image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
+                `Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
+                numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
+                or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
+                list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
+                latents as `image`, but if passing latents directly it is not encoded again.
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image. This is set to 1024 by default for the best results.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image. This is set to 1024 by default for the best results.
+            strength (`float`, *optional*, defaults to 1.0):
+                Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
+                starting point and more noise is added the higher the `strength`. The number of denoising steps depends
+                on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
+                process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
+                essentially ignores `image`.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            timesteps (`List[int]`, *optional*):
+                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
+                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
+                passed will be used. Must be in descending order.
+            guidance_scale (`float`, *optional*, defaults to 7.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
+                If not provided, pooled text embeddings will be generated from `prompt` input argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.flux.FluxPipelineOutput`] instead of a plain tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
+        Examples:
+        Returns:
+            [`~pipelines.flux.FluxPipelineOutput`] or `tuple`: [`~pipelines.flux.FluxPipelineOutput`] if `return_dict`
+            is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated
+            images.
+        """
+        height = height or self.default_sample_size * self.vae_scale_factor
+        width = width or self.default_sample_size * self.vae_scale_factor
+        # 1. Check inputs. Raise error if not correct
+        self.img2img_check_inputs(
+            prompt,
+            prompt_2,
+            strength,
+            height,
+            width,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
+        )
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._interrupt = False
+        # 2. Preprocess image
+        init_image = self.image_processor.preprocess(image, height=height, width=width)
+        init_image = init_image.to(dtype=torch.float32)
+        # 3. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        lora_scale = (
+            self.joint_attention_kwargs.get("scale", None) if self.joint_attention_kwargs is not None else None
+        )
+        (
+            prompt_embeds,
+            pooled_prompt_embeds,
+            text_ids,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+            lora_scale=lora_scale,
+        )
+        register_regular_attention_processors(self.transformer)
+        # 4.Prepare timesteps
+        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps)
+        image_seq_len = (int(height) // self.vae_scale_factor) * (int(width) // self.vae_scale_factor)
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.base_image_seq_len,
+            self.scheduler.config.max_image_seq_len,
+            self.scheduler.config.base_shift,
+            self.scheduler.config.max_shift,
+        )
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            timesteps,
+            sigmas,
+            mu=mu,
+        )
+        timesteps, num_inference_steps = self.img2img_get_timesteps(num_inference_steps, strength, device)
+        if num_inference_steps < 1:
+            raise ValueError(
+                f"After adjusting the num_inference_steps by strength parameter: {strength}, the number of pipeline"
+                f"steps is {num_inference_steps} which is < 1 and not appropriate for this pipeline."
+            )
+        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
+        # 5. Prepare latent variables
+        num_channels_latents = self.transformer.config.in_channels // 4
+        latents, latent_image_ids = self.img2img_prepare_latents(
+            init_image,
+            latent_timestep,
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+        # handle guidance
+        if self.transformer.config.guidance_embeds:
+            guidance = torch.full([1], guidance_scale, device=device, dtype=torch.float32)
+            guidance = guidance.expand(latents.shape[0])
+        else:
+            guidance = None
+        text_ids = text_ids.expand(latents.shape[0], -1, -1)
+        latent_image_ids = latent_image_ids.expand(latents.shape[0], -1, -1)
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                if self.interrupt:
+                    continue
+                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                timestep = t.expand(latents.shape[0]).to(latents.dtype)
+                noise_pred = self.transformer(
+                    hidden_states=latents,
+                    timestep=timestep / 1000,
+                    guidance=guidance,
+                    pooled_projections=pooled_prompt_embeds,
+                    encoder_hidden_states=prompt_embeds,
+                    txt_ids=text_ids,
+                    img_ids=latent_image_ids,
+                    joint_attention_kwargs=self.joint_attention_kwargs,
+                    return_dict=False,
+                )[0]
+                # compute the previous noisy sample x_t -> x_t-1
+                latents_dtype = latents.dtype
+                latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+                if latents.dtype != latents_dtype:
+                    if torch.backends.mps.is_available():
+                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                        latents = latents.to(latents_dtype)
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                # if XLA_AVAILABLE:
+                #     xm.mark_step()
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
+            image = self.vae.decode(latents, return_dict=False)[0]
+            image = self.image_processor.postprocess(image, output_type=output_type)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image,)
+        return FluxPipelineOutput(images=image)
+    ############# Invert Methods #############
+    def invert_prepare_latents(
+        self,
+        image,
+        timestep,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+        add_noise=False,
+    ):
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        height = 2 * (int(height) // self.vae_scale_factor)
+        width = 2 * (int(width) // self.vae_scale_factor)
+        shape = (batch_size, num_channels_latents, height, width)
+        latent_image_ids = self._prepare_latent_image_ids(batch_size, height, width, device, dtype)
+        if latents is not None:
+            return latents.to(device=device, dtype=dtype), latent_image_ids
+        image = image.to(device=device, dtype=dtype)
+        image_latents = self.img2img_encode_vae_image(image=image, generator=generator)
+        if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:
+            # expand init_latents for batch_size
+            additional_image_per_prompt = batch_size // image_latents.shape[0]
+            image_latents = torch.cat([image_latents] * additional_image_per_prompt, dim=0)
+        elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:
+            raise ValueError(
+                f"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts."
+            )
+        else:
+            image_latents = torch.cat([image_latents], dim=0)
+        if add_noise:
+            noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            latents = self.scheduler.scale_noise(image_latents, timestep, noise)
+        else:
+            latents = image_latents
+        latents = self._pack_latents(latents, batch_size, num_channels_latents, height, width)
+        return latents, latent_image_ids
+    @torch.no_grad()
+    def call_invert(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        image: PipelineImageInput = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 28,
+        timesteps: List[int] = None,
+        guidance_scale: float = 7.0,
+        num_images_per_prompt: Optional[int] = 1,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        max_sequence_length: int = 512,
+        fixed_point_iterations: int = 1,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            prompt_2 (`str` or `List[str]`, *optional*):
+                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
+                will be used instead
+            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The height in pixels of the generated image. This is set to 1024 by default for the best results.
+            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
+                The width in pixels of the generated image. This is set to 1024 by default for the best results.
+            num_inference_steps (`int`, *optional*, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            timesteps (`List[int]`, *optional*):
+                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
+                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
+                passed will be used. Must be in descending order.
+            guidance_scale (`float`, *optional*, defaults to 7.0):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, *optional*, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
+                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
+                to make generation deterministic.
+            latents (`torch.FloatTensor`, *optional*):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
+                If not provided, pooled text embeddings will be generated from `prompt` input argument.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.flux.FluxPipelineOutput`] instead of a plain tuple.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            callback_on_step_end (`Callable`, *optional*):
+                A function that calls at the end of each denoising steps during the inference. The function is called
+                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
+                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
+                `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
+        Examples:
+        Returns:
+            [`~pipelines.flux.FluxPipelineOutput`] or `tuple`: [`~pipelines.flux.FluxPipelineOutput`] if `return_dict`
+            is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated
+            images.
+        """
+        height = height or self.default_sample_size * self.vae_scale_factor
+        width = width or self.default_sample_size * self.vae_scale_factor
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            height,
+            width,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
+            max_sequence_length=max_sequence_length,
+        )
+        self._guidance_scale = guidance_scale
+        self._joint_attention_kwargs = joint_attention_kwargs
+        self._interrupt = False
+        # 1.5. Preprocess image
+        if isinstance(image, Image.Image):
+            init_image = self.image_processor.preprocess(image, height=height, width=width)
+        elif isinstance(image, torch.Tensor):
+            init_image = image
+            latents = image
+        else:
+            raise ValueError("Image must be of type `PIL.Image.Image` or `torch.Tensor`")
+        init_image = init_image.to(dtype=torch.float32)
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        lora_scale = (
+            self.joint_attention_kwargs.get("scale", None) if self.joint_attention_kwargs is not None else None
+        )
+        (
+            prompt_embeds,
+            pooled_prompt_embeds,
+            text_ids,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+            lora_scale=lora_scale,
+        )
+        # 4. Prepare latent variables
+        num_channels_latents = self.transformer.config.in_channels // 4
+        # latents, latent_image_ids = self.prepare_latents(
+        #     batch_size * num_images_per_prompt,
+        #     num_channels_latents,
+        #     height,
+        #     width,
+        #     prompt_embeds.dtype,
+        #     device,
+        #     generator,
+        #     latents,
+        # )
+        latents, latent_image_ids = self.invert_prepare_latents(
+            init_image,
+            None,
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+            False
+        )
+        register_regular_attention_processors(self.transformer)
+        # 5. Prepare timesteps
+        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps)
+        image_seq_len = latents.shape[1]
+        mu = calculate_shift(
+            image_seq_len,
+            self.scheduler.config.base_image_seq_len,
+            self.scheduler.config.max_image_seq_len,
+            self.scheduler.config.base_shift,
+            self.scheduler.config.max_shift,
+        )
+        # For Inversion, reverse the sigmas
+        # sigmas = sigmas[::-1]
+        timesteps, num_inference_steps = retrieve_timesteps(
+            self.scheduler,
+            num_inference_steps,
+            device,
+            timesteps,
+            sigmas,
+            mu=mu,
+        )
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+        # handle guidance
+        if self.transformer.config.guidance_embeds:
+            guidance = torch.tensor([guidance_scale], device=device)
+            guidance = guidance.expand(latents.shape[0])
+        else:
+            guidance = None
+        self.scheduler.sigmas = reversed(self.scheduler.sigmas)
+        timesteps_zero_start = reversed(torch.cat([self.scheduler.timesteps[1:], torch.tensor([0], device=device)]))
+        timesteps_one_start = reversed(self.scheduler.timesteps)
+        self.scheduler.timesteps = timesteps_zero_start
+        # self.scheduler.timesteps = timesteps_one_start
+        timesteps = self.scheduler.timesteps
+        latents_list = []
+        latents_list.append(latents)
+        # 6. Denoising loop
+        with self.progress_bar(total=num_inference_steps * fixed_point_iterations) as progress_bar:
+            for i, t in enumerate(timesteps):
+                original_latents = latents.clone()
+                for j in range(fixed_point_iterations):
+                    if self.interrupt:
+                        continue
+                    if j == 0:
+                        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+                        timestep = timesteps[i].expand(latents.shape[0]).to(latents.dtype)
+                    else:
+                        timestep = timesteps_one_start[i].expand(latents.shape[0]).to(latents.dtype)
+                    noise_pred = self.transformer(
+                        hidden_states=latents,
+                        # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
+                        timestep=timestep / 1000,
+                        guidance=guidance,
+                        pooled_projections=pooled_prompt_embeds,
+                        encoder_hidden_states=prompt_embeds,
+                        txt_ids=text_ids,
+                        img_ids=latent_image_ids,
+                        joint_attention_kwargs=self.joint_attention_kwargs,
+                        return_dict=False,
+                    )[0]
+                    # compute the previous noisy sample x_t -> x_t-1
+                    latents_dtype = latents.dtype
+                    # noise_pred = -noise_pred
+                    latents = self.scheduler.step(noise_pred, t, original_latents, return_dict=False, step_index=i)[0]
+                    if latents.dtype != latents_dtype:
+                        if torch.backends.mps.is_available():
+                            # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
+                            latents = latents.to(latents_dtype)
+                    if callback_on_step_end is not None:
+                        callback_kwargs = {}
+                        for k in callback_on_step_end_tensor_inputs:
+                            callback_kwargs[k] = locals()[k]
+                        callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                        latents = callback_outputs.pop("latents", latents)
+                        prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                    # call the callback, if provided
+                    if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                        progress_bar.update()
+                    # if XLA_AVAILABLE:
+                    #     xm.mark_step()
+                latents_list.append(latents)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        return latents_list

addit_flux_transformer.py ADDED Viewed

	@@ -0,0 +1,521 @@

+# Copyright 2024 Black Forest Labs, The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any, Dict, List, Optional, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.loaders import FromOriginalModelMixin, PeftAdapterMixin
+from diffusers.models.attention import FeedForward
+from diffusers.models.attention_processor import Attention, FluxAttnProcessor2_0, FluxSingleAttnProcessor2_0
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers.models.normalization import AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle
+from diffusers.utils import USE_PEFT_BACKEND, is_torch_version, logging, scale_lora_layers, unscale_lora_layers
+from diffusers.utils.torch_utils import maybe_allow_in_graph
+from diffusers.models.embeddings import CombinedTimestepGuidanceTextProjEmbeddings, CombinedTimestepTextProjEmbeddings
+from diffusers.models.modeling_outputs import Transformer2DModelOutput
+from addit_attention_processors import AdditFluxAttnProcessor2_0, AdditFluxSingleAttnProcessor2_0
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+# YiYi to-do: refactor rope related functions/classes
+def rope(pos: torch.Tensor, dim: int, theta: int) -> torch.Tensor:
+    assert dim % 2 == 0, "The dimension must be even."
+    scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
+    omega = 1.0 / (theta**scale)
+    batch_size, seq_length = pos.shape
+    out = torch.einsum("...n,d->...nd", pos, omega)
+    cos_out = torch.cos(out)
+    sin_out = torch.sin(out)
+    stacked_out = torch.stack([cos_out, -sin_out, sin_out, cos_out], dim=-1)
+    out = stacked_out.view(batch_size, -1, dim // 2, 2, 2)
+    return out.float()
+# YiYi to-do: refactor rope related functions/classes
+class EmbedND(nn.Module):
+    def __init__(self, dim: int, theta: int, axes_dim: List[int]):
+        super().__init__()
+        self.dim = dim
+        self.theta = theta
+        self.axes_dim = axes_dim
+    def forward(self, ids: torch.Tensor) -> torch.Tensor:
+        n_axes = ids.shape[-1]
+        emb = torch.cat(
+            [rope(ids[..., i], self.axes_dim[i], self.theta) for i in range(n_axes)],
+            dim=-3,
+        )
+        return emb.unsqueeze(1)
+@maybe_allow_in_graph
+class AdditFluxSingleTransformerBlock(nn.Module):
+    r"""
+    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
+    Reference: https://arxiv.org/abs/2403.03206
+    Parameters:
+        dim (`int`): The number of channels in the input and output.
+        num_attention_heads (`int`): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`): The number of channels in each head.
+        context_pre_only (`bool`): Boolean to determine if we should add some blocks associated with the
+            processing of `context` conditions.
+    """
+    def __init__(self, dim, num_attention_heads, attention_head_dim, mlp_ratio=4.0):
+        super().__init__()
+        self.mlp_hidden_dim = int(dim * mlp_ratio)
+        self.norm = AdaLayerNormZeroSingle(dim)
+        self.proj_mlp = nn.Linear(dim, self.mlp_hidden_dim)
+        self.act_mlp = nn.GELU(approximate="tanh")
+        self.proj_out = nn.Linear(dim + self.mlp_hidden_dim, dim)
+        processor = FluxSingleAttnProcessor2_0()
+        self.attn = Attention(
+            query_dim=dim,
+            cross_attention_dim=None,
+            dim_head=attention_head_dim,
+            heads=num_attention_heads,
+            out_dim=dim,
+            bias=True,
+            processor=processor,
+            qk_norm="rms_norm",
+            eps=1e-6,
+            pre_only=True,
+        )
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        temb: torch.FloatTensor,
+        image_rotary_emb=None,
+        proccesor_kwargs=None,
+    ):
+        residual = hidden_states
+        norm_hidden_states, gate = self.norm(hidden_states, emb=temb)
+        mlp_hidden_states = self.act_mlp(self.proj_mlp(norm_hidden_states))
+        attn_output = self.attn(
+            hidden_states=norm_hidden_states,
+            image_rotary_emb=image_rotary_emb,
+            **(proccesor_kwargs or {}),
+        )
+        hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=2)
+        gate = gate.unsqueeze(1)
+        hidden_states = gate * self.proj_out(hidden_states)
+        hidden_states = residual + hidden_states
+        if hidden_states.dtype == torch.float16:
+            hidden_states = hidden_states.clip(-65504, 65504)
+        return hidden_states
+@maybe_allow_in_graph
+class AdditFluxTransformerBlock(nn.Module):
+    r"""
+    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
+    Reference: https://arxiv.org/abs/2403.03206
+    Parameters:
+        dim (`int`): The number of channels in the input and output.
+        num_attention_heads (`int`): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`): The number of channels in each head.
+        context_pre_only (`bool`): Boolean to determine if we should add some blocks associated with the
+            processing of `context` conditions.
+    """
+    def __init__(self, dim, num_attention_heads, attention_head_dim, qk_norm="rms_norm", eps=1e-6):
+        super().__init__()
+        self.norm1 = AdaLayerNormZero(dim)
+        self.norm1_context = AdaLayerNormZero(dim)
+        if hasattr(F, "scaled_dot_product_attention"):
+            processor = FluxAttnProcessor2_0()
+        else:
+            raise ValueError(
+                "The current PyTorch version does not support the `scaled_dot_product_attention` function."
+            )
+        self.attn = Attention(
+            query_dim=dim,
+            cross_attention_dim=None,
+            added_kv_proj_dim=dim,
+            dim_head=attention_head_dim,
+            heads=num_attention_heads,
+            out_dim=dim,
+            context_pre_only=False,
+            bias=True,
+            processor=processor,
+            qk_norm=qk_norm,
+            eps=eps,
+        )
+        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
+        self.norm2_context = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff_context = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
+        # let chunk size default to None
+        self._chunk_size = None
+        self._chunk_dim = 0
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        encoder_hidden_states: torch.FloatTensor,
+        temb: torch.FloatTensor,
+        image_rotary_emb=None,
+        proccesor_kwargs=None,
+    ):
+        norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(hidden_states, emb=temb)
+        norm_encoder_hidden_states, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.norm1_context(
+            encoder_hidden_states, emb=temb
+        )
+        # Attention.
+        attn_output, context_attn_output = self.attn(
+            hidden_states=norm_hidden_states,
+            encoder_hidden_states=norm_encoder_hidden_states,
+            image_rotary_emb=image_rotary_emb,
+            **(proccesor_kwargs or {}),
+        )
+        # Process attention outputs for the `hidden_states`.
+        attn_output = gate_msa.unsqueeze(1) * attn_output
+        hidden_states = hidden_states + attn_output
+        norm_hidden_states = self.norm2(hidden_states)
+        norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+        ff_output = self.ff(norm_hidden_states)
+        ff_output = gate_mlp.unsqueeze(1) * ff_output
+        hidden_states = hidden_states + ff_output
+        # Process attention outputs for the `encoder_hidden_states`.
+        context_attn_output = c_gate_msa.unsqueeze(1) * context_attn_output
+        encoder_hidden_states = encoder_hidden_states + context_attn_output
+        norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states)
+        norm_encoder_hidden_states = norm_encoder_hidden_states * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
+        context_ff_output = self.ff_context(norm_encoder_hidden_states)
+        encoder_hidden_states = encoder_hidden_states + c_gate_mlp.unsqueeze(1) * context_ff_output
+        if encoder_hidden_states.dtype == torch.float16:
+            encoder_hidden_states = encoder_hidden_states.clip(-65504, 65504)
+        return encoder_hidden_states, hidden_states
+class AdditFluxTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
+    """
+    The Transformer model introduced in Flux.
+    Reference: https://blackforestlabs.ai/announcing-black-forest-labs/
+    Parameters:
+        patch_size (`int`): Patch size to turn the input data into small patches.
+        in_channels (`int`, *optional*, defaults to 16): The number of channels in the input.
+        num_layers (`int`, *optional*, defaults to 18): The number of layers of MMDiT blocks to use.
+        num_single_layers (`int`, *optional*, defaults to 18): The number of layers of single DiT blocks to use.
+        attention_head_dim (`int`, *optional*, defaults to 64): The number of channels in each head.
+        num_attention_heads (`int`, *optional*, defaults to 18): The number of heads to use for multi-head attention.
+        joint_attention_dim (`int`, *optional*): The number of `encoder_hidden_states` dimensions to use.
+        pooled_projection_dim (`int`): Number of dimensions to use when projecting the `pooled_projections`.
+        guidance_embeds (`bool`, defaults to False): Whether to use guidance embeddings.
+    """
+    _supports_gradient_checkpointing = True
+    @register_to_config
+    def __init__(
+        self,
+        patch_size: int = 1,
+        in_channels: int = 64,
+        num_layers: int = 19,
+        num_single_layers: int = 38,
+        attention_head_dim: int = 128,
+        num_attention_heads: int = 24,
+        joint_attention_dim: int = 4096,
+        pooled_projection_dim: int = 768,
+        guidance_embeds: bool = False,
+        axes_dims_rope: List[int] = [16, 56, 56],
+    ):
+        super().__init__()
+        self.out_channels = in_channels
+        self.inner_dim = self.config.num_attention_heads * self.config.attention_head_dim
+        self.pos_embed = EmbedND(dim=self.inner_dim, theta=10000, axes_dim=axes_dims_rope)
+        text_time_guidance_cls = (
+            CombinedTimestepGuidanceTextProjEmbeddings if guidance_embeds else CombinedTimestepTextProjEmbeddings
+        )
+        self.time_text_embed = text_time_guidance_cls(
+            embedding_dim=self.inner_dim, pooled_projection_dim=self.config.pooled_projection_dim
+        )
+        self.context_embedder = nn.Linear(self.config.joint_attention_dim, self.inner_dim)
+        self.x_embedder = torch.nn.Linear(self.config.in_channels, self.inner_dim)
+        self.transformer_blocks = nn.ModuleList(
+            [
+                AdditFluxTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=self.config.num_attention_heads,
+                    attention_head_dim=self.config.attention_head_dim,
+                )
+                for i in range(self.config.num_layers)
+            ]
+        )
+        self.single_transformer_blocks = nn.ModuleList(
+            [
+                AdditFluxSingleTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=self.config.num_attention_heads,
+                    attention_head_dim=self.config.attention_head_dim,
+                )
+                for i in range(self.config.num_single_layers)
+            ]
+        )
+        self.norm_out = AdaLayerNormContinuous(self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6)
+        self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True)
+        self.gradient_checkpointing = False
+    def _set_gradient_checkpointing(self, module, value=False):
+        if hasattr(module, "gradient_checkpointing"):
+            module.gradient_checkpointing = value
+    @property
+    def attn_processors(self):
+        r"""
+        Returns:
+            `dict` of attention processors: A dictionary containing all attention processors used in the model with
+            indexed by its weight name.
+        """
+        # set recursively
+        processors = {}
+        def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors):
+            if hasattr(module, "get_processor"):
+                processors[f"{name}.processor"] = module.get_processor()
+            for sub_name, child in module.named_children():
+                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
+            return processors
+        for name, module in self.named_children():
+            fn_recursive_add_processors(name, module, processors)
+        return processors
+    def set_attn_processor(
+        self, processor
+    ):
+        r"""
+        Sets the attention processor to use to compute attention.
+        Parameters:
+            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
+                The instantiated processor class or a dictionary of processor classes that will be set as the processor
+                for **all** `Attention` layers.
+                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
+                processor. This is strongly recommended when setting trainable attention processors.
+        """
+        count = len(self.attn_processors.keys())
+        if isinstance(processor, dict) and len(processor) != count:
+            raise ValueError(
+                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
+                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
+            )
+        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
+            if hasattr(module, "set_processor"):
+                if not isinstance(processor, dict):
+                    module.set_processor(processor)
+                else:
+                    module.set_processor(processor.pop(f"{name}.processor"))
+            for sub_name, child in module.named_children():
+                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
+        for name, module in self.named_children():
+            fn_recursive_attn_processor(name, module, processor)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        encoder_hidden_states: torch.Tensor = None,
+        pooled_projections: torch.Tensor = None,
+        timestep: torch.LongTensor = None,
+        img_ids: torch.Tensor = None,
+        txt_ids: torch.Tensor = None,
+        guidance: torch.Tensor = None,
+        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
+        return_dict: bool = True,
+        proccesor_kwargs: Optional[Dict[str, Any]] = None,
+    ) -> Union[torch.FloatTensor, Transformer2DModelOutput]:
+        """
+        The [`FluxTransformer2DModel`] forward method.
+        Args:
+            hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
+                Input `hidden_states`.
+            encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
+                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
+            pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected
+                from the embeddings of input conditions.
+            timestep ( `torch.LongTensor`):
+                Used to indicate denoising step.
+            block_controlnet_hidden_states: (`list` of `torch.Tensor`):
+                A list of tensors that if specified are added to the residuals of transformer blocks.
+            joint_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
+                `self.processor` in
+                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
+                tuple.
+        Returns:
+            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
+            `tuple` where the first element is the sample tensor.
+        """
+        if joint_attention_kwargs is not None:
+            joint_attention_kwargs = joint_attention_kwargs.copy()
+            lora_scale = joint_attention_kwargs.pop("scale", 1.0)
+        else:
+            lora_scale = 1.0
+        if USE_PEFT_BACKEND:
+            # weight the lora layers by setting `lora_scale` for each PEFT layer
+            scale_lora_layers(self, lora_scale)
+        else:
+            if joint_attention_kwargs is not None and joint_attention_kwargs.get("scale", None) is not None:
+                logger.warning(
+                    "Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective."
+                )
+        hidden_states = self.x_embedder(hidden_states)
+        timestep = timestep.to(hidden_states.dtype) * 1000
+        if guidance is not None:
+            guidance = guidance.to(hidden_states.dtype) * 1000
+        else:
+            guidance = None
+        temb = (
+            self.time_text_embed(timestep, pooled_projections)
+            if guidance is None
+            else self.time_text_embed(timestep, guidance, pooled_projections)
+        )
+        encoder_hidden_states = self.context_embedder(encoder_hidden_states)
+        ids = torch.cat((txt_ids, img_ids), dim=1)
+        image_rotary_emb = self.pos_embed(ids)
+        for index_block, block in enumerate(self.transformer_blocks):
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
+                encoder_hidden_states, hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    hidden_states,
+                    encoder_hidden_states,
+                    temb,
+                    image_rotary_emb,
+                    **ckpt_kwargs,
+                )
+            else:
+                encoder_hidden_states, hidden_states = block(
+                    hidden_states=hidden_states,
+                    encoder_hidden_states=encoder_hidden_states,
+                    temb=temb,
+                    image_rotary_emb=image_rotary_emb,
+                    proccesor_kwargs=proccesor_kwargs,
+                )
+        hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
+        for index_block, block in enumerate(self.single_transformer_blocks):
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                ckpt_kwargs: Dict[str, Any] = {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
+                hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    hidden_states,
+                    temb,
+                    image_rotary_emb,
+                    **ckpt_kwargs,
+                )
+            else:
+                hidden_states = block(
+                    hidden_states=hidden_states,
+                    temb=temb,
+                    image_rotary_emb=image_rotary_emb,
+                    proccesor_kwargs=proccesor_kwargs,
+                )
+        hidden_states = hidden_states[:, encoder_hidden_states.shape[1] :, ...]
+        hidden_states = self.norm_out(hidden_states, temb)
+        output = self.proj_out(hidden_states)
+        if USE_PEFT_BACKEND:
+            # remove `lora_scale` from each PEFT layer
+            unscale_lora_layers(self, lora_scale)
+        if not return_dict:
+            return (output,)
+        return Transformer2DModelOutput(sample=output)

addit_methods.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import gc
+import torch
+from visualization_utils import show_images
+def _add_object(
+    pipe,
+    prompts,
+    seed_src,
+    seed_obj,
+    extended_scale,
+    source_latents,
+    structure_transfer_step,
+    subject_token,
+    blend_steps,
+    show_attention=False,
+    localization_model="attention_points_sam",
+    is_img_src=False,
+    img_src_latents=None,
+    use_offset=False,
+    display_output=False,
+):
+    gc.collect()
+    torch.cuda.empty_cache()
+    out = pipe(
+        prompt=prompts,
+        guidance_scale=3.5 if (not is_img_src) else [1,3.5],
+        height=1024,
+        width=1024,
+        max_sequence_length=512,
+        num_inference_steps=30,
+        seed=[seed_src, seed_obj],
+        # Extended Attention
+        extended_scale=extended_scale,
+        extended_steps_multi=10,
+        extended_steps_single=20,
+        # Structure Transfer
+        source_latents=source_latents,
+        structure_transfer_step=structure_transfer_step,
+        # Latent Blending
+        subject_token=subject_token,
+        localization_model=localization_model,
+        blend_steps=blend_steps,
+        show_attention=show_attention,
+        # Real Image Source
+        is_img_src=is_img_src,
+        img_src_latents=img_src_latents,
+        use_offset=use_offset,
+    )
+    if display_output:
+        show_images(out.images)
+    return out.images
+def add_object_generated(
+    pipe,
+    prompt_source,
+    prompt_object,
+    subject_token,
+    seed_src,
+    seed_obj,
+    show_attention=False,
+    extended_scale=1.05,
+    structure_transfer_step=2,
+    blend_steps=[15],
+    localization_model="attention_points_sam",
+    display_output=False
+):
+    gc.collect()
+    torch.cuda.empty_cache()
+    # Generate source image and latents for each seed1
+    print('Generating source image...')
+    source_image, source_latents = pipe(
+        prompt=[prompt_source],
+        guidance_scale=3.5,
+        height=1024,
+        width=1024,
+        max_sequence_length=512,
+        num_inference_steps=30,
+        seed=[seed_src],
+        output_type="both",
+    )
+    source_image = source_image[0]
+    # Run the core combination logic
+    print('Running Addit...')
+    src_image, edited_image = _add_object(
+        pipe=pipe,
+        prompts=[prompt_source, prompt_object],
+        subject_token=subject_token,
+        seed_src=seed_src,
+        seed_obj=seed_obj,
+        source_latents=source_latents,
+        structure_transfer_step=structure_transfer_step,
+        extended_scale=extended_scale,
+        blend_steps=blend_steps,
+        show_attention=show_attention,
+        localization_model=localization_model,
+        display_output=display_output
+    )
+    return src_image, edited_image
+def add_object_real(
+    pipe,
+    source_image,
+    prompt_source,
+    prompt_object,
+    subject_token,
+    seed_src,
+    seed_obj,
+    localization_model="attention_points_sam",
+    extended_scale=1.05,
+    structure_transfer_step=4,
+    blend_steps=[20],
+    use_offset=False,
+    show_attention=False,
+    use_inversion=False,
+    display_output=False
+):
+    print('Noising-Denoising Original Image')
+    gc.collect()
+    torch.cuda.empty_cache()
+    # Get initial latents
+    source_latents = pipe.call_img2img(
+        prompt=prompt_source,
+        image=source_image,
+        num_inference_steps=30,
+        strength=0.1,
+        guidance_scale=3.5,
+        output_type="latent",
+        generator=torch.Generator(device=pipe.device).manual_seed(0)
+    ).images
+    # Optional inversion step
+    img_src_latents = None
+    if use_inversion:
+        print('Inverting Image')
+        gc.collect()
+        torch.cuda.empty_cache()
+        latents_list = pipe.call_invert(
+            prompt=prompt_source,
+            image=source_latents,
+            num_inference_steps=30,
+            guidance_scale=1,
+            fixed_point_iterations=2,
+            generator=torch.Generator(device=pipe.device).manual_seed(0)
+        )
+        img_src_latents = [x[0] for x in latents_list][::-1]
+    print('Running Addit')
+    gc.collect()
+    torch.cuda.empty_cache()
+    src_image, edited_image = _add_object(
+        pipe,
+        prompts=[prompt_source, prompt_object],
+        seed_src=seed_src,
+        seed_obj=seed_obj,
+        extended_scale=extended_scale,
+        source_latents=source_latents,
+        structure_transfer_step=structure_transfer_step,
+        subject_token=subject_token,
+        blend_steps=blend_steps,
+        show_attention=show_attention,
+        localization_model=localization_model,
+        is_img_src=True,
+        img_src_latents=img_src_latents,
+        use_offset=use_offset,
+        display_output=display_output,
+    )
+    return src_image, edited_image

addit_scheduler.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# Copyright 2024 Stability AI, Katherine Crowson and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from diffusers.schedulers.scheduling_flow_match_euler_discrete import FlowMatchEulerDiscreteScheduler, FlowMatchEulerDiscreteSchedulerOutput
+from typing import Union, Optional, Tuple
+import torch
+class AdditFlowMatchEulerDiscreteScheduler(FlowMatchEulerDiscreteScheduler):
+    def step(
+        self,
+        model_output: torch.FloatTensor,
+        timestep: Union[float, torch.FloatTensor],
+        sample: torch.FloatTensor,
+        s_churn: float = 0.0,
+        s_tmin: float = 0.0,
+        s_tmax: float = float("inf"),
+        s_noise: float = 1.0,
+        generator: Optional[torch.Generator] = None,
+        return_dict: bool = True,
+        step_index: Optional[int] = None,
+    ) -> Union[FlowMatchEulerDiscreteSchedulerOutput, Tuple]:
+        """
+        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
+        process from the learned model outputs (most often the predicted noise).
+        Args:
+            model_output (`torch.FloatTensor`):
+                The direct output from learned diffusion model.
+            timestep (`float`):
+                The current discrete timestep in the diffusion chain.
+            sample (`torch.FloatTensor`):
+                A current instance of a sample created by the diffusion process.
+            s_churn (`float`):
+            s_tmin  (`float`):
+            s_tmax  (`float`):
+            s_noise (`float`, defaults to 1.0):
+                Scaling factor for noise added to the sample.
+            generator (`torch.Generator`, *optional*):
+                A random number generator.
+            return_dict (`bool`):
+                Whether or not to return a [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or
+                tuple.
+        Returns:
+            [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or `tuple`:
+                If return_dict is `True`, [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] is
+                returned, otherwise a tuple is returned where the first element is the sample tensor.
+        """
+        if (
+            isinstance(timestep, int)
+            or isinstance(timestep, torch.IntTensor)
+            or isinstance(timestep, torch.LongTensor)
+        ):
+            raise ValueError(
+                (
+                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
+                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
+                    " one of the `scheduler.timesteps` as a timestep."
+                ),
+            )
+        if step_index is not None:
+            self._step_index = step_index
+        if self.step_index is None:
+            self._init_step_index(timestep)
+        # Upcast to avoid precision issues when computing prev_sample
+        sample = sample.to(torch.float32)
+        sigma = self.sigmas[self.step_index]
+        sigma_next = self.sigmas[self.step_index + 1]
+        prev_sample = sample + (sigma_next - sigma) * model_output
+        # Calculate X_0
+        x_0 = sample - sigma * model_output
+        # Cast sample back to model compatible dtype
+        prev_sample = prev_sample.to(model_output.dtype)
+        x_0 = x_0.to(model_output.dtype)
+        # upon completion increase step index by one
+        self._step_index += 1
+        if not return_dict:
+            return (prev_sample, x_0)
+        return FlowMatchEulerDiscreteSchedulerOutput(prev_sample=prev_sample)

app.py ADDED Viewed

	@@ -0,0 +1,416 @@

+#!/usr/bin/env python3
+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import os
+import gradio as gr
+import spaces
+import torch
+import numpy as np
+from PIL import Image
+import tempfile
+import gc
+from addit_flux_pipeline import AdditFluxPipeline
+from addit_flux_transformer import AdditFluxTransformer2DModel
+from addit_scheduler import AdditFlowMatchEulerDiscreteScheduler
+from addit_methods import add_object_generated, add_object_real
+# Global variables for model
+pipe = None
+device = None
+# Initialize model at startup
+print("Initializing ADDIT model...")
+try:
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    # Load transformer
+    my_transformer = AdditFluxTransformer2DModel.from_pretrained(
+        "black-forest-labs/FLUX.1-dev",
+        subfolder="transformer",
+        torch_dtype=torch.bfloat16
+    )
+    # Load pipeline
+    pipe = AdditFluxPipeline.from_pretrained(
+        "black-forest-labs/FLUX.1-dev",
+        transformer=my_transformer,
+        torch_dtype=torch.bfloat16
+    ).to(device)
+    # Set scheduler
+    pipe.scheduler = AdditFlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)
+    print("Model initialized successfully!")
+except Exception as e:
+    print(f"Error initializing model: {str(e)}")
+    print("The application will start but model functionality will be unavailable.")
+def validate_inputs(prompt_source, prompt_target, subject_token):
+    """Validate user inputs"""
+    if not prompt_source.strip():
+        return "Source prompt cannot be empty"
+    if not prompt_target.strip():
+        return "Target prompt cannot be empty"
+    if not subject_token.strip():
+        return "Subject token cannot be empty"
+    if subject_token not in prompt_target:
+        return f"Subject token '{subject_token}' must appear in the target prompt"
+    return None
+@spaces.GPU
+def process_generated_image(
+    prompt_source,
+    prompt_target,
+    subject_token,
+    seed_src,
+    seed_obj,
+    extended_scale,
+    structure_transfer_step,
+    blend_steps,
+    localization_model,
+    progress=gr.Progress(track_tqdm=True)
+):
+    """Process generated image with ADDIT"""
+    global pipe
+    if pipe is None:
+        return None, None, "Model not initialized. Please restart the application."
+    # Validate inputs
+    error_msg = validate_inputs(prompt_source, prompt_target, subject_token)
+    if error_msg:
+        return None, None, error_msg
+    try:
+        # Parse blend steps
+        if blend_steps.strip():
+            blend_steps_list = [int(x.strip()) for x in blend_steps.split(',') if x.strip()]
+        else:
+            blend_steps_list = []
+        # Generate images
+        src_image, edited_image = add_object_generated(
+            pipe=pipe,
+            prompt_source=prompt_source,
+            prompt_object=prompt_target,
+            subject_token=subject_token,
+            seed_src=seed_src,
+            seed_obj=seed_obj,
+            show_attention=False,
+            extended_scale=extended_scale,
+            structure_transfer_step=structure_transfer_step,
+            blend_steps=blend_steps_list,
+            localization_model=localization_model,
+            display_output=False
+        )
+        return src_image, edited_image, "Images generated successfully!"
+    except Exception as e:
+        error_msg = f"Error generating images: {str(e)}"
+        print(error_msg)
+        return None, None, error_msg
+@spaces.GPU
+def process_real_image(
+    source_image,
+    prompt_source,
+    prompt_target,
+    subject_token,
+    seed_src,
+    seed_obj,
+    extended_scale,
+    structure_transfer_step,
+    blend_steps,
+    localization_model,
+    use_offset,
+    disable_inversion,
+    progress=gr.Progress(track_tqdm=True)
+):
+    """Process real image with ADDIT"""
+    global pipe
+    if pipe is None:
+        return None, None, "Model not initialized. Please restart the application."
+    if source_image is None:
+        return None, None, "Please upload a source image"
+    # Validate inputs
+    error_msg = validate_inputs(prompt_source, prompt_target, subject_token)
+    if error_msg:
+        return None, None, error_msg
+    try:
+        # Resize source image
+        source_image = source_image.resize((1024, 1024))
+        # Parse blend steps
+        if blend_steps.strip():
+            blend_steps_list = [int(x.strip()) for x in blend_steps.split(',') if x.strip()]
+        else:
+            blend_steps_list = []
+        # Process image
+        src_image, edited_image = add_object_real(
+            pipe=pipe,
+            source_image=source_image,
+            prompt_source=prompt_source,
+            prompt_object=prompt_target,
+            subject_token=subject_token,
+            seed_src=seed_src,
+            seed_obj=seed_obj,
+            extended_scale=extended_scale,
+            structure_transfer_step=structure_transfer_step,
+            blend_steps=blend_steps_list,
+            localization_model=localization_model,
+            use_offset=use_offset,
+            show_attention=False,
+            use_inversion=not disable_inversion,
+            display_output=False
+        )
+        return src_image, edited_image, "Image edited successfully!"
+    except Exception as e:
+        error_msg = f"Error processing image: {str(e)}"
+        print(error_msg)
+        return None, None, error_msg
+def create_interface():
+    """Create the Gradio interface"""
+    # Show model status in the interface
+    model_status = "Model ready!" if pipe is not None else "Model initialization failed - functionality unavailable"
+    with gr.Blocks(title="🎨 Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models", theme=gr.themes.Soft()) as demo:
+        gr.HTML(f"""
+        <div style="text-align: center; margin-bottom: 20px;">
+            <h1>🎨 Add-it: Training-Free Object Insertion</h1>
+            <p>Add objects to images using pretrained diffusion models</p>
+            <p><a href="https://research.nvidia.com/labs/par/addit/" target="_blank">🌐 Project Website</a> |
+               <a href="https://arxiv.org/abs/2411.07232" target="_blank">📄 Paper</a> |
+               <a href="https://github.com/NVlabs/addit" target="_blank">💻 Code</a></p>
+            <p style="color: {'green' if pipe is not None else 'red'}; font-weight: bold;">Status: {model_status}</p>
+        </div>
+        """)
+        # Main interface
+        with gr.Tabs():
+            # Generated Images Tab
+            with gr.TabItem("🎭 Generated Images"):
+                gr.Markdown("### Generate a base image and add objects to it")
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        gen_prompt_source = gr.Textbox(
+                            label="Source Prompt",
+                            placeholder="A photo of a cat sitting on the couch",
+                            value="A photo of a cat sitting on the couch"
+                        )
+                        gen_prompt_target = gr.Textbox(
+                            label="Target Prompt",
+                            placeholder="A photo of a cat wearing a red hat sitting on the couch",
+                            value="A photo of a cat wearing a red hat sitting on the couch"
+                        )
+                        gen_subject_token = gr.Textbox(
+                            label="Subject Token",
+                            placeholder="hat",
+                            value="hat",
+                            info="Single token representing the object to add **(must appear in target prompt)**"
+                        )
+                        with gr.Accordion("Advanced Settings", open=False):
+                            gen_seed_src = gr.Number(label="Source Seed", value=6311, precision=0)
+                            gen_seed_obj = gr.Number(label="Object Seed", value=1, precision=0)
+                            gen_extended_scale = gr.Slider(
+                                label="Extended Scale",
+                                minimum=1.0,
+                                maximum=1.3,
+                                value=1.05,
+                                step=0.01
+                            )
+                            gen_structure_transfer_step = gr.Slider(
+                                label="Structure Transfer Step",
+                                minimum=0,
+                                maximum=10,
+                                value=2,
+                                step=1
+                            )
+                            gen_blend_steps = gr.Textbox(
+                                label="Blend Steps",
+                                value="15",
+                                info="Comma-separated list of steps (e.g., '15,20') or empty for no blending"
+                            )
+                            gen_localization_model = gr.Dropdown(
+                                label="Localization Model",
+                                choices=[
+                                    "attention_points_sam",
+                                    "attention",
+                                    "attention_box_sam",
+                                    "attention_mask_sam",
+                                    "grounding_sam"
+                                ],
+                                value="attention_points_sam"
+                            )
+                        gen_submit_btn = gr.Button("🎨 Generate & Edit", variant="primary")
+                    with gr.Column(scale=2):
+                        with gr.Row():
+                            gen_src_output = gr.Image(label="Generated Source Image", type="pil")
+                            gen_edited_output = gr.Image(label="Edited Image", type="pil")
+                        gen_status = gr.Textbox(label="Status", interactive=False)
+                gen_submit_btn.click(
+                    fn=process_generated_image,
+                    inputs=[
+                        gen_prompt_source, gen_prompt_target, gen_subject_token,
+                        gen_seed_src, gen_seed_obj, gen_extended_scale,
+                        gen_structure_transfer_step, gen_blend_steps,
+                        gen_localization_model
+                    ],
+                    outputs=[gen_src_output, gen_edited_output, gen_status]
+                )
+                # Examples for generated images
+                gr.Examples(
+                    examples=[
+                        ["A photo of a man sitting on a bench", "A photo of a man sitting on a bench with a dog", "dog"],
+                        ["A photo of a cat sitting on the couch", "A photo of a cat wearing a red hat sitting on the couch", "hat"],
+                        ["A car driving through an empty street", "A pink car driving through an empty street", "car"]
+                    ],
+                    inputs=[
+                        gen_prompt_source, gen_prompt_target, gen_subject_token
+                    ],
+                    label="Example Prompts"
+                )
+            # Real Images Tab
+            with gr.TabItem("📸 Real Images"):
+                gr.Markdown("### Upload an image and add objects to it")
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        real_source_image = gr.Image(label="Source Image", type="pil")
+                        real_prompt_source = gr.Textbox(
+                            label="Source Prompt",
+                            placeholder="A photo of a bed in a dark room",
+                            value="A photo of a bed in a dark room"
+                        )
+                        real_prompt_target = gr.Textbox(
+                            label="Target Prompt",
+                            placeholder="A photo of a dog lying on a bed in a dark room",
+                            value="A photo of a dog lying on a bed in a dark room"
+                        )
+                        real_subject_token = gr.Textbox(
+                            label="Subject Token",
+                            placeholder="dog",
+                            value="dog",
+                            info="Single token representing the object to add **(must appear in target prompt)**"
+                        )
+                        with gr.Accordion("Advanced Settings", open=False):
+                            real_seed_src = gr.Number(label="Source Seed", value=6311, precision=0)
+                            real_seed_obj = gr.Number(label="Object Seed", value=1, precision=0)
+                            real_extended_scale = gr.Slider(
+                                label="Extended Scale",
+                                minimum=1.0,
+                                maximum=1.3,
+                                value=1.1,
+                                step=0.01
+                            )
+                            real_structure_transfer_step = gr.Slider(
+                                label="Structure Transfer Step",
+                                minimum=0,
+                                maximum=10,
+                                value=4,
+                                step=1
+                            )
+                            real_blend_steps = gr.Textbox(
+                                label="Blend Steps",
+                                value="18",
+                                info="Comma-separated list of steps (e.g., '15,20') or empty for no blending"
+                            )
+                            real_localization_model = gr.Dropdown(
+                                label="Localization Model",
+                                choices=[
+                                    "attention",
+                                    "attention_points_sam",
+                                    "attention_box_sam",
+                                    "attention_mask_sam",
+                                    "grounding_sam"
+                                ],
+                                value="attention"
+                            )
+                            real_use_offset = gr.Checkbox(label="Use Offset", value=False)
+                            real_disable_inversion = gr.Checkbox(label="Disable Inversion", value=False)
+                        real_submit_btn = gr.Button("🎨 Edit Image", variant="primary")
+                    with gr.Column(scale=2):
+                        with gr.Row():
+                            real_src_output = gr.Image(label="Source Image", type="pil")
+                            real_edited_output = gr.Image(label="Edited Image", type="pil")
+                        real_status = gr.Textbox(label="Status", interactive=False)
+                real_submit_btn.click(
+                    fn=process_real_image,
+                    inputs=[
+                        real_source_image, real_prompt_source, real_prompt_target, real_subject_token,
+                        real_seed_src, real_seed_obj, real_extended_scale,
+                        real_structure_transfer_step, real_blend_steps,
+                        real_localization_model, real_use_offset,
+                        real_disable_inversion
+                    ],
+                    outputs=[real_src_output, real_edited_output, real_status]
+                )
+                # Examples for real images
+                gr.Examples(
+                    examples=[
+                        [
+                            "images/bed_dark_room.jpg",
+                            "A photo of a bed in a dark room",
+                            "A photo of a dog lying on a bed in a dark room",
+                            "dog"
+                        ],
+                        [
+                            "images/flower.jpg",
+                            "A photo of a flower",
+                            "A bee standing on a flower",
+                            "bee"
+                        ]
+                    ],
+                    inputs=[
+                        real_source_image, real_prompt_source, real_prompt_target, real_subject_token
+                    ],
+                    label="Example Images & Prompts"
+                )
+        # Tips
+        with gr.Accordion("💡 Tips for Better Results", open=False):
+            gr.Markdown("""
+            - **Prompt Design**: The Target Prompt should be similar to the Source Prompt, but include a description of the new object to insert
+            - **Seed Variation**: Try different values for Object Seed - some prompts may require a few attempts to get satisfying results
+            - **Localization Models**: The most effective options are `attention_points_sam` and `attention`. Use Show Attention to visualize localization performance
+            - **Object Placement Issues**: If the object is not added to the image:
+              - Try **decreasing** Structure Transfer Step
+              - Try **increasing** Extended Scale
+            - **Flexibility**: To allow more flexibility in modifying the source image, leave Blend Steps empty to send an empty list
+            """)
+    return demo
+demo = create_interface()
+demo.launch(
+    server_name="0.0.0.0",
+    server_port=7860,
+    share=True
+)

images/bed_dark_room.jpg ADDED Viewed

Git LFS Details

SHA256: dd0c09288fb6d87a6ca9c0e3687a2748d0c61c885dc928b24073d00929ee76cc
Pointer size: 132 Bytes
Size of remote file: 4.26 MB

images/flower.jpg ADDED Viewed

Git LFS Details

SHA256: 72f485cdc855329da5a899f98dae55f51413099abc826857be780c2a4b9bcea7
Pointer size: 132 Bytes
Size of remote file: 1.77 MB

requirements.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+torch==2.5.1
+torchvision==0.20.1
+numpy==1.26.4
+scipy==1.14.1
+scikit-image==0.24.0
+pandas==2.2.2
+matplotlib
+transformers==4.44.0
+accelerate==0.33.0
+diffusers @ git+https://github.com/huggingface/diffusers.git@15eb77bc4cf2ccb40781cb630b9a734b43cffcb8
+opencv-python
+pyarrow
+fastparquet
+ipykernel
+sentencepiece==0.2.0
+protobuf==5.27.3
+python-dotenv
+git+https://github.com/facebookresearch/sam2.git
+gradio
+spaces

run_CLI_addit_generated.py ADDED Viewed

	@@ -0,0 +1,102 @@

+#!/usr/bin/env python3
+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import os
+import argparse
+import torch
+import random
+from visualization_utils import show_images
+from addit_flux_pipeline import AdditFluxPipeline
+from addit_flux_transformer import AdditFluxTransformer2DModel
+from addit_scheduler import AdditFlowMatchEulerDiscreteScheduler
+from addit_methods import add_object_generated
+def main():
+    parser = argparse.ArgumentParser(description='Run ADDIT with generated images')
+    # Required arguments
+    parser.add_argument('--prompt_source', type=str, default="A photo of a cat sitting on the couch",
+                        help='Source prompt for generating the base image')
+    parser.add_argument('--prompt_target', type=str, default="A photo of a cat wearing a red hat sitting on the couch",
+                        help='Target prompt describing the desired edited image')
+    parser.add_argument('--subject_token', type=str, default="hat",
+                        help='Single token representing the subject to add to the image, must appear in the prompt_target')
+    # Optional arguments
+    parser.add_argument('--output_dir', type=str, default='outputs',
+                        help='Directory to save output images (default: outputs)')
+    parser.add_argument('--seed_src', type=int, default=6311,
+                        help='Seed for source generation')
+    parser.add_argument('--seed_obj', type=int, default=1,
+                        help='Seed for edited image generation')
+    parser.add_argument('--extended_scale', type=float, default=1.05,
+                        help='Extended attention scale (default: 1.05)')
+    parser.add_argument('--structure_transfer_step', type=int, default=2,
+                        help='Structure transfer step (default: 2)')
+    parser.add_argument('--blend_steps', type=int, nargs='*', default=[15],
+                        help='Blend steps (default: [15])')
+    parser.add_argument('--localization_model', type=str, default="attention_points_sam",
+                        help='Localization model (default: attention_points_sam, Options: [attention_points_sam, attention, attention_box_sam, attention_mask_sam, grounding_sam])')
+    parser.add_argument('--show_attention', action='store_true',
+                        help='Show attention maps')
+    parser.add_argument('--display_output', action='store_true',
+                        help='Display output images during processing')
+    args = parser.parse_args()
+    assert args.subject_token in args.prompt_target, "Subject token must appear in the prompt_target"
+    # Set up device and model
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    my_transformer = AdditFluxTransformer2DModel.from_pretrained(
+        "black-forest-labs/FLUX.1-dev",
+        subfolder="transformer",
+        torch_dtype=torch.bfloat16
+    )
+    pipe = AdditFluxPipeline.from_pretrained(
+        "black-forest-labs/FLUX.1-dev",
+        transformer=my_transformer,
+        torch_dtype=torch.bfloat16
+    ).to(device)
+    pipe.scheduler = AdditFlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Process the seeds
+    print(f"\nProcessing with source seed: {args.seed_src}, object seed: {args.seed_obj}")
+    src_image, edited_image = add_object_generated(
+        pipe,
+        args.prompt_source,
+        args.prompt_target,
+        args.subject_token,
+        args.seed_src,
+        args.seed_obj,
+        show_attention=args.show_attention,
+        extended_scale=args.extended_scale,
+        structure_transfer_step=args.structure_transfer_step,
+        blend_steps=args.blend_steps,
+        localization_model=args.localization_model,
+        display_output=args.display_output
+    )
+    # Save output images
+    src_filename = f"src_{args.prompt_source}_seed-src={args.seed_src}.png"
+    edited_filename = f"edited_{args.prompt_target}_seed-src={args.seed_src}_seed-obj={args.seed_obj}.png"
+    src_image.save(os.path.join(args.output_dir, src_filename))
+    edited_image.save(os.path.join(args.output_dir, edited_filename))
+    print(f"Saved images: {src_filename}, {edited_filename}")
+if __name__ == "__main__":
+    main()

run_CLI_addit_real.py ADDED Viewed

	@@ -0,0 +1,121 @@

+#!/usr/bin/env python3
+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import os
+import argparse
+import torch
+import random
+from PIL import Image
+from visualization_utils import show_images
+from addit_flux_pipeline import AdditFluxPipeline
+from addit_flux_transformer import AdditFluxTransformer2DModel
+from addit_scheduler import AdditFlowMatchEulerDiscreteScheduler
+from addit_methods import add_object_real
+def main():
+    parser = argparse.ArgumentParser(description='Run ADDIT with real images')
+    # Required arguments
+    parser.add_argument('--source_image', type=str, default="images/bed_dark_room.jpg",
+                        help='Path to the source image')
+    parser.add_argument('--prompt_source', type=str, default="A photo of a bed in a dark room",
+                        help='Source prompt describing the original image')
+    parser.add_argument('--prompt_target', type=str, default="A photo of a dog lying on a bed in a dark room",
+                        help='Target prompt describing the desired edited image')
+    parser.add_argument('--subject_token', type=str, default="dog",
+                        help='Subject token to add to the image')
+    # Optional arguments
+    parser.add_argument('--output_dir', type=str, default='outputs',
+                        help='Directory to save output images (default: outputs)')
+    parser.add_argument('--seed_src', type=int, default=6311,
+                        help='Seed for source generation')
+    parser.add_argument('--seed_obj', type=int, default=1,
+                        help='Seed for edited image generation')
+    parser.add_argument('--extended_scale', type=float, default=1.1,
+                        help='Extended attention scale (default: 1.1)')
+    parser.add_argument('--structure_transfer_step', type=int, default=4,
+                        help='Structure transfer step (default: 4)')
+    parser.add_argument('--blend_steps', type=int, nargs='*', default=[18],
+                        help='Blend steps (default: [18])')
+    parser.add_argument('--localization_model', type=str, default="attention",
+                        help='Localization model (default: attention, Options: [attention_points_sam, attention, attention_box_sam, attention_mask_sam, grounding_sam])')
+    parser.add_argument('--use_offset', action='store_true',
+                        help='Use offset in processing')
+    parser.add_argument('--show_attention', action='store_true',
+                        help='Show attention maps')
+    parser.add_argument('--disable_inversion', action='store_true',
+                        help='Disable source image inversion')
+    parser.add_argument('--display_output', action='store_true',
+                        help='Display output images during processing')
+    args = parser.parse_args()
+    assert args.subject_token in args.prompt_target, "Subject token must appear in the prompt_target"
+    # Set up device and model
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    my_transformer = AdditFluxTransformer2DModel.from_pretrained(
+        "black-forest-labs/FLUX.1-dev",
+        subfolder="transformer",
+        torch_dtype=torch.bfloat16
+    )
+    pipe = AdditFluxPipeline.from_pretrained(
+        "black-forest-labs/FLUX.1-dev",
+        transformer=my_transformer,
+        torch_dtype=torch.bfloat16
+    ).to(device)
+    pipe.scheduler = AdditFlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)
+    # Load and resize source image
+    source_image = Image.open(args.source_image).resize((1024, 1024))
+    print(f"Loaded source image: {args.source_image}")
+    # Set random seed
+    if args.seed_src is None:
+        random.seed(0)
+        args.seed_src = random.randint(0, 10000)
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Process the seeds
+    print(f"\nProcessing with source seed: {args.seed_src}, object seed: {args.seed_obj}")
+    src_image, edited_image = add_object_real(
+        pipe,
+        source_image=source_image,
+        prompt_source=args.prompt_source,
+        prompt_object=args.prompt_target,
+        subject_token=args.subject_token,
+        seed_src=args.seed_src,
+        seed_obj=args.seed_obj,
+        extended_scale=args.extended_scale,
+        structure_transfer_step=args.structure_transfer_step,
+        blend_steps=args.blend_steps,
+        localization_model=args.localization_model,
+        use_offset=args.use_offset,
+        show_attention=args.show_attention,
+        use_inversion=not args.disable_inversion,
+        display_output=args.display_output
+    )
+    # Save output images
+    src_filename = f"src_{args.prompt_source}_seed-src={args.seed_src}.png"
+    edited_filename = f"edited_{args.prompt_target}_seed-src={args.seed_src}_seed-obj={args.seed_obj}.png"
+    src_image.save(os.path.join(args.output_dir, src_filename))
+    edited_image.save(os.path.join(args.output_dir, edited_filename))
+    print(f"Saved images: {src_filename}, {edited_filename}")
+if __name__ == "__main__":
+    main()

run_addit_generated.ipynb ADDED Viewed

	@@ -0,0 +1,80 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.\n",
+    "#\n",
+    "# This work is licensed under the LICENSE file\n",
+    "# located at the root directory.\n",
+    "import torch\n",
+    "import random\n",
+    "\n",
+    "from visualization_utils import show_images\n",
+    "from addit_flux_pipeline import AdditFluxPipeline\n",
+    "from addit_flux_transformer import AdditFluxTransformer2DModel\n",
+    "from addit_scheduler import AdditFlowMatchEulerDiscreteScheduler\n",
+    "from addit_methods import add_object_generated\n",
+    "\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "my_transformer  = AdditFluxTransformer2DModel.from_pretrained(\"black-forest-labs/FLUX.1-dev\", subfolder=\"transformer\", torch_dtype=torch.bfloat16)\n",
+    "\n",
+    "pipe = AdditFluxPipeline.from_pretrained(\"black-forest-labs/FLUX.1-dev\", \n",
+    "                                      transformer=my_transformer,\n",
+    "                                      torch_dtype=torch.bfloat16).to(device)\n",
+    "pipe.scheduler = AdditFlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reset the GPU memory tracking\n",
+    "torch.cuda.reset_max_memory_allocated(0)\n",
+    "\n",
+    "(prompt1, prompt2), subject_token = [\"A photo of a man sitting on a bench\", \"A photo of a man sitting on a bench with a dog\"], \"dog\"\n",
+    "\n",
+    "\n",
+    "random.seed(0)\n",
+    "seeds_src = [663]\n",
+    "seeds_obj = [0,1,2]\n",
+    "\n",
+    "for seed_src in seeds_src:\n",
+    "    for seed_obj in seeds_obj:\n",
+    "        src_image, edited_image = add_object_generated(pipe, prompt1, prompt2, subject_token, seed_src, seed_obj, show_attention=True, \n",
+    "                                        extended_scale=1.05, structure_transfer_step=2, blend_steps=[15], \n",
+    "                                        localization_model=\"attention_points_sam\", display_output=True)\n",
+    "\n",
+    "# Report maximum GPU memory usage in GB\n",
+    "max_memory_used = torch.cuda.max_memory_allocated(0) / (1024**3)  # Convert to GB\n",
+    "print(f\"Maximum GPU memory used: {max_memory_used:.2f} GB\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "addit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

run_addit_real.ipynb ADDED Viewed

	@@ -0,0 +1,85 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.\n",
+    "#\n",
+    "# This work is licensed under the LICENSE file\n",
+    "# located at the root directory.\n",
+    "\n",
+    "import torch\n",
+    "import random\n",
+    "from PIL import Image\n",
+    "\n",
+    "from visualization_utils import show_images\n",
+    "from addit_flux_pipeline import AdditFluxPipeline\n",
+    "from addit_flux_transformer import AdditFluxTransformer2DModel\n",
+    "from addit_scheduler import AdditFlowMatchEulerDiscreteScheduler\n",
+    "from addit_methods import add_object_real\n",
+    "\n",
+    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "my_transformer  = AdditFluxTransformer2DModel.from_pretrained(\"black-forest-labs/FLUX.1-dev\", subfolder=\"transformer\", torch_dtype=torch.bfloat16)\n",
+    "\n",
+    "pipe = AdditFluxPipeline.from_pretrained(\"black-forest-labs/FLUX.1-dev\", \n",
+    "                                      transformer=my_transformer,\n",
+    "                                      torch_dtype=torch.bfloat16).to(device)\n",
+    "pipe.scheduler = AdditFlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reset the GPU memory tracking\n",
+    "torch.cuda.reset_max_memory_allocated(0)\n",
+    "\n",
+    "# source_image = Image.open(\"images/cat.jpg\").resize((1024, 1024))\n",
+    "# (prompt_src, prompt_tgt), subject_token = [\"A photo of a cat\", \"A photo of a cat wearing a scarf\"], \"scarf\"\n",
+    "\n",
+    "source_image = Image.open(\"images/bed_dark_room.jpg\").resize((1024, 1024))\n",
+    "(prompt_src, prompt_tgt), subject_token = [\"A photo of a bed in a dark room\", \"A photo of a dog lying on a bed in a dark room\"], \"dog\"\n",
+    "\n",
+    "random.seed(0)\n",
+    "seed_src = random.randint(0, 10000)\n",
+    "seeds_obj = [0,1,2]\n",
+    "\n",
+    "for seed_obj in seeds_obj:\n",
+    "    images_list = add_object_real(pipe, source_image=source_image, prompt_source=prompt_src, prompt_object=prompt_tgt, \n",
+    "                            subject_token=subject_token, seed_src=seed_src, seed_obj=seed_obj, \n",
+    "                            extended_scale =1.1, structure_transfer_step=4, blend_steps = [18], #localization_model=\"attention\",\n",
+    "                            use_offset=False, show_attention=True, use_inversion=True, display_output=True)\n",
+    "\n",
+    "# Report maximum GPU memory usage in GB\n",
+    "max_memory_used = torch.cuda.max_memory_allocated(0) / (1024**3)  # Convert to GB\n",
+    "print(f\"Maximum GPU memory used: {max_memory_used:.2f} GB\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "addit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

visualization_utils.py ADDED Viewed

	@@ -0,0 +1,235 @@

+# Copyright (C) 2025 NVIDIA Corporation.  All rights reserved.
+#
+# This work is licensed under the LICENSE file
+# located at the root directory.
+import cv2
+import numpy as np
+from PIL import Image, ImageDraw
+import torch
+import matplotlib.pyplot as plt
+from skimage import filters
+from IPython.display import display
+def gaussian_blur(heatmap, kernel_size=7):
+    # Shape of heatmap: (H, W)
+    heatmap = heatmap.cpu().numpy()
+    heatmap = cv2.GaussianBlur(heatmap, (kernel_size, kernel_size), 0)
+    heatmap = torch.tensor(heatmap)
+    return heatmap
+def show_cam_on_image(img, mask):
+        heatmap = cv2.applyColorMap(np.uint8(255 * mask), cv2.COLORMAP_JET)
+        heatmap = np.float32(heatmap) / 255
+        cam = heatmap + np.float32(img)
+        cam = cam / np.max(cam)
+        return cam
+def show_image_and_heatmap(heatmap: torch.Tensor, image: Image.Image, relevnace_res: int = 256, interpolation: str = 'bilinear', gassussian_kernel_size: int = 3):
+    image = image.resize((relevnace_res, relevnace_res))
+    image = np.array(image)
+    image = (image - image.min()) / (image.max() - image.min())
+    # Apply gaussian blur to heatmap
+    # heatmap = gaussian_blur(heatmap, kernel_size=gassussian_kernel_size)
+    # heatmap = (heatmap - heatmap.min()) / (heatmap.max() - heatmap.min())
+    # otsu_thr = filters.threshold_otsu(heatmap.cpu().numpy())
+    # heatmap = (heatmap > otsu_thr).to(heatmap.dtype)
+    heatmap = heatmap.reshape(1, 1, heatmap.shape[-1], heatmap.shape[-1])
+    heatmap = torch.nn.functional.interpolate(heatmap, size=relevnace_res, mode=interpolation)
+    heatmap = (heatmap - heatmap.min()) / (heatmap.max() - heatmap.min())
+    heatmap = heatmap.reshape(relevnace_res, relevnace_res).cpu()
+    vis = show_cam_on_image(image, heatmap)
+    vis = np.uint8(255 * vis)
+    vis = cv2.cvtColor(np.array(vis), cv2.COLOR_RGB2BGR)
+    vis = vis.astype(np.uint8)
+    vis = Image.fromarray(vis).resize((relevnace_res, relevnace_res))
+    return vis
+def show_only_heatmap(heatmap: torch.Tensor, relevnace_res: int = 256, interpolation: str = 'bilinear', gassussian_kernel_size: int = 3):
+    # Apply gaussian blur to heatmap
+    # heatmap = gaussian_blur(heatmap, kernel_size=gassussian_kernel_size)
+    heatmap = heatmap.reshape(1, 1, heatmap.shape[-1], heatmap.shape[-1])
+    heatmap = torch.nn.functional.interpolate(heatmap, size=relevnace_res, mode=interpolation)
+    heatmap = (heatmap - heatmap.min()) / (heatmap.max() - heatmap.min())
+    heatmap = heatmap.reshape(relevnace_res, relevnace_res).cpu()
+    vis = heatmap
+    vis = np.uint8(255 * vis)
+    # Show in black and white
+    vis = cv2.cvtColor(np.array(vis), cv2.COLOR_GRAY2BGR)
+    vis = Image.fromarray(vis).resize((relevnace_res, relevnace_res))
+    return vis
+def visualize_tokens_attentions(attention, tokens, image, heatmap_interpolation="nearest", show_on_image=True):
+    # Tokens: list of strings
+    # attention: tensor of shape (batch_size, num_tokens, width, height)
+    token_vis = []
+    for j, token in enumerate(tokens):
+        if j >= attention.shape[0]:
+            break
+        if show_on_image:
+            vis = show_image_and_heatmap(attention[j], image, relevnace_res=512, interpolation=heatmap_interpolation)
+        else:
+            vis = show_only_heatmap(attention[j], relevnace_res=512, interpolation=heatmap_interpolation)
+        token_vis.append((token, vis))
+    # Display the token and the attention map in a grid, with K tokens per row
+    K = 4
+    n_rows = (len(token_vis) + K - 1) // K  # Ceiling division
+    fig, axs = plt.subplots(n_rows, K, figsize=(K*5, n_rows*5))
+    for i, (token, vis) in enumerate(token_vis):
+        row, col = divmod(i, K)
+        if n_rows > 1:
+            ax = axs[row, col]
+        elif K > 1:
+            ax = axs[col]
+        else:
+            ax = axs
+        ax.imshow(vis)
+        ax.set_title(token)
+        ax.axis("off")
+    # Hide unused subplots
+    for j in range(i + 1, n_rows * K):
+        row, col = divmod(j, K)
+        if n_rows > 1:
+            axs[row, col].axis('off')
+        elif K > 1:
+            axs[col].axis('off')
+    plt.tight_layout()
+    # We want to return the figure so that we can save it to a file
+    return fig
+def show_images(images, titles=None, size=1024, max_row_length=5, figsize=None, col_height=10, save_path=None):
+    if isinstance(images, Image.Image):
+        images = [images]
+    if len(images) == 1:
+        img = images[0]
+        img = img.resize((size, size))
+        plt.imshow(img)
+        plt.axis('off')
+        if titles is not None:
+            plt.title(titles[0])
+        if save_path:
+            plt.savefig(save_path, bbox_inches='tight', dpi=150)
+        plt.show()
+    else:
+        images = [img.resize((size, size)) for img in images]
+        # Check if the number of titles matches the number of images
+        if titles is not None:
+            assert len(images) == len(titles), "Number of titles should match the number of images"
+        n_images = len(images)
+        n_cols = min(n_images, max_row_length)
+        n_rows = (n_images + n_cols - 1) // n_cols  # Calculate the number of rows needed
+        if figsize is None:
+            figsize=(n_cols * col_height, n_rows * col_height)
+        fig, axs = plt.subplots(n_rows, n_cols, figsize=figsize)
+        axs = axs.flatten() if isinstance(axs, np.ndarray) else [axs]
+        # Display images in the subplots
+        for i, img in enumerate(images):
+            axs[i].imshow(img)
+            if titles is not None:
+                axs[i].set_title(titles[i])
+            axs[i].axis("off")
+        # Turn off any unused subplots
+        for ax in axs[len(images):]:
+            ax.axis("off")
+        if save_path:
+            plt.savefig(save_path, bbox_inches='tight', dpi=150)
+        plt.show()
+def show_tensors(tensors, titles=None, size=None, max_row_length=5):
+    # Shape of tensors: List[Tensor[H, W]]
+    if size is not None:
+        tensors = [torch.nn.functional.interpolate(t.unsqueeze(0).unsqueeze(0), size=(size, size), mode='bilinear').squeeze() for t in tensors]
+    if len(tensors) == 1:
+        plt.imshow(tensors[0].cpu().numpy())
+        plt.axis('off')
+        if titles is not None:
+            plt.title(titles[0])
+        plt.show()
+    else:
+        # Check if the number of titles matches the number of images
+        if titles is not None:
+            assert len(tensors) == len(titles), "Number of titles should match the number of images"
+        n_tensors = len(tensors)
+        n_cols = min(n_tensors, max_row_length)
+        n_rows = (n_tensors + n_cols - 1) // n_cols
+        fig, axs = plt.subplots(n_rows, n_cols, figsize=(n_cols * 10, n_rows * 10))
+        axs = axs.flatten() if isinstance(axs, np.ndarray) else [axs]
+        for i, tensor in enumerate(tensors):
+            axs[i].imshow(tensor.cpu().numpy())
+            if titles is not None:
+                axs[i].set_title(titles[i])
+            axs[i].axis("off")
+        for ax in axs[len(tensors):]:
+            ax.axis("off")
+        plt.show()
+def draw_bboxes_on_image(image, bboxes, color="red", thickness=2):
+    image = image.copy()
+    draw = ImageDraw.Draw(image)
+    for bbox in bboxes:
+        draw.rectangle(bbox, outline=color, width=thickness)
+    return image
+def draw_points_on_pil_image(pil_image, point_coords, point_color="red", radius=5):
+    """
+    Draw points (circles) on a PIL image and return the modified image.
+    :param pil_image:      PIL Image (e.g., sam_masked_image)
+    :param point_coords:   An array-like of shape (N, 2), with x,y coordinates
+    :param point_color:    Color of the point (default 'red')
+    :param radius:         Radius of the drawn circles
+    :return:               PIL Image with points drawn
+    """
+    # Copy so we don't modify the original
+    out_img = pil_image.copy()
+    draw = ImageDraw.Draw(out_img)
+    # Draw each point
+    for x, y in point_coords:
+        # Calculate bounding box of the circle
+        left_up_point = (x - radius, y - radius)
+        right_down_point = (x + radius, y + radius)
+        # Draw the circle
+        draw.ellipse([left_up_point, right_down_point], fill=point_color, outline=point_color)
+    return out_img