Spaces:

orpatashnik
/

NestedAttentionEncoder

Running on Zero

App Files Files Community

orpatashnik commited on 9 days ago

Commit

b197ccc

1 Parent(s): a1c0b29

add code

Browse files

Files changed (6) hide show

README.md +63 -12
app.py +115 -0
nested_attention_pipeline.py +248 -0
nested_attention_processor.py +363 -0
resampler.py +169 -0
utils.py +128 -0

README.md CHANGED Viewed

@@ -1,12 +1,63 @@
----
-title: NestedAttentionPersonalization
-emoji: 🐢
-colorFrom: blue
-colorTo: gray
-sdk: gradio
-sdk_version: 5.28.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Nested Attention: Semantic-aware Attention Values for Concept Personalization (SIGGRAPH 2025)
+![](assets/teaser_site.jpg)
+> **Nested Attention: Semantic-aware Attention Values for Concept Personalization**
+> Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
+> https://arxiv.org/abs/2501.01407
+>
+> **Abstract:** Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often struggle to balance identity preservation with alignment to the input text prompt. Some methods rely on a single textual token to represent a subject, limiting expressiveness, while others use richer representations but disrupt the model's prior, weakening prompt alignment.
+> In this work, we introduce **Nested Attention**, a novel mechanism that injects rich and expressive image representations into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
+> We integrate these nested layers into an encoder-based personalization method and show that they enable strong identity preservation while maintaining adherence to input text prompts. Our approach is general and can be trained across various domains. Additionally, its prior preservation allows for combining multiple personalized subjects from different domains in a single image.
+## Description
+Official implementation of **Nested Attention**, an encoder-based method for text-to-image personalization using a novel nested attention mechanism.
+The implementation of the nested attention mechanism can be found in `nested_attention_processor.py`.
+This repository provides:
+- An inference notebook (`inference_notebook.ipynb`)
+- A trained encoder for faces
+- A Gradio-based application
+## Setup
+Please download the following models:
+- https://github.com/ageitgey/face_recognition_models/blob/master/face_recognition_models/models/shape_predictor_68_face_landmarks.dat
+- https://github.com/justadudewhohacks/face-recognition.js-models/blob/master/models/mmod_human_face_detector.dat
+- image encoder (add link)
+- trained encoder (add link)
+Tested with:
+- `torch==2.6.0`
+- `diffusers==0.33.1`
+- `transformers==4.51.2`
+## Usage
+Refer to the inference notebook for an example. Key usage notes:
+- The input image should be aligned and cropped.
+- The special token `<person>` represents the personalized subject and **must appear exactly once** in the input prompt.
+- The parameter `special_token_weight` corresponds to $\lambda$ in the paper, controlling the tradeoff between identity preservation and prompt adherence. Increasing this parameter improves identity preservation.
+- The code supports multiple input images of the same subject. To enable this, set `multiple_images=True` and provide a list of images. For single-image usage, pass an image directly instead of a list.
+## Related Work
+This repository builds upon [IP-Adapter](https://ip-adapter.github.io/).
+## BibTeX
+```bibtex
+@inproceedings{patashnik2025nested,
+    author = {Patashnik, Or and Gal, Rinon and Ostashev, Daniil and Tulyakov, Sergey and Aberman, Kfir and Cohen-Or, Daniel},
+    title = {Nested Attention: Semantic-aware Attention Values for Concept Personalization},
+    year = {2025},
+    publisher = {Association for Computing Machinery},
+    url = {https://doi.org/10.1145/3721238.3730634},
+    booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
+    articleno = {6},
+    numpages = {12},
+    series = {SIGGRAPH Conference Papers '25}
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,115 @@

+import os
+import torch
+from diffusers import StableDiffusionXLPipeline
+import gradio as gr
+from huggingface_hub import hf_hub_download, snapshot_download
+from nested_attention_pipeline import NestedAdapterInference, add_special_token_to_tokenizer
+from utils import align_face
+import dlib
+# ----------------------
+# Configuration (update paths as needed)
+# ----------------------
+SHAPE_PREDICTOR_PATH = hf_hub_download("orpatashnik/NestedAttentionEncoder", "shape_predictor_68_face_landmarks.dat")
+FACE_DETECTOR_PATH = hf_hub_download("orpatashnik/NestedAttentionEncoder", "mmod_human_face_detector.dat")
+base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
+image_encoder_path = snapshot_download("orpatashnik/NestedAttentionEncoder", allow_patterns=["image_encoder/**"])
+image_encoder_path = os.path.join(image_encoder_path, "image_encoder")
+personalization_ckpt = hf_hub_download("orpatashnik/NestedAttentionEncoder", "personalization_encoder/pytorch_model.bin")
+device = "cuda"
+# Special token settings
+placeholder_token = "<person>"
+initializer_token = "person"
+# ----------------------
+# Load models
+# ----------------------
+pipe = StableDiffusionXLPipeline.from_pretrained(
+    base_model_path,
+    torch_dtype=torch.float16,
+)
+add_special_token_to_tokenizer(pipe, placeholder_token, initializer_token)
+ip_model = NestedAdapterInference(
+    pipe,
+    image_encoder_path,
+    personalization_ckpt,
+    1024,
+    vq_normalize_factor=2.0,
+    device=device
+)
+# Initialize face alignment predictor
+predictor = dlib.shape_predictor(SHAPE_PREDICTOR_PATH)
+detector = dlib.cnn_face_detection_model_v1(FACE_DETECTOR_PATH)
+# Generation defaults
+negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
+num_inference_steps = 30
+guidance_scale = 5.0
+# ----------------------
+# Inference function with alignment
+# ----------------------
+def generate_images(img1, img2, img3, prompt, w, num_samples, seed):
+    # Collect non-empty reference images
+    refs = [img for img in (img1, img2, img3) if img is not None]
+    if not refs:
+        return []
+    # Align directly on PIL
+    aligned_refs = [align_face(img, predictor, detector) for img in refs]
+    # Resize to model resolution
+    pil_images = [aligned.resize((512, 512)) for aligned in aligned_refs]
+    placeholder_token_ids = ip_model.pipe.tokenizer.convert_tokens_to_ids([placeholder_token])
+    # Generate personalized samples
+    results = ip_model.generate(
+        pil_image=pil_images,
+        prompt=prompt,
+        negative_prompt=negative_prompt,
+        num_samples=num_samples,
+        num_inference_steps=num_inference_steps,
+        placeholder_token_ids=placeholder_token_ids,
+        seed=seed if seed > 0 else None,
+        guidance_scale=guidance_scale,
+        multiple_images=True,
+        special_token_weight=w
+    )
+    return results
+# ----------------------
+# Gradio UI
+# ----------------------
+with gr.Blocks() as demo:
+    gr.Markdown("## Personalized Image Generation Demo")
+    gr.Markdown(
+        "Upload up to 3 reference images. "
+        "Faces will be auto-aligned before personalization. Include the placeholder token (e.g., \\<person\\>) in your prompt, "
+        "set token weight, and choose how many outputs you want."
+    )
+    with gr.Row():
+        with gr.Column(scale=1):
+            # Reference images
+            with gr.Row():
+                img1 = gr.Image(type="pil", label="Reference Image 1")
+                img2 = gr.Image(type="pil", label="Reference Image 2 (optional)")
+                img3 = gr.Image(type="pil", label="Reference Image 3 (optional)")
+            prompt_input = gr.Textbox(label="Prompt", placeholder="e.g., an abstract pencil drawing of a <person>")
+            w_input = gr.Slider(minimum=1.0, maximum=5.0, step=0.5, value=1.0, label="Special Token Weight (w)")
+            num_samples_input = gr.Slider(minimum=1, maximum=6, step=1, value=4, label="Number of Images to Generate")
+            seed_input = gr.Slider(minimum=-1, maximum=100000, step=1, value=-1, label="Random Seed (use -1 for random and up to 100000)")
+            generate_button = gr.Button("Generate Images")
+        with gr.Column(scale=1):
+            output_gallery = gr.Gallery(label="Generated Images", columns=3)
+    generate_button.click(
+        fn=generate_images,
+        inputs=[img1, img2, img3, prompt_input, w_input, num_samples_input, seed_input],
+        outputs=output_gallery
+    )
+if __name__ == "__main__":
+    demo.launch(share=True, debug=True)

nested_attention_pipeline.py ADDED Viewed

	@@ -0,0 +1,248 @@

+import os
+from typing import List
+import torch
+from PIL import Image
+from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
+from nested_attention_processor import AttnProcessor, NestedAttnProcessor
+from utils import get_generator
+from resampler import Resampler
+def add_special_token_to_tokenizer(
+    pipe,
+    placeholder_token,
+    initializer_token
+):
+    num_added_tokens1 = pipe.tokenizer.add_tokens([placeholder_token])
+    num_added_tokens2 = pipe.tokenizer_2.add_tokens([placeholder_token])
+    if num_added_tokens1 != 1 or num_added_tokens2 != 1:
+        raise ValueError("Failed to add placeholder token to tokenizer")
+    token_ids1 = pipe.tokenizer.encode(initializer_token, add_special_tokens=False)
+    token_ids2 = pipe.tokenizer_2.encode(initializer_token, add_special_tokens=False)
+    if len(token_ids1) > 1 or len(token_ids2) > 1:
+        raise ValueError("The initializer token must be a single token.")
+    initializer_token_id1 = token_ids1[0]
+    initializer_token_id2 = token_ids2[0]
+    placeholder_token_ids1 = pipe.tokenizer.convert_tokens_to_ids([placeholder_token])
+    placeholder_token_ids2 = pipe.tokenizer_2.convert_tokens_to_ids([placeholder_token])
+    pipe.text_encoder.resize_token_embeddings(len(pipe.tokenizer))
+    pipe.text_encoder_2.resize_token_embeddings(len(pipe.tokenizer_2))
+    token_embeds1 = pipe.text_encoder.get_input_embeddings().weight.data
+    token_embeds2 = pipe.text_encoder_2.get_input_embeddings().weight.data
+    with torch.no_grad():
+        for token_id in placeholder_token_ids1:
+            token_embeds1[token_id] = token_embeds1[initializer_token_id1].clone()
+        for token_id in placeholder_token_ids2:
+            token_embeds2[token_id] = token_embeds2[initializer_token_id2].clone()
+class NestedAdapterInference:
+    def __init__(
+        self,
+        sd_pipe,
+        image_encoder_path,
+        adapter_ckpt,
+        resampler_num_queries,
+        vq_normalize_factor,
+        device,
+    ):
+        self.device = device
+        self.image_encoder_path = image_encoder_path
+        self.adapter_ckpt = adapter_ckpt
+        self.vq_normalize_factor = vq_normalize_factor
+        self.pipe = sd_pipe.to(self.device)
+        self.set_nested_adapter()
+        # load image encoder
+        self.image_encoder = CLIPVisionModelWithProjection.from_pretrained(
+            self.image_encoder_path
+        ).to(self.device, dtype=torch.float16)
+        self.clip_image_processor = CLIPImageProcessor()
+        # spatial features model
+        self.qformer = Resampler(
+            dim=self.pipe.unet.config.cross_attention_dim,
+            depth=4,
+            dim_head=64,
+            heads=12,
+            num_queries=resampler_num_queries,
+            embedding_dim=self.image_encoder.config.hidden_size,
+            output_dim=self.pipe.unet.config.cross_attention_dim,
+            ff_mult=4,
+        ).to(self.device, dtype=torch.float16)
+        if adapter_ckpt is not None:
+            self.load_nested_adapter()
+    def set_nested_adapter(self):
+        unet = self.pipe.unet
+        attn_procs = {}
+        for name in unet.attn_processors.keys():
+            cross_attention_dim = (
+                None
+                if name.endswith("attn1.processor")
+                else unet.config.cross_attention_dim
+            )
+            if name.startswith("mid_block"):
+                hidden_size = unet.config.block_out_channels[-1]
+            elif name.startswith("up_blocks"):
+                block_id = int(name[len("up_blocks.")])
+                hidden_size = list(reversed(unet.config.block_out_channels))[block_id]
+            elif name.startswith("down_blocks"):
+                block_id = int(name[len("down_blocks.")])
+                hidden_size = unet.config.block_out_channels[block_id]
+            if cross_attention_dim is None:
+                attn_procs[name] = AttnProcessor()
+            else:
+                attn_procs[name] = NestedAttnProcessor(
+                    hidden_size=hidden_size,
+                    cross_attention_dim=cross_attention_dim,
+                    normalize_factor=self.vq_normalize_factor,
+                ).to(self.device, dtype=torch.float16)
+        unet.set_attn_processor(attn_procs)
+    def load_nested_adapter(self):
+        state_dict = {"adapter_modules": {}, "qformer": {}}
+        f = torch.load(self.adapter_ckpt, map_location="cpu")
+        for key in f.keys():
+            if key.startswith("adapter_modules."):
+                state_dict["adapter_modules"][key.replace("adapter_modules.", "")] = f[
+                    key
+                ]
+            elif key.startswith("spatial_features_model."):
+                state_dict["qformer"][key.replace("spatial_features_model.", "")] = f[
+                    key
+                ]
+        self.qformer.load_state_dict(state_dict["qformer"])
+        adapter_layers = torch.nn.ModuleList(self.pipe.unet.attn_processors.values())
+        adapter_layers.load_state_dict(state_dict["adapter_modules"])
+    @torch.inference_mode()
+    def get_image_embeds(self, pil_image=None, clip_image_embeds=None):
+        if isinstance(pil_image, Image.Image):
+            pil_image = [pil_image]
+        clip_image = self.clip_image_processor(
+            images=pil_image, return_tensors="pt"
+        ).pixel_values
+        clip_image_embeds = self.image_encoder(
+            clip_image.to(self.device, dtype=torch.float16)
+        )
+        spatial_clip_image_embeds = clip_image_embeds.last_hidden_state
+        spatial_clip_image_embeds = spatial_clip_image_embeds[:, 1:]  # remove CLS token
+        return spatial_clip_image_embeds
+    def generate(
+        self,
+        pil_image=None,
+        clip_image_embeds=None,
+        prompt=None,
+        placeholder_token_ids=None,
+        negative_prompt=None,
+        scale=1.0,
+        num_samples=4,
+        seed=None,
+        guidance_scale=5.0,
+        num_inference_steps=30,
+        multiple_images=False,
+        special_token_weight=1.0,
+        **kwargs,
+    ):
+        if pil_image is not None:
+            num_prompts = (
+                1
+                if isinstance(pil_image, Image.Image) or multiple_images
+                else len(pil_image)
+            )
+        else:
+            num_prompts = clip_image_embeds.size(0)
+        if prompt is None:
+            prompt = "best quality, high quality"
+        if negative_prompt is None:
+            negative_prompt = (
+                "monochrome, lowres, bad anatomy, worst quality, low quality"
+            )
+        if not isinstance(prompt, List):
+            prompt = [prompt] * num_prompts
+        if not isinstance(negative_prompt, List):
+            negative_prompt = [negative_prompt] * num_prompts
+        text_input_ids = self.pipe.tokenizer(
+            prompt,
+            max_length=self.pipe.tokenizer.model_max_length,
+            padding="max_length",
+            truncation=True,
+            return_tensors="pt",
+        ).input_ids
+        special_token_indices = (text_input_ids == placeholder_token_ids[0]).nonzero()[
+            :, 1
+        ]
+        spatial_clip_image_embeds = self.get_image_embeds(
+            pil_image=pil_image, clip_image_embeds=clip_image_embeds
+        )  # (bs, 256, 1280)
+        with torch.no_grad():
+            (
+                prompt_embeds,
+                negative_prompt_embeds,
+                pooled_prompt_embeds,
+                negative_pooled_prompt_embeds,
+            ) = self.pipe.encode_prompt(
+                prompt,
+                num_images_per_prompt=num_samples,
+                do_classifier_free_guidance=True,
+                negative_prompt=negative_prompt,
+            )
+        special_token_indices = (text_input_ids == placeholder_token_ids[0]).nonzero()[
+            :, 1
+        ]
+        with torch.no_grad():
+            qformer_tokens_out = self.qformer(spatial_clip_image_embeds)
+        if multiple_images:
+            b, num_tokens, d = qformer_tokens_out.shape
+            qformer_tokens_out = qformer_tokens_out.reshape(
+                1, num_tokens * b, d
+            )
+        bs_embed, num_tokens, _ = qformer_tokens_out.shape
+        qformer_tokens_out = qformer_tokens_out.repeat(1, num_samples, 1, 1)
+        qformer_tokens_out = qformer_tokens_out.view(
+            bs_embed * num_samples, num_tokens, -1
+        )
+        qformer_tokens_out = qformer_tokens_out.repeat_interleave(2, dim=0)
+        cross_attention_kwargs = {
+            "qformer_tokens_out": qformer_tokens_out,
+            "special_token_indices": special_token_indices,
+            "special_token_weight": special_token_weight,
+            "inference_mode": True,
+        }
+        generator = get_generator(seed, self.device)
+        images = self.pipe(
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+            guidance_scale=guidance_scale,
+            num_inference_steps=num_inference_steps,
+            generator=generator,
+            cross_attention_kwargs=cross_attention_kwargs,
+            **kwargs,
+        ).images
+        return images

nested_attention_processor.py ADDED Viewed

	@@ -0,0 +1,363 @@

+# modified from https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+def my_scaled_dot_product_attention(
+    query,
+    key,
+    value,
+    attn_mask=None,
+    dropout_p=0.0,
+    is_causal=False,
+    scale=None,
+    special_token_weight=1.0,
+    special_token_indices=None,
+) -> torch.Tensor:
+    """
+    Computes the scaled dot-product attention with additional control over specific tokens.
+    This function is a re-implementation of the scaled dot-product attention mechanism,
+    designed to return both the attention map and the output of the attention operation.
+    It also provides additional control via a scalar that modifies the attention map
+    for specific tokens.
+    """
+    L, S = query.size(-2), key.size(-2)
+    scale_factor = 1 / math.sqrt(query.size(-1)) if scale is None else scale
+    attn_bias = torch.zeros(L, S, dtype=query.dtype).cuda()
+    if is_causal:
+        assert attn_mask is None
+        temp_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0)
+        attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
+        attn_bias.to(query.dtype)
+    if attn_mask is not None:
+        if attn_mask.dtype == torch.bool:
+            attn_bias.masked_fill_(attn_mask.logical_not(), float("-inf"))
+        else:
+            attn_bias += attn_mask
+    attn_weight = query @ key.transpose(-2, -1) * scale_factor
+    attn_weight += attn_bias
+    if special_token_indices is not None and special_token_weight != 1.0:
+        bs = attn_weight.shape[0]
+        attn_weight[torch.arange(bs), :, :, special_token_indices] = torch.max(
+            attn_weight[torch.arange(bs), :, :, special_token_indices],
+            attn_weight[torch.arange(bs), :, :, special_token_indices]
+            * special_token_weight,
+        )
+    attn_weight = torch.softmax(attn_weight, dim=-1)
+    attn_weight = torch.dropout(attn_weight, dropout_p, train=True)
+    return attn_weight @ value, attn_weight
+class AttnProcessor(torch.nn.Module):
+    r"""
+    Processor for implementing scaled dot-product attention.
+    """
+    def __init__(
+        self,
+        hidden_size=None,
+        cross_attention_dim=None,
+    ):
+        super().__init__()
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError(
+                "AttnProcessor requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
+            )
+    def __call__(
+        self,
+        attn,
+        hidden_states,
+        qformer_tokens_out=None,
+        special_token_indices=None,
+        inference_mode=None,
+        encoder_hidden_states=None,
+        attention_mask=None,
+        temb=None,
+        special_token_weight=None,
+    ):
+        residual = hidden_states
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+        input_ndim = hidden_states.ndim
+        if input_ndim == 4:
+            batch_size, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(
+                batch_size, channel, height * width
+            ).transpose(1, 2)
+        batch_size, sequence_length, _ = (
+            hidden_states.shape
+            if encoder_hidden_states is None
+            else encoder_hidden_states.shape
+        )
+        if attention_mask is not None:
+            attention_mask = attn.prepare_attention_mask(
+                attention_mask, sequence_length, batch_size
+            )
+            # scaled_dot_product_attention expects attention_mask shape to be
+            # (batch, heads, source_length, target_length)
+            attention_mask = attention_mask.view(
+                batch_size, attn.heads, -1, attention_mask.shape[-1]
+            )
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(
+                1, 2
+            )
+        query = attn.to_q(hidden_states)
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        elif attn.norm_cross:
+            encoder_hidden_states = attn.norm_encoder_hidden_states(
+                encoder_hidden_states
+            )
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # the output of sdp = (batch, num_heads, seq_len, head_dim)
+        hidden_states = F.scaled_dot_product_attention(
+            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
+        )
+        hidden_states = hidden_states.transpose(1, 2).reshape(
+            batch_size, -1, attn.heads * head_dim
+        )
+        hidden_states = hidden_states.to(query.dtype)
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(
+                batch_size, channel, height, width
+            )
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+        hidden_states = hidden_states / attn.rescale_output_factor
+        return hidden_states
+class NestedAttnProcessor(torch.nn.Module):
+    r"""
+    Nested Attention processor for IP-Adapater for PyTorch 2.0.
+    """
+    def __init__(self, hidden_size, cross_attention_dim=None, normalize_factor=1.0):
+        super().__init__()
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError(
+                "NestedAttnProcessor requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
+            )
+        self.hidden_size = hidden_size
+        self.cross_attention_dim = cross_attention_dim
+        self.normalize_factor = normalize_factor
+        self.nested_to_k = nn.Linear(
+            cross_attention_dim or hidden_size, hidden_size, bias=False
+        )
+        self.nested_to_v = nn.Linear(
+            cross_attention_dim or hidden_size, hidden_size, bias=False
+        )
+    def __call__(
+        self,
+        attn,
+        hidden_states,
+        qformer_tokens_out,
+        special_token_indices,
+        inference_mode=False,
+        encoder_hidden_states=None,
+        attention_mask=None,
+        temb=None,
+        special_token_weight=1.0,
+    ):
+        assert (
+            special_token_indices.shape[0] > 0
+        ), "special_token_indices should not be empty"
+        # if inference mode is set to True, the code assumes that CFG is used and the first half
+        # of the batch is used for the null prompt and the second half is used for the prompt
+        residual = hidden_states
+        if attn.spatial_norm is not None:
+            hidden_states = attn.spatial_norm(hidden_states, temb)
+        input_ndim = hidden_states.ndim
+        bs = hidden_states.shape[0]
+        if input_ndim == 4:
+            bs, channel, height, width = hidden_states.shape
+            hidden_states = hidden_states.view(bs, channel, height * width).transpose(
+                1, 2
+            )
+        bs, sequence_length, _ = (
+            hidden_states.shape
+            if encoder_hidden_states is None
+            else encoder_hidden_states.shape
+        )
+        if attention_mask is not None:
+            attention_mask = attn.prepare_attention_mask(
+                attention_mask, sequence_length, bs
+            )
+            # scaled_dot_product_attention expects attention_mask shape to be
+            # (batch, heads, source_length, target_length)
+            attention_mask = attention_mask.view(
+                bs, attn.heads, -1, attention_mask.shape[-1]
+            )
+        if attn.group_norm is not None:
+            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(
+                1, 2
+            )
+        query = attn.to_q(hidden_states)
+        if encoder_hidden_states is None:
+            encoder_hidden_states = hidden_states
+        else:
+            if attn.norm_cross:
+                encoder_hidden_states = attn.norm_encoder_hidden_states(
+                    encoder_hidden_states
+                )
+        key = attn.to_k(encoder_hidden_states)
+        value = attn.to_v(encoder_hidden_states)
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(bs, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(bs, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(bs, -1, attn.heads, head_dim).transpose(1, 2)
+        # nested attention
+        nested_key = self.nested_to_k(qformer_tokens_out)
+        nested_value = self.nested_to_v(qformer_tokens_out)
+        nested_key = nested_key.view(bs, -1, attn.heads, head_dim).transpose(1, 2)
+        nested_value = nested_value.view(bs, -1, attn.heads, head_dim).transpose(1, 2)
+        nested_hidden_states = F.scaled_dot_product_attention(
+            query,
+            nested_key,
+            nested_value,
+            attn_mask=None,
+            dropout_p=0.0,
+            is_causal=False,
+        )
+        # normalize V_q
+        textual_values_norms = torch.norm(
+            value[torch.arange(bs), :, special_token_indices], dim=-1
+        )
+        nested_hidden_states = (
+            torch.nn.functional.normalize(nested_hidden_states, p=2, dim=-1)
+            * self.normalize_factor
+        )
+        nested_hidden_states = (
+            textual_values_norms.view(bs, -1, 1, 1) * nested_hidden_states
+        )
+        # outer attention
+        value_without_special_tokens = value.clone()
+        if inference_mode:
+            value_without_special_tokens[bs // 2 : bs, :, special_token_indices, :] = (
+                0.0
+            )
+        else:
+            value_without_special_tokens[
+                torch.arange(bs), :, special_token_indices, :
+            ] = 0.0
+        hidden_states_without_special_tokens, attn_weight = (
+            my_scaled_dot_product_attention(
+                query,
+                key,
+                value_without_special_tokens,
+                attn_mask=None,
+                dropout_p=0.0,
+                is_causal=False,
+                special_token_weight=special_token_weight,
+                special_token_indices=special_token_indices,
+            )
+        )
+        # add the special token values
+        if inference_mode:
+            special_token_attn_weight = attn_weight[
+                bs // 2 : bs, :, :, special_token_indices
+            ]
+        else:
+            special_token_attn_weight = attn_weight[
+                torch.arange(bs), :, :, special_token_indices
+            ]
+        if inference_mode:
+            special_token_weighted_values = (
+                special_token_attn_weight * nested_hidden_states[bs // 2 : bs]
+            )
+        else:
+            special_token_weighted_values = (
+                special_token_attn_weight.unsqueeze(-1) * nested_hidden_states
+            )
+        if inference_mode:
+            hidden_states = hidden_states_without_special_tokens
+            hidden_states[bs // 2 : bs] += special_token_weighted_values
+        else:
+            hidden_states = (
+                hidden_states_without_special_tokens + special_token_weighted_values
+            )
+        # arrange hidden states
+        hidden_states = hidden_states.transpose(1, 2).reshape(
+            bs, -1, attn.heads * head_dim
+        )
+        hidden_states = hidden_states.to(query.dtype)
+        # linear proj
+        hidden_states = attn.to_out[0](hidden_states)
+        # dropout
+        hidden_states = attn.to_out[1](hidden_states)
+        if input_ndim == 4:
+            hidden_states = hidden_states.transpose(-1, -2).reshape(
+                bs, channel, height, width
+            )
+        if attn.residual_connection:
+            hidden_states = hidden_states + residual
+        hidden_states = hidden_states / attn.rescale_output_factor
+        return hidden_states

resampler.py ADDED Viewed

	@@ -0,0 +1,169 @@

+# modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py
+# and https://github.com/lucidrains/imagen-pytorch/blob/main/imagen_pytorch/imagen_pytorch.py
+import math
+import torch
+import torch.nn as nn
+from einops import rearrange
+from einops.layers.torch import Rearrange
+# FFN
+def FeedForward(dim, mult=4):
+    inner_dim = int(dim * mult)
+    return nn.Sequential(
+        nn.LayerNorm(dim),
+        nn.Linear(dim, inner_dim, bias=False),
+        nn.GELU(),
+        nn.Linear(inner_dim, dim, bias=False),
+    )
+def reshape_tensor(x, heads):
+    bs, length, width = x.shape
+    # (bs, length, width) --> (bs, length, n_heads, dim_per_head)
+    x = x.view(bs, length, heads, -1)
+    # (bs, length, n_heads, dim_per_head) --> (bs, n_heads, length, dim_per_head)
+    x = x.transpose(1, 2)
+    # (bs, n_heads, length, dim_per_head) --> (bs*n_heads, length, dim_per_head)
+    x = x.reshape(bs, heads, length, -1)
+    return x
+class PerceiverAttention(nn.Module):
+    def __init__(self, *, dim, dim_head=64, heads=8):
+        super().__init__()
+        self.scale = dim_head**-0.5
+        self.dim_head = dim_head
+        self.heads = heads
+        inner_dim = dim_head * heads
+        self.norm1 = nn.LayerNorm(dim)
+        self.norm2 = nn.LayerNorm(dim)
+        self.to_q = nn.Linear(dim, inner_dim, bias=False)
+        self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
+        self.to_out = nn.Linear(inner_dim, dim, bias=False)
+    def forward(self, x, latents):
+        """
+        Args:
+            x (torch.Tensor): image features
+                shape (b, n1, D)
+            latent (torch.Tensor): latent features
+                shape (b, n2, D)
+        """
+        x = self.norm1(x)
+        latents = self.norm2(latents)
+        b, l, _ = latents.shape
+        q = self.to_q(latents)
+        kv_input = torch.cat((x, latents), dim=-2)
+        k, v = self.to_kv(kv_input).chunk(2, dim=-1)
+        q = reshape_tensor(q, self.heads)
+        k = reshape_tensor(k, self.heads)
+        v = reshape_tensor(v, self.heads)
+        # attention
+        scale = 1 / math.sqrt(math.sqrt(self.dim_head))
+        weight = (q * scale) @ (k * scale).transpose(
+            -2, -1
+        )  # More stable with f16 than dividing afterwards
+        weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
+        out = weight @ v
+        out = out.permute(0, 2, 1, 3).reshape(b, l, -1)
+        return self.to_out(out)
+class Resampler(nn.Module):
+    def __init__(
+        self,
+        dim=1024,
+        depth=8,
+        dim_head=64,
+        heads=16,
+        num_queries=8,
+        embedding_dim=768,
+        output_dim=1024,
+        ff_mult=4,
+        max_seq_len: int = 257,  # CLIP tokens + CLS token
+        apply_pos_emb: bool = False,
+        num_latents_mean_pooled: int = 0,  # number of latents derived from mean pooled representation of the sequence
+    ):
+        super().__init__()
+        self.num_queries = num_queries
+        self.pos_emb = (
+            nn.Embedding(max_seq_len, embedding_dim) if apply_pos_emb else None
+        )
+        self.latents = nn.Parameter(torch.randn(1, num_queries, dim) / dim**0.5)
+        self.proj_in = nn.Linear(embedding_dim, dim)
+        self.proj_out = nn.Linear(dim, output_dim)
+        self.norm_out = nn.LayerNorm(output_dim)
+        self.to_latents_from_mean_pooled_seq = (
+            nn.Sequential(
+                nn.LayerNorm(dim),
+                nn.Linear(dim, dim * num_latents_mean_pooled),
+                Rearrange("b (n d) -> b n d", n=num_latents_mean_pooled),
+            )
+            if num_latents_mean_pooled > 0
+            else None
+        )
+        self.layers = nn.ModuleList([])
+        for _ in range(depth):
+            self.layers.append(
+                nn.ModuleList(
+                    [
+                        PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads),
+                        FeedForward(dim=dim, mult=ff_mult),
+                    ]
+                )
+            )
+    def forward(self, x):
+        if self.pos_emb is not None:
+            n, device = x.shape[1], x.device
+            pos_emb = self.pos_emb(torch.arange(n, device=device))
+            x = x + pos_emb
+        latents = self.latents.repeat(x.size(0), 1, 1)
+        x = self.proj_in(x)
+        if self.to_latents_from_mean_pooled_seq:
+            meanpooled_seq = masked_mean(
+                x,
+                dim=1,
+                mask=torch.ones(x.shape[:2], device=x.device, dtype=torch.bool),
+            )
+            meanpooled_latents = self.to_latents_from_mean_pooled_seq(meanpooled_seq)
+            latents = torch.cat((meanpooled_latents, latents), dim=-2)
+        for attn, ff in self.layers:
+            latents = attn(x, latents) + latents
+            latents = ff(latents) + latents
+        latents = self.proj_out(latents)
+        return self.norm_out(latents)
+def masked_mean(t, *, dim, mask=None):
+    if mask is None:
+        return t.mean(dim=dim)
+    denom = mask.sum(dim=dim, keepdim=True)
+    mask = rearrange(mask, "b n -> b n 1")
+    masked_t = t.masked_fill(~mask, 0.0)
+    return masked_t.sum(dim=dim) / denom.clamp(min=1e-5)

utils.py ADDED Viewed

	@@ -0,0 +1,128 @@

+from PIL import Image
+import torch
+import numpy as np
+import dlib
+import scipy
+def image_grid(imgs, rows, cols):
+    assert len(imgs) == rows*cols
+    w, h = imgs[0].size
+    grid = Image.new('RGB', size=(cols*w, rows*h))
+    grid_w, grid_h = grid.size
+    for i, img in enumerate(imgs):
+        grid.paste(img, box=(i%cols*w, i//cols*h))
+    return grid
+def get_generator(seed, device):
+    if seed is not None:
+        if isinstance(seed, list):
+            generator = [
+                torch.Generator(device).manual_seed(seed_item) for seed_item in seed
+            ]
+        else:
+            generator = torch.Generator(device).manual_seed(seed)
+    else:
+        generator = None
+    return generator
+def get_landmark_pil(pil_image, predictor, detector):
+    """Get 68 facial landmarks as a NumPy array of shape (68, 2)."""
+    img_np = np.array(pil_image.convert("RGB"))
+    dets = detector(img_np, 1)
+    if not dets:
+        return None
+    # Handle mmod or frontal detector output
+    det = dets[0].rect if hasattr(dets[0], 'rect') else dets[0]
+    shape = predictor(img_np, det)
+    coords = [(pt.x, pt.y) for pt in shape.parts()]
+    return np.array(coords)
+def align_face(pil_image, predictor, detector):
+    """Align a face from a PIL.Image, returning an aligned PIL.Image of size 512x512."""
+    lm = get_landmark_pil(pil_image, predictor, detector)
+    if lm is None:
+        return pil_image
+    # Define landmark regions
+    lm_chin = lm[0: 17]  # left-right
+    lm_eyebrow_left = lm[17: 22]  # left-right
+    lm_eyebrow_right = lm[22: 27]  # left-right
+    lm_nose = lm[27: 31]  # top-down
+    lm_nostrils = lm[31: 36]  # top-down
+    lm_eye_left = lm[36: 42]  # left-clockwise
+    lm_eye_right = lm[42: 48]  # left-clockwise
+    lm_mouth_outer = lm[48: 60]  # left-clockwise
+    lm_mouth_inner = lm[60: 68]  # left-clockwise
+    eye_left = np.mean(lm_eye_left, axis=0)
+    eye_right = np.mean(lm_eye_right, axis=0)
+    eye_avg = (eye_left + eye_right) * 0.5
+    eye_to_eye = eye_right - eye_left
+    mouth_left = lm_mouth_outer[0]
+    mouth_right = lm_mouth_outer[6]
+    mouth_avg = (mouth_left + mouth_right) * 0.5
+    eye_to_mouth = mouth_avg - eye_avg
+    # Compute oriented crop
+    x = eye_to_eye - np.flipud(eye_to_mouth) * [-1, 1]
+    x /= np.hypot(*x)
+    x *= max(np.hypot(*eye_to_eye) * 2.0, np.hypot(*eye_to_mouth) * 1.8)
+    y = np.flipud(x) * [-1, 1]
+    c = eye_avg + eye_to_mouth * 0.1
+    quad = np.stack([c - x - y, c - x + y, c + x + y, c + x - y])
+    qsize = np.hypot(*x) * 2
+    # Prepare image
+    img = pil_image.convert("RGB")
+    transform_size = 512
+    output_size = 512
+    enable_padding = True
+    # Shrink image for speed
+    shrink = int(np.floor(qsize / output_size * 0.5))
+    if shrink > 1:
+        rsize = (int(np.rint(float(img.size[0]) / shrink)), int(np.rint(float(img.size[1]) / shrink)))
+        img = img.resize(rsize, Image.Resampling.LANCZOS)
+        quad /= shrink
+        qsize /= shrink
+    # Crop around face
+    border = max(int(np.rint(qsize * 0.1)), 3)
+    crop = (int(np.floor(min(quad[:, 0]))), int(np.floor(min(quad[:, 1]))), int(np.ceil(max(quad[:, 0]))),
+            int(np.ceil(max(quad[:, 1]))))
+    crop = (max(crop[0] - border, 0), max(crop[1] - border, 0), min(crop[2] + border, img.size[0]),
+            min(crop[3] + border, img.size[1]))
+    if crop[2] - crop[0] < img.size[0] or crop[3] - crop[1] < img.size[1]:
+        img = img.crop(crop)
+        quad -= crop[0:2]
+    # Pad
+    pad = (int(np.floor(min(quad[:, 0]))), int(np.floor(min(quad[:, 1]))), int(np.ceil(max(quad[:, 0]))),
+            int(np.ceil(max(quad[:, 1]))))
+    pad = (max(-pad[0] + border, 0), max(-pad[1] + border, 0), max(pad[2] - img.size[0] + border, 0),
+            max(pad[3] - img.size[1] + border, 0))
+    if enable_padding and max(pad) > border - 4:
+        pad = np.maximum(pad, int(np.rint(qsize * 0.3)))
+        img = np.pad(np.float32(img), ((pad[1], pad[3]), (pad[0], pad[2]), (0, 0)), 'reflect')
+        h, w, _ = img.shape
+        y, x, _ = np.ogrid[:h, :w, :1]
+        mask = np.maximum(1.0 - np.minimum(np.float32(x) / pad[0], np.float32(w - 1 - x) / pad[2]),
+                            1.0 - np.minimum(np.float32(y) / pad[1], np.float32(h - 1 - y) / pad[3]))
+        blur = qsize * 0.02
+        img += (scipy.ndimage.gaussian_filter(img, [blur, blur, 0]) - img) * np.clip(mask * 3.0 + 1.0, 0.0, 1.0)
+        img += (np.median(img, axis=(0, 1)) - img) * np.clip(mask, 0.0, 1.0)
+        img = Image.fromarray(np.uint8(np.clip(np.rint(img), 0, 255)), 'RGB')
+        quad += pad[:2]
+    # Transform image
+    img = img.transform((transform_size, transform_size), Image.QUAD, (quad + 0.5).flatten(), Image.BILINEAR)
+    if output_size < transform_size:
+        img = img.resize((output_size, output_size), Image.Resampling.LANCZOS)
+    # Resize to final output
+    return img