Instruction-tuned Diffusion Models

non-profit

https://github.com/huggingface/instruction-tuned-sd

Activity Feed Request to join this org

AI & ML interests

Instruction tuning, Diffusion models

Recent Activity

sayakpaul authored a paper 3 days ago

DynEval: Holistic Evaluations of T2I Generative Models in the Wild

sayakpaul authored a paper 13 days ago

Posterior Augmented Flow Matching

sayakpaul authored a paper 13 days ago

4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation

View all activity

sayakpaul

authored a paper 3 days ago

DynEval: Holistic Evaluations of T2I Generative Models in the Wild

Paper • 2607.11199 • Published 10 days ago

sayakpaul

authored 3 papers 13 days ago

submitted a paper to Daily Papers 13 days ago

Flash-BoN: Instant Drafts for Inference-Time Scaling in Diffusion Models

Paper • 2607.04461 • Published 18 days ago • 11

adirik

authored 2 papers 4 months ago

ReasonX: MLLM-Guided Intrinsic Image Decomposition

Paper • 2512.04222 • Published Dec 3, 2025 • 1

PRISM: A Unified Framework for Photorealistic Reconstruction and Intrinsic Scene Modeling

Paper • 2504.14219 • Published Apr 19, 2025 • 2

sayakpaul

authored 2 papers 5 months ago

Fine-Grained Perturbation Guidance via Attention Head Selection

Paper • 2506.10978 • Published Jun 12, 2025 • 25

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Paper • 2602.21778 • Published Feb 25 • 16

sayakpaul

submitted a paper to Daily Papers 5 months ago

From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors

Paper • 2602.21778 • Published Feb 25 • 16

sayakpaul

authored a paper 5 months ago

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

Paper • 2602.15449 • Published Feb 17 • 7

sayakpaul

authored a paper 10 months ago

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Paper • 2510.05091 • Published Oct 6, 2025 • 20

sayakpaul

posted an update 12 months ago

Post

3222

Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast

3 replies

sayakpaul

posted an update about 1 year ago

Post

3041

Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.

So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.

Give it a go here:
https://lnkd.in/gf8Pi4-2

2 replies

sayakpaul

posted an update about 1 year ago

Post

1960

Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.

sayakpaul

authored 2 papers about 1 year ago

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Paper • 2505.10046 • Published May 15, 2025 • 9

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Paper • 2504.16080 • Published Apr 22, 2025 • 15

sayakpaul

authored a paper over 1 year ago

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

Paper • 2503.09641 • Published Mar 12, 2025 • 42

sayakpaul

posted an update over 1 year ago

Post

4014

Inference-time scaling meets Flux.1-Dev (and others) 🔥

Presenting a simple re-implementation of "Inference-time scaling diffusion models beyond denoising steps" by Ma et al.

I did the simplest random search strategy, but results can potentially be improved with better-guided search methods.

Supports Gemini 2 Flash & Qwen2.5 as verifiers for "LLMGrading" 🤗

The steps are simple:

For each round:

1> Starting by sampling 2 starting noises with different seeds.
2> Score the generations w.r.t a metric.
3> Obtain the best generation from the current round.

If you have more compute budget, go to the next search round. Scale the noise pool (2 ** search_round) and repeat 1 - 3.

This constitutes the random search method as done in the paper by Google DeepMind.

Code, more results, and a bunch of other stuff are in the repository. Check it out here: https://github.com/sayakpaul/tt-scale-flux/ 🤗

sayakpaul

posted an update over 1 year ago

Post

2220

We have been cooking a couple of fine-tuning runs on CogVideoX with finetrainers, smol datasets, and LoRA to generate cool video effects like crushing, dissolving, etc.

We are also releasing a LoRA extraction utility from a fully fine-tuned checkpoint. I know that kind of stuff has existed since eternity, but the quality on video models was nothing short of spectacular. Below are some links:

* Models and datasets:

finetrainers
* finetrainers: https://github.com/a-r-r-o-w/finetrainers
* LoRA extraction: https://github.com/huggingface/diffusers/blob/main/scripts/extract_lora_from_model.py

1 reply

AI & ML interests

Recent Activity

Team members 2

instruction-tuning-sd's activity