""" Curated HuggingFace Diffusers optimization knowledge base Manually extracted and organized for reliable prompt injection """ OPTIMIZATION_GUIDE = """ # DIFFUSERS OPTIMIZATION TECHNIQUES ## Memory Optimization Techniques ### 1. Model CPU Offloading Use `enable_model_cpu_offload()` to move models between GPU and CPU automatically: ```python pipe.enable_model_cpu_offload() ``` - Saves significant VRAM by keeping only active models on GPU - Automatic management, no manual intervention needed - Compatible with all pipelines ### 2. Sequential CPU Offloading Use `enable_sequential_cpu_offload()` for more aggressive memory saving: ```python pipe.enable_sequential_cpu_offload() ``` - More memory efficient than model offloading - Moves models to CPU after each forward pass - Best for very limited VRAM scenarios ### 3. Attention Slicing Use `enable_attention_slicing()` to reduce memory during attention computation: ```python pipe.enable_attention_slicing() # or specify slice size pipe.enable_attention_slicing("max") # maximum slicing pipe.enable_attention_slicing(1) # slice_size = 1 ``` - Trades compute time for memory - Most effective for high-resolution images - Can be combined with other techniques ### 4. VAE Slicing Use `enable_vae_slicing()` for large batch processing: ```python pipe.enable_vae_slicing() ``` - Decodes images one at a time instead of all at once - Essential for batch sizes > 4 - Minimal performance impact on single images ### 5. VAE Tiling Use `enable_vae_tiling()` for high-resolution image generation: ```python pipe.enable_vae_tiling() ``` - Enables 4K+ image generation on 8GB VRAM - Splits images into overlapping tiles - Automatically disabled for 512x512 or smaller images ### 6. Memory Efficient Attention (xFormers) Use `enable_xformers_memory_efficient_attention()` if xFormers is installed: ```python pipe.enable_xformers_memory_efficient_attention() ``` - Significantly reduces memory usage and improves speed - Requires xformers library installation - Compatible with most models ## Performance Optimization Techniques ### 1. Half Precision (FP16/BF16) Use lower precision for better memory and speed: ```python # FP16 (widely supported) pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) # BF16 (better numerical stability, newer hardware) pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) ``` - FP16: Halves memory usage, widely supported - BF16: Better numerical stability, requires newer GPUs - Essential for most optimization scenarios ### 2. Torch Compile (PyTorch 2.0+) Use `torch.compile()` for significant speed improvements: ```python pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) # For some models, compile VAE too: pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True) ``` - 5-50% speed improvement - Requires PyTorch 2.0+ - First run is slower due to compilation ### 3. Fast Schedulers Use faster schedulers for fewer steps: ```python from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler # LMS Scheduler (good quality, fast) pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config) # UniPC Scheduler (fastest) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) ``` ## Hardware-Specific Optimizations ### NVIDIA GPU Optimizations ```python # Enable Tensor Cores torch.backends.cudnn.benchmark = True # Optimal data type for NVIDIA torch_dtype = torch.float16 # or torch.bfloat16 for RTX 30/40 series ``` ### Apple Silicon (MPS) Optimizations ```python # Use MPS device device = "mps" if torch.backends.mps.is_available() else "cpu" pipe = pipe.to(device) # Recommended dtype for Apple Silicon torch_dtype = torch.bfloat16 # Better than float16 on Apple Silicon # Attention slicing often helps on MPS pipe.enable_attention_slicing() ``` ### CPU Optimizations ```python # Use float32 for CPU torch_dtype = torch.float32 # Enable optimized attention pipe.enable_attention_slicing() ``` ## Model-Specific Guidelines ### FLUX Models - Do NOT use guidance_scale parameter (not needed for FLUX) - Use 4-8 inference steps maximum - BF16 dtype recommended - Enable attention slicing for memory optimization ### Stable Diffusion XL - Enable attention slicing for high resolutions - Use refiner model sparingly to save memory - Consider VAE tiling for >1024px images ### Stable Diffusion 1.5/2.1 - Very memory efficient base models - Can often run without optimizations on 8GB+ VRAM - Enable VAE slicing for batch processing ## Memory Usage Estimation - FLUX.1: ~24GB for full precision, ~12GB for FP16 - SDXL: ~7GB for FP16, ~14GB for FP32 - SD 1.5: ~2GB for FP16, ~4GB for FP32 ## Optimization Combinations by VRAM ### 24GB+ VRAM (High-end) ```python pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) ``` ### 12-24GB VRAM (Mid-range) ```python pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to("cuda") pipe.enable_model_cpu_offload() pipe.enable_xformers_memory_efficient_attention() ``` ### 8-12GB VRAM (Entry-level) ```python pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload() pipe.enable_attention_slicing() pipe.enable_vae_slicing() pipe.enable_xformers_memory_efficient_attention() ``` ### <8GB VRAM (Low-end) ```python pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe.enable_sequential_cpu_offload() pipe.enable_attention_slicing("max") pipe.enable_vae_slicing() pipe.enable_vae_tiling() ``` """ def get_optimization_guide(): """Return the curated optimization guide.""" return OPTIMIZATION_GUIDE if __name__ == "__main__": print("Optimization guide loaded successfully!") print(f"Guide length: {len(OPTIMIZATION_GUIDE)} characters")