Spaces:
Running
Running
""" | |
Curated HuggingFace Diffusers optimization knowledge base | |
Manually extracted and organized for reliable prompt injection | |
""" | |
OPTIMIZATION_GUIDE = """ | |
# DIFFUSERS OPTIMIZATION TECHNIQUES | |
## Memory Optimization Techniques | |
### 1. Model CPU Offloading | |
Use `enable_model_cpu_offload()` to move models between GPU and CPU automatically: | |
```python | |
pipe.enable_model_cpu_offload() | |
``` | |
- Saves significant VRAM by keeping only active models on GPU | |
- Automatic management, no manual intervention needed | |
- Compatible with all pipelines | |
### 2. Sequential CPU Offloading | |
Use `enable_sequential_cpu_offload()` for more aggressive memory saving: | |
```python | |
pipe.enable_sequential_cpu_offload() | |
``` | |
- More memory efficient than model offloading | |
- Moves models to CPU after each forward pass | |
- Best for very limited VRAM scenarios | |
### 3. Attention Slicing | |
Use `enable_attention_slicing()` to reduce memory during attention computation: | |
```python | |
pipe.enable_attention_slicing() | |
# or specify slice size | |
pipe.enable_attention_slicing("max") # maximum slicing | |
pipe.enable_attention_slicing(1) # slice_size = 1 | |
``` | |
- Trades compute time for memory | |
- Most effective for high-resolution images | |
- Can be combined with other techniques | |
### 4. VAE Slicing | |
Use `enable_vae_slicing()` for large batch processing: | |
```python | |
pipe.enable_vae_slicing() | |
``` | |
- Decodes images one at a time instead of all at once | |
- Essential for batch sizes > 4 | |
- Minimal performance impact on single images | |
### 5. VAE Tiling | |
Use `enable_vae_tiling()` for high-resolution image generation: | |
```python | |
pipe.enable_vae_tiling() | |
``` | |
- Enables 4K+ image generation on 8GB VRAM | |
- Splits images into overlapping tiles | |
- Automatically disabled for 512x512 or smaller images | |
### 6. Memory Efficient Attention (xFormers) | |
Use `enable_xformers_memory_efficient_attention()` if xFormers is installed: | |
```python | |
pipe.enable_xformers_memory_efficient_attention() | |
``` | |
- Significantly reduces memory usage and improves speed | |
- Requires xformers library installation | |
- Compatible with most models | |
## Performance Optimization Techniques | |
### 1. Half Precision (FP16/BF16) | |
Use lower precision for better memory and speed: | |
```python | |
# FP16 (widely supported) | |
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
# BF16 (better numerical stability, newer hardware) | |
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) | |
``` | |
- FP16: Halves memory usage, widely supported | |
- BF16: Better numerical stability, requires newer GPUs | |
- Essential for most optimization scenarios | |
### 2. Torch Compile (PyTorch 2.0+) | |
Use `torch.compile()` for significant speed improvements: | |
```python | |
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
# For some models, compile VAE too: | |
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True) | |
``` | |
- 5-50% speed improvement | |
- Requires PyTorch 2.0+ | |
- First run is slower due to compilation | |
### 3. Fast Schedulers | |
Use faster schedulers for fewer steps: | |
```python | |
from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler | |
# LMS Scheduler (good quality, fast) | |
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config) | |
# UniPC Scheduler (fastest) | |
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) | |
``` | |
## Hardware-Specific Optimizations | |
### NVIDIA GPU Optimizations | |
```python | |
# Enable Tensor Cores | |
torch.backends.cudnn.benchmark = True | |
# Optimal data type for NVIDIA | |
torch_dtype = torch.float16 # or torch.bfloat16 for RTX 30/40 series | |
``` | |
### Apple Silicon (MPS) Optimizations | |
```python | |
# Use MPS device | |
device = "mps" if torch.backends.mps.is_available() else "cpu" | |
pipe = pipe.to(device) | |
# Recommended dtype for Apple Silicon | |
torch_dtype = torch.bfloat16 # Better than float16 on Apple Silicon | |
# Attention slicing often helps on MPS | |
pipe.enable_attention_slicing() | |
``` | |
### CPU Optimizations | |
```python | |
# Use float32 for CPU | |
torch_dtype = torch.float32 | |
# Enable optimized attention | |
pipe.enable_attention_slicing() | |
``` | |
## Model-Specific Guidelines | |
### FLUX Models | |
- Do NOT use guidance_scale parameter (not needed for FLUX) | |
- Use 4-8 inference steps maximum | |
- BF16 dtype recommended | |
- Enable attention slicing for memory optimization | |
### Stable Diffusion XL | |
- Enable attention slicing for high resolutions | |
- Use refiner model sparingly to save memory | |
- Consider VAE tiling for >1024px images | |
### Stable Diffusion 1.5/2.1 | |
- Very memory efficient base models | |
- Can often run without optimizations on 8GB+ VRAM | |
- Enable VAE slicing for batch processing | |
## Memory Usage Estimation | |
- FLUX.1: ~24GB for full precision, ~12GB for FP16 | |
- SDXL: ~7GB for FP16, ~14GB for FP32 | |
- SD 1.5: ~2GB for FP16, ~4GB for FP32 | |
## Optimization Combinations by VRAM | |
### 24GB+ VRAM (High-end) | |
```python | |
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) | |
pipe = pipe.to("cuda") | |
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
``` | |
### 12-24GB VRAM (Mid-range) | |
```python | |
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
pipe = pipe.to("cuda") | |
pipe.enable_model_cpu_offload() | |
pipe.enable_xformers_memory_efficient_attention() | |
``` | |
### 8-12GB VRAM (Entry-level) | |
```python | |
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
pipe.enable_sequential_cpu_offload() | |
pipe.enable_attention_slicing() | |
pipe.enable_vae_slicing() | |
pipe.enable_xformers_memory_efficient_attention() | |
``` | |
### <8GB VRAM (Low-end) | |
```python | |
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) | |
pipe.enable_sequential_cpu_offload() | |
pipe.enable_attention_slicing("max") | |
pipe.enable_vae_slicing() | |
pipe.enable_vae_tiling() | |
``` | |
""" | |
def get_optimization_guide(): | |
"""Return the curated optimization guide.""" | |
return OPTIMIZATION_GUIDE | |
if __name__ == "__main__": | |
print("Optimization guide loaded successfully!") | |
print(f"Guide length: {len(OPTIMIZATION_GUIDE)} characters") |