auto-diffuser-config / optimization_knowledge.py
chansung's picture
Upload folder using huggingface_hub
aae35f1 verified
"""
Curated HuggingFace Diffusers optimization knowledge base
Manually extracted and organized for reliable prompt injection
"""
OPTIMIZATION_GUIDE = """
# DIFFUSERS OPTIMIZATION TECHNIQUES
## Memory Optimization Techniques
### 1. Model CPU Offloading
Use `enable_model_cpu_offload()` to move models between GPU and CPU automatically:
```python
pipe.enable_model_cpu_offload()
```
- Saves significant VRAM by keeping only active models on GPU
- Automatic management, no manual intervention needed
- Compatible with all pipelines
### 2. Sequential CPU Offloading
Use `enable_sequential_cpu_offload()` for more aggressive memory saving:
```python
pipe.enable_sequential_cpu_offload()
```
- More memory efficient than model offloading
- Moves models to CPU after each forward pass
- Best for very limited VRAM scenarios
### 3. Attention Slicing
Use `enable_attention_slicing()` to reduce memory during attention computation:
```python
pipe.enable_attention_slicing()
# or specify slice size
pipe.enable_attention_slicing("max") # maximum slicing
pipe.enable_attention_slicing(1) # slice_size = 1
```
- Trades compute time for memory
- Most effective for high-resolution images
- Can be combined with other techniques
### 4. VAE Slicing
Use `enable_vae_slicing()` for large batch processing:
```python
pipe.enable_vae_slicing()
```
- Decodes images one at a time instead of all at once
- Essential for batch sizes > 4
- Minimal performance impact on single images
### 5. VAE Tiling
Use `enable_vae_tiling()` for high-resolution image generation:
```python
pipe.enable_vae_tiling()
```
- Enables 4K+ image generation on 8GB VRAM
- Splits images into overlapping tiles
- Automatically disabled for 512x512 or smaller images
### 6. Memory Efficient Attention (xFormers)
Use `enable_xformers_memory_efficient_attention()` if xFormers is installed:
```python
pipe.enable_xformers_memory_efficient_attention()
```
- Significantly reduces memory usage and improves speed
- Requires xformers library installation
- Compatible with most models
## Performance Optimization Techniques
### 1. Half Precision (FP16/BF16)
Use lower precision for better memory and speed:
```python
# FP16 (widely supported)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
# BF16 (better numerical stability, newer hardware)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
```
- FP16: Halves memory usage, widely supported
- BF16: Better numerical stability, requires newer GPUs
- Essential for most optimization scenarios
### 2. Torch Compile (PyTorch 2.0+)
Use `torch.compile()` for significant speed improvements:
```python
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# For some models, compile VAE too:
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)
```
- 5-50% speed improvement
- Requires PyTorch 2.0+
- First run is slower due to compilation
### 3. Fast Schedulers
Use faster schedulers for fewer steps:
```python
from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler
# LMS Scheduler (good quality, fast)
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
# UniPC Scheduler (fastest)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
```
## Hardware-Specific Optimizations
### NVIDIA GPU Optimizations
```python
# Enable Tensor Cores
torch.backends.cudnn.benchmark = True
# Optimal data type for NVIDIA
torch_dtype = torch.float16 # or torch.bfloat16 for RTX 30/40 series
```
### Apple Silicon (MPS) Optimizations
```python
# Use MPS device
device = "mps" if torch.backends.mps.is_available() else "cpu"
pipe = pipe.to(device)
# Recommended dtype for Apple Silicon
torch_dtype = torch.bfloat16 # Better than float16 on Apple Silicon
# Attention slicing often helps on MPS
pipe.enable_attention_slicing()
```
### CPU Optimizations
```python
# Use float32 for CPU
torch_dtype = torch.float32
# Enable optimized attention
pipe.enable_attention_slicing()
```
## Model-Specific Guidelines
### FLUX Models
- Do NOT use guidance_scale parameter (not needed for FLUX)
- Use 4-8 inference steps maximum
- BF16 dtype recommended
- Enable attention slicing for memory optimization
### Stable Diffusion XL
- Enable attention slicing for high resolutions
- Use refiner model sparingly to save memory
- Consider VAE tiling for >1024px images
### Stable Diffusion 1.5/2.1
- Very memory efficient base models
- Can often run without optimizations on 8GB+ VRAM
- Enable VAE slicing for batch processing
## Memory Usage Estimation
- FLUX.1: ~24GB for full precision, ~12GB for FP16
- SDXL: ~7GB for FP16, ~14GB for FP32
- SD 1.5: ~2GB for FP16, ~4GB for FP32
## Optimization Combinations by VRAM
### 24GB+ VRAM (High-end)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```
### 12-24GB VRAM (Mid-range)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()
```
### 8-12GB VRAM (Entry-level)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe.enable_xformers_memory_efficient_attention()
```
### <8GB VRAM (Low-end)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing("max")
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
```
"""
def get_optimization_guide():
"""Return the curated optimization guide."""
return OPTIMIZATION_GUIDE
if __name__ == "__main__":
print("Optimization guide loaded successfully!")
print(f"Guide length: {len(OPTIMIZATION_GUIDE)} characters")