File size: 6,004 Bytes
aae35f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
"""
Curated HuggingFace Diffusers optimization knowledge base
Manually extracted and organized for reliable prompt injection
"""

OPTIMIZATION_GUIDE = """
# DIFFUSERS OPTIMIZATION TECHNIQUES

## Memory Optimization Techniques

### 1. Model CPU Offloading
Use `enable_model_cpu_offload()` to move models between GPU and CPU automatically:
```python
pipe.enable_model_cpu_offload()
```
- Saves significant VRAM by keeping only active models on GPU
- Automatic management, no manual intervention needed
- Compatible with all pipelines

### 2. Sequential CPU Offloading  
Use `enable_sequential_cpu_offload()` for more aggressive memory saving:
```python
pipe.enable_sequential_cpu_offload()
```
- More memory efficient than model offloading
- Moves models to CPU after each forward pass
- Best for very limited VRAM scenarios

### 3. Attention Slicing
Use `enable_attention_slicing()` to reduce memory during attention computation:
```python
pipe.enable_attention_slicing()
# or specify slice size
pipe.enable_attention_slicing("max")  # maximum slicing
pipe.enable_attention_slicing(1)      # slice_size = 1
```
- Trades compute time for memory
- Most effective for high-resolution images
- Can be combined with other techniques

### 4. VAE Slicing
Use `enable_vae_slicing()` for large batch processing:
```python
pipe.enable_vae_slicing()
```
- Decodes images one at a time instead of all at once
- Essential for batch sizes > 4
- Minimal performance impact on single images

### 5. VAE Tiling
Use `enable_vae_tiling()` for high-resolution image generation:
```python
pipe.enable_vae_tiling()
```
- Enables 4K+ image generation on 8GB VRAM
- Splits images into overlapping tiles
- Automatically disabled for 512x512 or smaller images

### 6. Memory Efficient Attention (xFormers)
Use `enable_xformers_memory_efficient_attention()` if xFormers is installed:
```python
pipe.enable_xformers_memory_efficient_attention()
```
- Significantly reduces memory usage and improves speed
- Requires xformers library installation
- Compatible with most models

## Performance Optimization Techniques

### 1. Half Precision (FP16/BF16)
Use lower precision for better memory and speed:
```python
# FP16 (widely supported)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

# BF16 (better numerical stability, newer hardware)
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
```
- FP16: Halves memory usage, widely supported
- BF16: Better numerical stability, requires newer GPUs
- Essential for most optimization scenarios

### 2. Torch Compile (PyTorch 2.0+)
Use `torch.compile()` for significant speed improvements:
```python
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
# For some models, compile VAE too:
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True)
```
- 5-50% speed improvement
- Requires PyTorch 2.0+
- First run is slower due to compilation

### 3. Fast Schedulers
Use faster schedulers for fewer steps:
```python
from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler

# LMS Scheduler (good quality, fast)
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)

# UniPC Scheduler (fastest)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
```

## Hardware-Specific Optimizations

### NVIDIA GPU Optimizations
```python
# Enable Tensor Cores
torch.backends.cudnn.benchmark = True

# Optimal data type for NVIDIA
torch_dtype = torch.float16  # or torch.bfloat16 for RTX 30/40 series
```

### Apple Silicon (MPS) Optimizations
```python
# Use MPS device
device = "mps" if torch.backends.mps.is_available() else "cpu"
pipe = pipe.to(device)

# Recommended dtype for Apple Silicon
torch_dtype = torch.bfloat16  # Better than float16 on Apple Silicon

# Attention slicing often helps on MPS
pipe.enable_attention_slicing()
```

### CPU Optimizations
```python
# Use float32 for CPU
torch_dtype = torch.float32

# Enable optimized attention
pipe.enable_attention_slicing()
```

## Model-Specific Guidelines

### FLUX Models
- Do NOT use guidance_scale parameter (not needed for FLUX)
- Use 4-8 inference steps maximum
- BF16 dtype recommended
- Enable attention slicing for memory optimization

### Stable Diffusion XL
- Enable attention slicing for high resolutions
- Use refiner model sparingly to save memory
- Consider VAE tiling for >1024px images

### Stable Diffusion 1.5/2.1
- Very memory efficient base models
- Can often run without optimizations on 8GB+ VRAM
- Enable VAE slicing for batch processing

## Memory Usage Estimation
- FLUX.1: ~24GB for full precision, ~12GB for FP16
- SDXL: ~7GB for FP16, ~14GB for FP32
- SD 1.5: ~2GB for FP16, ~4GB for FP32

## Optimization Combinations by VRAM

### 24GB+ VRAM (High-end)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```

### 12-24GB VRAM (Mid-range)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()
```

### 8-12GB VRAM (Entry-level)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
pipe.enable_xformers_memory_efficient_attention()
```

### <8GB VRAM (Low-end)
```python
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
pipe.enable_attention_slicing("max")
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
```
"""


def get_optimization_guide():
    """Return the curated optimization guide."""
    return OPTIMIZATION_GUIDE


if __name__ == "__main__":
    print("Optimization guide loaded successfully!")
    print(f"Guide length: {len(OPTIMIZATION_GUIDE)} characters")