File size: 7,089 Bytes
d18c3a8
 
8b9c170
d18c3a8
 
 
 
 
 
 
 
 
 
 
8b9c170
 
 
 
 
 
 
c358966
720cbb9
 
 
c358966
8b9c170
720cbb9
8b9c170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
991a47c
8b9c170
 
 
 
 
 
 
 
991a47c
 
 
720cbb9
 
 
9e02dae
 
8b9c170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
720cbb9
8b9c170
 
 
 
 
720cbb9
 
 
 
 
 
 
 
 
9e02dae
 
 
 
 
 
8b9c170
 
 
 
 
 
720cbb9
 
 
 
 
 
9e02dae
 
 
 
 
 
 
 
 
 
 
8b9c170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
991a47c
8b9c170
 
 
991a47c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: Tp 1 Dgx Node Estimator
emoji: ⚙️
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 5.34.0
app_file: app.py
pinned: false
license: mit
short_description: for NVIDIA TRDC estimation
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# 🚀 H100 Node & CUDA Version Estimator

An interactive Gradio application for estimating H100 GPU node requirements and CUDA version recommendations based on your machine learning workload specifications.

## Features

- **Comprehensive Model Support**: Supports 40+ models including:
  - **Text Models**: LLaMA-2/3/3.1, Nemotron-4, Qwen2/2.5
  - **Vision-Language**: Qwen-VL, Qwen2-VL, NVIDIA VILA series
  - **Audio Models**: Qwen-Audio, Qwen2-Audio
  - **Physics-ML**: NVIDIA PhysicsNeMo (FNO, PINN, GraphCast, SFNO)
- **Smart Estimation**: Calculates memory requirements including model weights, KV cache, and operational overhead
- **Multimodal Support**: Handles vision-language and audio-language models with specialized memory calculations
- **Use Case Optimization**: Provides different estimates for inference, training, and fine-tuning scenarios
- **Precision Support**: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
- **Interactive Visualizations**: Memory breakdown charts and node utilization graphs
- **CUDA Recommendations**: Suggests optimal CUDA versions and driver requirements

## Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd tp-1-dgx-node-estimator
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

1. Run the application:
```bash
python app.py
```

2. Open your browser and navigate to `http://localhost:7860`

3. Configure your parameters:
   - **Model**: Select from supported models (LLaMA, Nemotron, Qwen2)
   - **Input Tokens**: Number of input tokens per request
   - **Output Tokens**: Number of output tokens per request
   - **Batch Size**: Number of concurrent requests
   - **Use Case**: Choose between inference, training, or fine-tuning
   - **Precision**: Select model precision/quantization level

4. Click "💡 Estimate Requirements" to get your recommendations

## Key Calculations

### Memory Estimation
- **Model Memory**: Base model weights adjusted for precision
- **KV Cache**: Calculated based on sequence length and model architecture
- **Overhead**: Use-case specific multipliers:
  - Inference: 1.2x (20% overhead)
  - Training: 3.0x (gradients + optimizer states)
  - Fine-tuning: 2.5x (moderate overhead)

### Node Calculation
- **H100 Node**: 8 × H100 GPUs per node = 640GB HBM3 total (576GB usable per node)
- **Model Parallelism**: Automatic consideration for large models
- **Memory Efficiency**: Optimal distribution across nodes

## Example Scenarios

| Model | Tokens (In/Out) | Batch Size | Use Case | Precision | Estimated Nodes |
|-------|----------------|------------|----------|-----------|----------------|
| LLaMA-3-8B | 2048/512 | 1 | Inference | FP16 | 1 |
| LLaMA-3-70B | 4096/1024 | 4 | Inference | FP16 | 1 |
| Qwen2.5-72B | 8192/2048 | 2 | Fine-tuning | BF16 | 1 |
| Nemotron-4-340B | 2048/1024 | 1 | Inference | INT8 | 1-2 |
| Qwen2-VL-7B | 1024/256 | 1 | Inference | FP16 | 1 |
| VILA-1.5-13B | 2048/512 | 2 | Inference | BF16 | 1 |
| Qwen2-Audio-7B | 1024/256 | 1 | Inference | FP16 | 1 |
| PhysicsNeMo-FNO-Large | 512/128 | 8 | Training | FP32 | 1 |
| PhysicsNeMo-GraphCast-Medium | 1024/256 | 4 | Training | FP16 | 1 |

## CUDA Recommendations

The application provides tailored CUDA version recommendations:

- **Optimal**: CUDA 12.4 + cuDNN 8.9+
- **Recommended**: CUDA 12.1+ + cuDNN 8.7+
- **Minimum**: CUDA 11.8 + cuDNN 8.5+

## Output Features

### 📊 Detailed Analysis
- Complete memory breakdown
- Parameter counts and model specifications
- Step-by-step calculation explanation

### 🔧 CUDA Recommendations
- Version compatibility matrix
- Driver requirements
- Compute capability information

### 📈 Memory Utilization
- Visual memory breakdown (pie chart)
- Node utilization distribution (bar chart)
- Efficiency metrics

## Technical Details

### Supported Models
#### Text Models
- **LLaMA**: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
- **Nemotron**: 4-15B, 4-340B
- **Qwen2**: 0.5B, 1.5B, 7B, 72B
- **Qwen2.5**: 0.5B, 1.5B, 7B, 14B, 32B, 72B

#### Vision-Language Models
- **Qwen-VL**: Base, Chat, Plus, Max variants
- **Qwen2-VL**: 2B, 7B, 72B
- **NVIDIA VILA**: 1.5-3B, 1.5-8B, 1.5-13B, 1.5-40B

#### Audio Models
- **Qwen-Audio**: Base, Chat variants
- **Qwen2-Audio**: 7B

#### Physics-ML Models (NVIDIA PhysicsNeMo)
- **Fourier Neural Operators (FNO)**: Small (1M), Medium (10M), Large (50M)
- **Physics-Informed Neural Networks (PINN)**: Small (0.5M), Medium (5M), Large (20M)
- **GraphCast**: Small (50M), Medium (200M), Large (1B) - for weather/climate modeling
- **Spherical FNO (SFNO)**: Small (25M), Medium (100M), Large (500M) - for global simulations

### Precision Impact
- **FP32**: Full precision (4 bytes per parameter)
- **FP16/BF16**: Half precision (2 bytes per parameter)
- **INT8**: 8-bit quantization (1 byte per parameter)
- **INT4**: 4-bit quantization (0.5 bytes per parameter)

### Multimodal Considerations
- **Vision Models**: Process images as token sequences (typically 256-1024 tokens per image)
- **Audio Models**: Handle audio segments with frame-based tokenization
- **Memory Overhead**: Additional memory for vision/audio encoders and cross-modal attention
- **Token Estimation**: Consider multimodal inputs when calculating token counts

### PhysicsNeMo Considerations
- **Grid-Based Data**: Physics models work with spatial/temporal grids rather than text tokens
- **Batch Training**: Physics-ML models typically require larger batch sizes for stable training
- **Memory Patterns**: Different from LLMs - less KV cache, more gradient memory for PDE constraints
- **Precision Requirements**: Many physics simulations require FP32 for numerical stability
- **Use Cases**: 
  - **FNO**: Solving PDEs on regular grids (fluid dynamics, heat transfer)
  - **PINN**: Physics-informed training with PDE constraints
  - **GraphCast**: Weather prediction and climate modeling
  - **SFNO**: Global atmospheric and oceanic simulations

## Limitations

- Estimates are approximate and may vary based on:
  - Specific model implementation details
  - Framework overhead (PyTorch, TensorFlow, etc.)
  - Hardware configuration
  - Network topology for multi-node setups

## Contributing

Feel free to submit issues and enhancement requests!

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Notes

- **Node Configuration**: Each H100 node contains 8 × H100 GPUs (640GB total memory)
- For production deployments, consider adding a 10-20% buffer to estimates
- Network bandwidth and storage requirements are not included in calculations
- Estimates assume optimal memory layout and efficient implementations
- Multi-node setups require high-speed interconnects (InfiniBand/NVLink) for optimal performance