app4 (#66)
Browse files- dist/index.html +19 -7
- src/index.html +19 -7
dist/index.html
CHANGED
|
@@ -2728,18 +2728,23 @@
|
|
| 2728 |
|
| 2729 |
<h3>Training Frameworks</h3>
|
| 2730 |
<div>
|
| 2731 |
-
<a href="https://github.com/
|
| 2732 |
-
<p>
|
| 2733 |
</div>
|
| 2734 |
-
|
| 2735 |
<div>
|
| 2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
| 2737 |
-
<p>NVIDIA's framework for training large language models
|
| 2738 |
</div>
|
| 2739 |
|
| 2740 |
<div>
|
| 2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
| 2742 |
-
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2743 |
</div>
|
| 2744 |
|
| 2745 |
<div>
|
|
@@ -2932,7 +2937,7 @@
|
|
| 2932 |
|
| 2933 |
<div>
|
| 2934 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
| 2935 |
-
<p>Some of Horace He's blogposts
|
| 2936 |
</div>
|
| 2937 |
|
| 2938 |
<div>
|
|
@@ -3546,12 +3551,19 @@
|
|
| 3546 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
| 3547 |
</ul>
|
| 3548 |
|
| 3549 |
-
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time
|
| 3550 |
|
| 3551 |
<d-math block>
|
| 3552 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
| 3553 |
</d-math>
|
| 3554 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3555 |
<p>The computation time for backward pass is:</p>
|
| 3556 |
|
| 3557 |
<d-math block>
|
|
|
|
| 2728 |
|
| 2729 |
<h3>Training Frameworks</h3>
|
| 2730 |
<div>
|
| 2731 |
+
<a href="https://github.com/huggingface/nanotron"><strong>Nanotron</strong></a>
|
| 2732 |
+
<p>Our framework for training large language models featuring various parallelism strategies</p>
|
| 2733 |
</div>
|
| 2734 |
+
|
| 2735 |
<div>
|
| 2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
| 2737 |
+
<p>NVIDIA's framework for training large language models featuring various parallelism strategies.</p>
|
| 2738 |
</div>
|
| 2739 |
|
| 2740 |
<div>
|
| 2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
| 2742 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism strategies.</p>
|
| 2743 |
+
</div>
|
| 2744 |
+
|
| 2745 |
+
<div>
|
| 2746 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
| 2747 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
| 2748 |
</div>
|
| 2749 |
|
| 2750 |
<div>
|
|
|
|
| 2937 |
|
| 2938 |
<div>
|
| 2939 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
| 2940 |
+
<p>Some of Horace He's blogposts - Making GPUs go BRRR..</p>
|
| 2941 |
</div>
|
| 2942 |
|
| 2943 |
<div>
|
|
|
|
| 3551 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
| 3552 |
</ul>
|
| 3553 |
|
| 3554 |
+
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time to all-reduce each bucket is:</p>
|
| 3555 |
|
| 3556 |
<d-math block>
|
| 3557 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
| 3558 |
</d-math>
|
| 3559 |
|
| 3560 |
+
<div class="note-box">
|
| 3561 |
+
<p class="note-box-title">π Note</p>
|
| 3562 |
+
<div class="note-box-content">
|
| 3563 |
+
<p>For bandwidth calculations, we use the bus bandwidth formulas from the <a href="https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#summary">NCCL documentation</a>. These formulas account for the specific communication patterns when calculating effective bandwidth between GPUs.</p>
|
| 3564 |
+
</div>
|
| 3565 |
+
</div>
|
| 3566 |
+
|
| 3567 |
<p>The computation time for backward pass is:</p>
|
| 3568 |
|
| 3569 |
<d-math block>
|
src/index.html
CHANGED
|
@@ -2728,18 +2728,23 @@
|
|
| 2728 |
|
| 2729 |
<h3>Training Frameworks</h3>
|
| 2730 |
<div>
|
| 2731 |
-
<a href="https://github.com/
|
| 2732 |
-
<p>
|
| 2733 |
</div>
|
| 2734 |
-
|
| 2735 |
<div>
|
| 2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
| 2737 |
-
<p>NVIDIA's framework for training large language models
|
| 2738 |
</div>
|
| 2739 |
|
| 2740 |
<div>
|
| 2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
| 2742 |
-
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2743 |
</div>
|
| 2744 |
|
| 2745 |
<div>
|
|
@@ -2932,7 +2937,7 @@
|
|
| 2932 |
|
| 2933 |
<div>
|
| 2934 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
| 2935 |
-
<p>Some of Horace He's blogposts
|
| 2936 |
</div>
|
| 2937 |
|
| 2938 |
<div>
|
|
@@ -3546,12 +3551,19 @@
|
|
| 3546 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
| 3547 |
</ul>
|
| 3548 |
|
| 3549 |
-
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time
|
| 3550 |
|
| 3551 |
<d-math block>
|
| 3552 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
| 3553 |
</d-math>
|
| 3554 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3555 |
<p>The computation time for backward pass is:</p>
|
| 3556 |
|
| 3557 |
<d-math block>
|
|
|
|
| 2728 |
|
| 2729 |
<h3>Training Frameworks</h3>
|
| 2730 |
<div>
|
| 2731 |
+
<a href="https://github.com/huggingface/nanotron"><strong>Nanotron</strong></a>
|
| 2732 |
+
<p>Our framework for training large language models featuring various parallelism strategies</p>
|
| 2733 |
</div>
|
| 2734 |
+
|
| 2735 |
<div>
|
| 2736 |
<a href="https://github.com/NVIDIA/Megatron-LM"><strong>Megatron-LM</strong></a>
|
| 2737 |
+
<p>NVIDIA's framework for training large language models featuring various parallelism strategies.</p>
|
| 2738 |
</div>
|
| 2739 |
|
| 2740 |
<div>
|
| 2741 |
<a href="https://www.deepspeed.ai/"><strong>DeepSpeed</strong></a>
|
| 2742 |
+
<p>Microsoft's deep learning optimization library featuring ZeRO optimization stages and various parallelism strategies.</p>
|
| 2743 |
+
</div>
|
| 2744 |
+
|
| 2745 |
+
<div>
|
| 2746 |
+
<a href="https://github.com/facebookresearch/fairscale/tree/main"><strong>FairScale</strong></a>
|
| 2747 |
+
<p>PyTorch extension library for large-scale training, offering various parallelism and optimization techniques.</p>
|
| 2748 |
</div>
|
| 2749 |
|
| 2750 |
<div>
|
|
|
|
| 2937 |
|
| 2938 |
<div>
|
| 2939 |
<a href="https://www.thonking.ai/"><strong>thonking.ai</strong></a>
|
| 2940 |
+
<p>Some of Horace He's blogposts - Making GPUs go BRRR..</p>
|
| 2941 |
</div>
|
| 2942 |
|
| 2943 |
<div>
|
|
|
|
| 3551 |
<li>Gradients = Parameters β <d-math>num\_layers \cdot 16h^2</d-math></li>
|
| 3552 |
</ul>
|
| 3553 |
|
| 3554 |
+
<p>During backward pass, these gradients are communicated in buckets (default 25MB). The communication time to all-reduce each bucket is:</p>
|
| 3555 |
|
| 3556 |
<d-math block>
|
| 3557 |
t_{comm} = t_{comm\_bucket} = \frac{bucket\_size \cdot 2(DP-1)}{DP \cdot peak\_bw}
|
| 3558 |
</d-math>
|
| 3559 |
|
| 3560 |
+
<div class="note-box">
|
| 3561 |
+
<p class="note-box-title">π Note</p>
|
| 3562 |
+
<div class="note-box-content">
|
| 3563 |
+
<p>For bandwidth calculations, we use the bus bandwidth formulas from the <a href="https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#summary">NCCL documentation</a>. These formulas account for the specific communication patterns when calculating effective bandwidth between GPUs.</p>
|
| 3564 |
+
</div>
|
| 3565 |
+
</div>
|
| 3566 |
+
|
| 3567 |
<p>The computation time for backward pass is:</p>
|
| 3568 |
|
| 3569 |
<d-math block>
|