diff --git "a/dist/index.html" "b/dist/index.html" --- "a/dist/index.html" +++ "b/dist/index.html" @@ -58,24 +58,15 @@ - - -

The Ultra-Scale Playbook:
Training LLMs on GPU Clusters

-
+

We ran over 4,000 scaling experiments on up to 512 GPUs and measured throughput (size of markers) and GPU utilization (color of markers). Note that both are normalized per model size in this visualization.

-
-
-
- - -
@@ -85,12 +76,10 @@ Thousands of GPUs humming in perfect harmony. That's what it takes to train today's most powerful AI models – a symphony of computing power that until recently was the exclusive domain of elite research labs. Open source has transformed this landscape, but not completely. Yes, you can download the latest Llama or DeepSeek models. Yes, you can read their technical and experiment reports. But the most challenging part – the training code, the knowledge and techniques necessary to coordinate GPUs to train these massive systems – remains shrouded in complexity and spread around in a series of disconnected papers and often private codebases.

- -

+

This open source book is here to change that. Starting from the basics, we'll walk you through the knowledge necessary to scale the training of large language models (LLMs) from one GPU to tens, hundreds, and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.

-

As the size of the clusters used to train these models has grown, various techniques, such as data parallelism, tensor parallelism, pipeline parallelism, and context parallelism as well as ZeRO and kernel fusion, have been invented to make sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the most efficient use of this expensive hardware. These distributed training techniques are not only important for building initial models but have also become essential for fine-tuning large models on specialized data, which often produces the best results. In this book, we'll progressively go over all of these techniques – from the simplest to the most refined ones – while maintaining a single story line to help you understand where each method comes from.

@@ -106,7 +95,7 @@
-
+
Memory usage breakdown
@@ -2122,7 +2111,7 @@

To run the kernel you will also need some host code, which is executed on the CPU/host and takes care of preparing data allocations and loading data and code:

-
+
// Host code