Sayak Paul's picture

Sayak Paul

sayakpaul

AI & ML interests

Diffusion models, representation learning

Recent Activity

replied to a-r-r-o-w's post about 20 hours ago
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead. In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed. Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles. The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future! Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8 (Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
reacted to a-r-r-o-w's post with 👍 about 20 hours ago
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead. In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed. Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles. The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future! Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8 (Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
updated a Space about 21 hours ago
zerogpu-aoti/ltx-dev-fast
View all activity

Organizations

Hugging Face's profile picture 🧨Diffusers's profile picture TensorFlow TPU's profile picture Hugging Face Internal Testing Organization's profile picture All Things ViTs's profile picture Probing ViTs's profile picture Evaluation on the Hub's profile picture Instruction-tuned Diffusion Models's profile picture JAX ♥️ Diffusers 🧨's profile picture (De)fusing's profile picture Huggingface Projects's profile picture Keras Dreambooth Event's profile picture Hugging Face OSS Metrics's profile picture Deploy HF TF ViTs's profile picture Open Generative AI's profile picture UniDiffuser Testing's profile picture Personal Coding Assistant's profile picture Diffusers Demo at ICCV 2023's profile picture huggingPartyParis's profile picture Latent Consistency's profile picture ZeroGPU Explorers's profile picture SPRIGHT's profile picture PEFT's profile picture NYU VisionX's profile picture Social Post Explorers's profile picture MaPO's profile picture diffusers-internal-dev's profile picture AuraFlow's profile picture lawrence's profile picture Optimum Internal Testing's profile picture Diffusion Guidance's profile picture syn-t2i's profile picture Hugging Face FineVideo's profile picture DDUF's profile picture HunyuanVideo Community's profile picture Finetrainers's profile picture Diffusion CoT's profile picture Cinematic T2V's profile picture Diffusers Internal Demos's profile picture ZeroGPU AoT's profile picture