|
<!DOCTYPE html> |
|
<html lang="en"> |
|
<head> |
|
<meta charset="UTF-8"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
<title>DeepSeek Deployment with SGLang: Visual Explanation</title> |
|
<script src="https://cdn.tailwindcss.com"></script> |
|
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet"> |
|
<style> |
|
body { |
|
font-family: 'Inter', sans-serif; |
|
background-color: #f3f4f6; |
|
} |
|
.section-title { |
|
font-size: 1.75rem; |
|
font-weight: 700; |
|
color: #1e3a8a; |
|
border-bottom: 2px solid #3b82f6; |
|
padding-bottom: 0.5rem; |
|
margin-bottom: 1.5rem; |
|
} |
|
.subsection-title { |
|
font-size: 1.25rem; |
|
font-weight: 600; |
|
color: #1d4ed8; |
|
margin-top: 1rem; |
|
margin-bottom: 0.75rem; |
|
} |
|
.card { |
|
background-color: #ffffff; |
|
border-radius: 0.75rem; |
|
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06); |
|
padding: 1.5rem; |
|
margin-bottom: 1.5rem; |
|
transition: transform 0.2s ease-in-out; |
|
} |
|
.card:hover { |
|
transform: translateY(-5px); |
|
} |
|
.highlight { |
|
background-color: #eff6ff; |
|
color: #1e40af; |
|
padding: 0.25rem 0.75rem; |
|
border-radius: 0.375rem; |
|
font-weight: 600; |
|
} |
|
.metric { |
|
font-size: 1.1rem; |
|
font-weight: 700; |
|
color: #16a34a; |
|
} |
|
.comparison-metric { |
|
font-size: 1rem; |
|
font-weight: 600; |
|
color: #52525b; |
|
} |
|
ul { |
|
list-style-type: none; |
|
padding-left: 0; |
|
} |
|
li { |
|
position: relative; |
|
padding-left: 1.75rem; |
|
margin-bottom: 0.75rem; |
|
line-height: 1.6; |
|
} |
|
li::before { |
|
content: '✓'; |
|
position: absolute; |
|
left: 0; |
|
color: #2563eb; |
|
font-weight: bold; |
|
font-size: 1.25rem; |
|
} |
|
.arrow { |
|
font-size: 1.5rem; |
|
color: #3b82f6; |
|
margin: 0 0.5rem; |
|
} |
|
.gpu-icon svg { |
|
width: 24px; |
|
height: 24px; |
|
fill: currentColor; |
|
margin-right: 8px; |
|
} |
|
.flex-container { |
|
display: flex; |
|
align-items: center; |
|
justify-content: space-around; |
|
flex-wrap: wrap; |
|
} |
|
.flow-item { |
|
text-align: center; |
|
margin: 1rem; |
|
padding: 1rem; |
|
background-color: #e0e7ff; |
|
border-radius: 0.5rem; |
|
min-width: 150px; |
|
} |
|
</style> |
|
</head> |
|
<body class="p-4 md:p-8"> |
|
<div class="max-w-5xl mx-auto"> |
|
<header class="mb-12 text-center"> |
|
<h1 class="text-4xl font-bold text-gray-800 mb-2">Deploying DeepSeek with SGLang</h1> |
|
<p class="text-xl text-gray-600">Achieving High Performance with PD Disaggregation & Large-scale Expert Parallelism</p> |
|
<p class="text-sm text-gray-500 mt-1">Based on SGLang Team, May 05, 2025</p> |
|
</header> |
|
|
|
<section class="mb-10"> |
|
<h2 class="section-title">Key Achievements with SGLang</h2> |
|
<div class="grid md:grid-cols-2 gap-6"> |
|
<div class="card"> |
|
<h3 class="subsection-title">🚀 Near Official Performance</h3> |
|
<p class="text-gray-700">SGLang's implementation on 12 nodes (96 H100 GPUs) nearly matches DeepSeek's official inference throughput.</p> |
|
<p class="mt-2">Input: <span class="metric">52.3k tokens/s per node</span></p> |
|
<p>Output: <span class="metric">22.3k tokens/s per node</span> (for 2k token inputs)</p> |
|
</div> |
|
<div class="card"> |
|
<h3 class="subsection-title">💰 Cost Efficiency</h3> |
|
<p class="text-gray-700">Translates to <span class="metric">$0.20 / 1M output tokens</span>, approximately <span class="highlight">1/5th the cost</span> of the official DeepSeek Chat API.</p> |
|
</div> |
|
<div class="card md:col-span-2"> |
|
<h3 class="subsection-title">⚡ Throughput Boost</h3> |
|
<p class="text-gray-700">Optimized strategy improves output throughput by up to <span class="metric">5x</span> compared to vanilla tensor parallelism on the same resources.</p> |
|
</div> |
|
</div> |
|
<div class="card mt-6"> |
|
<h3 class="subsection-title">Core SGLang Enhancements</h3> |
|
<ul> |
|
<li>Support for Prefill-Decode (PD) Disaggregation.</li> |
|
<li>Large-scale Expert Parallelism (EP), including DeepEP, DeepGEMM, and EPLB.</li> |
|
<li>Open-source implementation for community access and development.</li> |
|
</ul> |
|
</div> |
|
</section> |
|
|
|
<section class="mb-10"> |
|
<h2 class="section-title">Parallelism Design Strategies</h2> |
|
<div class="grid md:grid-cols-2 gap-6"> |
|
<div class="card"> |
|
<h3 class="subsection-title">Attention Layers (MLA)</h3> |
|
<p class="text-gray-700">Utilizes <span class="highlight">DP Attention</span> (Data Parallelism):</p> |
|
<ul> |
|
<li>Eliminates KV cache duplication across devices.</li> |
|
<li>Significantly reduces memory overhead.</li> |
|
<li>Supports hybrid data and tensor parallelism for flexibility.</li> |
|
</ul> |
|
</div> |
|
<div class="card"> |
|
<h3 class="subsection-title">Dense FFNs</h3> |
|
<p class="text-gray-700">Adopts <span class="highlight">Data Parallelism (DP)</span> over Tensor Parallelism (TP):</p> |
|
<ul> |
|
<li><span class="font-semibold">Enhanced Scalability:</span> Avoids fragmentation and ensures balanced workloads.</li> |
|
<li><span class="font-semibold">Optimized Memory Efficiency:</span> Lower TP degree often minimizes memory, making DP favorable.</li> |
|
<li><span class="font-semibold">Minimized Communication:</span> Reduces all-reduce operations by 50% compared to pure TP.</li> |
|
</ul> |
|
</div> |
|
<div class="card"> |
|
<h3 class="subsection-title">Sparse FFNs (Mixture of Experts)</h3> |
|
<p class="text-gray-700">Implements <span class="highlight">Expert Parallelism (EP)</span>:</p> |
|
<ul> |
|
<li>Distributes expert weights across multiple devices.</li> |
|
<li>Scales memory capacity effectively.</li> |
|
<li>Addresses challenges like irregular communication and workload imbalance using DeepEP.</li> |
|
</ul> |
|
</div> |
|
<div class="card"> |
|
<h3 class="subsection-title">LM Head</h3> |
|
<p class="text-gray-700">Employs <span class="highlight">Data Parallelism (DP)</span>:</p> |
|
<ul> |
|
<li>Mirrors the strategy for dense FFNs.</li> |
|
<li>Reduces memory overhead for large vocabulary computations.</li> |
|
<li>Simplifies communication across devices.</li> |
|
</ul> |
|
</div> |
|
</div> |
|
</section> |
|
|
|
<section class="mb-10"> |
|
<h2 class="section-title">Prefill & Decode (PD) Disaggregation</h2> |
|
<div class="card"> |
|
<p class="text-gray-700 mb-4">LLM inference has two phases: computation-intensive <span class="font-semibold">Prefill</span> and memory-intensive <span class="font-semibold">Decode</span>. Unified scheduling is inefficient.</p> |
|
<h3 class="subsection-title">Problems with Unified Scheduling:</h3> |
|
<ul> |
|
<li>Prefill batches interrupt decode batches (delay).</li> |
|
<li>DP Attention imbalance (increased decode latency).</li> |
|
<li>Incompatible with DeepEP's dual dispatch modes.</li> |
|
</ul> |
|
<h3 class="subsection-title mt-4">SGLang's PD Disaggregation Solution:</h3> |
|
<div class="flex-container my-4 p-4 bg-blue-50 rounded-lg"> |
|
<div class="flow-item">Input Request</div> |
|
<div class="arrow">➔</div> |
|
<div class="flow-item">Prefill Server<br/>(Computes KV Cache)</div> |
|
<div class="arrow">➔</div> |
|
<div class="flow-item">Data Transfer (RDMA)</div> |
|
<div class="arrow">➔</div> |
|
<div class="flow-item">Decode Server<br/>(Iterative Token Gen)</div> |
|
</div> |
|
<p class="text-gray-700">This separation allows tailored optimizations for each phase, maximizing GPU utilization.</p> |
|
<h4 class="font-semibold text-gray-800 mt-3 mb-1">Key Implementation Details:</h4> |
|
<ul> |
|
<li><span class="highlight">Non-blocking Transfer:</span> Background data send/receive.</li> |
|
<li><span class="highlight">RDMA-Based Transfer:</span> Efficient for non-contiguous memory.</li> |
|
<li><span class="highlight">Flexible API Integration:</span> Supports Mooncake, NIXL.</li> |
|
</ul> |
|
</div> |
|
</section> |
|
|
|
<section class="mb-10"> |
|
<h2 class="section-title">Large-scale Expert Parallelism Optimizations</h2> |
|
<div class="space-y-6"> |
|
<div class="card"> |
|
<h3 class="subsection-title">Expert Parallelism with DeepEP</h3> |
|
<p class="text-gray-700">DeepEP streamlines EP by efficiently routing tokens to experts across GPUs.</p> |
|
<p class="text-gray-700 mt-2"><span class="highlight">Normal Dispatch:</span> For prefill (long inputs, max throughput). Incompatible with CUDA Graph.</p> |
|
<p class="text-gray-700 mt-1"><span class="highlight">Low-Latency Dispatch:</span> For decode (output tokens, min delay). Supports CUDA Graph.</p> |
|
<p class="text-gray-700 mt-2">SGLang's <span class="font-semibold">PD Disaggregation</span> enables using both modes effectively with DP Attention.</p> |
|
</div> |
|
|
|
<div class="card"> |
|
<h3 class="subsection-title">DeepGEMM Integration</h3> |
|
<p class="text-gray-700">Optimizes MoE matrix multiplications (Grouped GEMMs).</p> |
|
<p class="text-gray-700 mt-2"><span class="highlight">Contiguous Layout Kernel:</span> For prefill (dynamic shapes). Used with DeepEP's Normal Dispatch (requires permutation).</p> |
|
<p class="text-gray-700 mt-1"><span class="highlight">Masked Layout Kernel:</span> For decode (fixed shapes, CUDA Graph compatible). Used with DeepEP's Low-Latency Dispatch.</p> |
|
</div> |
|
|
|
<div class="card"> |
|
<h3 class="subsection-title">Two-batch Overlap (TBO)</h3> |
|
<p class="text-gray-700">Splits a batch into two micro-batches to <span class="highlight">overlap computation and communication</span>.</p> |
|
<ul> |
|
<li>Lowers peak memory usage.</li> |
|
<li>Addresses limited communication bandwidth in multi-node setups.</li> |
|
<li>SGLang uses an abstraction layer (operations & yield points) for clean implementation.</li> |
|
<li>Optimized launch order in prefill to avoid CPU-blocking by DeepEP.</li> |
|
</ul> |
|
</div> |
|
|
|
<div class="card"> |
|
<h3 class="subsection-title">Expert Parallelism Load Balancer (EPLB)</h3> |
|
<p class="text-gray-700">Addresses uneven workload distribution in MoE models.</p> |
|
<ul> |
|
<li>Computes optimal expert arrangement to minimize imbalance.</li> |
|
<li>Uses redundant experts (e.g., 288 instead of 256) for flexible placement.</li> |
|
<li>Enables diverse parallelism sizes (e.g., 12 or 72).</li> |
|
<li>SGLang implements efficient, non-disruptive rebalancing.</li> |
|
</ul> |
|
<p class="mt-2 text-gray-600">Effectiveness depends on matching input distribution to serving workload (achieved via larger batches or periodic rebalancing).</p> |
|
</div> |
|
</div> |
|
</section> |
|
|
|
<section class="mb-10"> |
|
<h2 class="section-title">Evaluation Highlights</h2> |
|
<div class="grid md:grid-cols-2 gap-6"> |
|
<div class="card"> |
|
<h3 class="subsection-title">Prefill Phase Performance</h3> |
|
<p class="text-gray-700">On 4 nodes (32 H100s, EP32):</p> |
|
<p>Up to <span class="metric">3.3x improvement</span> over TP16 baseline.</p> |
|
<p>Throughput within <span class="comparison-metric">5.6% of DeepSeek's official profile</span> (assuming perfect balance).</p> |
|
<p class="mt-1">Example: <span class="highlight">50,302 tokens/s per node</span> for 4K prompts.</p> |
|
</div> |
|
<div class="card"> |
|
<h3 class="subsection-title">Decode Phase Performance</h3> |
|
<p class="text-gray-700">On 9 nodes (72 H100s, EP72):</p> |
|
<p><span class="metric">5.2x speedup</span> over TP16 baseline.</p> |
|
<p>With simulated MTP, throughput <span class="comparison-metric">6.6% below DeepSeek's profile</span>.</p> |
|
<p class="mt-1">Example: <span class="highlight">22,282 tokens/s per node</span> for 2K inputs.</p> |
|
</div> |
|
</div> |
|
|
|
<div class="card mt-6"> |
|
<h3 class="subsection-title">Ablation Study: Two-batch Overlap (TBO)</h3> |
|
<p class="text-gray-700"><span class="font-semibold">Prefill:</span></p> |
|
<ul> |
|
<li>Supports larger batch sizes (e.g., 16k tokens/device vs 8k OOM without TBO).</li> |
|
<li><span class="metric">27-35% throughput increase</span> by overlapping computation & communication.</li> |
|
</ul> |
|
<p class="text-gray-700 mt-3"><span class="font-semibold">Decode:</span></p> |
|
<ul> |
|
<li>Speedup contingent on batch size (e.g., <span class="metric">25.5% at 256 tokens/device</span>).</li> |
|
<li>Most substantial speedup (<span class="metric">35%</span>) in simulated MTP with prolonged attention.</li> |
|
</ul> |
|
</div> |
|
|
|
<div class="card mt-6"> |
|
<h3 class="subsection-title">Ablation Study: EPLB</h3> |
|
<p class="text-gray-700">Delivers significant speedup by mitigating workload imbalance:</p> |
|
<ul> |
|
<li>Prefill: <span class="metric">1.49x speedup</span>.</li> |
|
<li>Decode: <span class="metric">2.54x speedup</span>.</li> |
|
</ul> |
|
<p class="text-gray-700 mt-2">Strong correlation between <span class="highlight">workload balancedness and overall throughput</span>.</p> |
|
<p class="text-gray-700 mt-2">Different expert distributions for prefill vs. decode support PD disaggregation for phase-specific expert placement.</p> |
|
</div> |
|
</section> |
|
|
|
<section class="mb-6"> |
|
<h2 class="section-title">Conclusion</h2> |
|
<div class="card"> |
|
<p class="text-gray-700 leading-relaxed"> |
|
SGLang, by integrating advanced techniques like Prefill-Decode Disaggregation and sophisticated Expert Parallelism strategies (DeepEP, DeepGEMM, TBO, EPLB), successfully deploys the large DeepSeek model on H100 GPUs with performance nearly matching official reports and significantly reducing costs. |
|
The open-source nature of these components empowers the community to build upon these optimizations for efficient large-scale LLM serving. |
|
</p> |
|
</div> |
|
</section> |
|
|
|
<footer class="text-center mt-12 py-6 border-t border-gray-300"> |
|
<p class="text-gray-600">Visual summary generated based on "Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs" by The SGLang Team.</p> |
|
</footer> |
|
</div></body> |
|
</html> |
|
|