File size: 17,476 Bytes
535344b b34bb10 ceaf146 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>DeepSeek Deployment with SGLang: Visual Explanation</title>
<script src="https://cdn.tailwindcss.com"></script>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Inter', sans-serif;
background-color: #f3f4f6; /* Light gray background */
}
.section-title {
font-size: 1.75rem; /* Larger section titles */
font-weight: 700;
color: #1e3a8a; /* Dark blue */
border-bottom: 2px solid #3b82f6; /* Medium blue border */
padding-bottom: 0.5rem;
margin-bottom: 1.5rem;
}
.subsection-title {
font-size: 1.25rem;
font-weight: 600;
color: #1d4ed8; /* Slightly lighter blue */
margin-top: 1rem;
margin-bottom: 0.75rem;
}
.card {
background-color: #ffffff;
border-radius: 0.75rem; /* More rounded corners */
box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
padding: 1.5rem;
margin-bottom: 1.5rem;
transition: transform 0.2s ease-in-out;
}
.card:hover {
transform: translateY(-5px);
}
.highlight {
background-color: #eff6ff; /* Light blue background for highlights */
color: #1e40af; /* Darker blue text for highlights */
padding: 0.25rem 0.75rem;
border-radius: 0.375rem;
font-weight: 600;
}
.metric {
font-size: 1.1rem;
font-weight: 700;
color: #16a34a; /* Green for positive metrics */
}
.comparison-metric {
font-size: 1rem;
font-weight: 600;
color: #52525b; /* Neutral gray for comparison details */
}
ul {
list-style-type: none; /* Remove default bullets */
padding-left: 0;
}
li {
position: relative;
padding-left: 1.75rem; /* Space for custom bullet */
margin-bottom: 0.75rem;
line-height: 1.6;
}
li::before {
content: 'β'; /* Custom checkmark bullet */
position: absolute;
left: 0;
color: #2563eb; /* Blue checkmark */
font-weight: bold;
font-size: 1.25rem;
}
.arrow {
font-size: 1.5rem;
color: #3b82f6;
margin: 0 0.5rem;
}
.gpu-icon svg {
width: 24px;
height: 24px;
fill: currentColor;
margin-right: 8px;
}
.flex-container {
display: flex;
align-items: center;
justify-content: space-around;
flex-wrap: wrap;
}
.flow-item {
text-align: center;
margin: 1rem;
padding: 1rem;
background-color: #e0e7ff;
border-radius: 0.5rem;
min-width: 150px;
}
</style>
</head>
<body class="p-4 md:p-8">
<div class="max-w-5xl mx-auto">
<header class="mb-12 text-center">
<h1 class="text-4xl font-bold text-gray-800 mb-2">Deploying DeepSeek with SGLang</h1>
<p class="text-xl text-gray-600">Achieving High Performance with PD Disaggregation & Large-scale Expert Parallelism</p>
<p class="text-sm text-gray-500 mt-1">Based on SGLang Team, May 05, 2025</p>
</header>
<section class="mb-10">
<h2 class="section-title">Key Achievements with SGLang</h2>
<div class="grid md:grid-cols-2 gap-6">
<div class="card">
<h3 class="subsection-title">π Near Official Performance</h3>
<p class="text-gray-700">SGLang's implementation on 12 nodes (96 H100 GPUs) nearly matches DeepSeek's official inference throughput.</p>
<p class="mt-2">Input: <span class="metric">52.3k tokens/s per node</span></p>
<p>Output: <span class="metric">22.3k tokens/s per node</span> (for 2k token inputs)</p>
</div>
<div class="card">
<h3 class="subsection-title">π° Cost Efficiency</h3>
<p class="text-gray-700">Translates to <span class="metric">$0.20 / 1M output tokens</span>, approximately <span class="highlight">1/5th the cost</span> of the official DeepSeek Chat API.</p>
</div>
<div class="card md:col-span-2">
<h3 class="subsection-title">β‘ Throughput Boost</h3>
<p class="text-gray-700">Optimized strategy improves output throughput by up to <span class="metric">5x</span> compared to vanilla tensor parallelism on the same resources.</p>
</div>
</div>
<div class="card mt-6">
<h3 class="subsection-title">Core SGLang Enhancements</h3>
<ul>
<li>Support for Prefill-Decode (PD) Disaggregation.</li>
<li>Large-scale Expert Parallelism (EP), including DeepEP, DeepGEMM, and EPLB.</li>
<li>Open-source implementation for community access and development.</li>
</ul>
</div>
</section>
<section class="mb-10">
<h2 class="section-title">Parallelism Design Strategies</h2>
<div class="grid md:grid-cols-2 gap-6">
<div class="card">
<h3 class="subsection-title">Attention Layers (MLA)</h3>
<p class="text-gray-700">Utilizes <span class="highlight">DP Attention</span> (Data Parallelism):</p>
<ul>
<li>Eliminates KV cache duplication across devices.</li>
<li>Significantly reduces memory overhead.</li>
<li>Supports hybrid data and tensor parallelism for flexibility.</li>
</ul>
</div>
<div class="card">
<h3 class="subsection-title">Dense FFNs</h3>
<p class="text-gray-700">Adopts <span class="highlight">Data Parallelism (DP)</span> over Tensor Parallelism (TP):</p>
<ul>
<li><span class="font-semibold">Enhanced Scalability:</span> Avoids fragmentation and ensures balanced workloads.</li>
<li><span class="font-semibold">Optimized Memory Efficiency:</span> Lower TP degree often minimizes memory, making DP favorable.</li>
<li><span class="font-semibold">Minimized Communication:</span> Reduces all-reduce operations by 50% compared to pure TP.</li>
</ul>
</div>
<div class="card">
<h3 class="subsection-title">Sparse FFNs (Mixture of Experts)</h3>
<p class="text-gray-700">Implements <span class="highlight">Expert Parallelism (EP)</span>:</p>
<ul>
<li>Distributes expert weights across multiple devices.</li>
<li>Scales memory capacity effectively.</li>
<li>Addresses challenges like irregular communication and workload imbalance using DeepEP.</li>
</ul>
</div>
<div class="card">
<h3 class="subsection-title">LM Head</h3>
<p class="text-gray-700">Employs <span class="highlight">Data Parallelism (DP)</span>:</p>
<ul>
<li>Mirrors the strategy for dense FFNs.</li>
<li>Reduces memory overhead for large vocabulary computations.</li>
<li>Simplifies communication across devices.</li>
</ul>
</div>
</div>
</section>
<section class="mb-10">
<h2 class="section-title">Prefill & Decode (PD) Disaggregation</h2>
<div class="card">
<p class="text-gray-700 mb-4">LLM inference has two phases: computation-intensive <span class="font-semibold">Prefill</span> and memory-intensive <span class="font-semibold">Decode</span>. Unified scheduling is inefficient.</p>
<h3 class="subsection-title">Problems with Unified Scheduling:</h3>
<ul>
<li>Prefill batches interrupt decode batches (delay).</li>
<li>DP Attention imbalance (increased decode latency).</li>
<li>Incompatible with DeepEP's dual dispatch modes.</li>
</ul>
<h3 class="subsection-title mt-4">SGLang's PD Disaggregation Solution:</h3>
<div class="flex-container my-4 p-4 bg-blue-50 rounded-lg">
<div class="flow-item">Input Request</div>
<div class="arrow">β</div>
<div class="flow-item">Prefill Server<br/>(Computes KV Cache)</div>
<div class="arrow">β</div>
<div class="flow-item">Data Transfer (RDMA)</div>
<div class="arrow">β</div>
<div class="flow-item">Decode Server<br/>(Iterative Token Gen)</div>
</div>
<p class="text-gray-700">This separation allows tailored optimizations for each phase, maximizing GPU utilization.</p>
<h4 class="font-semibold text-gray-800 mt-3 mb-1">Key Implementation Details:</h4>
<ul>
<li><span class="highlight">Non-blocking Transfer:</span> Background data send/receive.</li>
<li><span class="highlight">RDMA-Based Transfer:</span> Efficient for non-contiguous memory.</li>
<li><span class="highlight">Flexible API Integration:</span> Supports Mooncake, NIXL.</li>
</ul>
</div>
</section>
<section class="mb-10">
<h2 class="section-title">Large-scale Expert Parallelism Optimizations</h2>
<div class="space-y-6">
<div class="card">
<h3 class="subsection-title">Expert Parallelism with DeepEP</h3>
<p class="text-gray-700">DeepEP streamlines EP by efficiently routing tokens to experts across GPUs.</p>
<p class="text-gray-700 mt-2"><span class="highlight">Normal Dispatch:</span> For prefill (long inputs, max throughput). Incompatible with CUDA Graph.</p>
<p class="text-gray-700 mt-1"><span class="highlight">Low-Latency Dispatch:</span> For decode (output tokens, min delay). Supports CUDA Graph.</p>
<p class="text-gray-700 mt-2">SGLang's <span class="font-semibold">PD Disaggregation</span> enables using both modes effectively with DP Attention.</p>
</div>
<div class="card">
<h3 class="subsection-title">DeepGEMM Integration</h3>
<p class="text-gray-700">Optimizes MoE matrix multiplications (Grouped GEMMs).</p>
<p class="text-gray-700 mt-2"><span class="highlight">Contiguous Layout Kernel:</span> For prefill (dynamic shapes). Used with DeepEP's Normal Dispatch (requires permutation).</p>
<p class="text-gray-700 mt-1"><span class="highlight">Masked Layout Kernel:</span> For decode (fixed shapes, CUDA Graph compatible). Used with DeepEP's Low-Latency Dispatch.</p>
</div>
<div class="card">
<h3 class="subsection-title">Two-batch Overlap (TBO)</h3>
<p class="text-gray-700">Splits a batch into two micro-batches to <span class="highlight">overlap computation and communication</span>.</p>
<ul>
<li>Lowers peak memory usage.</li>
<li>Addresses limited communication bandwidth in multi-node setups.</li>
<li>SGLang uses an abstraction layer (operations & yield points) for clean implementation.</li>
<li>Optimized launch order in prefill to avoid CPU-blocking by DeepEP.</li>
</ul>
</div>
<div class="card">
<h3 class="subsection-title">Expert Parallelism Load Balancer (EPLB)</h3>
<p class="text-gray-700">Addresses uneven workload distribution in MoE models.</p>
<ul>
<li>Computes optimal expert arrangement to minimize imbalance.</li>
<li>Uses redundant experts (e.g., 288 instead of 256) for flexible placement.</li>
<li>Enables diverse parallelism sizes (e.g., 12 or 72).</li>
<li>SGLang implements efficient, non-disruptive rebalancing.</li>
</ul>
<p class="mt-2 text-gray-600">Effectiveness depends on matching input distribution to serving workload (achieved via larger batches or periodic rebalancing).</p>
</div>
</div>
</section>
<section class="mb-10">
<h2 class="section-title">Evaluation Highlights</h2>
<div class="grid md:grid-cols-2 gap-6">
<div class="card">
<h3 class="subsection-title">Prefill Phase Performance</h3>
<p class="text-gray-700">On 4 nodes (32 H100s, EP32):</p>
<p>Up to <span class="metric">3.3x improvement</span> over TP16 baseline.</p>
<p>Throughput within <span class="comparison-metric">5.6% of DeepSeek's official profile</span> (assuming perfect balance).</p>
<p class="mt-1">Example: <span class="highlight">50,302 tokens/s per node</span> for 4K prompts.</p>
</div>
<div class="card">
<h3 class="subsection-title">Decode Phase Performance</h3>
<p class="text-gray-700">On 9 nodes (72 H100s, EP72):</p>
<p><span class="metric">5.2x speedup</span> over TP16 baseline.</p>
<p>With simulated MTP, throughput <span class="comparison-metric">6.6% below DeepSeek's profile</span>.</p>
<p class="mt-1">Example: <span class="highlight">22,282 tokens/s per node</span> for 2K inputs.</p>
</div>
</div>
<div class="card mt-6">
<h3 class="subsection-title">Ablation Study: Two-batch Overlap (TBO)</h3>
<p class="text-gray-700"><span class="font-semibold">Prefill:</span></p>
<ul>
<li>Supports larger batch sizes (e.g., 16k tokens/device vs 8k OOM without TBO).</li>
<li><span class="metric">27-35% throughput increase</span> by overlapping computation & communication.</li>
</ul>
<p class="text-gray-700 mt-3"><span class="font-semibold">Decode:</span></p>
<ul>
<li>Speedup contingent on batch size (e.g., <span class="metric">25.5% at 256 tokens/device</span>).</li>
<li>Most substantial speedup (<span class="metric">35%</span>) in simulated MTP with prolonged attention.</li>
</ul>
</div>
<div class="card mt-6">
<h3 class="subsection-title">Ablation Study: EPLB</h3>
<p class="text-gray-700">Delivers significant speedup by mitigating workload imbalance:</p>
<ul>
<li>Prefill: <span class="metric">1.49x speedup</span>.</li>
<li>Decode: <span class="metric">2.54x speedup</span>.</li>
</ul>
<p class="text-gray-700 mt-2">Strong correlation between <span class="highlight">workload balancedness and overall throughput</span>.</p>
<p class="text-gray-700 mt-2">Different expert distributions for prefill vs. decode support PD disaggregation for phase-specific expert placement.</p>
</div>
</section>
<section class="mb-6">
<h2 class="section-title">Conclusion</h2>
<div class="card">
<p class="text-gray-700 leading-relaxed">
SGLang, by integrating advanced techniques like Prefill-Decode Disaggregation and sophisticated Expert Parallelism strategies (DeepEP, DeepGEMM, TBO, EPLB), successfully deploys the large DeepSeek model on H100 GPUs with performance nearly matching official reports and significantly reducing costs.
The open-source nature of these components empowers the community to build upon these optimizations for efficient large-scale LLM serving.
</p>
</div>
</section>
<footer class="text-center mt-12 py-6 border-t border-gray-300">
<p class="text-gray-600">Visual summary generated based on "Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs" by The SGLang Team.</p>
</footer>
</div></body>
</html>
|