Spaces:

ucalyptus
/

sglang-prefill-decoded-aggregation

Running

App Files Files Community

sglang-prefill-decoded-aggregation / index.html

ucalyptus

Update index.html

b34bb10 verified 23 days ago

raw

history blame contribute delete

17.5 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>DeepSeek Deployment with SGLang: Visual Explanation</title>
	<script src="https://cdn.tailwindcss.com"></script>
	<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
	<style>
	body {
	font-family: 'Inter', sans-serif;
	background-color: #f3f4f6; /* Light gray background */
	}
	.section-title {
	font-size: 1.75rem; /* Larger section titles */
	font-weight: 700;
	color: #1e3a8a; /* Dark blue */
	border-bottom: 2px solid #3b82f6; /* Medium blue border */
	padding-bottom: 0.5rem;
	margin-bottom: 1.5rem;
	}
	.subsection-title {
	font-size: 1.25rem;
	font-weight: 600;
	color: #1d4ed8; /* Slightly lighter blue */
	margin-top: 1rem;
	margin-bottom: 0.75rem;
	}
	.card {
	background-color: #ffffff;
	border-radius: 0.75rem; /* More rounded corners */
	box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
	padding: 1.5rem;
	margin-bottom: 1.5rem;
	transition: transform 0.2s ease-in-out;
	}
	.card:hover {
	transform: translateY(-5px);
	}
	.highlight {
	background-color: #eff6ff; /* Light blue background for highlights */
	color: #1e40af; /* Darker blue text for highlights */
	padding: 0.25rem 0.75rem;
	border-radius: 0.375rem;
	font-weight: 600;
	}
	.metric {
	font-size: 1.1rem;
	font-weight: 700;
	color: #16a34a; /* Green for positive metrics */
	}
	.comparison-metric {
	font-size: 1rem;
	font-weight: 600;
	color: #52525b; /* Neutral gray for comparison details */
	}
	ul {
	list-style-type: none; /* Remove default bullets */
	padding-left: 0;
	}
	li {
	position: relative;
	padding-left: 1.75rem; /* Space for custom bullet */
	margin-bottom: 0.75rem;
	line-height: 1.6;
	}
	li::before {
	content: '✓'; /* Custom checkmark bullet */
	position: absolute;
	left: 0;
	color: #2563eb; /* Blue checkmark */
	font-weight: bold;
	font-size: 1.25rem;
	}
	.arrow {
	font-size: 1.5rem;
	color: #3b82f6;
	margin: 0 0.5rem;
	}
	.gpu-icon svg {
	width: 24px;
	height: 24px;
	fill: currentColor;
	margin-right: 8px;
	}
	.flex-container {
	display: flex;
	align-items: center;
	justify-content: space-around;
	flex-wrap: wrap;
	}
	.flow-item {
	text-align: center;
	margin: 1rem;
	padding: 1rem;
	background-color: #e0e7ff;
	border-radius: 0.5rem;
	min-width: 150px;
	}
	</style>
	</head>
	<body class="p-4 md:p-8">
	<div class="max-w-5xl mx-auto">
	<header class="mb-12 text-center">
	<h1 class="text-4xl font-bold text-gray-800 mb-2">Deploying DeepSeek with SGLang</h1>
	<p class="text-xl text-gray-600">Achieving High Performance with PD Disaggregation & Large-scale Expert Parallelism</p>
	<p class="text-sm text-gray-500 mt-1">Based on SGLang Team, May 05, 2025</p>
	</header>

	<section class="mb-10">
	<h2 class="section-title">Key Achievements with SGLang</h2>
	<div class="grid md:grid-cols-2 gap-6">
	<div class="card">
	<h3 class="subsection-title">🚀 Near Official Performance</h3>
	<p class="text-gray-700">SGLang's implementation on 12 nodes (96 H100 GPUs) nearly matches DeepSeek's official inference throughput.</p>
	<p class="mt-2">Input: <span class="metric">52.3k tokens/s per node</span></p>
	<p>Output: <span class="metric">22.3k tokens/s per node</span> (for 2k token inputs)</p>
	</div>
	<div class="card">
	<h3 class="subsection-title">💰 Cost Efficiency</h3>
	<p class="text-gray-700">Translates to <span class="metric">$0.20 / 1M output tokens</span>, approximately <span class="highlight">1/5th the cost</span> of the official DeepSeek Chat API.</p>
	</div>
	<div class="card md:col-span-2">
	<h3 class="subsection-title">⚡ Throughput Boost</h3>
	<p class="text-gray-700">Optimized strategy improves output throughput by up to <span class="metric">5x</span> compared to vanilla tensor parallelism on the same resources.</p>
	</div>
	</div>
	<div class="card mt-6">
	<h3 class="subsection-title">Core SGLang Enhancements</h3>
	<ul>
	<li>Support for Prefill-Decode (PD) Disaggregation.</li>
	<li>Large-scale Expert Parallelism (EP), including DeepEP, DeepGEMM, and EPLB.</li>
	<li>Open-source implementation for community access and development.</li>
	</ul>
	</div>
	</section>

	<section class="mb-10">
	<h2 class="section-title">Parallelism Design Strategies</h2>
	<div class="grid md:grid-cols-2 gap-6">
	<div class="card">
	<h3 class="subsection-title">Attention Layers (MLA)</h3>
	<p class="text-gray-700">Utilizes <span class="highlight">DP Attention</span> (Data Parallelism):</p>
	<ul>
	<li>Eliminates KV cache duplication across devices.</li>
	<li>Significantly reduces memory overhead.</li>
	<li>Supports hybrid data and tensor parallelism for flexibility.</li>
	</ul>
	</div>
	<div class="card">
	<h3 class="subsection-title">Dense FFNs</h3>
	<p class="text-gray-700">Adopts <span class="highlight">Data Parallelism (DP)</span> over Tensor Parallelism (TP):</p>
	<ul>
	<li><span class="font-semibold">Enhanced Scalability:</span> Avoids fragmentation and ensures balanced workloads.</li>
	<li><span class="font-semibold">Optimized Memory Efficiency:</span> Lower TP degree often minimizes memory, making DP favorable.</li>
	<li><span class="font-semibold">Minimized Communication:</span> Reduces all-reduce operations by 50% compared to pure TP.</li>
	</ul>
	</div>
	<div class="card">
	<h3 class="subsection-title">Sparse FFNs (Mixture of Experts)</h3>
	<p class="text-gray-700">Implements <span class="highlight">Expert Parallelism (EP)</span>:</p>
	<ul>
	<li>Distributes expert weights across multiple devices.</li>
	<li>Scales memory capacity effectively.</li>
	<li>Addresses challenges like irregular communication and workload imbalance using DeepEP.</li>
	</ul>
	</div>
	<div class="card">
	<h3 class="subsection-title">LM Head</h3>
	<p class="text-gray-700">Employs <span class="highlight">Data Parallelism (DP)</span>:</p>
	<ul>
	<li>Mirrors the strategy for dense FFNs.</li>
	<li>Reduces memory overhead for large vocabulary computations.</li>
	<li>Simplifies communication across devices.</li>
	</ul>
	</div>
	</div>
	</section>

	<section class="mb-10">
	<h2 class="section-title">Prefill & Decode (PD) Disaggregation</h2>
	<div class="card">
	<p class="text-gray-700 mb-4">LLM inference has two phases: computation-intensive <span class="font-semibold">Prefill</span> and memory-intensive <span class="font-semibold">Decode</span>. Unified scheduling is inefficient.</p>
	<h3 class="subsection-title">Problems with Unified Scheduling:</h3>
	<ul>
	<li>Prefill batches interrupt decode batches (delay).</li>
	<li>DP Attention imbalance (increased decode latency).</li>
	<li>Incompatible with DeepEP's dual dispatch modes.</li>
	</ul>
	<h3 class="subsection-title mt-4">SGLang's PD Disaggregation Solution:</h3>
	<div class="flex-container my-4 p-4 bg-blue-50 rounded-lg">
	<div class="flow-item">Input Request</div>
	<div class="arrow">➔</div>
	<div class="flow-item">Prefill Server<br/>(Computes KV Cache)</div>
	<div class="arrow">➔</div>
	<div class="flow-item">Data Transfer (RDMA)</div>
	<div class="arrow">➔</div>
	<div class="flow-item">Decode Server<br/>(Iterative Token Gen)</div>
	</div>
	<p class="text-gray-700">This separation allows tailored optimizations for each phase, maximizing GPU utilization.</p>
	<h4 class="font-semibold text-gray-800 mt-3 mb-1">Key Implementation Details:</h4>
	<ul>
	<li><span class="highlight">Non-blocking Transfer:</span> Background data send/receive.</li>
	<li><span class="highlight">RDMA-Based Transfer:</span> Efficient for non-contiguous memory.</li>
	<li><span class="highlight">Flexible API Integration:</span> Supports Mooncake, NIXL.</li>
	</ul>
	</div>
	</section>

	<section class="mb-10">
	<h2 class="section-title">Large-scale Expert Parallelism Optimizations</h2>
	<div class="space-y-6">
	<div class="card">
	<h3 class="subsection-title">Expert Parallelism with DeepEP</h3>
	<p class="text-gray-700">DeepEP streamlines EP by efficiently routing tokens to experts across GPUs.</p>
	<p class="text-gray-700 mt-2"><span class="highlight">Normal Dispatch:</span> For prefill (long inputs, max throughput). Incompatible with CUDA Graph.</p>
	<p class="text-gray-700 mt-1"><span class="highlight">Low-Latency Dispatch:</span> For decode (output tokens, min delay). Supports CUDA Graph.</p>
	<p class="text-gray-700 mt-2">SGLang's <span class="font-semibold">PD Disaggregation</span> enables using both modes effectively with DP Attention.</p>
	</div>

	<div class="card">
	<h3 class="subsection-title">DeepGEMM Integration</h3>
	<p class="text-gray-700">Optimizes MoE matrix multiplications (Grouped GEMMs).</p>
	<p class="text-gray-700 mt-2"><span class="highlight">Contiguous Layout Kernel:</span> For prefill (dynamic shapes). Used with DeepEP's Normal Dispatch (requires permutation).</p>
	<p class="text-gray-700 mt-1"><span class="highlight">Masked Layout Kernel:</span> For decode (fixed shapes, CUDA Graph compatible). Used with DeepEP's Low-Latency Dispatch.</p>
	</div>

	<div class="card">
	<h3 class="subsection-title">Two-batch Overlap (TBO)</h3>
	<p class="text-gray-700">Splits a batch into two micro-batches to <span class="highlight">overlap computation and communication</span>.</p>
	<ul>
	<li>Lowers peak memory usage.</li>
	<li>Addresses limited communication bandwidth in multi-node setups.</li>
	<li>SGLang uses an abstraction layer (operations & yield points) for clean implementation.</li>
	<li>Optimized launch order in prefill to avoid CPU-blocking by DeepEP.</li>
	</ul>
	</div>

	<div class="card">
	<h3 class="subsection-title">Expert Parallelism Load Balancer (EPLB)</h3>
	<p class="text-gray-700">Addresses uneven workload distribution in MoE models.</p>
	<ul>
	<li>Computes optimal expert arrangement to minimize imbalance.</li>
	<li>Uses redundant experts (e.g., 288 instead of 256) for flexible placement.</li>
	<li>Enables diverse parallelism sizes (e.g., 12 or 72).</li>
	<li>SGLang implements efficient, non-disruptive rebalancing.</li>
	</ul>
	<p class="mt-2 text-gray-600">Effectiveness depends on matching input distribution to serving workload (achieved via larger batches or periodic rebalancing).</p>
	</div>
	</div>
	</section>

	<section class="mb-10">
	<h2 class="section-title">Evaluation Highlights</h2>
	<div class="grid md:grid-cols-2 gap-6">
	<div class="card">
	<h3 class="subsection-title">Prefill Phase Performance</h3>
	<p class="text-gray-700">On 4 nodes (32 H100s, EP32):</p>
	<p>Up to <span class="metric">3.3x improvement</span> over TP16 baseline.</p>
	<p>Throughput within <span class="comparison-metric">5.6% of DeepSeek's official profile</span> (assuming perfect balance).</p>
	<p class="mt-1">Example: <span class="highlight">50,302 tokens/s per node</span> for 4K prompts.</p>
	</div>
	<div class="card">
	<h3 class="subsection-title">Decode Phase Performance</h3>
	<p class="text-gray-700">On 9 nodes (72 H100s, EP72):</p>
	<p><span class="metric">5.2x speedup</span> over TP16 baseline.</p>
	<p>With simulated MTP, throughput <span class="comparison-metric">6.6% below DeepSeek's profile</span>.</p>
	<p class="mt-1">Example: <span class="highlight">22,282 tokens/s per node</span> for 2K inputs.</p>
	</div>
	</div>

	<div class="card mt-6">
	<h3 class="subsection-title">Ablation Study: Two-batch Overlap (TBO)</h3>
	<p class="text-gray-700"><span class="font-semibold">Prefill:</span></p>
	<ul>
	<li>Supports larger batch sizes (e.g., 16k tokens/device vs 8k OOM without TBO).</li>
	<li><span class="metric">27-35% throughput increase</span> by overlapping computation & communication.</li>
	</ul>
	<p class="text-gray-700 mt-3"><span class="font-semibold">Decode:</span></p>
	<ul>
	<li>Speedup contingent on batch size (e.g., <span class="metric">25.5% at 256 tokens/device</span>).</li>
	<li>Most substantial speedup (<span class="metric">35%</span>) in simulated MTP with prolonged attention.</li>
	</ul>
	</div>

	<div class="card mt-6">
	<h3 class="subsection-title">Ablation Study: EPLB</h3>
	<p class="text-gray-700">Delivers significant speedup by mitigating workload imbalance:</p>
	<ul>
	<li>Prefill: <span class="metric">1.49x speedup</span>.</li>
	<li>Decode: <span class="metric">2.54x speedup</span>.</li>
	</ul>
	<p class="text-gray-700 mt-2">Strong correlation between <span class="highlight">workload balancedness and overall throughput</span>.</p>
	<p class="text-gray-700 mt-2">Different expert distributions for prefill vs. decode support PD disaggregation for phase-specific expert placement.</p>
	</div>
	</section>

	<section class="mb-6">
	<h2 class="section-title">Conclusion</h2>
	<div class="card">
	<p class="text-gray-700 leading-relaxed">
	SGLang, by integrating advanced techniques like Prefill-Decode Disaggregation and sophisticated Expert Parallelism strategies (DeepEP, DeepGEMM, TBO, EPLB), successfully deploys the large DeepSeek model on H100 GPUs with performance nearly matching official reports and significantly reducing costs.
	The open-source nature of these components empowers the community to build upon these optimizations for efficient large-scale LLM serving.
	</p>
	</div>
	</section>

	<footer class="text-center mt-12 py-6 border-t border-gray-300">
	<p class="text-gray-600">Visual summary generated based on "Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs" by The SGLang Team.</p>
	</footer>
	</div></body>
	</html>