⚡ nano-vLLM: Lightweight, Low-Latency LLM Inference from Scratch

Introduction: What Is Inference in LLMs?
When you hear “ChatGPT responding” or “LLM generating text,” you’re witnessing inference.
Inference is the process of using a trained model to make predictions or generate outputs.
In LLMs, inference means:
- Taking your prompt
- Running it through billions of weights
- Getting a smart and relevant output
But here's the catch: inference is slow, resource-hungry, and often not optimized for edge or personal devices.
This is where optimization tools like vLLM — and now, nano-vLLM — come into play.
Why Inference Optimization Matters
Large models (even 1B+ parameters) tend to:
- Consume a lot of VRAM
- Introduce latency, especially in long generations
- Require massive infrastructure for production
We want:
- Fast token generation
- Low memory footprint
- Parallel request handling
- Feasibility on laptops, Colab, and edge devices
vLLM: The Inference Giant
vLLM is a production-grade inference engine built by researchers at UC Berkeley and Meta.
Key Strengths:
- PagedAttention for virtual memory-efficient KV caching
- Continuous batching for parallel prompt handling
- Prefill + Decode parallelism
- Tensor Parallelism for multi-GPU inference
- Used in HuggingFace Inference API
Challenges:
- Complex, heavy codebase (~10K+ LOC)
- Uses C++, CUDA extensions
- Harder to modify and learn from
- Not beginner or Colab-friendly
Introducing nano-vLLM: A Lightweight Rebuild
nano-vLLM is a minimal reimplementation of vLLM — just ~1200 lines of clean Python built for:
- Understanding
- Hacking
- Running on limited hardware
Think of it as vLLM’s tiny, readable sibling — yet remarkably fast and useful.
Highlights of nano-vLLM
Feature | nano-vLLM |
---|---|
Tiny Codebase | ~1.2k lines |
Pure Python & Triton | Easy to modify |
CUDA Graph + torch.compile | For faster decoding |
Flash Attention Support | Optional |
Runs on Laptops/Colab | |
Tensor Parallel Support | Basic |
No C++/CUDA extensions | Simpler installs |
Hackable for research | Great for learning |
How nano-vLLM Works (Internals)
Let’s break down the core pieces of the nano-vLLM engine:
1. Prompt Tokenization
- Uses HuggingFace tokenizer
- Supports multiple prompts (batching)
- Splits into
prefill
anddecode
phases
2. KV Cache Management
- Implements Triton kernel:
store_kvcache_kernel
- Efficient key/value attention memory storage
- Supports prefix caching
3. Flash Attention
- Uses
flash-attn
for speed (if installed) - Fallback to default attention also possible
4. Decode Engine
- Reuses cached values for fast step-wise generation
- Wraps decoding loop in
torch.cuda.graph()
if possible
5. SamplingParams
- Implements
temperature
,top_k
,max_tokens
- No unnecessary abstractions — just works
6. Tensor Parallelism
- Lightweight
torch.distributed
wrapper - Splits model across multiple GPUs (optional)
How nano-vLLM Works Under the Hood (Deep Dive for ML Developers)
nano-vLLM simplifies many of vLLM’s advanced concepts while preserving performance-critical components. Here's a breakdown of its internals:
1. Prompt Tokenization and Input Formatting
nano-vLLM uses Hugging Face tokenizers to preprocess input text. During tokenization:
- Inputs are batched and padded using
cu_seqlens
to support variable-length sequences. - The engine distinguishes between:
- Prefill Phase: When KV cache is being initialized
- Decode Phase: When generation is happening token by token
This separation enables more efficient handling of multi-turn or streamed generation.
2. KV Cache: Custom Memory Management
The KV (Key-Value) Cache stores hidden states for attention layers, allowing the model to:
- Reuse previous context in the decode phase
- Avoid redundant computation
In nano-vLLM:
- A Triton kernel (
store_kvcache_kernel
) is used for writing keys and values into a preallocated cache efficiently. - Cache slots are mapped using a
slot_mapping
tensor, avoiding Python-level indexing. - Cache layout:
[batch_size, num_heads, head_dim] → [total_slots, head_dim]
(flattened for performance)
This design mimics PagedAttention
from vLLM but stays readable and modifiable.
3. ⚡ Flash Attention (v2 Compatible)
If flash-attn
is installed, nano-vLLM:
- Calls
flash_attn_varlen_func
during the prefill phase - Calls
flash_attn_with_kvcache
during decode
FlashAttention v2:
- Reduces memory usage by avoiding the materialization of attention matrices
- Computes softmax attention in fused Triton kernels
- Uses block-sparse layouts for better GPU utilization
Fallback to standard attention is also supported for environments where flash-attn
isn't installed.
4. torch.compile + CUDA Graphs
For the decode phase, where token generation is done one at a time:
- nano-vLLM wraps the generation loop in a CUDA Graph (if supported)
- Also uses
torch.compile()
to fuse operations and reduce Python overhead
This yields:
- Stable memory allocation
- Better kernel launch efficiency
- Reduced latency for single-token decoding
This is especially useful on consumer GPUs like T4 or RTX 30/40 series.
5. SamplingParams: Clean, Minimalistic Sampling API
The SamplingParams
class supports:
temperature
: Controls randomnesstop_k
: Top-k filteringmax_tokens
: Token budget per requeststop_tokens
: Optional stop sequence enforcement
The sampling logic is implemented efficiently using PyTorch tensor ops:
- Logits are filtered using top-k
- Softmax is applied post-scaling
torch.multinomial
is used for sampling the next token
6. Lightweight Tensor Parallelism
nano-vLLM supports basic tensor parallelism using torch.distributed
:
- Splits model weights across multiple GPUs
- Each GPU holds a slice of the linear projections in the attention/MLP blocks
- Final output is gathered across GPUs
While not as feature-rich as DeepSpeed or vLLM’s NCCL sharding, it works well for small to medium models in research or Colab settings.
7. Modular Design and Simplicity
The architecture is broken into clean modules:
llm_engine.py
: Main inference orchestratorlayers/
: Custom attention, MLP, rotary embeddingsutils/context.py
: Global inference state management (e.g., block tables, cache length)
There is no C++/CUDA custom extension involved — everything is pure Python + Triton, making it highly readable and customizable.
🛠 Summary
Component | Technique / Module Used |
---|---|
Tokenization | HuggingFace tokenizer + cu_seqlens batching |
KV Caching | Triton kernel with slot mapping |
Attention | Flash Attention v2 or custom fallback |
Sampling | PyTorch-based multinomial sampling |
Speed Optimizations | torch.compile + CUDA Graph (decode phase) |
Parallelism | Basic tensor parallelism with torch.distributed |
Code Simplicity | ~1.2k LOC, no C++ or opaque abstractions |
This modular and deeply educational structure makes nano-vLLM an excellent choice for:
- LLM engineers exploring inference
- Researchers trying new decoding algorithms
- Students learning systems-level ML
📊 Benchmarks (RTX 4070 Laptop)
Engine | Tokens Generated | Time (s) | Throughput (tokens/s) |
---|---|---|---|
vLLM | 133,966 | 98.37 | 1,361.84 |
nano-vLLM | 133,966 | 93.41 | 1,434.13 |
🚀 Faster than vLLM on the same setup.
No magic — just smart caching, Triton, and lean architecture.
🔄 nano-vLLM vs vLLM: Key Differences
Feature | vLLM | nano-vLLM |
---|---|---|
Codebase Size | ~10k+ LOC | ~1.2k LOC |
Language | C++ + Python + CUDA | Python + Triton |
Flash Attention | Built-in | Optional |
Tensor Parallelism | Advanced | Basic |
KV Caching | PagedAttention | Manual Triton kernel |
Compatibility | Full HF | HF (subset, tested) |
Quantization | External only | Coming soon via community |
Ideal Use Case | Production servers | Research, Colab, on-device |
Real Use Case: Run on Google Colab (T4 GPU)
pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B
from nanovllm import LLM, SamplingParams
llm = LLM("./Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, max_tokens=128)
output = llm.generate(["Hello nano-vLLM!"], sampling_params)
print(output[0]['text'])
Final Thoughts
nano-vLLM is more than just a mini-inference engine — it's a powerful tool to learn, adapt, and deploy large language models entirely on your terms.
Whether you're:
- A researcher diving into LLM internals
- A hacker crafting tools for the edge
- An engineer chasing performance on small GPUs
nano-vLLM is your open-source armour. Flexible, nimble, and made to empower your ideas.
🌐 Useful Link
- 🔗 GitHub: nano-vLLM Repository