⚡ nano-vLLM: Lightweight, Low-Latency LLM Inference from Scratch

Community Article Published June 28, 2025

Upvote

zamal_

zamal

Introduction: What Is Inference in LLMs?

Why Inference Optimization Matters

vLLM: The Inference Giant
Key Strengths:

Challenges:

Introducing nano-vLLM: A Lightweight Rebuild

Highlights of nano-vLLM

How nano-vLLM Works (Internals)
1. Prompt Tokenization

2. KV Cache Management

3. Flash Attention

4. Decode Engine

5. SamplingParams

6. Tensor Parallelism

How nano-vLLM Works Under the Hood (Deep Dive for ML Developers)
1. Prompt Tokenization and Input Formatting

2. KV Cache: Custom Memory Management

3. ⚡ Flash Attention (v2 Compatible)

4. torch.compile + CUDA Graphs

5. SamplingParams: Clean, Minimalistic Sampling API

6. Lightweight Tensor Parallelism

7. Modular Design and Simplicity

🛠 Summary

📊 Benchmarks (RTX 4070 Laptop)

🔄 nano-vLLM vs vLLM: Key Differences

Real Use Case: Run on Google Colab (T4 GPU)

Final Thoughts

Introduction: What Is Inference in LLMs?

When you hear “ChatGPT responding” or “LLM generating text,” you’re witnessing inference.

Inference is the process of using a trained model to make predictions or generate outputs.

In LLMs, inference means:

Taking your prompt
Running it through billions of weights
Getting a smart and relevant output

But here's the catch: inference is slow, resource-hungry, and often not optimized for edge or personal devices.

This is where optimization tools like vLLM — and now, nano-vLLM — come into play.

Why Inference Optimization Matters

Large models (even 1B+ parameters) tend to:

Consume a lot of VRAM
Introduce latency, especially in long generations
Require massive infrastructure for production

We want:

Fast token generation
Low memory footprint
Parallel request handling
Feasibility on laptops, Colab, and edge devices

vLLM: The Inference Giant

vLLM is a production-grade inference engine built by researchers at UC Berkeley and Meta.

Key Strengths:

PagedAttention for virtual memory-efficient KV caching
Continuous batching for parallel prompt handling
Prefill + Decode parallelism
Tensor Parallelism for multi-GPU inference
Used in HuggingFace Inference API

Challenges:

Complex, heavy codebase (~10K+ LOC)
Uses C++, CUDA extensions
Harder to modify and learn from
Not beginner or Colab-friendly

Introducing nano-vLLM: A Lightweight Rebuild

nano-vLLM is a minimal reimplementation of vLLM — just ~1200 lines of clean Python built for:

Understanding
Hacking
Running on limited hardware

Think of it as vLLM’s tiny, readable sibling — yet remarkably fast and useful.

Highlights of nano-vLLM

Feature	nano-vLLM
Tiny Codebase	~1.2k lines
Pure Python & Triton	Easy to modify
CUDA Graph + torch.compile	For faster decoding
Flash Attention Support	Optional
Runs on Laptops/Colab
Tensor Parallel Support	Basic
No C++/CUDA extensions	Simpler installs
Hackable for research	Great for learning

How nano-vLLM Works (Internals)

Let’s break down the core pieces of the nano-vLLM engine:

1. Prompt Tokenization

Uses HuggingFace tokenizer
Supports multiple prompts (batching)
Splits into prefill and decode phases

2. KV Cache Management

Implements Triton kernel: store_kvcache_kernel
Efficient key/value attention memory storage
Supports prefix caching

3. Flash Attention

Uses flash-attn for speed (if installed)
Fallback to default attention also possible

4. Decode Engine

Reuses cached values for fast step-wise generation
Wraps decoding loop in torch.cuda.graph() if possible

5. SamplingParams

Implements temperature, top_k, max_tokens
No unnecessary abstractions — just works

6. Tensor Parallelism

Lightweight torch.distributed wrapper
Splits model across multiple GPUs (optional)

How nano-vLLM Works Under the Hood (Deep Dive for ML Developers)

nano-vLLM simplifies many of vLLM’s advanced concepts while preserving performance-critical components. Here's a breakdown of its internals:

1. Prompt Tokenization and Input Formatting

nano-vLLM uses Hugging Face tokenizers to preprocess input text. During tokenization:

Inputs are batched and padded using cu_seqlens to support variable-length sequences.
The engine distinguishes between:
- Prefill Phase: When KV cache is being initialized
- Decode Phase: When generation is happening token by token

This separation enables more efficient handling of multi-turn or streamed generation.

2. KV Cache: Custom Memory Management

The KV (Key-Value) Cache stores hidden states for attention layers, allowing the model to:

Reuse previous context in the decode phase
Avoid redundant computation

In nano-vLLM:

A Triton kernel (store_kvcache_kernel) is used for writing keys and values into a preallocated cache efficiently.
Cache slots are mapped using a slot_mapping tensor, avoiding Python-level indexing.
Cache layout: [batch_size, num_heads, head_dim] → [total_slots, head_dim] (flattened for performance)

This design mimics PagedAttention from vLLM but stays readable and modifiable.

3. ⚡ Flash Attention (v2 Compatible)

If flash-attn is installed, nano-vLLM:

Calls flash_attn_varlen_func during the prefill phase
Calls flash_attn_with_kvcache during decode

FlashAttention v2:

Reduces memory usage by avoiding the materialization of attention matrices
Computes softmax attention in fused Triton kernels
Uses block-sparse layouts for better GPU utilization

Fallback to standard attention is also supported for environments where flash-attn isn't installed.

4. torch.compile + CUDA Graphs

For the decode phase, where token generation is done one at a time:

nano-vLLM wraps the generation loop in a CUDA Graph (if supported)
Also uses torch.compile() to fuse operations and reduce Python overhead

This yields:

Stable memory allocation
Better kernel launch efficiency
Reduced latency for single-token decoding

This is especially useful on consumer GPUs like T4 or RTX 30/40 series.

5. SamplingParams: Clean, Minimalistic Sampling API

The SamplingParams class supports:

temperature: Controls randomness
top_k: Top-k filtering
max_tokens: Token budget per request
stop_tokens: Optional stop sequence enforcement

The sampling logic is implemented efficiently using PyTorch tensor ops:

Logits are filtered using top-k
Softmax is applied post-scaling
torch.multinomial is used for sampling the next token

6. Lightweight Tensor Parallelism

nano-vLLM supports basic tensor parallelism using torch.distributed:

Splits model weights across multiple GPUs
Each GPU holds a slice of the linear projections in the attention/MLP blocks
Final output is gathered across GPUs

While not as feature-rich as DeepSpeed or vLLM’s NCCL sharding, it works well for small to medium models in research or Colab settings.

7. Modular Design and Simplicity

The architecture is broken into clean modules:

llm_engine.py: Main inference orchestrator
layers/: Custom attention, MLP, rotary embeddings
utils/context.py: Global inference state management (e.g., block tables, cache length)

There is no C++/CUDA custom extension involved — everything is pure Python + Triton, making it highly readable and customizable.

🛠 Summary

Component	Technique / Module Used
Tokenization	HuggingFace tokenizer + cu_seqlens batching
KV Caching	Triton kernel with slot mapping
Attention	Flash Attention v2 or custom fallback
Sampling	PyTorch-based multinomial sampling
Speed Optimizations	torch.compile + CUDA Graph (decode phase)
Parallelism	Basic tensor parallelism with torch.distributed
Code Simplicity	~1.2k LOC, no C++ or opaque abstractions

This modular and deeply educational structure makes nano-vLLM an excellent choice for:

LLM engineers exploring inference
Researchers trying new decoding algorithms
Students learning systems-level ML

📊 Benchmarks (RTX 4070 Laptop)

Engine	Tokens Generated	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.37	1,361.84
nano-vLLM	133,966	93.41	1,434.13

🚀 Faster than vLLM on the same setup.
No magic — just smart caching, Triton, and lean architecture.

🔄 nano-vLLM vs vLLM: Key Differences

Feature	vLLM	nano-vLLM
Codebase Size	~10k+ LOC	~1.2k LOC
Language	C++ + Python + CUDA	Python + Triton
Flash Attention	Built-in	Optional
Tensor Parallelism	Advanced	Basic
KV Caching	PagedAttention	Manual Triton kernel
Compatibility	Full HF	HF (subset, tested)
Quantization	External only	Coming soon via community
Ideal Use Case	Production servers	Research, Colab, on-device

Real Use Case: Run on Google Colab (T4 GPU)

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B

from nanovllm import LLM, SamplingParams

llm = LLM("./Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, max_tokens=128)

output = llm.generate(["Hello nano-vLLM!"], sampling_params)
print(output[0]['text'])

Final Thoughts

nano-vLLM is more than just a mini-inference engine — it's a powerful tool to learn, adapt, and deploy large language models entirely on your terms.

Whether you're:

A researcher diving into LLM internals
A hacker crafting tools for the edge
An engineer chasing performance on small GPUs

nano-vLLM is your open-source armour. Flexible, nimble, and made to empower your ideas.

🌐 Useful Link

🔗 GitHub: nano-vLLM Repository

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote