⚡ nano-vLLM: Lightweight, Low-Latency LLM Inference from Scratch

Community Article Published June 28, 2025

nano-vLLM beating VLLM

Introduction: What Is Inference in LLMs?

When you hear “ChatGPT responding” or “LLM generating text,” you’re witnessing inference.

Inference is the process of using a trained model to make predictions or generate outputs.

In LLMs, inference means:

  • Taking your prompt
  • Running it through billions of weights
  • Getting a smart and relevant output

But here's the catch: inference is slow, resource-hungry, and often not optimized for edge or personal devices.

This is where optimization tools like vLLM — and now, nano-vLLM — come into play.


Why Inference Optimization Matters

Large models (even 1B+ parameters) tend to:

  • Consume a lot of VRAM
  • Introduce latency, especially in long generations
  • Require massive infrastructure for production

We want:

  • Fast token generation
  • Low memory footprint
  • Parallel request handling
  • Feasibility on laptops, Colab, and edge devices

vLLM: The Inference Giant

vLLM is a production-grade inference engine built by researchers at UC Berkeley and Meta.

Key Strengths:

  • PagedAttention for virtual memory-efficient KV caching
  • Continuous batching for parallel prompt handling
  • Prefill + Decode parallelism
  • Tensor Parallelism for multi-GPU inference
  • Used in HuggingFace Inference API

Challenges:

  • Complex, heavy codebase (~10K+ LOC)
  • Uses C++, CUDA extensions
  • Harder to modify and learn from
  • Not beginner or Colab-friendly

Introducing nano-vLLM: A Lightweight Rebuild

nano-vLLM is a minimal reimplementation of vLLM — just ~1200 lines of clean Python built for:

  • Understanding
  • Hacking
  • Running on limited hardware

Think of it as vLLM’s tiny, readable sibling — yet remarkably fast and useful.


Highlights of nano-vLLM

Feature nano-vLLM
Tiny Codebase ~1.2k lines
Pure Python & Triton Easy to modify
CUDA Graph + torch.compile For faster decoding
Flash Attention Support Optional
Runs on Laptops/Colab
Tensor Parallel Support Basic
No C++/CUDA extensions Simpler installs
Hackable for research Great for learning

How nano-vLLM Works (Internals)

Let’s break down the core pieces of the nano-vLLM engine:

1. Prompt Tokenization

  • Uses HuggingFace tokenizer
  • Supports multiple prompts (batching)
  • Splits into prefill and decode phases

2. KV Cache Management

  • Implements Triton kernel: store_kvcache_kernel
  • Efficient key/value attention memory storage
  • Supports prefix caching

3. Flash Attention

  • Uses flash-attn for speed (if installed)
  • Fallback to default attention also possible

4. Decode Engine

  • Reuses cached values for fast step-wise generation
  • Wraps decoding loop in torch.cuda.graph() if possible

5. SamplingParams

  • Implements temperature, top_k, max_tokens
  • No unnecessary abstractions — just works

6. Tensor Parallelism

  • Lightweight torch.distributed wrapper
  • Splits model across multiple GPUs (optional)

How nano-vLLM Works Under the Hood (Deep Dive for ML Developers)

nano-vLLM simplifies many of vLLM’s advanced concepts while preserving performance-critical components. Here's a breakdown of its internals:

1. Prompt Tokenization and Input Formatting

nano-vLLM uses Hugging Face tokenizers to preprocess input text. During tokenization:

  • Inputs are batched and padded using cu_seqlens to support variable-length sequences.
  • The engine distinguishes between:
    • Prefill Phase: When KV cache is being initialized
    • Decode Phase: When generation is happening token by token

This separation enables more efficient handling of multi-turn or streamed generation.


2. KV Cache: Custom Memory Management

The KV (Key-Value) Cache stores hidden states for attention layers, allowing the model to:

  • Reuse previous context in the decode phase
  • Avoid redundant computation

In nano-vLLM:

  • A Triton kernel (store_kvcache_kernel) is used for writing keys and values into a preallocated cache efficiently.
  • Cache slots are mapped using a slot_mapping tensor, avoiding Python-level indexing.
  • Cache layout: [batch_size, num_heads, head_dim] → [total_slots, head_dim] (flattened for performance)

This design mimics PagedAttention from vLLM but stays readable and modifiable.


3. ⚡ Flash Attention (v2 Compatible)

If flash-attn is installed, nano-vLLM:

  • Calls flash_attn_varlen_func during the prefill phase
  • Calls flash_attn_with_kvcache during decode

FlashAttention v2:

  • Reduces memory usage by avoiding the materialization of attention matrices
  • Computes softmax attention in fused Triton kernels
  • Uses block-sparse layouts for better GPU utilization

Fallback to standard attention is also supported for environments where flash-attn isn't installed.


4. torch.compile + CUDA Graphs

For the decode phase, where token generation is done one at a time:

  • nano-vLLM wraps the generation loop in a CUDA Graph (if supported)
  • Also uses torch.compile() to fuse operations and reduce Python overhead

This yields:

  • Stable memory allocation
  • Better kernel launch efficiency
  • Reduced latency for single-token decoding

This is especially useful on consumer GPUs like T4 or RTX 30/40 series.


5. SamplingParams: Clean, Minimalistic Sampling API

The SamplingParams class supports:

  • temperature: Controls randomness
  • top_k: Top-k filtering
  • max_tokens: Token budget per request
  • stop_tokens: Optional stop sequence enforcement

The sampling logic is implemented efficiently using PyTorch tensor ops:

  • Logits are filtered using top-k
  • Softmax is applied post-scaling
  • torch.multinomial is used for sampling the next token

6. Lightweight Tensor Parallelism

nano-vLLM supports basic tensor parallelism using torch.distributed:

  • Splits model weights across multiple GPUs
  • Each GPU holds a slice of the linear projections in the attention/MLP blocks
  • Final output is gathered across GPUs

While not as feature-rich as DeepSpeed or vLLM’s NCCL sharding, it works well for small to medium models in research or Colab settings.


7. Modular Design and Simplicity

The architecture is broken into clean modules:

  • llm_engine.py: Main inference orchestrator
  • layers/: Custom attention, MLP, rotary embeddings
  • utils/context.py: Global inference state management (e.g., block tables, cache length)

There is no C++/CUDA custom extension involved — everything is pure Python + Triton, making it highly readable and customizable.


🛠 Summary

Component Technique / Module Used
Tokenization HuggingFace tokenizer + cu_seqlens batching
KV Caching Triton kernel with slot mapping
Attention Flash Attention v2 or custom fallback
Sampling PyTorch-based multinomial sampling
Speed Optimizations torch.compile + CUDA Graph (decode phase)
Parallelism Basic tensor parallelism with torch.distributed
Code Simplicity ~1.2k LOC, no C++ or opaque abstractions

This modular and deeply educational structure makes nano-vLLM an excellent choice for:

  • LLM engineers exploring inference
  • Researchers trying new decoding algorithms
  • Students learning systems-level ML

📊 Benchmarks (RTX 4070 Laptop)

Engine Tokens Generated Time (s) Throughput (tokens/s)
vLLM 133,966 98.37 1,361.84
nano-vLLM 133,966 93.41 1,434.13

🚀 Faster than vLLM on the same setup.
No magic — just smart caching, Triton, and lean architecture.


🔄 nano-vLLM vs vLLM: Key Differences

Feature vLLM nano-vLLM
Codebase Size ~10k+ LOC ~1.2k LOC
Language C++ + Python + CUDA Python + Triton
Flash Attention Built-in Optional
Tensor Parallelism Advanced Basic
KV Caching PagedAttention Manual Triton kernel
Compatibility Full HF HF (subset, tested)
Quantization External only Coming soon via community
Ideal Use Case Production servers Research, Colab, on-device

Real Use Case: Run on Google Colab (T4 GPU)

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
huggingface-cli download Qwen/Qwen3-0.6B --local-dir ./Qwen3-0.6B
from nanovllm import LLM, SamplingParams

llm = LLM("./Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, max_tokens=128)

output = llm.generate(["Hello nano-vLLM!"], sampling_params)
print(output[0]['text'])

Final Thoughts

nano-vLLM is more than just a mini-inference engine — it's a powerful tool to learn, adapt, and deploy large language models entirely on your terms.

Whether you're:

  • A researcher diving into LLM internals
  • A hacker crafting tools for the edge
  • An engineer chasing performance on small GPUs

nano-vLLM is your open-source armour. Flexible, nimble, and made to empower your ideas.


🌐 Useful Link

Community

Sign up or log in to comment