hotdog: highly optimised Toeplitz/Cauchy/displacement-rank GPU kernels
HOTDOG (Highly Optimised Toeplitz-Cauchy with Displacement Rank Acceleration On GPUs) is a CUDA kernel bundle for the structured linear-algebra primitives that underpin state-space models (S4/S5, Mamba), long-convolution networks (Hyena, H3), and low-displacement-rank layers.
The kernel exposes the usual Toeplitz/Cauchy ops β with full autograd, BF16/FP16/complex-dtype paths, stream-aware dispatch,
and torch.compile compliance via pt2_compliant_tag +
needs_fixed_stride_order.
Load from the Hugging Face Hub
pip install kernels torch
import torch
from kernels import get_kernel
hotdog = get_kernel("chrisvoncsefalvay/hotdog")
B, N, L = 4, 1024, 1024
v = torch.randn(B, N, device="cuda", dtype=torch.float32)
omega = torch.linspace(0.5, 1.5, L, device="cuda", dtype=torch.float32)
lambd = torch.linspace(-1.5, -0.5, N, device="cuda", dtype=torch.float32)
out = torch.empty(B, L, device="cuda", dtype=torch.float32)
hotdog.cauchy_kernel_forward(out, v, omega, lambd)
The kernels client picks the pre-built binary that matches your Python,
PyTorch, and CUDA versions. For day-to-day PyTorch use prefer the top-level
Python API (see below) rather than calling the raw ops.
Public API
hotdog.cauchy_kernel(v, omega, lambd) # Cauchy kernel evaluation
hotdog.toeplitz_matvec(c, r, x) # y = T(c, r) @ x
hotdog.toeplitz_matvec_transpose(c, r, x) # y = T(c, r)^T @ x
hotdog.toeplitz_solve(c, r, b) # T(c, r) x = b
hotdog.toeplitz_solve_symmetric(c, b) # T(c, c) x = b (fast path)
hotdog.toeplitz_inverse_apply(gs_u, gs_v, x) # T^{-1} x via Gohberg-Semencul
hotdog.gs_compute_generators(c, r) # Gohberg-Semencul generators
hotdog.toeplitz_to_cauchy(c, r) # C = F T F^{-1}
hotdog.displacement_matvec(G, H, x, disp_type=0) # y = A x, generator-aware
hotdog.cauchy_pin_l2(omega, lambd) # L2 persistence pin
Every entry accepts an optional stream= keyword argument; forward-only
entries additionally accept an out= buffer. Calls broadcast leading
batch dimensions and stay correct under torch.compile and
torch.cuda.graph capture. Only Sylvester-type displacement
(disp_type=0) is supported today. nn.Module wrappers (CauchyKernel,
ToeplitzMV, ToeplitzSolve, ToeplitzInverse, DisplacementMV) are
provided for parameterised layers that store structured matrices as
buffers.
What's included
- Cauchy: tiled shared-memory reduction with Kahan-Babushka compensated
summation,
__nv_bfloat162pair loads on the BF16 path, $+1$ bank-padded shared tiles, and a dedicated complex (float2) kernel for S4 poles. Custom CUDA backward over $v$, $\omega$, $\lambda$. - Toeplitz matvec: single-block shared-memory radix-2 FFT for padded $N\leq 1024$; cuFFT-backed circulant embedding for larger sizes. Explicit transpose entry; BF16/FP16 forward via promote-on-load.
- Toeplitz solve: block-cooperative Levinson-Durbin with shared-memory
linear-prediction coefficients, warp-shuffle reductions, and opt-in
large-shmem via
cudaFuncSetAttributeup to the device'ssharedMemPerBlockOptinceiling. - Gohberg-Semencul inverse application: two-stage batched fused path
(
[U(Jv)|U(v)]then[L(u)|L(Ju)]) with shared $x$ FFT,record_streamdiscipline on every intermediate, and a small-$N$ shared-memory FFT route. - Displacement-rank matvec: generator-aware frequency-domain Cauchy-sum kernel; $O(B\cdot r\cdot n)$ memory, pseudoinverse handling at the Sylvester diagonal pole.
cauchy_pin_l2: pins $\omega$ and $\lambda$ into the L2 persistence window on the current stream for repeated-call workloads (Ampere+).
Performance summary
NVIDIA GB10 (sm_121), PyTorch 2.10, CUDA 13.0:
| Operation | Shape ($B\times n$) | HOTDOG | vs reference | Achieved |
|---|---|---|---|---|
cauchy_kernel |
16x4096 | 0.55 ms | 25x naive | 1.87 TF/s |
cauchy_kernel |
16x1024 | 0.08 ms | 8.3x naive | 855 GF/s |
toeplitz_matvec |
4x256 | 0.031 ms | 1.6x FFT | 9 GF/s |
toeplitz_solve_symmetric |
4x64 | 0.084 ms | 54.7x SciPy | β |
toeplitz_solve_symmetric |
4x256 | 0.278 ms | 16.5x SciPy | β |
toeplitz_inverse_apply |
4x256 | 0.344 ms | 28.4x over 4x FFT | β |
displacement_matvec ($r=2$) |
4x256 | 0.101 ms | 1.9x dense | 44 GF/s |
At $N\geq 1024$ the Cauchy kernel runs at $\approx 60%$ of the GB10 fp32
peak. Small-shape rows are launch-overhead bound, not kernel-bound; closing
that gap is an orchestration concern (torch.cuda.graph, op fusion,
cauchy_pin_l2), not a kernel choice.
Techniques
- Tiled shared-memory Cauchy reduction with Kahan-Babushka compensation against $|\omega-\lambda|$ cancellations.
- Architecture-aware Toeplitz dispatch between in-kernel shared-memory FFT and cuFFT-backed embedding.
- Block-cooperative Levinson-Durbin with a shared-memory-heavy L1 carveout and opt-in dynamic shmem past the 48 KB default.
- Generator-aware displacement matvec that never materialises the dense $[B,n,n]$ matrix; memory is $O(B\cdot r\cdot n)$.
- Fused Gohberg-Semencul apply folding four triangular Toeplitz-matvecs
into two batched stages with shared $x$ FFT and
record_streamdiscipline. - Cache-aware loads and stores (
__ldca,__ldcs,__stwt) plus bank-conflict padding on shared tiles. - Principled pole handling at $\omega_i=\omega_j^{-1}$: minimum-norm pseudoinverse rather than epsilon regularisation.
- PT2 compliance:
pt2_compliant_tag,needs_fixed_stride_order, and Meta (FakeTensor) implementations on every op.
Supported dtypes
| Op | float32 | bfloat16 | float16 | complex64 |
|---|---|---|---|---|
cauchy_kernel (forward) |
β | β | β | β |
cauchy_kernel (backward) |
β | β | β | β |
toeplitz_matvec |
β | β | β | β |
toeplitz_solve_* |
β | β | β | β |
toeplitz_inverse_apply |
β | β | β | β |
displacement_matvec |
β | β | β | β |
toeplitz_to_cauchy |
β | β | β | β |
Practical constraints
- Toeplitz FFT paths zero-pad non-power-of-two inputs to the next power of two.
- The shared-memory FFT path is enabled for padded $N\leq 1024$; larger sizes use cuFFT.
- The symmetric solve fast path falls back to the structured general solver when its residual check does not pass or when $c\neq r$.
- Levinson recursion requires non-singular principal minors; unstable slices are routed to a dense solve automatically.
- The fast Gohberg-Semencul application assumes symmetric generators ($v\approx\text{reverse}(u)$); non-symmetric or gradient-bearing calls take the exact recovery path.
Build targets
Pre-built Hub variants cover sm_80, sm_86, sm_89, sm_90, and sm_120+PTX
(Ampere through Blackwell) on Linux x86_64 and aarch64 for the Python and
PyTorch versions kernels-community publishes against. For local
development and additional arches, build with kernel-builder:
nix run -L --max-jobs 1 --cores 4 .#build-and-copy
Links
Author
I'm Chris von Csefalvay, an AI researcher specialising in post-training, and the author of Post-Training: A Practical Guide for AI Engineers and Developers (No Starch Press, 2026). I also write Post-Slop, a periodic diatribe about AI, and what it's doing for society. You can also find me on LinkedIn and X.
License
MIT
- Downloads last month
- 32