hotdog: highly optimised Toeplitz/Cauchy/displacement-rank GPU kernels

HOTDOG (Highly Optimised Toeplitz-Cauchy with Displacement Rank Acceleration On GPUs) is a CUDA kernel bundle for the structured linear-algebra primitives that underpin state-space models (S4/S5, Mamba), long-convolution networks (Hyena, H3), and low-displacement-rank layers.

The kernel exposes the usual Toeplitz/Cauchy ops — with full autograd, BF16/FP16/complex-dtype paths, stream-aware dispatch, and torch.compile compliance via pt2_compliant_tag + needs_fixed_stride_order.

Load from the Hugging Face Hub

pip install kernels torch

import torch
from kernels import get_kernel

hotdog = get_kernel("chrisvoncsefalvay/hotdog")

B, N, L = 4, 1024, 1024
v = torch.randn(B, N, device="cuda", dtype=torch.float32)
omega = torch.linspace(0.5, 1.5, L, device="cuda", dtype=torch.float32)
lambd = torch.linspace(-1.5, -0.5, N, device="cuda", dtype=torch.float32)

out = torch.empty(B, L, device="cuda", dtype=torch.float32)
hotdog.cauchy_kernel_forward(out, v, omega, lambd)

The kernels client picks the pre-built binary that matches your Python, PyTorch, and CUDA versions. For day-to-day PyTorch use prefer the top-level Python API (see below) rather than calling the raw ops.

Public API

hotdog.cauchy_kernel(v, omega, lambd)            # Cauchy kernel evaluation
hotdog.toeplitz_matvec(c, r, x)                  # y = T(c, r) @ x
hotdog.toeplitz_matvec_transpose(c, r, x)        # y = T(c, r)^T @ x
hotdog.toeplitz_solve(c, r, b)                   # T(c, r) x = b
hotdog.toeplitz_solve_symmetric(c, b)            # T(c, c) x = b  (fast path)
hotdog.toeplitz_inverse_apply(gs_u, gs_v, x)     # T^{-1} x via Gohberg-Semencul
hotdog.gs_compute_generators(c, r)               # Gohberg-Semencul generators
hotdog.toeplitz_to_cauchy(c, r)                  # C = F T F^{-1}
hotdog.displacement_matvec(G, H, x, disp_type=0) # y = A x, generator-aware
hotdog.cauchy_pin_l2(omega, lambd)               # L2 persistence pin

Every entry accepts an optional stream= keyword argument; forward-only entries additionally accept an out= buffer. Calls broadcast leading batch dimensions and stay correct under torch.compile and torch.cuda.graph capture. Only Sylvester-type displacement (disp_type=0) is supported today. nn.Module wrappers (CauchyKernel, ToeplitzMV, ToeplitzSolve, ToeplitzInverse, DisplacementMV) are provided for parameterised layers that store structured matrices as buffers.

What's included

Cauchy: tiled shared-memory reduction with Kahan-Babushka compensated summation, __nv_bfloat162 pair loads on the BF16 path, $+1$ bank-padded shared tiles, and a dedicated complex (float2) kernel for S4 poles. Custom CUDA backward over $v$, $\omega$, $\lambda$.
Toeplitz matvec: single-block shared-memory radix-2 FFT for padded $N\leq 1024$; cuFFT-backed circulant embedding for larger sizes. Explicit transpose entry; BF16/FP16 forward via promote-on-load.
Toeplitz solve: block-cooperative Levinson-Durbin with shared-memory linear-prediction coefficients, warp-shuffle reductions, and opt-in large-shmem via cudaFuncSetAttribute up to the device's sharedMemPerBlockOptin ceiling.
Gohberg-Semencul inverse application: two-stage batched fused path ([U(Jv)|U(v)] then [L(u)|L(Ju)]) with shared $x$ FFT, record_stream discipline on every intermediate, and a small-$N$ shared-memory FFT route.
Displacement-rank matvec: generator-aware frequency-domain Cauchy-sum kernel; $O(B\cdot r\cdot n)$ memory, pseudoinverse handling at the Sylvester diagonal pole.
cauchy_pin_l2: pins $\omega$ and $\lambda$ into the L2 persistence window on the current stream for repeated-call workloads (Ampere+).

Performance summary

NVIDIA GB10 (sm_121), PyTorch 2.10, CUDA 13.0:

Operation	Shape ($B\times n$)	HOTDOG	vs reference	Achieved
`cauchy_kernel`	16x4096	0.55 ms	25x naive	1.87 TF/s
`cauchy_kernel`	16x1024	0.08 ms	8.3x naive	855 GF/s
`toeplitz_matvec`	4x256	0.031 ms	1.6x FFT	9 GF/s
`toeplitz_solve_symmetric`	4x64	0.084 ms	54.7x SciPy	—
`toeplitz_solve_symmetric`	4x256	0.278 ms	16.5x SciPy	—
`toeplitz_inverse_apply`	4x256	0.344 ms	28.4x over 4x FFT	—
`displacement_matvec` ($r=2$)	4x256	0.101 ms	1.9x dense	44 GF/s

At $N\geq 1024$ the Cauchy kernel runs at $\approx 60%$ of the GB10 fp32 peak. Small-shape rows are launch-overhead bound, not kernel-bound; closing that gap is an orchestration concern (torch.cuda.graph, op fusion, cauchy_pin_l2), not a kernel choice.

Techniques

Tiled shared-memory Cauchy reduction with Kahan-Babushka compensation against $|\omega-\lambda|$ cancellations.
Architecture-aware Toeplitz dispatch between in-kernel shared-memory FFT and cuFFT-backed embedding.
Block-cooperative Levinson-Durbin with a shared-memory-heavy L1 carveout and opt-in dynamic shmem past the 48 KB default.
Generator-aware displacement matvec that never materialises the dense $[B,n,n]$ matrix; memory is $O(B\cdot r\cdot n)$.
Fused Gohberg-Semencul apply folding four triangular Toeplitz-matvecs into two batched stages with shared $x$ FFT and record_stream discipline.
Cache-aware loads and stores (__ldca, __ldcs, __stwt) plus bank-conflict padding on shared tiles.
Principled pole handling at $\omega_i=\omega_j^{-1}$: minimum-norm pseudoinverse rather than epsilon regularisation.
PT2 compliance: pt2_compliant_tag, needs_fixed_stride_order, and Meta (FakeTensor) implementations on every op.

Supported dtypes

Op	float32	bfloat16	float16	complex64
`cauchy_kernel` (forward)	✓	✓	—	✓
`cauchy_kernel` (backward)	✓	—	—	—
`toeplitz_matvec`	✓	✓	✓	—
`toeplitz_solve_*`	✓	—	—	—
`toeplitz_inverse_apply`	✓	—	—	—
`displacement_matvec`	✓	—	—	—
`toeplitz_to_cauchy`	✓	—	—	—

Practical constraints

Toeplitz FFT paths zero-pad non-power-of-two inputs to the next power of two.
The shared-memory FFT path is enabled for padded $N\leq 1024$; larger sizes use cuFFT.
The symmetric solve fast path falls back to the structured general solver when its residual check does not pass or when $c\neq r$.
Levinson recursion requires non-singular principal minors; unstable slices are routed to a dense solve automatically.
The fast Gohberg-Semencul application assumes symmetric generators ($v\approx\text{reverse}(u)$); non-symmetric or gradient-bearing calls take the exact recovery path.

Build targets

Pre-built Hub variants cover sm_80, sm_86, sm_89, sm_90, and sm_120+PTX (Ampere through Blackwell) on Linux x86_64 and aarch64 for the Python and PyTorch versions kernels-community publishes against. For local development and additional arches, build with kernel-builder:

nix run -L --max-jobs 1 --cores 4 .#build-and-copy

Author

I'm Chris von Csefalvay, an AI researcher specialising in post-training, and the author of Post-Training: A Practical Guide for AI Engineers and Developers (No Starch Press, 2026). I also write Post-Slop, a periodic diatribe about AI, and what it's doing for society. You can also find me on LinkedIn and X.

License

MIT

Downloads last month: 32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

chrisvoncsefalvay
/

hotdog