hotdog: highly optimised Toeplitz/Cauchy/displacement-rank GPU kernels

HOTDOG (Highly Optimised Toeplitz-Cauchy with Displacement Rank Acceleration On GPUs) is a CUDA kernel bundle for the structured linear-algebra primitives that underpin state-space models (S4/S5, Mamba), long-convolution networks (Hyena, H3), and low-displacement-rank layers.

The kernel exposes the usual Toeplitz/Cauchy ops β€” with full autograd, BF16/FP16/complex-dtype paths, stream-aware dispatch, and torch.compile compliance via pt2_compliant_tag + needs_fixed_stride_order.

Load from the Hugging Face Hub

pip install kernels torch
import torch
from kernels import get_kernel

hotdog = get_kernel("chrisvoncsefalvay/hotdog")

B, N, L = 4, 1024, 1024
v = torch.randn(B, N, device="cuda", dtype=torch.float32)
omega = torch.linspace(0.5, 1.5, L, device="cuda", dtype=torch.float32)
lambd = torch.linspace(-1.5, -0.5, N, device="cuda", dtype=torch.float32)

out = torch.empty(B, L, device="cuda", dtype=torch.float32)
hotdog.cauchy_kernel_forward(out, v, omega, lambd)

The kernels client picks the pre-built binary that matches your Python, PyTorch, and CUDA versions. For day-to-day PyTorch use prefer the top-level Python API (see below) rather than calling the raw ops.

Public API

hotdog.cauchy_kernel(v, omega, lambd)            # Cauchy kernel evaluation
hotdog.toeplitz_matvec(c, r, x)                  # y = T(c, r) @ x
hotdog.toeplitz_matvec_transpose(c, r, x)        # y = T(c, r)^T @ x
hotdog.toeplitz_solve(c, r, b)                   # T(c, r) x = b
hotdog.toeplitz_solve_symmetric(c, b)            # T(c, c) x = b  (fast path)
hotdog.toeplitz_inverse_apply(gs_u, gs_v, x)     # T^{-1} x via Gohberg-Semencul
hotdog.gs_compute_generators(c, r)               # Gohberg-Semencul generators
hotdog.toeplitz_to_cauchy(c, r)                  # C = F T F^{-1}
hotdog.displacement_matvec(G, H, x, disp_type=0) # y = A x, generator-aware
hotdog.cauchy_pin_l2(omega, lambd)               # L2 persistence pin

Every entry accepts an optional stream= keyword argument; forward-only entries additionally accept an out= buffer. Calls broadcast leading batch dimensions and stay correct under torch.compile and torch.cuda.graph capture. Only Sylvester-type displacement (disp_type=0) is supported today. nn.Module wrappers (CauchyKernel, ToeplitzMV, ToeplitzSolve, ToeplitzInverse, DisplacementMV) are provided for parameterised layers that store structured matrices as buffers.

What's included

  • Cauchy: tiled shared-memory reduction with Kahan-Babushka compensated summation, __nv_bfloat162 pair loads on the BF16 path, $+1$ bank-padded shared tiles, and a dedicated complex (float2) kernel for S4 poles. Custom CUDA backward over $v$, $\omega$, $\lambda$.
  • Toeplitz matvec: single-block shared-memory radix-2 FFT for padded $N\leq 1024$; cuFFT-backed circulant embedding for larger sizes. Explicit transpose entry; BF16/FP16 forward via promote-on-load.
  • Toeplitz solve: block-cooperative Levinson-Durbin with shared-memory linear-prediction coefficients, warp-shuffle reductions, and opt-in large-shmem via cudaFuncSetAttribute up to the device's sharedMemPerBlockOptin ceiling.
  • Gohberg-Semencul inverse application: two-stage batched fused path ([U(Jv)|U(v)] then [L(u)|L(Ju)]) with shared $x$ FFT, record_stream discipline on every intermediate, and a small-$N$ shared-memory FFT route.
  • Displacement-rank matvec: generator-aware frequency-domain Cauchy-sum kernel; $O(B\cdot r\cdot n)$ memory, pseudoinverse handling at the Sylvester diagonal pole.
  • cauchy_pin_l2: pins $\omega$ and $\lambda$ into the L2 persistence window on the current stream for repeated-call workloads (Ampere+).

Performance summary

NVIDIA GB10 (sm_121), PyTorch 2.10, CUDA 13.0:

Operation Shape ($B\times n$) HOTDOG vs reference Achieved
cauchy_kernel 16x4096 0.55 ms 25x naive 1.87 TF/s
cauchy_kernel 16x1024 0.08 ms 8.3x naive 855 GF/s
toeplitz_matvec 4x256 0.031 ms 1.6x FFT 9 GF/s
toeplitz_solve_symmetric 4x64 0.084 ms 54.7x SciPy β€”
toeplitz_solve_symmetric 4x256 0.278 ms 16.5x SciPy β€”
toeplitz_inverse_apply 4x256 0.344 ms 28.4x over 4x FFT β€”
displacement_matvec ($r=2$) 4x256 0.101 ms 1.9x dense 44 GF/s

At $N\geq 1024$ the Cauchy kernel runs at $\approx 60%$ of the GB10 fp32 peak. Small-shape rows are launch-overhead bound, not kernel-bound; closing that gap is an orchestration concern (torch.cuda.graph, op fusion, cauchy_pin_l2), not a kernel choice.

Techniques

  1. Tiled shared-memory Cauchy reduction with Kahan-Babushka compensation against $|\omega-\lambda|$ cancellations.
  2. Architecture-aware Toeplitz dispatch between in-kernel shared-memory FFT and cuFFT-backed embedding.
  3. Block-cooperative Levinson-Durbin with a shared-memory-heavy L1 carveout and opt-in dynamic shmem past the 48 KB default.
  4. Generator-aware displacement matvec that never materialises the dense $[B,n,n]$ matrix; memory is $O(B\cdot r\cdot n)$.
  5. Fused Gohberg-Semencul apply folding four triangular Toeplitz-matvecs into two batched stages with shared $x$ FFT and record_stream discipline.
  6. Cache-aware loads and stores (__ldca, __ldcs, __stwt) plus bank-conflict padding on shared tiles.
  7. Principled pole handling at $\omega_i=\omega_j^{-1}$: minimum-norm pseudoinverse rather than epsilon regularisation.
  8. PT2 compliance: pt2_compliant_tag, needs_fixed_stride_order, and Meta (FakeTensor) implementations on every op.

Supported dtypes

Op float32 bfloat16 float16 complex64
cauchy_kernel (forward) βœ“ βœ“ β€” βœ“
cauchy_kernel (backward) βœ“ β€” β€” β€”
toeplitz_matvec βœ“ βœ“ βœ“ β€”
toeplitz_solve_* βœ“ β€” β€” β€”
toeplitz_inverse_apply βœ“ β€” β€” β€”
displacement_matvec βœ“ β€” β€” β€”
toeplitz_to_cauchy βœ“ β€” β€” β€”

Practical constraints

  • Toeplitz FFT paths zero-pad non-power-of-two inputs to the next power of two.
  • The shared-memory FFT path is enabled for padded $N\leq 1024$; larger sizes use cuFFT.
  • The symmetric solve fast path falls back to the structured general solver when its residual check does not pass or when $c\neq r$.
  • Levinson recursion requires non-singular principal minors; unstable slices are routed to a dense solve automatically.
  • The fast Gohberg-Semencul application assumes symmetric generators ($v\approx\text{reverse}(u)$); non-symmetric or gradient-bearing calls take the exact recovery path.

Build targets

Pre-built Hub variants cover sm_80, sm_86, sm_89, sm_90, and sm_120+PTX (Ampere through Blackwell) on Linux x86_64 and aarch64 for the Python and PyTorch versions kernels-community publishes against. For local development and additional arches, build with kernel-builder:

nix run -L --max-jobs 1 --cores 4 .#build-and-copy

Links

Author

I'm Chris von Csefalvay, an AI researcher specialising in post-training, and the author of Post-Training: A Practical Guide for AI Engineers and Developers (No Starch Press, 2026). I also write Post-Slop, a periodic diatribe about AI, and what it's doing for society. You can also find me on LinkedIn and X.

License

MIT

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support