CUDA kernels for the shapes that matter

FMM-class GEMM kernels for NVIDIA tensor cores

Drop-in nn.Linear replacements measured to beat cuBLAS on production workloads — embedding inference, frontier-LLM training, FP64 HPC — with a safe fall-through to stock cuBLAS on every shape we have not measured a win on.

Fast matrix multiplication on NVIDIA hardware is governed jointly by silicon, the cuBLAS library, and matrix shape: the same algorithmic kernel can win on one corner of the (GPU, precision, shape) cube and lose on a neighbouring corner. The vendor library takes one to three years to tune each new precision class, FMM-class algorithms beat it during that window, and the frontier migrates to the next class as the vendor catches up. We measure across five NVIDIA generations (T4, L4, A100, H200) and seven precision classes, and ship the kernels as a drop-in patch that activates only on shapes we have proven to win.

Headline results

All numbers are measured, not extrapolated. Each card points to a section below with the supporting detail.

Square shape, A100-80 FP16
1.78×
Peak measured speedup over cuBLAS at N=8,192 on a 2-level FMM hybrid kernel.
Frontier-LLM training, H200 BF16
+12.1%
On the LLaMA-3-405B GPTQ Hessian operation. Generalises to LLaMA-3-70B, Qwen2.5-72B, Falcon-180B.
Embedding inference, T4 BF16
+3.6 to +7.6%
End-to-end on BGE, mxbai-embed-large-v1, GTE, all-roberta-large-v1 — measured on the production encoders, not synthetic GEMMs.
FP64 HPC, T4
+5.6 to +12.8%
HPL / LINPACK-class FP64 sweep on Turing. Largest deltas on rectangular shapes.

Where we win across the cube

Best measured speedup over a same-precision cuBLAS baseline at each (GPU, precision) corner, taken across all our measured shapes. Wins are highlighted; cells we have not surfaced a win on are dim. Read this as a map of where the patch fires, not as a per-shape table.

GPU FP64 FP32 TF32 BF16 FP16 FP8 INT8
T4 1.23× 1.52× 1.42× 1.27× 1.24×
L4 1.13× 1.40× 1.01× 1.03× 1.16× 0.86×
A100-40 1.12× 1.19× 1.07× 1.08× 1.08×
A100-80 1.12× 1.17× 1.09× 1.13× 1.78×
H200 1.13× 1.15× 1.11× 1.14× 1.08× 1.20× 1.15×

The migrating frontier: cuBLAS efficiency on each precision climbs from roughly 41% of peak to 95% of peak as the vendor matures the library, typically over one to three years. FMM-class wins are largest while a precision class is young (T4 BF16 and H200 FP8 today), narrowing as the library matures (A100 BF16). The pattern forecasts the next big wins on Blackwell FP4.

Frontier-LLM training transfer (Hopper BF16)

Three large BF16 training operations on the LLaMA-3-405B FFN-down weight matrix (out=16,384, in=53,248). The lift transfers from square-shape benchmarks to real frontier-training shapes once the sub-GEMMs dispatch to Hopper’s wgmma path. Same recipe generalises across LLaMA-3-70B, Qwen2.5-72B, Falcon-180B.

GPTQ Hessian (XTX)
1.121×
Measured at the typical calibration set K=32,768. Robust across K∈[8,192, 65,536].
Shampoo right factor (dWTdW)
1.114×
Second-order preconditioner term; same shape regime as the GPTQ Hessian.
Long-context weight gradient (XTG)
1.050×
At K=131,072 sequence-length context. Smaller delta because K dominates the operation count.

Across frontier-class LLMs

model hidden ffn_dim GPTQ Hessian Shampoo factor
LLaMA-3-405B16,38453,2481.121×1.114×
Falcon-180B 14,84859,3921.135×1.128×
Qwen2.5-72B 8,192 29,5681.097×1.095×
LLaMA-3-70B 8,192 28,6721.095×1.091×

Win condition is sharp: above an FFN-dim threshold in the mid-twenties of thousands. Below it (LLaMA-3-8B class, hidden=4,096), the form/reconstruct overhead is not amortised and the kernel falls through to stock cuBLAS. This is the regime where preconditioner sweeps and second-order optimisers on frontier-scale models live.

Embedding inference (T4 BF16, BERT-large class)

End-to-end throughput on production embedding encoders, measured on the model itself — not a synthetic GEMM sweep. Each row is one of the four leading open-weight BERT-large-class embedders. Capacity-freed is the headline; the dollar number is a footnote at T4 spot pricing and scales with whatever GPU class your fleet actually runs.

model baseline (vec/s) patched (vec/s) delta capacity freed / yr @ 10B vec/mo
BAAI/bge-large-en-v1.5 13.00813.997+7.60%~155,000 T4-hr
mixedbread-ai/mxbai-embed-large-v114.30815.241+6.52%~143,000 T4-hr
thenlper/gte-large 14.65015.333+4.66%~95,000 T4-hr
sentence-transformers/all-roberta-large-v114.75715.628+5.90%~119,000 T4-hr

The patch swaps the FFN projections (the largest GEMMs in a BERT-large encoder) foran FMM-tiled tensor-core kernel; the QKV/O projections fall below the measured win threshold and are not touched. Try the per-model dollar / hours calculator at /calculator.html.

Algorithm gain vs. implementation overhead

FMM-class algorithms reduce the multiplication count of an n×n matmul by 12.5% (the theoretical 8/7 ceiling). The remaining gap is the cost of the extra elementwise add passes. Stripping kernel-launch overhead with an 8-GEMM identical-block control places the pure algorithmic gain at 4–14% across the cube, against a 14.3% ceiling. The implication for partners: the wins are real and bounded, and the engineering work is in landing them on each new precision class before cuBLAS catches up.

Speedup vs matrix size

Square-shape envelope per GPU. At each measured N (log₂-spaced), the line shows the best speedup over a same-precision cuBLAS baseline taken across the methods we ship. Above the dashed 1.00× line is a win. One panel per GPU; one colour per precision class.

Atlas

Per-cell view of measured speedups across the (GPU, precision, shape, N) cube. Filter by GPU and precision; sort any column. We hide the method codename and the production patch-policy — those are the IP — and show the best measurement we have per cell. Square shapes use N; rectangular shapes use the labelled (M, K, N) tuple.

GPU
Precision
showing 0 of 0
GPU prec shape N / M×K×N cuBLAS ms kernel ms speedup TFLOPS
page 1 of 1

How it ships

Partner with us

We are looking for a small number of design partners who run embedding, reranking, or training workloads at scale and want a kernel-optimisation team without hiring one. The first kernel we ship is the proof point; the partnership is the recurring pipeline of wins on your specific shapes — your reranker, your quantised variants, your H100 / Blackwell path as it lands.

Concretely: a 30-minute working session, a measured win-zone probe on your actual models, a roadmap of next-kernel targets ranked by impact. No bytes leave your VPC; we ship binaries and a Python wrapper.

talk to us open the calculator