FMM-class GEMM kernels for NVIDIA tensor cores
Drop-in nn.Linear replacements measured to beat cuBLAS on production workloads — embedding inference, frontier-LLM training, FP64 HPC — with a safe fall-through to stock cuBLAS on every shape we have not measured a win on.
Fast matrix multiplication on NVIDIA hardware is governed jointly by silicon, the cuBLAS library, and matrix shape: the same algorithmic kernel can win on one corner of the (GPU, precision, shape) cube and lose on a neighbouring corner. The vendor library takes one to three years to tune each new precision class, FMM-class algorithms beat it during that window, and the frontier migrates to the next class as the vendor catches up. We measure across five NVIDIA generations (T4, L4, A100, H200) and seven precision classes, and ship the kernels as a drop-in patch that activates only on shapes we have proven to win.
Headline results
All numbers are measured, not extrapolated. Each card points to a section below with the supporting detail.
Where we win across the cube
Best measured speedup over a same-precision cuBLAS baseline at each (GPU, precision) corner, taken across all our measured shapes. Wins are highlighted; cells we have not surfaced a win on are dim. Read this as a map of where the patch fires, not as a per-shape table.
| GPU | FP64 | FP32 | TF32 | BF16 | FP16 | FP8 | INT8 |
|---|---|---|---|---|---|---|---|
| T4 | 1.23× | 1.52× | 1.42× | 1.27× | 1.24× | — | — |
| L4 | 1.13× | 1.40× | — | 1.01× | 1.03× | 1.16× | 0.86× |
| A100-40 | 1.12× | 1.19× | 1.07× | 1.08× | 1.08× | — | — |
| A100-80 | 1.12× | 1.17× | 1.09× | 1.13× | 1.78× | — | — |
| H200 | 1.13× | 1.15× | 1.11× | 1.14× | 1.08× | 1.20× | 1.15× |
The migrating frontier: cuBLAS efficiency on each precision climbs from roughly 41% of peak to 95% of peak as the vendor matures the library, typically over one to three years. FMM-class wins are largest while a precision class is young (T4 BF16 and H200 FP8 today), narrowing as the library matures (A100 BF16). The pattern forecasts the next big wins on Blackwell FP4.
Frontier-LLM training transfer (Hopper BF16)
Three large BF16 training operations on the LLaMA-3-405B FFN-down weight matrix (out=16,384, in=53,248). The lift transfers from square-shape benchmarks to real frontier-training shapes once the sub-GEMMs dispatch to Hopper’s wgmma path. Same recipe generalises across LLaMA-3-70B, Qwen2.5-72B, Falcon-180B.
XTX)dWTdW)XTG)Across frontier-class LLMs
| model | hidden | ffn_dim | GPTQ Hessian | Shampoo factor |
|---|---|---|---|---|
| LLaMA-3-405B | 16,384 | 53,248 | 1.121× | 1.114× |
| Falcon-180B | 14,848 | 59,392 | 1.135× | 1.128× |
| Qwen2.5-72B | 8,192 | 29,568 | 1.097× | 1.095× |
| LLaMA-3-70B | 8,192 | 28,672 | 1.095× | 1.091× |
Win condition is sharp: above an FFN-dim threshold in the mid-twenties of thousands. Below it (LLaMA-3-8B class, hidden=4,096), the form/reconstruct overhead is not amortised and the kernel falls through to stock cuBLAS. This is the regime where preconditioner sweeps and second-order optimisers on frontier-scale models live.
Embedding inference (T4 BF16, BERT-large class)
End-to-end throughput on production embedding encoders, measured on the model itself — not a synthetic GEMM sweep. Each row is one of the four leading open-weight BERT-large-class embedders. Capacity-freed is the headline; the dollar number is a footnote at T4 spot pricing and scales with whatever GPU class your fleet actually runs.
| model | baseline (vec/s) | patched (vec/s) | delta | capacity freed / yr @ 10B vec/mo |
|---|---|---|---|---|
| BAAI/bge-large-en-v1.5 | 13.008 | 13.997 | +7.60% | ~155,000 T4-hr |
| mixedbread-ai/mxbai-embed-large-v1 | 14.308 | 15.241 | +6.52% | ~143,000 T4-hr |
| thenlper/gte-large | 14.650 | 15.333 | +4.66% | ~95,000 T4-hr |
| sentence-transformers/all-roberta-large-v1 | 14.757 | 15.628 | +5.90% | ~119,000 T4-hr |
The patch swaps the FFN projections (the largest GEMMs in a BERT-large encoder) foran FMM-tiled tensor-core kernel; the QKV/O projections fall below the measured win threshold and are not touched. Try the per-model dollar / hours calculator at /calculator.html.
Algorithm gain vs. implementation overhead
FMM-class algorithms reduce the multiplication count of an n×n matmul by 12.5% (the theoretical 8/7 ceiling). The remaining gap is the cost of the extra elementwise add passes. Stripping kernel-launch overhead with an 8-GEMM identical-block control places the pure algorithmic gain at 4–14% across the cube, against a 14.3% ceiling. The implication for partners: the wins are real and bounded, and the engineering work is in landing them on each new precision class before cuBLAS catches up.
Speedup vs matrix size
Square-shape envelope per GPU. At each measured N (log₂-spaced), the line shows the best speedup over a same-precision cuBLAS baseline taken across the methods we ship. Above the dashed 1.00× line is a win. One panel per GPU; one colour per precision class.
Atlas
Per-cell view of measured speedups across the (GPU, precision, shape, N) cube. Filter by GPU and precision; sort any column. We hide the method codename and the production patch-policy — those are the IP — and show the best measurement we have per cell. Square shapes use N; rectangular shapes use the labelled (M, K, N) tuple.
| GPU | prec | shape | N / M×K×N | cuBLAS ms | kernel ms | speedup | TFLOPS |
|---|
How it ships
-
Drop-in
nn.Linearreplacement. Wrap your model once at load time; the rest of the inference / training pipeline is unchanged. No retraining, no quantisation, no API surface change. - Win-zone gate, not a global swap. The patcher inspects the shape of every linear layer and only swaps in the FMM kernel where it has been measured to beat cuBLAS on the target SKU. Every other shape falls through to stock cuBLAS bit-identically. There is no slow path.
- Reviewable as a dependency. The off-zone path is the stock path; the on-zone path is a self-contained CUDA kernel. No new compiler. No new attention pattern. No new numerics surface for your inference team to validate.
- Measured on your model, not generalised from a GEMM sweep. For prospective partners we run the win-zone probe directly on your encoder, your reranker, your training preconditioner — whatever shapes drive your bill.
Partner with us
We are looking for a small number of design partners who run embedding, reranking, or training workloads at scale and want a kernel-optimisation team without hiring one. The first kernel we ship is the proof point; the partnership is the recurring pipeline of wins on your specific shapes — your reranker, your quantised variants, your H100 / Blackwell path as it lands.
Concretely: a 30-minute working session, a measured win-zone probe on your actual models, a roadmap of next-kernel targets ranked by impact. No bytes leave your VPC; we ship binaries and a Python wrapper.
talk to us open the calculator