CUDA kernels for the shapes that matter

GEMM kernels that beat cuBLAS on NVIDIA tensor cores

Drop-in nn.Linear replacements measured to beat cuBLAS on production workloads — embedding inference, frontier-LLM training, FP64 HPC — with a safe fall-through to stock cuBLAS on every shape we have not measured a win on.

Matrix-multiply performance on NVIDIA hardware is governed jointly by silicon, the cuBLAS library, and matrix shape: the same kernel can win on one corner of the (GPU, precision, shape) cube and lose on a neighbouring corner. The vendor library takes one to three years to tune each new precision class, leaving a window on the corners it has not yet specialised — and the frontier migrates to the next class as the vendor catches up. Our engine searches that space and generates drop-in kernels that beat the stock library where it is under-tuned. We measure across five NVIDIA generations (T4, L4, A100, H200, B200) and eight precision classes — most recently Blackwell NVFP4, where we beat cuBLAS-FP4 the same year it shipped — and ship the kernels as a drop-in patch that activates only on shapes we have proven to win.

Headline results

All numbers are measured, not extrapolated. Each card points to a section below with the supporting detail.

Square shape, A100-80 FP16

1.78×

Peak measured speedup over cuBLAS at frontier-scale matrix sizes.

Frontier-LLM training, H200 BF16

+20.1%

On the Falcon-180B long-context weight gradient; +13.6% on the LLaMA-3-405B GPTQ Hessian. Generalises to LLaMA-3-70B, Qwen2.5-72B.

Embedding inference, T4 BF16

+3.6 to +7.6%

End-to-end on BGE, mxbai-embed-large-v1, GTE, all-roberta-large-v1 — measured on the production encoders, not synthetic GEMMs.

FP64 HPC, T4

+5.6 to +12.8%

HPL / LINPACK-class FP64 sweep on Turing. Largest deltas on rectangular shapes.

Square shape, T4 INT8

1.35×

cuBLAS's Turing INT8 kernel runs at only ~20% of peak; our kernel recovers up to 1.35× at scale — measured, floor-held on every shape.

Square shape, B200 NVFP4

1.17×

End-to-end over dense cuBLAS-FP4 at frontier scale — the newest precision class, beaten the year it shipped; win grows with size.

GPU vs speedup

Best measured speedup over a same-precision cuBLAS baseline at each (GPU, precision) corner, taken across all our measured shapes. Wins are highlighted; cells we have not surfaced a win on are dim.

GPU	FP64	FP32	TF32	BF16	FP16	FP8	INT8	FP4
T4	1.28×	1.52×	—	1.27×	1.57×	—	1.35×	—
L4	1.13×	1.40×	1.13×	1.29×	1.43×	1.13×	1.07×	—
A100-80	1.14×	1.17×	1.09×	1.17×	1.78×	—	—	—
H200	1.14×	1.19×	1.17×	1.32×	1.39×	1.20×	1.75×	—
B200	1.13×	1.20×	1.17×	1.20×	1.22×	1.15×	1.06×	1.20×

Frontier-LLM training transfer (Hopper BF16)

Three large BF16 training operations on frontier-LLM FFN-down weight matrices. The lift transfers from square-shape benchmarks to real frontier-training shapes once the kernel dispatches to Hopper’s wgmma path. Same recipe generalises across LLaMA-3-70B, Qwen2.5-72B, Falcon-180B.

GPTQ Hessian (X^TX)

1.136×

Measured at typical GPTQ calibration-set sizes; robust across the practical range.

Shampoo right factor (dW^TdW)

1.123×

Second-order preconditioner term; same shape regime as the GPTQ Hessian.

Long-context weight gradient (X^TG)

1.201×

Falcon-180B at long-context training sequence lengths; 1.106× on LLaMA-3-405B.

Across frontier-class LLMs

model	GPTQ Hessian	Shampoo factor	long-context wgrad
LLaMA-3-405B	1.136×	1.123×	1.106×
Falcon-180B	1.135×	1.131×	1.201×
Qwen2.5-72B	1.051×	1.095×	1.064×
LLaMA-3-70B	1.039×	1.110×	1.071×

Win condition is sharp: frontier-scale FFN widths clear it; small-model shapes (LLaMA-3-8B class) do not — there the kernel’s fixed overhead is not amortised and it falls through to stock cuBLAS. This is the regime where preconditioner sweeps and second-order optimisers on frontier-scale models live.

Embedding inference (T4 BF16)

Throughput on production embedding encoders. The first four rows are end-to-end model throughput on BERT-large-class encoders, measured on the model itself — not a synthetic GEMM sweep. The last two rows are jina‑v5‑omni (multimodal, three towers), measured per-FFN and aggregated across the towers the selector fires on. Capacity-freed is the BERT-large headline at T4 spot pricing and scales with whatever GPU class your fleet actually runs.

model	delta	capacity freed / yr @ 10B vec/mo
BAAI/bge-large-en-v1.5	+7.60%	~155,000 T4-hr
mixedbread-ai/mxbai-embed-large-v1	+6.52%	~143,000 T4-hr
thenlper/gte-large	+4.66%	~95,000 T4-hr
sentence-transformers/all-roberta-large-v1	+5.90%	~119,000 T4-hr
jinaai/jina-embeddings-v5-omni-small (per-FFN, vision + audio towers)	+5.27%	—
jinaai/jina-embeddings-v5-omni-nano (per-FFN, audio tower only)	+7.74%	—

Speedup vs matrix size

Square-shape envelope per GPU. At each measured N (log₂-spaced), the line shows the best speedup over a same-precision cuBLAS baseline taken across the methods we ship. Above the dashed 1.00× line is a win. One panel per GPU; one colour per precision class.

How it ships

Drop-in nn.Linear replacement. Wrap your model once at load time; everything else in your inference / training pipeline stays exactly as it is. No retraining, no quantisation, no API change.
Faster where it counts, never slower. The dispatcher profiles every linear layer and applies the kernel precisely where it is measured to beat cuBLAS on your hardware — bit-identical stock behaviour everywhere else, by construction.
Self-contained and production-ready. A single CUDA kernel behind the standard PyTorch interface. No new compiler, no framework migration, nothing new for your team to maintain.
Measured on your model. For prospective partners we benchmark directly on your encoder, your reranker, your training preconditioner — and report wall-clock savings on the workloads that drive your bill.

Interested?

talk to us