FMM-patched embedding inference · NVIDIA T4 BF16
How many T4-hours of ingest capacity does the FMM patch free up on your embedding stack?
Pick a model, kernel, GPU price tier, and monthly volume; you’ll see the
cuBLAS baseline, the patched cost, and the dollars and T4-hours you get
back. Numbers are wall-clock on production BERT-large class embedding
models. The patch is a drop-in nn.Linear replacement: no
retraining, no quantization, no inference API change.
cuBLAS baseline
FMM patched
Delta
Throughput
—
—
—
$ per billion vectors
—
—
—
Monthly cost at volume
—
—
—
Capacity freed (yearly)
—
—
—
Outreach blurb
[copy]How it works
-
Drop-in PyTorch
nn.Linearreplacement. Wrap your model once; the rest of the inference pipeline is unchanged. No retraining, no quantization, no API surface change. - Safe across your fleet. The patcher inspects the shape of every linear layer at load time and only swaps in the FMM kernel where it’s been measured to beat cuBLAS on your hardware. Everything else stays on cuBLAS. There is no slow path.
- Want numbers on your model and shape? The landing page summarises where the patch wins across the (GPU, precision) cube; for a benchmark on your specific model, reach out at hello@unified-sciences.com.