FMM-patched embedding inference · NVIDIA T4 BF16

How many T4-hours of ingest capacity does the FMM patch free up on your embedding stack?

Pick a model, kernel, GPU price tier, and monthly volume; you’ll see the cuBLAS baseline, the patched cost, and the dollars and T4-hours you get back. Numbers are wall-clock on production BERT-large class embedding models. The patch is a drop-in nn.Linear replacement: no retraining, no quantization, no inference API change.

Workload
GPU + pricing
NVIDIA T4 (BF16)
Volume
10,000,000,000
 
cuBLAS baseline
FMM patched
Delta
Throughput
$ per billion vectors
Monthly cost at volume
Capacity freed (yearly)

Outreach blurb

Plain-text. Drop into a cold email or slide footer; the numbers update with the form above.

[copy]

How it works