- OP here.
I'm a student working on an older laptop without a dedicated GPU. I wanted to run some neural network experiments that required large matrix multiplications (2000x2000+), but standard numpy.dot (Intel MKL) was becoming a bottleneck for my iteration time.
I realized that for my specific inference tasks (and fuzzy clustering), I didn't need perfect FP32 precision.
I wrote a C99 kernel (wrapped in Python via ctypes) that uses Monte Carlo Outer-Product Sampling. Instead of computing the full N^3 product, it samples columns based on a uniform distribution. It uses OpenMP for parallelism and is optimized for L1/L2 cache locality.
The Result: On my i5 machine, it gets ~4.1x speedup over NumPy with about 5-10% error (configurable via sampling rate).
It's obviously not for scientific simulation or finance, but for stochastic ML approaches, it feels like a free hardware upgrade.
The binary is in the repo if you want to test the speedup. I'm curious if this approach (probabilistic BLAS) is used in production anywhere else.