• Unlike GPUs, CPUs aren't designed for massive parallelism. Because of this, batching inference won't necessarily give you a speed boost here. In fact, it can actually slow the process down.

    Instead, I'd recommend exploring CPU-specific AI optimizations. For instance, leveraging AVX512_BF16 instructions could reduce the inference time by 2x or 3x compared to the results in the article. OpenVINO supports this really well on Intel CPUs, and converting an ONNX model to OpenVINO is straightforward.

    • A consumer CPU like a 285k caps out around 130 GB/s of memory bandwidth.

      Each of its 24 cores can do two 8 wide FMA ops per cycle. Lets say holding a continuous 4 GHz clock speed.

      This works out to over 1.5 trillion 32 bit floating point multiplies per cycle.

      If you are doing vector matrix multiplies (like in single token no batching). `xW` then each weight loaded sort of gets used in 1 multiplication and 1 addition.

      Doing the math you can clearly see even if each weight were just 1 byte you can at most load 130 billion of them in a second from memory.

      But in the same timespan you could have done over 1.5 trillion multiplications.

      So you are still memory bound.

    • +1 for OpenVINO, we utilise it for our model. It's quite amazing the inference speed you can get from CPUs that most people would assume are running on a GPU.
    • I'm not sure how you manage to be so wrong about something this simple.

      If you do a single inference at a time, you do GEMV, which spends most of the time loading parameters and then performs one multiplication and one add per parameter.

      If you do batching, then you get to do GEMM, which means you load the parameter once and perform multiple calculations per parameter. This is faster even for a purely sequential matrix multiplication implementation. CPUs tend to have both SIMD and multiple cores these days. This means that your computational resources exceed the available memory bandwidth by far.

      What you suggested in the second "paragraph" is just letting someone else do the batching but with a lower precision data type. You're starting to contradict your first point.

    • ONNX has AVX512 CPU kernels too, and openvino uses ONNX internally (and ONNX supports openvino backend)
      • > openvino uses ONNX internally

        OpenVINO only uses ONNX to parse the model, not to execute it. It runs computations through its own highly optimized inference engine specifically designed for Intel hardware. It doesn't rely on the ONNX engine at all, and it will even automatically convert eligible model weights to BF16 for you

  • We really need a replacement for all-MiniLM-L12-v2 that can create more robust embeddings with the same compute.

    You can technically do Q4 quantization for larger embedding models but I am not sure if that plays nice with ONNX.

    • it's a pain in the ass to do properly.

      what we really need it something like auto-round for ONNX

  • ONNX is my first suggestion to people looking for speed gains on CPU
  • Spinlocks are basically the heroin of parallel programming. So many people are addicted to them in the pursuit of performance but the truth is that in 99% of all cases, they are a terrible idea.

    Something people don't get about spinlocks is that you're basically saying you own the entire CPU core. In any other situation where the core is shared by multiple processes, it is inherently illogical to use a spinlock.

    • In my experience, there haven't been many cases where I can't assign a core or multiple of them to spinlocks, if it's in my hot path.

      That said, I probably wouldn't ship spinlocks in consumer libraries or code I expect to be reused across deployments.