• WebGPU (WGSL) handles this by having a specified accuracy for each operation.

    https://www.w3.org/TR/WGSL/#concrete-float-accuracy

    This is all fully tested in the CTS.

    https://gpuweb.github.io/cts/standalone/?q=webgpu:shader,*

  • > This is not the first time we can see Nvidia taking shortcuts to achieve maximum performance of their GPUs

    Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.

    • It is not correct because it does not implement the FP arithmetic standard and this can lead to much greater numerical errors than expected.

      NVIDIA is not responsible alone, because the Microsoft DirectX specification includes the non-standard behavior.

      Nevertheless, as shown in TFA, both the AMD and Intel GPUs allow the user to choose between correct behavior and incorrect behavior that might be faster, while NVIDIA ignores what the user requests and implements only the non-standard behavior.

      The developers of graphics or ML/AI applications do not care about errors, but there are also people who want to use GPUs for normal computations, where the accuracy of the results matters, so they want to be able to choose between correct behavior and incorrect but faster behavior.

      Actually "faster" is a misnomer, because denormals can be handled correctly without diminishing the speed, but that costs additional die area. Thus what NVIDIA gains by not implementing the right behavior is a reduced production cost.

  • Denormals happen to be the way that Zero can even be represented at all?
  • Flush denormals to zero. Even their inventor had trouble writing correct code in their presence - see the Appendix to that "what every programmer should know..." paper
    • On the other hand, they (unexpectedly to the inventor, who intended them to be a debugging tool) underpin a few foundational results in correctly rounded computation, such as https://en.wikipedia.org/wiki/Sterbenz_lemma.
    • > Even their inventor had trouble writing correct code in their presence

      I didn't know that. Could you provide a more specific reference?

  • Another thing to keep in mind is that CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago.

    For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.

    • Denormal processing is slow only on certain CPUs, where the designers have been lazy, so when denormals are encountered that is handled by a microprogrammed sequence.

      During the last half of century there have been plenty of CPUs where denormals have been handled in hardware, so that any slow down caused by them is negligible.

      Except for generating graphic images seen by humans or in ML/AI applications, neither flushing results to zero nor treating denormal inputs as zero are acceptable, because they can lead to huge errors.

      Whoever fears that denormals can slow down an application, must enable the underflow exception. In that case denormals are never generated, but the underflow exceptions must be handled, because when denormals are not desired but underflows happen, that means that there are bugs in the program, which must be fixed.

      Denormals have been created so that people can mask the underflow exception and avoid to handle it, without dire consequences.

      However this habit of no longer handling the floating-point exceptions, like before the IEEE 754 standard, has created younger developers who are no longer aware of how FP arithmetic must be handled to avoid errors, so now there are too many who believe that the use of "-ffast-math" is permitted in general-purpose programs, not only in special applications where result accuracy does not matter.

      For correct results, you must use either denormals or underflow exception handling. There is no third choice. The third choice, like in GPUs, is only for when correctness is irrelevant.

    • > CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago

      Intel CPU processing, where slowdowns can be as bad as couple hundred cycles. AMD CPUs penalize them much more mildly, usually single-digit cycles. (No idea about ARM.)

    • cpus that aren't Intel are plenty fast on denormals. Intel is the only one where denormals are 100x slower. (and Intel has fixed that on their new cpus, but only on their e cores)
    • More like 100x, but not sure how true that is nowadays.
  • It's one of several issues with the design of IEEE floats, unfortunately. I wish we could start thinking more seriously about a new design, to complement if not replace IEEE in the long term. Posits are an example https://github.com/andrepd/posit-rust