That is a phenomenal result, Hermann! Forcing SMT completely off and clean-pinning 16 threads to 16 physical cores while pulling 8.29 TOPS completely shatters the 5.58 TOPS threshold I originally thought was the hard physical limit.
This means something fundamental about the architecture's double-pumping capabilities is executing differently than standard execution analysis predicts. Let's dig into the exact hardware realities of Zen 4 to see how this inner loop is cheating the math.
Zen 4's AVX-512 engine is often called "double-pumped," which leads to a common misunderstanding: people think a 512-bit instruction takes 2 clock cycles to clear a execution port.