Scalability of N-Body Simulations on FPGAs and CPUs Published and to be Presented

Under the title "The Strong Scaling Advantage of FPGAs in HPC for N-Body Simulations", our recent results on the scalability of N-Body simulations have been published in the ACM Transactions on Reconfigurable Technology and Systems (TRETS) and are available as open access article. The work will be presented and discussed at the International Conference of Field-Programmable Technology FPT'21 in a virtual format in the night from 9 to 10 December (corrected: 0:00am CET, midnight). Registration for the entire conference program is still possible and free of charge.

For this work, we developed highly optimized implementations for N-Body simulations for the Intel Stratix 10 FPGAs and Xeon Gold "Skylake" 6148 CPUs in Noctua 1. The FPGA design includes a novel solution for the required triangular accumulation pattern and the CPU version profits from manual vectorization along with a refinement scheme for inverse sqaure roots and from manual scheduling of instructions. Both designs employ a ring communication scheme to scale over multiple nodes, where the FPGA solutions makes use of the user customizable FPGA-to-FPGA communication infrastructure of Noctua 1. Update 2022: The FPGAs with user customizable FPGA-to-FPGA communication infrastructure are now integrated in Noctua 2.

While the CPU version achieves higher peak performance in double precision, the FPGA version reaches best double precision performance in several strong scaling scenarios, where the fine grained task-level parallelism and deterministic overlapping of communication with computation allow to achieve full utilization even for small problem sizes of around 1000 elements per device. An outlook on single precision designs indicates that in this case the FPGA architecture can overcome the CPU even in terms of peak performance.

The full source code and documentation of the project is available at https://github.com/pc2/n-body-ring-solver.