PC2 at ISC'18

We are participating at ISC 2018 in Frankfurt, which is the largest European conference and trade show on High-Performance Computing. You can meet us at the Gauss-Allianz booth (M-230) on Tuesday from 10:00-11:00.

Dr. Tobias Kenter will present a talk on "Accelerating Modern Scientific Simulations with FPGAs"

Abstract

The Paderborn Center for Parallel Computing is investing into Field-Programmable Gate Arrays (FPGAs) as accelerator technology for HPC. Complementary to a recently signed procurement of a system with 32 latest generation Stratix 10 FPGAs, development efforts are put into FPGA acceleration libraries, infrastructure, and applications. First results of a collaboration project for the acceleration of modern scientific simulation codes have recently been presented at one of the major scientific conferences of the FPGA community, the 26th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). At ISC, we want to make the HPC community aware of these results and their context.

The exploration of FPGAs as accelerators for scientific simulations has so far mostly been focused on small kernels of methods working on regular data structures, for example in the form of stencil computations for finite difference methods. In computational sciences, often more advanced methods are employed that promise better stability, convergence, locality, and scaling. Unstructured meshes are shown to be more effective and more accurate, compared to regular grids, in representing computation domains of various shapes. Using unstructured meshes, the discontinuous Galerkin method preserves the ability to perform explicit local update operations for simulations in the time domain. 

In our current work, we investigate FPGAs as the target platform for implementation of the nodal discontinuous Galerkin method to find time-domain solutions of Maxwell’s equations in an unstructured mesh. In a first step, off-chip memory bandwidth demands are limited by maximizing data reuse and fitting constant coefficients into registers and on-chip RAM blocks. The data layout and parallel access patterns are suitably partitioned to allow this local memory to feed wide data paths with hundreds of parallel floating point operations reliably with data in every cycle. 

Despite the high arithmetic intensity obtained after customizing the local memory layout, some phases and variants of the application require high sustained off-chip bandwidth in the order of tens of GB/s and thus close to the target platforms peak bandwidth. Achiving this bandwidth is particularly challenging due to indirect addressing into the unstructured mesh, which leads to deterministic, yet irregular memory access patterns. We show that by decoupling off-chip memory accesses from the computations on the FPGA, high memory bandwidth can be sustained even for this indirect addressing pattern.

Using the Intel OpenCL SDK for FPGAs, we present implementations for different polynomial orders of the discontinuous Galerkin method. In different phases of the algorithm, either computational or bandwidth limits of the Arria 10 platform are almost reached, achieving more than 100 GFLOPS/s in several kernels and thus overall outperforming a highly multithreaded CPU implementation by around 2x. With ongoing efforts towards the maintainability of these codes for computational scientists, towards scaling the designs for the bigger next generation Stratix 10 FPGAs and over multiple compute nodes, we see this application domain as a highly promising field for FPGA acceleration in HPC.