Achtung:

Sie haben Javascript deaktiviert!
Sie haben versucht eine Funktion zu nutzen, die nur mit Javascript möglich ist. Um sämtliche Funktionalitäten unserer Internetseite zu nutzen, aktivieren Sie bitte Javascript in Ihrem Browser.

Data Center Building O Show image information

Data Center Building O

FPGA Acceleration of Electromagnetic Simulations

Figure: FPGA cards installed in a one of the Noctua servers.

Dr. Tobias Kenter, Universität Paderborn

Simulations of electromagnetic effects in novel materials and surfaces contribute a significant workload to our HPC systems. In order to improve performance and energy efficiency of these workloads, we have investigated them as a target for FPGA acceleration by looking at applications which operate on unstructured meshes and use the Discontinuous Galerkin method. In an initial extensive case study, we found out that several characteristics enable these applications to profit well from the flexibility of FPGA architectures. A single FPGA of the previous Arria 10 generation is now outperforming the two-socket CPU nodes of the previous HPC Oculus cluster by around 1.5–2x at much lower power consumption. With the availability of multiple FPGA nodes connected with high speed interconnect, we scaled this application to up to 32 Stratix 10 FPGAs by communicating through the host via MPI. Multiple FPGAs create the possibility to solve larger problem sizes or smaller problem sizes faster, but the parallel efficiency still left room for improvement. The addition of direct FPGA-to-FPGA interconnect provides the foundation for such improvement. In initial designs on 2 and 4 FPGAs, the efficiency problems were resolved and significant headroom for more communication-intensive scenarios was indicated.

In the collaborative BMBF-funded project HighPerMeshes, we are working on generalizing these case studies towards more applications regarding unstructured meshes by using a domain-specific language (DSL) embedded in C++. The manually-optimized designs will be complemented by code generation for FPGAs and other targets. Scaling is achieved transparently to the application writer through distributed execution of suitable loop structures within the DSL.

 

References


Open list in Research Information System

OpenCL-based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes

T. Kenter, G. Mahale, S. Alhaddad, Y. Grynko, C. Schmitt, A. Afzal, F. Hannig, J. Förstner, C. Plessl, in: Proc. Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), IEEE, 2018

The exploration of FPGAs as accelerators for scientific simulations has so far mostly been focused on small kernels of methods working on regular data structures, for example in the form of stencil computations for finite difference methods. In computational sciences, often more advanced methods are employed that promise better stability, convergence, locality and scaling. Unstructured meshes are shown to be more effective and more accurate, compared to regular grids, in representing computation domains of various shapes. Using unstructured meshes, the discontinuous Galerkin method preserves the ability to perform explicit local update operations for simulations in the time domain. In this work, we investigate FPGAs as target platform for an implementation of the nodal discontinuous Galerkin method to find time-domain solutions of Maxwell's equations in an unstructured mesh. When maximizing data reuse and fitting constant coefficients into suitably partitioned on-chip memory, high computational intensity allows us to implement and feed wide data paths with hundreds of floating point operators. By decoupling off-chip memory accesses from the computations, high memory bandwidth can be sustained, even for the irregular access pattern required by parts of the application. Using the Intel/Altera OpenCL SDK for FPGAs, we present different implementation variants for different polynomial orders of the method. In different phases of the algorithm, either computational or bandwidth limits of the Arria 10 platform are almost reached, thus outperforming a highly multithreaded CPU implementation by around 2x.


    Open list in Research Information System

    The University for the Information Society