Sie haben Javascript deaktiviert!
Sie haben versucht eine Funktion zu nutzen, die nur mit Javascript möglich ist. Um sämtliche Funktionalitäten unserer Internetseite zu nutzen, aktivieren Sie bitte Javascript in Ihrem Browser.

Info-Icon Diese Seite ist nicht in Deutsch verfügbar

Data Center Building O Bildinformationen anzeigen

Data Center Building O

HighPerMeshes - Domain-specific programming and target-platform-aware compiler infrastructure for algorithms on unstructured grids

HighPerMeshes is a collaborative research project funded by German Ministry of Education and Research (BMBF). The project comprises a consortium of four funded partners and one associated partner and is coordinated by the Paderborn Center for Parallel Computing at Paderborn University


The goal of HighPerMeshes is to develop a practically usable domain-specific framework for the efficient, parallel and scaling implementation of iterative algorithms on unstructured grids. Simulation software in the time domain, that falls into this category (e.g. TD-FEM, TD-DG, network simulations), has increasingly been used in scientific and industrial domains in recent years and complements or supplements comparable methods on structured grids. With the results of this project, developers can with moderate effort extend existing source codes in high-level languages by domain-specific library and language elements. The intelligent compiler infrastructure uses domain knowledge to enable performance optimized, highly parallel execution on all relevant modern hardware architectures (Multicore, Manycore, GPU, FPGA), also in heterogeneous systems. Thus, the project offers to many HPC developers from science and industry an easy and sustainable path towards scaling usage of the most efficient current and future target architectures. 

Funding Identifier: 01IH16005 Project Runtime: 4/2017–3/2020 


Paderborn University

Paderborn University (UPB) will be represented within the project by the Paderborn Center for Parallel Computing (PC²). PC² was founded in 1991 as an interdisciplinary institute of UPB and since then has established a reputation as a competence centre for parallel and distributed computing and innovative computer architecture. Two research groups whose heads are affiliated with PC² will contribute to the project.

The High-Performance IT Systems group headed by Prof. Christian Plessl contributes expertise in the area of design methods and application of FPGA-based custom computing technology to scientific computing problems.

 The Theoretical Electrical Engineering Group headed by Prof. Jens Förstner contributes expertise in the area of the theoretical description and simulation of photonic and optoelectronic systems. The key expertise of the group is to combine advance material models with state-of-the-art numerical simulation methods of electromagnetic fields.

The role of UPB in the HighPerMeshes project is:

  • project coordination
  • providing a computational nanophotonics code and case studies
  • contribution to design and validation of domain-specific language
  • develop domain-specific code generation and optimization strategies for FPGAs
  • validate results with FPGAs


Friedrich-Alexander Universität Erlangen-Nürnberg

Friedrich-Alexander University Erlangen-Nürnberg (FAU) is a strong research university and one of the largest universities in Germany, with 40,000 students and 4,000 academic staff (including over 570 professors).

HighPerMeshes' project partner at FAU is the Chair of Hardware/Software Co-Design. Here, the Architecture and Compiler Design (ACD) group headed by Dr. Frank Hannig contributes expert knowledge in the fields of domain-specific computing and programming languages, as well as compilation techniques for parallel processor architectures including accelerators.

The role of FAU in the HighPerMeshes project is:

  • design of a domain-specific compiler front-end and intermediate representation
  • development of compiler infrastructure for transformation and optimization
  • development of OpenCL code generation backend


Fraunhofer ITWM

The Fraunhofer Institute for Industrial Mathematics ITWM at Kaiserslautern is one of currently 66 institutes of the Fraunhofer-Gesellschaft, an application-oriented research organization. Fraunhofer ITWM focuses on the development of mathematical applications for industry, technology and economy. Mathematical approaches to practical  challenges are the specific competences of the institute and complement knowledge in engineering and economics in an optimal way.

The Competence Center High Performance Computing contributes expertise in the development and application of innovative new software tools like the communication library GPI-2 for the efficient and highly scalable implementation of parallel software.

contributes expertise in the development and application of innovative new software tools like the communication library GPI-2 for the efficient and highly scalable implementation of parallel software.

The role of ITWM in the HighPerMeshes project is:

  • development of code transformations for GPI-2 based interprocess communication and synchronization
  • development of problem partitioning strategies for large-scale clusters
  • providing task models for load balancing within a compute node and within the cluster


Konrad-Zuse-Zentrum für Informationstechnik Berlin

The Zuse Institute Berlin (ZIB) is a non-university research institute of the State of Berlin. In close interdisciplinary cooperation with the Berlin universities and scientific institutions Zuse Institute implements research and development in the field of information technology with a particular focus on application-oriented algorithmic mathematics and practical computer science. ZIB also provides high-performance computer capacity as an accompanying service.

In the Supercomputing department advanced programming tools and flexible runtime environments for complex application settings are developed targeting emerging technologies like heterogeneous many-core systems.

The Numerical Mathematics department develops efficient modelling, simulation, and optimization tools and algorithms for challenging application problems from medicine, systems biology, molecular dynamics and nano-optical systems.

The role of ZIB in the HighPerMeshes project is:

  • providing the Kaskade FE toolbox reference code and case studies
  • contribution to design and validation of domain-specific language
  • develop domain-specific code generation and optimization strategies for multi- and many-cores (Xeon, Xeon Phi) and GPU (OpenCL)
  • design of intermediate representation for communication and task offloading
  • validate results with multi- and many-cores


Computer Simulation Technology AG (associated project partner)

CST is a market leader in delivering 3D electromagnetic (EM) field simulation tools through a global network of sales and support staff and representatives. CST develops CST STUDIO SUITE, a package of high-performance software for the simulation of EM fields in all frequency bands. Its growing success is based on a combination of leading edge technology, a user-friendly interface and knowledgeable support staff. CST solutions are used by market leaders in a diverse range of industries, including aerospace, automotive, defense, electronics, healthcare and telecommunications. CST is part of SIMULIA, a Dassault Systèmes brand. Further information about CST is available on the web at

The role of CST in the HighPerMeshes project is:

  • contribute to definition of requirements for domain-specific language and compilation framework
  • provide insights into customer demand and technological developments
  • validation of results with CST Microwave Studio



Kaskade 7

Kaskade 7 is a finite element toolbox for the solution of stationary and transient systems of partial differential equations. The library is written in C++ and utilizes template meta-programming to achieve flexibility and efficiency. It is based to a large extent on the DUNE (Distributed and Unified Numerics Environment) core modules. The Kaskade 7 code is under active development by the "Computational Medicine" research group at ZIB.

Two application examples implemented with Kaskade 7, modeling cardiac electrophysiology and elastomechanics, serve as case studies in the HighPerMeshes project.


Liste im Research Information System öffnen


The HighPerMeshes framework for numerical algorithms on unstructured grids

S. Alhaddad, J. Förstner, S. Groth, D. Grünewald, Y. Grynko, F. Hannig, T. Kenter, F. Pfreundt, C. Plessl, M. Schotte, T. Steinke, J. Teich, M. Weiser, F. Wende, Concurrency and Computation: Practice and Experience (2021), pp. e6616


A Runtime System for Finite Element Methods in a Partitioned Global Address Space

S. Groth, D. Grünewald, J. Teich, F. Hannig, in: Proceedings of the 17th ACM International Conference on Computing Frontiers (CF '2020), ACM, 2020



OpenCL Implementation of Cannon's Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs

P. Gorlani, T. Kenter, C. Plessl, in: Proceedings of the International Conference on Field-Programmable Technology (FPT), IEEE, 2019

Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS.

SYCL Code Generation for Multigrid Methods

S. Groth, C. Schmitt, J. Teich, F. Hannig, in: Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems - SCOPES '19, 2019

Multigrid methods are fast and scalable numerical solvers for partial differential equations (PDEs) that possess a large design space for implementing their algorithmic components. Code generation approaches allow formulating multigrid methods on a higher level of abstraction that can then be used to derive a problem- and hardware-specific solutions. Since these problems have a considerable implementation variability, it is crucial to investigate a general mapping of core components in multigrid methods to the target software. With SYCL there exists a high-level C++ abstraction layer that is capable of targeting a multitude of architectures. We contribute a general way to map multigrid components to SYCL functionality and provide a performance evaluation for specific algorithmic component.


OpenCL-based FPGA Design to Accelerate the Nodal Discontinuous Galerkin Method for Unstructured Meshes

T. Kenter, G. Mahale, S. Alhaddad, Y. Grynko, C. Schmitt, A. Afzal, F. Hannig, J. Förstner, C. Plessl, in: Proc. Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), IEEE, 2018

The exploration of FPGAs as accelerators for scientific simulations has so far mostly been focused on small kernels of methods working on regular data structures, for example in the form of stencil computations for finite difference methods. In computational sciences, often more advanced methods are employed that promise better stability, convergence, locality and scaling. Unstructured meshes are shown to be more effective and more accurate, compared to regular grids, in representing computation domains of various shapes. Using unstructured meshes, the discontinuous Galerkin method preserves the ability to perform explicit local update operations for simulations in the time domain. In this work, we investigate FPGAs as target platform for an implementation of the nodal discontinuous Galerkin method to find time-domain solutions of Maxwell's equations in an unstructured mesh. When maximizing data reuse and fitting constant coefficients into suitably partitioned on-chip memory, high computational intensity allows us to implement and feed wide data paths with hundreds of floating point operators. By decoupling off-chip memory accesses from the computations, high memory bandwidth can be sustained, even for the irregular access pattern required by parts of the application. Using the Intel/Altera OpenCL SDK for FPGAs, we present different implementation variants for different polynomial orders of the method. In different phases of the algorithm, either computational or bandwidth limits of the Arria 10 platform are almost reached, thus outperforming a highly multithreaded CPU implementation by around 2x.

Solving Maxwell's Equations with Modern C++ and SYCL: A Case Study

A. Afzal, C. Schmitt, S. Alhaddad, Y. Grynko, J. Teich, J. Förstner, F. Hannig, in: Proceedings of the 29th Annual IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2018, pp. 49-56

In scientific computing, unstructured meshes are a crucial foundation for the simulation of real-world physical phenomena. Compared to regular grids, they allow resembling the computational domain with a much higher accuracy, which in turn leads to more efficient computations.<br />There exists a wealth of supporting libraries and frameworks that aid programmers with the implementation of applications working on such grids, each built on top of existing parallelization technologies. However, many approaches require the programmer to introduce a different programming paradigm into their application or provide different variants of the code. SYCL is a new programming standard providing a remedy to this dilemma by building on standard C ++17 with its so-called single-source approach: Programmers write standard C ++ code and expose parallelism using C++17 keywords. The application is<br />then transformed into a concrete implementation by the SYCL implementation. By encapsulating the OpenCL ecosystem, different SYCL implementations enable not only the programming of CPUs but also of heterogeneous platforms such as GPUs or other devices. For the first time, this paper showcases a SYCL-<br />based solver for the nodal Discontinuous Galerkin method for Maxwell’s equations on unstructured meshes. We compare our solution to a previous C-based implementation with respect to programmability and performance on heterogeneous platforms.<br


Flexible FPGA design for FDTD using OpenCL

T. Kenter, J. Förstner, C. Plessl, in: Proc. Int. Conf. on Field Programmable Logic and Applications (FPL), IEEE, 2017

Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a parameterized FDTD implementation for photonic microcavity simulations. Our design can trade-off different forms of parallelism and works for two independent OpenCL-based FPGA design flows. Hence, we can target FPGAs from different vendors and different FPGA families. We describe how we used pre-processor macros to achieve this flexibility and to work around different shortcomings of the current tools. Choosing the right design configurations, we are able to present two extremely competitive solutions for very different FPGA targets, reaching up to 172 GFLOPS sustained performance. With the portability and flexibility demonstrated, code developers not only avoid vendor lock-in, but can even make best use of real trade-offs between different architectures.

Liste im Research Information System öffnen


Prof. Dr. Christian Plessl

Paderborn Center for Parallel Computing (PC2)

Christian Plessl
+49 5251 60-5399
+49 5251 60-1714

Die Universität der Informationsgesellschaft