

### OpenCL Design Flows for Intel and Xilinx FPGAs

Common Optimization Strategies, Design Patterns and Vendor-specific Differences

#### **Tobias Kenter**

Paderborn Center for Parallel Computing & Department of Computer Science Paderborn University, Germany



DATE, Monday Tutorials – 25 March 2019 – Florence, Italy

# Part 1 Common Design Patterns Key Differences

# Introduction

# My Agenda

• Our mission at PC<sup>2</sup>

Promote and Establish FPGAs as accelerators in HPC



Paderborn Center for Parallel Computing

• Objectives for applications and libraries

Achieve Throughput Close to Architectural Limits

Use OpenCL as Performance Portable FPGA Design Tool

• How far can those coexist?

# **My Background**

- Research interests / background
  - application acceleration
  - architecture exploration
  - compilation tools
    - tool user: OpenCL, Maxeler
    - compiler extensions: LLVM, Clang
- Experience with OpenCL FPGA tool chains since 2016
  - FDTD stencil computations with Xilinx and Intel
  - DG code with Intel
  - matrix multiplication with Intel and Xilinx
  - CNN, convolutions with Xilinx and Intel
  - FFT with Intel
  - image processing and generalization with Xilinx
  - elliptic curve method with Xilinx
  - external channel communication with Intel

### **My Biases**

- Currently more focus on Intel tools due to our hardware setup
- Xilinx SDAccel has an extensive GUI that I mostly ignore here
  - makefile + command line flow to quickly switch targets

| Cr Brern tro                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                       |                                                                                  | Quick Access             | 9   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|--------------------------|-----|
| 🖕 Project Explorer 😫 👘 🗖                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | ≪vscale1_vec ಔ @ kml_vadd.cl @ vscale1_vec.cl D macros.h                                                                                                                                                                                                              |                                                                                  |                          | - 1 |
| - 🕸 🖮 📍                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | SDx Application Project Settings                                                                                                                                                                                                                                      | Active build con                                                                 | figuration: Emulation-SW | 4   |
| Solution States Stat | General                                                                                                                                                                                                                                                               | Options                                                                          |                          |     |
| <ul> <li>Èsdx</li> <li>ÈEmulation-HW</li> <li>ÈEmulation-SW</li> <li>Èsrc</li> <li>Èsrn[vadd.cl</li> <li>Èsmacros.h</li> <li>Èsvadd.cpp</li> <li>Èvadd.h</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Project name:       vscale1_vec         Project flow:       SDAccel         Platform:       alpha-data_adm-pcie-8k5_dynamic_5_0         Runtime:       OpenCL         System configuration:       Linux         Number of devices:       1         Hardware Functions | Target:<br>Host debug:<br>Kernel debug:<br>Report type:<br>Hun warm opt materian | Software Emulation       | × × |
| <ul> <li>✓ scale1_vec.cl</li> <li>✓ scl.cpp</li> <li>✓ scl.h</li> <li>✓ Export_Compliance_Notice.md</li> <li>✓ System</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Name Compute Units Port Data Width                                                                                                                                                                                                                                    | Max Mem                                                                          |                          |     |
| Assistant 🛙 🖛                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                       | P D Tarnet Connect                                                               | i 📮 Emulation Cons 🛙     |     |
| <pre>vscale1_vec [OpenCL] C Emulation-SW [Software Emulation C binary_container_1 Vscale [OpenCL] C Emulation-HW [Hardware Emulation System [Hardware]</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | SDx Build Console [vscale1_vec, Emulation-SW]<br>Attempting to get a License: ap openci<br>INFO: [XOCC 17-1540] The version limit for your license is '2018                                                                                                           | 3.12<br>5 X04                                                                    |                          | 0   |

### **Outline Part 1**

- Overview FPGAs and Goals
- OpenCL Overview
- Example 1: Vector Scale
  - compilation
  - reports
  - performance analysis
- Vector Scale Variations
  - automatic unrolling
- Example 2: SAXPY
  - blockwise design pattern
- Outer Loop Pipelining
- Streaming Kernels

# **Overview: FPGAs and Goals**



- Field-programmable Gate Array
  - Gates
    - fundamental building blocks are logic gates



- in all current FPGAs: LUTs (Lookup Tables)
- truth table stored in SRAM
- Array
  - many gates (LUTs) in a regular 2D structure
- Field-programmable
  - configuration can be changed "in the field", many times
  - in practice: currently takes up to few 100 ms
  - faster alternatives possible

### **FPGA – Basic Structures**



## **FPGA – Configuration and Application Domains**

- configuration
  - all FPGAs components are programmable (logic cell, DSP, IO-block functions, routing, ...)
  - configuration data (bitstream) is stored in SRAM cells
  - bitstream loaded from non-volatile memory at boot time
  - some devices can be re-configured at runtime
- application domains
  - glue logic
  - rapid prototyping, emulation
  - embedded systems
    - configurable system-on-chip
    - ASIC replacement
  - reconfigurable computing
    - computing without CPUs
    - combine processor-like programmability with ASIC-like performance
    - recent hot topic: CNNs with customized precision

# **FPGA Technology Today**

#### Example: Intel Stratix 10 GX2800 FPGA

- > 900000 configurable logic blocks
  - up to 4 Boolean functions of 8 inputs
- 5760 hardened arithmetic units (DSP)
  - fixed point and IEEE 754 SP floating-point
- 11721 independent SRAM blocks
  - width/depth/ports highly configurable
- integrated DDR4 memory controllers
- up to 96 serial transceivers, up to 28.3 Gbps
- typically about 300-600MHz
- power consumption 50-225W



# **Classical FPGA Development**



- Hardware design is traditionally done by modeling the system in a hardware description language (e.g. VHDL or Verilog)
- An FPGA synthesis tool (compiler) generates an netlist of basic logic elements,

- which is then translated (mapped) to components available on the FPGA,
- which are placed on the chip,
- and the connecting signals are routed through the interconnection network.
- The resulting configuration data (bitstream) for programing the FPGA is created

### **HDL Synthesis**



### **Technology Mapping**



### **Place & Route**





### **Modern FPGA Development**



}

for (int i = 0; i < SIZE; i++){
 c[i] = a[i] \* b[i];</pre>

### **Execution on CPU vs on FPGA**

```
for (int i = 0; i < SIZE; i++){
    c[i] = a[i] * b[i];
}</pre>
```

#### **Execution on CPU**

• Series of instructions

```
loop:
    ld %a $a(%i)
    ld %b $b(%i)
    %c = %a * %b
    st $c(%i) %c
    %i = %i + 1
    branch i<SIZE: loop</pre>
```

#### **Execution on FPGA**

• Spatial data path + control



# **Pipelining**

- Use functional units every cycle
- Initiation intervall II
  - describes pipeline fill rate

| ld \$a(%i) | ld \$b(%i)                                                                                                                                                                                        |        |
|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
|            |                                                                                                                                                                                                   |        |
| %a * %b    | 이이 지하지 이 가지 않고<br>하지 아이가 되어 2 세계 1<br>지하지 아이가 되어 2 세계 1<br>지하지 아이가 아이가 있는 것 같이<br>하지 아이가 아이가 있는 것 같이 같이 하지 않는 것 같이 하는 것 같이 않는 것 같이 있는 것 않 |        |
|            |                                                                                                                                                                                                   |        |
| st \$c(%i) |                                                                                                                                                                                                   |        |
|            |                                                                                                                                                                                                   | enable |
|            | i <size< td=""><td></td></size<>                                                                                                                                                                  |        |
|            |                                                                                                                                                                                                   |        |



#### loop iteration / work item

## **High Level Design Goals**

use (expensive) arithmetic units (almost) every cycle

have scaling designs up to resource or bandwidth limits



- This loop may use
  - 2 memory blocks for inputs
  - 1 DSP for multiplication
  - 1 memory block for output
  - 280 logic cells for counter and control
- Could create 3907 instances of this block
  - critical resource: 11721 memory blocks / 3
- or 3906 different blocks of this size
- or ...

# **OpenCL** Overview

# **OpenCL Standard and Platform Model**



- OpenCL 1.0 standard + selected features

https://www.khronos.org/registry/OpenCL/

### **Host Code**

- Detect a platform ( = runtime library, driver here )
- Detect devices
- Allocate devices ( = create context )
- Create and build program (on FPGA platforms = load and configure bitstreams)
- Create kernel objects
- Create command queues
- Allocate device memory
- Transfer data
- Setup kernel arguments
- Call kernels
- Synchronize

### **Kernel Code**

- Specify accelerator functionality in C syntax
- Special language features
  - function qualifier ( \_\_kernel )
  - vector data types and operations
  - address space qualifiers
- NDRangeKernel concept
  - express data parallel execution of work items and work groups
  - get\_global\_id
  - get\_local\_id
  - supported in FPGA platforms, but often not the most efficient method

# **Used Intel OpenCL Platform**

- Intel FPGA SDK for OpenCL 18.1.1
- <u>https://www.intel.com/content/www/us/en/programmable/products/design-software/embedded-software-developers/opencl/support.html</u>
  - Release Notes
  - Getting Started Guide
  - Programming Guide
  - Best Practices Guide
  - ...
  - Download the version specific PDFs!
- Target board: Bittware 520N
- Target FPGA: Intel Stratix 10 GX 2800
  - 933120 ALMs
  - 11721 M20k memory blocks (20kb each)
  - 5760 DSP blocks, 1x 32 bit IEEE 754 SP floating-point or 2x 18x19 multipiers



# **Used Xilinx OpenCL Platform**

- Xilinx SDx 2018.3 SDAccel
- <u>https://www.xilinx.com/html\_docs/xilinx2018\_3/sdaccel\_doc/index.html</u>
- https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html#documentation
  - Release Notes, Installation, and Licensing Guide
  - Environment User Guide
  - SDAccel Environment Programmers Guide
  - SDAccel Environment Profiling and Optimization Guide
  - SDx Pragma Reference Guide
  - ...
  - Download the version specific PDFs!
- Target board: Alpha Data ADM-PCIE-8k5
- Target FPGA: Xilinx Kintex Ultrascale KU115-
  - 663360 CLB LUTs
  - 2160 BRAM blocks, 36kb each
  - 5520 DSP slices, 27x18 multipliers



### **Note on Xilinx Tool Scope**

- SDx combines GUI tool and command line compiler for
  - SoCs (Zynq) and discrete target platforms (PCle)
    - SoCs
      - enables shared memory and CPU-FPGA interactions beyond OpenCL platform model
      - uses SDSoC license
    - discrete platforms
      - use BSP following OpenCL platform model
      - use SDAccel license
  - OpenCL and C/C++ kernel specification
    - OpenCL
      - attributes can be used to guide high-level synthesis step
    - C/C++
      - HLS pragmas are used to guide high-level synthesis step (more available)
      - fixed kernel interface for discrete target platforms
- Scope in this talk: discrete target platforms with OpenCL

### **Outline Part 1**

- Overview FPGAs and Goals
- OpenCL Overview
- Example 1: Vector Scale
  - compilation
  - reports
  - performance analysis
- Vector Scale Variations
  - automatic unrolling
- Example 2: SAXPY
  - blockwise design pattern
- Outer Loop Pipelining
- Streaming Kernels

# **Example 1: vector scale**

### **Vector Scale Single-Work Item Kernel**

• Examples and essential reports of both tools available at

https://github.com/kenter/OpenCL-FPGA-examples

```
kernel
void vscale(
global float16 *restrict x,
 global float16 *restrict y,
const float a,
const int size16)
{
    vscale:
    for(int i=0; i<size16; i++){</pre>
        y[i] = x[i]*a;
    }
```

## **Pipelining: Expectation**

```
kernel
void vscale(
global float8 *restrict x,
global float8 *restrict y,
const float a,
const int size8)
ł
    vscale:
    for(int i=0; i<size8; i++){</pre>
        y[i] = x[i]*a;
    }
```



### **Compiling with Intel FPGA SDK for OpenCL**

 aoc -rtl -report -v -board=p520\_max\_sg2801 -fp-relaxed -fpc device/vscale1\_vec.cl

[[kenter@fe-1 examples]\$ make reportIntel-vscale1\_vec aoc -rtl -report -v -board=p520\_max\_sg280l -fp-relaxed -fpc device/vscale1\_vec.cl aoc: Environment checks are completed successfully. aoc: Cached files in /var/tmp/aocl/ may be used to reduce compilation time aoc: Selected target board p520\_max\_sg280l aoc: Running OpenCL parser.... aoc: OpenCL parser completed successfully. aoc: Linking Object files.... aoc: Optimizing and doing static analysis of code... aoc: Linking with IP library ... Checking if memory usage is larger than 100%

| Estimated Resource Usage Summary | +       |  |
|----------------------------------|---------|--|
| Resource                         | + Usage |  |
| Logic utilization                | ; 69%   |  |
| ALUTS                            | ; 36%   |  |
| Dedicated logic registers        | ; 36%   |  |
| Memory blocks                    | ; 32%   |  |
| DSP blocks                       | : 29%   |  |



# Intel Report (1) Summary

- reports/report.html
- Summary
  - 1 Single work-item kernel
  - high resource includes BSP

| Info                                                                                                                            |                                                                  |                                                 |                                       |                  |                           |
|---------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|-------------------------------------------------|---------------------------------------|------------------|---------------------------|
| Project Name                                                                                                                    | vscale1_v                                                        | ec                                              |                                       |                  |                           |
| Target Family, Device, Bo                                                                                                       | oard Stratix 10                                                  | , 1SG280LU3F50E1                                | /GS1, nalla_pcie:                     | p520_max_sg2     | BOL                       |
| AOC Version                                                                                                                     | 18.1.1 Bui                                                       | ild 263                                         |                                       |                  |                           |
| Quartus Version                                                                                                                 | 18.1.1 Bui                                                       | ld 263 Pro                                      |                                       |                  |                           |
| Command                                                                                                                         |                                                                  | eport -v -board=p52<br>cale1_vec.cl             | 0_max_sg280l -                        | fp-relaxed -fpc  |                           |
| Reports Generated At                                                                                                            | Fri Mar 22                                                       | 13:52:36 2019                                   |                                       |                  |                           |
| Quartus Fit Summa                                                                                                               | ary                                                              |                                                 |                                       |                  |                           |
|                                                                                                                                 |                                                                  | C                                               | va information                        |                  |                           |
| Run Quartus compile to                                                                                                          | o populate this sectio                                           | n. See details for mo                           | ore information.                      |                  |                           |
|                                                                                                                                 |                                                                  | n, See details for mo                           |                                       | HyperFlex        |                           |
| Kernel Summary<br>Kernel Name Kernel 7                                                                                          | Type Autorur                                                     | n Workgroup Size                                | # Compute Ur                          | Control O        | x<br>ptimizatior          |
| Kernel Summary<br>Kernel Name Kernel 7                                                                                          |                                                                  |                                                 |                                       | aite             |                           |
| Kernel Summary<br>Kernel Name Kernel T<br>vscale Single w                                                                       | Type Autorur<br>vork-item No                                     | n Workgroup Size                                | # Compute Ur                          | Control O        |                           |
| Kernel Summary<br>Kernel Name Kernel T<br>vscale Single w                                                                       | Type Autorur<br>vork-item No                                     | n Workgroup Size                                | # Compute Ur                          | Control O        |                           |
| Kernel Summary<br>Kernel Name Kernel 7<br>vscale Single w<br>Estimated Resource                                                 | Type Autorur<br>vork-item No<br>e Usage                          | 1 Workgroup Size                                | # Compute Ur<br>1                     | On               | ptimizatior               |
| Kernel Summary<br>Kernel Name Kernel T<br>vscale Single w<br>Estimated Resource<br>Kernel Name                                  | Type Autorur<br>vork-item No<br>e Usage<br>ALUTs                 | 1 Workgroup Size<br>1,1,1<br>FFs                | # Compute Ur<br>1<br>RAMs             | On<br>On<br>DSPs | ptimizatior<br>MLABs      |
| Kernel Summary<br>Kernel Name Kernel T<br>vscale Single w<br>Estimated Resource<br>Kernel Name<br>vscale                        | Type Autorur<br>vork-item No<br>e Usage<br>ALUTs<br>3846         | Workgroup Size<br>1,1,1<br>FFs<br>9482          | # Compute Ur<br>1<br>RAMs<br>46       | On<br>On<br>DSPs | ptimization<br>MLABs<br>8 |
| Kernel Summary<br>Kernel Name Kernel T<br>vscale Single w<br>Estimated Resource<br>Kernel Name<br>vscale<br>Global Interconnect | Type Autorur<br>vork-item No<br>e Usage<br>ALUTs<br>3846<br>7490 | Workgroup Size<br>1,1,1<br>FFs<br>9482<br>15614 | # Compute Ur<br>1<br>RAMs<br>46<br>52 | DSPs<br>0        | MLABs<br>8<br>0           |

| sca  | e1_vec.cl •                                               | ×     |
|------|-----------------------------------------------------------|-------|
| 1    | <pre>#include "macros.h"</pre>                            |       |
| 2    | · · · · · · · · · · · · · · · · · · ·                     |       |
| 3    | kernel                                                    |       |
| 4    | void vscale(                                              |       |
| 5    | global float16 *restrict x,                               |       |
| 6    | global float16 *restrict y,                               |       |
| 7    | const float a,                                            |       |
| 8    | const int size)                                           |       |
| 9 -  |                                                           |       |
| 10   | vscale:                                                   |       |
| 11   | <pre>//attribute((xcl_pipeline_loop(</pre>                | (1))) |
| 12 - | <pre>for(int i=0; i<size; i++){<="" pre=""></size;></pre> |       |
| 13   | y[i] = x[i]*a;                                            |       |
| 14   | }                                                         |       |
| 15   | }                                                         |       |
| 16   |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |
|      |                                                           |       |

# Intel Report (2) Loop Analysis

• Loop analysis

| Loops analysis                    | lysis     |    | Show full  | y unrolled loops        |
|-----------------------------------|-----------|----|------------|-------------------------|
|                                   | Pipelined | н  | Bottleneck | Details                 |
| Kernel: vscale (vscale1_vec.cl:4) |           |    |            | Single work-item ex     |
| vscale.B2 (vscale1_vec.cl:12)     | Yes       | ~1 | n/a        | II is an approximation. |



#### vscale.B2:

- Loop orchestration compiler optimization is enabled.
- Il is an approximation due to the following stallable instructions:
  - Load Operation (vscale1\_vec.cl: 13)
  - Store Operation (vscale1\_vec.cl: 13)



### Intel Report (3) System viewer

| <ul> <li>System viewer</li> </ul> | • |
|-----------------------------------|---|
|-----------------------------------|---|

- selecting the loop denoted as vsacle.B2

| Details                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |     |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| vscale.B2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |     |
| Latency                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 656 |
| II Contraction of the second sec | 1   |
| Subloops                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | No  |
| Pipelined                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Yes |
| Fmax Bottlenecks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | No  |
| Loop Info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |     |



# **Pipeline Following System Viewer**



## Intel Report (4) Area Analysis

| Notation file:X > file:Y indica                | ates a function o | all on line X w | as inlined usi | ing code on l | line Y. |                                                  | 1 #include "macros.h"<br>2<br>3kernel                           |
|------------------------------------------------|-------------------|-----------------|----------------|---------------|---------|--------------------------------------------------|-----------------------------------------------------------------|
|                                                | ALUTs             | FFs             | RAMs           | DSPs          | MLABs   | Details                                          | <pre>4 void vscale( 5global float16 *restrict x,</pre>          |
| Static Partition                               | 480580 (35%)      | 961160 (35%)    | 2766 (31%)     | 1292 (29%)    | 0 (0%)  |                                                  | <pre>6global float16 *restrict y,<br/>7 const float a,</pre>    |
| Kernel System                                  | 11338 (1%)        | 25167 (1%)      | 100 (1%)       | 16 (0%)       | 8 (0%)  |                                                  | <pre>8 const int size) 9 - {</pre>                              |
| Global interconnect                            | 7490              | 15614           | 52             | 0             | 0       | For 1 global load a                              | <pre>10 vscale:<br/>11 //attribute((xcl_pipeline_loop(1))</pre> |
| System description ROM                         | 2                 | 71              | 2              | 0             | 0       | Contains informati                               | 12 * for(int i=0; i <size; i++){<br="">13</size;>               |
| ♥ vscale                                       | 3846 (0%)         | 9482 (0%)       | 46 (1%)        | 16 (0%)       | B (0%)  | 1 compute unit.                                  | 14 }<br>15 }                                                    |
| Function overhead                              | 1463              | 1467            | 0              | 0             | 6       | Kernel dispatch lo                               | 16                                                              |
| Private Variable:<br>- 'i' (vscale1_vec.cl:12) | 32                | 130             | 0              | 0             | 0       | Register,<br>1 reg, 32 width,<br>1 reg, 33 width |                                                                 |
| > vscale.B0                                    | 191 (0%)          | 144 (0%)        | 0 (0%)         | 0 (0%)        | 1 (0%)  |                                                  |                                                                 |
| ♥ vscale.B2                                    | 2160 (0%)         | 7741 (0%)       | 46 (1%)        | 16 (0%)       | 1 (0%)  |                                                  |                                                                 |
| Cluster logic                                  | 418               | 722             | 16             | 0             | 1       | Logic required to e                              |                                                                 |
| > State                                        | 34                | 697             | 1              | 0             | 0       | Live values and co                               |                                                                 |
| > Feedback                                     | 65                | 41              | 0              | 0             | 0       | Loop-carried depe                                |                                                                 |
| ✓ Computation                                  | 1643              | 6281            | 29             | 16            | 0       |                                                  |                                                                 |
| vscale1_vec.cl:12                              | 105               | 0               | 0              | 0             | 0       |                                                  |                                                                 |
| > vscale1_vec.cl:13                            | 1538              | 6281            | 29             | 16            | 0       |                                                  |                                                                 |

• 16 DSPs for 16 float multiplications

## **Design Review (1)**

#### use (expensive) arithmetic units (almost) every cycle

have scaling designs up to resource or bandwidth limits

- Initiation Intervall II = 1
- Latency L = 656
- Iterations N
- Time in Cycles C = N x II + L

| Ν      | С      | Efficiency |
|--------|--------|------------|
| 10     | 666    | 1.5%       |
| 100    | 756    | 13.2%      |
| 1000   | 1656   | 60.4%      |
| 10000  | 10656  | 93.8%      |
| 100000 | 100656 | 99.3%      |

## **Design Review (2)**

#### use (expensive) arithmetic units (almost) every cycle

#### have scaling designs up to resource or bandwidth limits

- Read and write 16 floats (32 bit) per cycle
  - $-2 \times 512$  bit = 2 x 64 byte per cycle
- Peak bandwidth of board
  - 4 x (64+8) bit x 2400 MHz (physical interface)
  - 4 x 512 bit x 300 MHz (OpenCL interface)
  - can unroll 2x more or have 2 compute units
- Kernel can run at > 300 MHz (350-400 MHz for this type of simple kernel)
  - 2x unrolled version mildly bandwidth limited
- Main problem: low arithmetic intensity
  - only 16 of 5760 DSPs used 0.28% utilization 0.55% with another 2x unrolling



#### **Compiling with Xilinx SDx (SDAccel)**

xocc -g -R 2 -s --platform=alpha-data\_adm-pcie-8k5\_dynamic\_5\_0

 --memory\_port\_data\_width all:512 -c device/vscale1\_vec.cl -o
 vscale1 vec.xo

[kenter@fe-1 examples]\$ make reportXilinx-vscale1\_vec make: aocl: Command not found make: aocl: Command not found xocc -g -R 2 -s --platform=alpha-data\_adm-pcie-8k5\_dynamic\_5\_0 --memory\_port\_data\_width all:512 -c device/vscale1\_vec.cl -o vscale1\_vec.xo

\*\*\*\*\*\* xocc v2018.3 (64-bit)

\*\*\*\* SW Build 2405991 on Thu Dec 6 23:36:41 MST 2018

\*\* Copyright 1986-2018 Xilinx, Inc. All Rights Reserved.

Attempting to get a license: ap\_opencl

INFO: [XOCC 60-1306] Additional information associated with this xocc compile can be found at: Reports: /upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/\_x/reports/vscale1\_vec Log files: /upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/\_x/logs/vscale1\_vec INFO: [XOCC 60-585] Compiling for hardware target Running SDx Rule Check Server on port:36586 INFO: [XOCC 60-895] Target platform: /opt/Xilinx/SDx/2018.3/platforms/alpha-data\_adm-pcie-8k5\_dynamic\_5\_0/alpha-data\_adm-pcie-8k5\_dynamic\_5\_0.xpfm INFO: [XOCC 60-423] Target device: alpha-data\_adm-pcie-8k5\_dynamic\_5\_0

===>The following messages were generated while performing high-level synthesis for kernel: vscale Log file: /upb/scratch/departments/pc2/groups/pc2-mita rbeiter/kenter/gitlab/2019-date-tutorial/examples/\_x/vscale1\_vec/vscale/vivado\_hls.log :

INFO: [XOCC 204-6] Pipelining loop 'vscale'. INFO: [XOCC 204-6] Pipelining result : Target II = 1, Final II = 1, Depth = 10. INFO: [XOCC 60-50] Finished kernel compilation

INFO: [XOCC 60-244] Generating system estimate report...

INFO: [XOCC 60-1092] Generated system estimate report: /upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/\_x/re ports/vscale1\_vec/system\_estimate\_vscale1\_vec.xtxt

## Xilinx Report (1) Vivado HLS Log

#### • Vivado HLS log

38 INFO: [HLS 214-115] Burst read of variable length and width 512 has been inferred on 'gmem' (/upb/scratch/ departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/device/vscale1\_vec.cl:12:5)
39 INFO: [HLS 214-115] Burst write of variable length and width 512 has been inferred on 'gmem' (/upb/scratch/ departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/device/vscale1\_vec.cl:12:5)

#### • Similar 512 bit burst loads / stores

| 53 INFO: [HLS 200-10]                                                        |           |
|------------------------------------------------------------------------------|-----------|
| 54 INF0: [SCHED 204–11] Starting scheduling                                  |           |
| 55 INFO: [SCHED 204-61] Pipelining loop 'vscale'.                            |           |
| 56 INFO: [SCHED 204-61] Pipelining result : Target II = 1, Final II = 1, Dep | oth = 10. |

- II = 1
- Depth = 10 vs. Latency = 656 in Intel Design
  - different terminology, different treatment of off-chip memory latency
  - latency is still there (will see in next example) estimate of loop efficiency harder

### Xilinx Report (2) System Estimate

#### • System estimate

| 35 | Area Informat | ion         |             |      |      |     |      |
|----|---------------|-------------|-------------|------|------|-----|------|
| 36 | Compute Unit  | Kernel Name | Module Name | FF   | LUT  | DSP | BRAM |
| 37 |               |             |             |      |      |     |      |
| 38 | vscale_1      | vscale      | vscale      | 6213 | 4806 | 48  | 30   |
| 39 |               |             |             |      |      |     |      |

- 3 DSPs (+ some logic) per MUL
  - need to combine 27x18 multipliers
  - Vivado HLS provides some control over balance between DSPs and logic
  - SDx with OpenCL inputs not directly
  - short multiplications can be done with single DSP

```
xocc -g -R 2 -s --platform=alpha-data_adm-pcie-8k5_dynamic_5_0 --memory_port_data_width all:256
-c device/vscale5_short.cl -o vscale5_short.xo
```

| 35 | Area Informat | 10n         |             |      |      |     |      |
|----|---------------|-------------|-------------|------|------|-----|------|
| 36 | Compute Unit  | Kernel Name | Module Name | FF   | LUT  | DSP | BRAM |
| 37 |               |             |             |      |      |     |      |
| 38 | vscale_1      | vscale      | vscale      | 1982 | 2273 | 16  | 16   |
| 39 |               |             |             |      |      |     |      |

#### **Vector Scale Summary**

- 2 very similar designs
- Found pipelining in reports
- Found 512 bit wide burst-coalesced loads / stores in reports
- Found 16 parallel floating point MULs indirectly in resource estimate

use (expensive) arithmetic units (almost) every cycle

have scaling designs up to resource or **bandwidth** limits

• It's much easier to reach bandwidth limits than compute resource limits

## **Vector Scale Variations**

#### **Vector Scale with Unrolling**

```
kernel
void vscale(
  global float *restrict x,
  global float *restrict y,
const float a,
const int size)
{
     attribute ((opencl_unroll_hint(16)))
    for(int i=0; i<size; i++){</pre>
        y[i] = x[i]*a;
    }
```

More typical alternative for Intel compiler #pragma unroll 16

- <u>https://github.com/kenter/OpenCL-FPGA-examples</u> -> vscale2\_u.cl
  - report files in reportIntel and reportXilinx
  - What has changed in contrast to vscale1\_vec (throughput, resources, ...)?

## **Intel Report: Area Analysis**

| Area analysis of system<br>(area utilization values are e<br>Notation <i>file:X</i> > <i>file:Y</i> indica |              | call on line X w | as inlined usi | Area analysis of system<br>(area utilization values are<br>Notation <i>file:X</i> > <i>file:Y</i> indi | and the second of the second second | call on line X v | was inlined u | sing code on |         | Collapse All                 |
|------------------------------------------------------------------------------------------------------------|--------------|------------------|----------------|--------------------------------------------------------------------------------------------------------|-------------------------------------|------------------|---------------|--------------|---------|------------------------------|
|                                                                                                            | ALUTs        | FFs              | RAMs           |                                                                                                        | ALUTs                               | FFs              | RAMs          | DSPs         | MLABs   | Details                      |
| Static Partition                                                                                           | 480580 (35%) | 961160 (35%)     | 2766 (31%)     | > Static Partition                                                                                     | 480580 (35%)                        | 961160 (35%)     | 2766 (31%)    | 1292 (29%)   | 0 (0%)  |                              |
| ✓ Kernel System                                                                                            | 11338 (1%)   | 25167 (1%)       | 100 (1%)       | ♥ Kernel System                                                                                        | 12134 (1%)                          | 25559 (1%)       | 105 (1%)      | 16 (0%)      | 12 (0%) |                              |
| Global interconnect                                                                                        | 7490         | 15614            | 52             | Global interconnect                                                                                    | 7490                                | 15614            | 52            | 0            | 0       | For 1 global load a          |
| System description ROM                                                                                     | 2            | 71               | 2              | System description ROM                                                                                 | 2                                   | 71               | 2             | 0            | 0       | Contains informati           |
| ♥ vscale                                                                                                   | 3846 (0%)    | 9482 (0%)        | 46 (1%)        | ♥ vscale                                                                                               | 4642 (0%)                           | 9874 (0%)        | 51 (1%)       | 16 (0%)      | 12 (0%) | 1 compute unit.              |
| Function overhead                                                                                          | 1463         | 1467             | 0              | Function overhead                                                                                      | 1463                                | 1467             | 0             | 0            | 6       | Kernel dispatch lo           |
| Private Variable:<br>- 'i' (vscale1_vec.cl:12)                                                             | 32           | 130              | 0              | Private Variable:<br>- 'i' (vscale2_u.cl:12)                                                           | 24                                  | 64               | 0             | 0            | 0       | Register,<br>1 reg, 32 width |
| > vscale.B0                                                                                                | 191 (0%)     | 144 (0%)         | 0 (0%)         | > vscale.B0                                                                                            | 59 (0%)                             | 75 (0%)          | 0 (0%)        | 0 (0%)       | 1 (0%)  |                              |
| ♥ vscale.B2                                                                                                | 2160 (0%)    | 7741 (0%)        | 46 (1%)        | ♥ vscale.B1                                                                                            | 3096 (0%)                           | 8268 (0%)        | 51 (1%)       | 16 (0%)      | 5 (0%)  |                              |
| Cluster logic                                                                                              | 418          | 722              | 16             | Cluster logic                                                                                          | 497                                 | 865              | 20            | 0            | 1       | Logic required to e          |
| > State                                                                                                    | 34           | 697              | 1              | > State                                                                                                | 92                                  | 972              | 2             | 0            | 4       | Live values and co           |
| > Feedback                                                                                                 | 65           | 41               | 0              | > Feedback                                                                                             | 48                                  | 41               | 0             | 0            | 0       | Loop-carried depe            |
| ✓ Computation                                                                                              | 1643         | 6281             | 29             | ✓ Computation                                                                                          | 2459                                | 6390             | 29            | 16           | 0       |                              |
| > vscale1_vec.cl:12                                                                                        | 105          | 0                | 0              | > No Source Line                                                                                       | 256                                 | 64               | 0             | 0            | 0       |                              |
| > vscale1_vec.cl:13                                                                                        | 1538         | 6281             | 29             | > vscale2_u.cl:12                                                                                      | 664                                 | 44               | 0             | 0            | 0       |                              |
|                                                                                                            |              |                  |                | vscale2_u.cl:13                                                                                        | 1539                                | 6282             | 29            | 16           | 0       |                              |

• Same functionality, increased resources, predication for loop epilogue

new

#### **Xilinx Report**

- Unroll hint ignored
  - Xilinx compiler doesn't generate automatic epilogues
  - no explicit message
  - area report reveals it

# 35Area Information36Compute UnitKernel NameModule NameFFLUTDSPBRAM37-------------------------------38vscale\_1vscalevscale3551708233239-------------------------------

#### and pipelining result?

```
INF0: [XOCC 204-61] Pipelining loop 'Loop 1'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 150. was 10 before!
```

- memory interface width doesn't fit

```
xocc -g -R 2 -s --platform=alpha-data_adm-pcie-8k5_dynamic_5_0 --memory_port_data_width all:32 -c device/vscale2_u.cl -o vscale
```

```
INF0: [XOCC 204-61] Pipelining loop 'Loop 1'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 10.
```

#### Lessons

- When code pattern doesn't fit
  - > Attributes and pragmas ignored by compilers
- When code pattern fits
  - Attributes and pragmas often not needed
  - \_\_attribute\_\_((xcl\_pipeline\_loop(1)))
  - #pragma ii <desired\_initiation\_interval>
- Xilinx compiler doesn't generate automatic epilogues

#### **Vector Scale with simpler Unrolling**

```
kernel
void vscale(
 global float *restrict x,
 global float *restrict y,
const float a,
const int size)
{
    // attention, functionality only
    // identical if size is multiple of 16
    const int size16 = size / 16;
     attribute ((opencl unroll hint(16)))
    for(int i=0; i<size16*16; i++){</pre>
        y[i] = x[i]*a;
            35 Area Information
}
                Compute Unit Kernel Name Module Name FF LUT
                                                                  DSP
                                                                      BRAM
            36
            37
                vscale_1
            38
                            vscale
                                          vscale
                                                      6454
                                                                      30
                                                            4992
                                                                  48
            39
```

### **Unrolled Vector Scale with Epilogue**

| kernel INF<br>void vscale( INF                              | 0: [XOCC 204-61]<br>0: [XOCC 204-61]<br>0: [XOCC 204-61]<br>0: [XOCC 204-61]<br>0: [XOCC 204-61] | Pipelining resul<br>Pipelining loop | t : Target II =<br>'Loop 2'. |          |          |        |           |
|-------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------|------------------------------|----------|----------|--------|-----------|
| global <b>float</b> *restrict x,                            |                                                                                                  |                                     |                              |          |          |        |           |
| global <b>float</b> *restrict y,                            |                                                                                                  |                                     | Notes:                       |          |          |        |           |
| const float a,                                              |                                                                                                  |                                     | - logic re                   | esourc   | e overh  | ead    |           |
| const int size)                                             |                                                                                                  |                                     | - simplify                   | y iterat | tion exp | oressi | ons       |
| {                                                           |                                                                                                  |                                     | - try pree                   | dicatio  | n        |        |           |
| <pre>const int size16 = size /</pre>                        | 16;                                                                                              |                                     | - tradeo                     |          | -        | ortak  | oility    |
| attribute((opencl_unro                                      | oll_hint(1                                                                                       | 6)))                                | and pe                       | erform   | ance!    |        |           |
| <pre>for(int i=0; i<size16*16;< pre=""></size16*16;<></pre> | i++){                                                                                            |                                     |                              |          |          |        |           |
| y[i] = x[i]*a;                                              |                                                                                                  |                                     |                              |          |          |        |           |
| }                                                           |                                                                                                  |                                     |                              |          |          |        |           |
| const int rest = size - si                                  | ize16;                                                                                           |                                     |                              |          |          |        |           |
| <pre>for(int i=size16*16; i<siz< pre=""></siz<></pre>       | ze16*16+re                                                                                       | st; i++){                           |                              |          |          |        |           |
|                                                             | Area Informat                                                                                    |                                     |                              |          | diana.   |        | ter de la |
| }                                                           | Compute Unit                                                                                     | Kernel Name                         | Module Name                  | FF       | LUT      | DSP    | BRAM      |
| } 38<br>39                                                  | vscale_1                                                                                         | vscale                              | vscale                       | 8196     | 10177    | 48     | 32        |

#### **Outline Part 1**

- Overview FPGAs and Goals
- OpenCL Overview
- Example 1: Vector Scale
  - compilation
  - reports
  - performance analysis
- Vector Scale Variations
  - automatic unrolling
- Example 2: SAXPY
  - blockwise design pattern
- Outer Loop Pipelining
- Streaming Kernels

## **Example 2: SAXPY**



• Level 1 BLAS routine (single precision a times x plus y)

```
_kernel
void SAXPY(
  __global const float *restrict x,
  __global float *restrict y,
const int a,
const int size)
{
  for (int i=0; i<size; i++)
    y[i] = a*x[i] + y[i];
}
```

Differences to previous example

- Uses y as input and output
- 2 loads + 1 store

- <u>https://github.com/kenter/OpenCL-FPGA-examples</u> -> SAXPY1.cl
  - report files in reportIntel and reportXilinx
  - How does pipelining work out here?

#### **Xilinx Report**

INF0: [XOCC 204-61] Pipelining loop 'Loop 1'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 153, Depth = 161.

- Xilinx compiler generates at most 2 concurrent bursts
- More global memory access will compete for 'gmem' port of memory controller

WARNING: [SCHED 204-68] The II Violation in module 'SAXPY': Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)

between bus access on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/ device/SAXPY.cl:11) and bus request on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/device/SAXPY.cl:11).

WARNING: [SCHED 204-68] The II Violation in module 'SAXPY': Unable to enforce a carried dependence constraint (II = 2, distance = 1, offset = 1)

between bus access on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/ device/SAXPY.cl:11) and bus request on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/device/SAXPY.cl:11).

WARNING: [SCHED 204-68] The II Violation in module 'SAXPY': Unable to enforce a carried dependence constraint (II = 3, distance = 1, offset = 1)

between bus access on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/ device/SAXPY.cl:11) and bus request on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/device/SAXPY.cl:11).

WARNING: [SCHED 204-68] The II Violation in module 'SAXPY': Unable to enforce a carried dependence constraint (II = 4, distance = 1, offset = 1)

between bus access on port '<u>gmem</u>' (/<u>upb</u>/scratch/departments/<u>pc2</u>/groups/<u>pc2-mitarbeiter/kenter/gitlab</u>/2019-date-tutorial/examples/ device/SAXPY.cl:11) and bus request on port '<u>gmem</u>' (/<u>upb</u>/scratch/departments/<u>pc2</u>/groups/<u>pc2-mitarbeiter/kenter</u>/ gitlab/2019-date-tutorial/examples/device/SAXPY.cl:11).

WARNING: [SCHED 204-68] The II Violation in module 'SAXPY': Unable to enforce a carried dependence constraint (II = 130, distance = 1, offset = 1)

between bus access on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/ device/SAXPY.cl:11) and bus request on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/device/SAXPY.cl:11).

WARNING: [SCHED 204-68] The II Violation in module 'SAXPY': Unable to enforce a carried dependence constraint (II = 145, distance = 1, offset = 1)

between bus access on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/gitlab/2019-date-tutorial/examples/ device/SAXPY.cl:11) and bus request on port 'gmem' (/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/device/SAXPY.cl:11).

#### **Design Pattern: Blockwise Read-Modify-Write**

• Data without address space qualifier goes to \_\_local memory (on chip BRAM)

```
#define BLOCK SIZE 1024
 kernel
void SAXPY(
 global float *restrict x,
 global float *restrict y,
const int a, const int size)
{
    for (int i=0; i<size;</pre>
         i+=BLOCK SIZE)
```

```
float local_x[BLOCK_SIZE];
float local y[BLOCK SIZE];
  attribute ((opencl unroll hint(16)))
for(int j=0; j<BLOCK_SIZE; j++){</pre>
   local x[j] = x[i+j];
 attribute ((opencl unroll hint(16)))
for(int j=0; j<BLOCK SIZE; j++){</pre>
   local \mathbf{y}[j] = \mathbf{y}[i+j];
  attribute ((opencl_unroll hint(16)))
for (int j=0; j<BLOCK SIZE; j++){</pre>
   \mathbf{y}[j] = a * \mathbf{local}_{\mathbf{x}}[j] + \mathbf{local}_{\mathbf{y}}[j];
```

#### **Xilinx Reports and Performance Model**

#### • Xilinx Pipelining

INF0: [XOCC 204-61] Pipelining loop 'Loop 1.1'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 3.
INF0: [XOCC 204-61] Pipelining loop 'Loop 1.2'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 3.
INF0: [XOCC 204-61] Pipelining loop 'Loop 1.3'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 21.

- 3 pipelined loops inside sequential outer loop
- Per outer loop iteration
  - Time in Cycles C =
    - N x II(Loop 1.1) + L(Loop 1.1) +
    - N x II(Loop 1.2) + L(Loop 1.2) +
    - N x II(Loop 1.3) + L(Loop 1.3)
  - = 1024 + 3 + 1024 + 3 + 1024 + 21
  - Asymptotically N x 3
- Still much better than II 151 and no bursts



#### Intel SAXPY: the Good...

- No fixed 'ports' on global memory
- Can sustain multiple burst transfers concurrently
  - see later case study on efficiency
- Original SAXYP implementation efficiently pipelined

| Loops analysis             | Show fully unrolled loop |    |            |                            |  |  |  |  |  |  |
|----------------------------|--------------------------|----|------------|----------------------------|--|--|--|--|--|--|
|                            | Pipelined                | Ш  | Bottleneck | Details                    |  |  |  |  |  |  |
| Kernel: SAXPY (SAXPY.cl:4) |                          |    |            | Single work-item execution |  |  |  |  |  |  |
| SAXPY.B2 (SAXPY.cl:10)     | Yes                      | ~1 | n/a        | II is an approximation.    |  |  |  |  |  |  |



#### Intel SAXPY Blockwise: the (not so) Bad...

- In this example: blockwise design for portability
- General reasons for blockwise designs
  - data reuse within block
  - reordering / indirect / irregular data access

| Loops analysis                                      |           |     |            | inrolled loops     | 1.1.1                              | And the second sec |     |
|-----------------------------------------------------|-----------|-----|------------|--------------------|------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
|                                                     | Pipelined | п   | Bottleneck | Details            | 12 - 13                            | <pre>for (int i=0; i<size; float="" i+="BLOCK_SIZE){" local_x[block_size];="" local_x[block_size];<="" pre=""></size;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |     |
| ernel: SAXPY (SAXPY_block.cl:6)                     |           |     |            | Single work-item   | 14<br>15<br>16 *                   | <pre>float local_y[BLOCK_SIZE];<br/>attribute((opencl_unroll_hint(16<br/>for (int j=0; j<block_size; j++){<="" pre=""></block_size;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | ))) |
| SAXPY.B2 (SAXPY_block.cl:12)                        | Yes       | >=1 | n/a        | Serial exe: Memo   | 17<br>18                           | <pre>local_x[j] = x[i+j]; }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |     |
| 16X Partially unrolled SAXPY.B4 (SAXPY_block.cl:16) | Yes       | ~1  | n/a        | II is an approxima | 19<br>20 -                         | <pre>attribute((opencl_unroll_hint(16 for (int j=0; j<block_size; j++){<="" pre=""></block_size;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | ))) |
| 16X Partially unrolled SAXPY.B6 (SAXPY_block.cl:20) | Yes       | ~1  | n/a        | II is an approxima | 21<br>22                           | <pre>local_y[j] = y[i+j]; }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |     |
| 16X Partially unrolled SAXPY.B8 (SAXPY_block.cl:24) | Yes       | ~1  | n/a        | II is an approxima | 23<br>24 -<br>25<br>26<br>27<br>28 | <pre>attribute((opencl_unroll_hint(16 for (int j=0; j<block_size; +="" j++){="" local_y[j]="" pre="" y[j]="a*local_x[j]" }="" }<=""></block_size;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |     |
| Details                                             |           |     |            |                    |                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 3   |

- Iteration executed serially across SAXPY.B6, SAXPY.B8. Only a single loop iteration will execute inside this region due to memory dependency:
  - From: Load Operation (SAXPY\_block.cl: 21)
  - To: Store Operation (SAXPY\_block.cl: 25)

### **Intel Serial Execution (1)**

• Technically the outer loop is pipelined, check aocl-best-practices-guide for details

#### Serial Execution

|              | N   | 400   |       |       |       |   |       |       |                 |       |              |       |                        |           |
|--------------|-----|-------|-------|-------|-------|---|-------|-------|-----------------|-------|--------------|-------|------------------------|-----------|
|              | i   | 0     | 0     | 0     | 0     |   | 0     | 1     |                 | 6     |              | 7     | ***                    | 8         |
|              | Ĵ   | 0     | 1     | 2     | 3     |   | 399   | 0     |                 | 0     |              | 0     |                        | 0         |
|              | 1   | i = 0 |       |       |       |   | 1 - 1 |       |                 |       |              |       |                        |           |
|              | 2   |       |       |       |       |   |       | i = 1 |                 |       |              |       |                        |           |
|              | 3   |       |       | 1     |       | - |       |       |                 |       | _            |       |                        |           |
|              |     |       | =     |       |       |   |       |       |                 |       | _            |       |                        |           |
|              | 6   |       |       |       |       | _ |       |       |                 |       |              |       |                        |           |
|              | 7   |       |       |       |       | - |       |       |                 | i = 6 |              |       |                        |           |
|              | 8   | i = 0 | 1.    |       |       |   |       |       |                 |       |              | i = 7 |                        |           |
|              | 9   |       | i = 0 |       |       |   | 1     | 11    |                 |       |              |       |                        |           |
| 1.1.6.1.     | 10  |       |       | i = 0 |       |   |       |       |                 |       |              |       |                        |           |
| Clock Cycles | 11  |       |       |       | i = 0 |   |       |       |                 |       |              |       |                        | 1         |
|              | 12  |       |       |       |       |   |       |       |                 |       | -            | 11.1  |                        |           |
|              | in  |       |       |       |       |   |       |       |                 |       |              |       |                        |           |
|              | 407 |       |       | 10.00 |       |   | i = 0 |       |                 |       |              |       |                        | and and a |
|              | 408 | 1     | -     | 122   |       |   |       | 1-1   |                 |       |              | -     | assessed in the second | -         |
|              | 409 |       |       |       |       |   |       |       |                 | -     | and services |       |                        |           |
|              | 410 |       |       |       |       |   |       |       | al and a second |       |              |       |                        |           |
|              | 411 |       |       | 1     |       |   |       | i = 1 |                 |       | 1            |       |                        | i =       |
|              | 412 | 1.000 |       | 12.   |       |   |       |       |                 |       |              |       |                        |           |

#### **Intel Serial Execution (2)**

• Technically the outer loop is pipelined, check aocl-best-practices-guide for details Consider the following code example:

In the example, the dependence in the outer loop resulted in the serial execution of the inner loop. The main difference in performance is the steady state II of outer loop = II of inner loop \* (trip count of inner loop - 1) + latency. In this example, II of inner loop is 1 with latency of 4 and II of outer loop is 1 with latency of 7. If N is large, such as 400, when compared to latency, then serial execution has little impact from the outer loop II.

#### Intel SAXPY Blockwise: the (not so) Bad... Performance

- 3 pipelined loops inside serial execution outer loop
- Per outer loop iteration
  - Time in Cycles C =
    - N x II(Loop 1.1) + L(Loop 1.1) +
    - N x II(Loop 1.2) + L(Loop 1.2) +
    - N x II(Loop 1.3) + L(Loop 1.3)
  - = 1024 + **244** + 1024 + **244** + 1024 + **79**
  - Asymptotically N x 3
- Asymptotically same throughput as Xilinx design



### Intel SAXPY Blockwise: the (slightly) Ugly

Additional memory resources are allocated for outer loop pipelining

| SAXPY_block.cl:13 (local_x) 0 | 0 | 13 | o | o   | Good H<br>4096B<br>16384 | 15<br>16 -<br>17<br>18 |
|-------------------------------|---|----|---|-----|--------------------------|------------------------|
|                               |   |    |   |     | ~ ~~                     | 19<br>20 -             |
| Details                       |   |    |   |     |                          |                        |
| SAXPY_block.cl:13 (local_x):  |   |    |   |     |                          |                        |
| Requested size                |   |    |   | 409 | 6 bytes                  |                        |
| Implemented size              |   |    |   | 163 | 84 bytes                 |                        |
| Total replication             |   |    |   | 4   |                          |                        |

- Number of banks Bank depth
- minor overhead in this case
- can be modified
  - #pragma max\_concurrency 1

- 1 (banked on bit 4294967295) 64 words
- Requested size 4096 bytes, implemented size 16384 bytes, replicated 4 times total, stall-free, 1 read and 1 write.
- 4 independent copies of this memory were created to enable simultaneous execution of 4 loop iterations defined at (SAXPY block.cl: 12)
- You can reduce the number of copies of this memory by limiting the concurrency of its loop; see the OpenCL Programming Guide for details.
- Private memory implemented in on-chip block RAM.

#### **Lessons from SAXPY**

- Xilinx designs suffer from competition on 'gmem' ports
  - next slide: brief look at Intel LSUs
- Blockwise designs can involve overheads like 3 x
  - will introduce streaming kernels as broadly applicable pattern to overcome this
  - sometimes the solution is simpler
- Intel compiler replicates local memories for outer loop pipelining
  - will look at example without 'serial execution'

## Intel LSUs

- LSU: Load Store Unit
  - initiate burst transfer to local buffer
  - feed kernel with data from local buffer
- Linear buffer or cache
  - automatic decision, mostly works well

#### Cached

Burst-coalesced LSUs might sometimes include a cache. A cache is created when the memory access pattern is data-dependent or appears to be repetitive. The cache cannot be shared with other loads even if the loads want the same data. The cache is flushed on kernel start and consumes more hardware resources than an equivalent LSU without a cache. The cache can be disabled by simplifying the access pattern or marking the pointer as volatile.



#### **Lessons from SAXPY**

- Xilinx designs suffer from competition on 'gmem' ports
  - next slide: brief look at Intel LSUs  $\checkmark$
- Blockwise designs can involve overheads like 3 x
  - will introduce streaming kernels as broadly applicable pattern to overcome this
  - sometimes the solution is simpler
- Intel compiler replicates local memories for outer loop pipelining
  - will look at example without 'serial execution'

## **Outer Loop Pipelining**

## **Resolving Serial Execution (1)**

#### • Review reason for serial execution

|                                                     | Pipelined | н   | Bottleneck | Details            |
|-----------------------------------------------------|-----------|-----|------------|--------------------|
| rnel: SAXPY (SAXPY_block.cl:6)                      |           |     |            | Single work-item   |
| SAXPY.B2 (SAXPY_block.cl:12)                        | Yes       | >=1 | n/a        | Serial exe: Memo   |
| 16X Partially unrolled SAXPY.B4 (SAXPY_block.cl:16) | Yes       | ~1  | n/a        | II is an approxima |
| 16X Partially unrolled SAXPY.B6 (SAXPY_block.cl:20) | Yes       | ~1  | n/a        | II is an approxima |
| 16X Partially unrolled SAXPY.B8 (SAXPY_block.cl:24) | Yes       | ~1  | n/a        | II is an approxima |



Details

#### SAXPY.B2:

- Iteration executed serially across SAXPY.B6, SAXPY.B8. Only a single loop iteration will execute inside this region due to memory dependency:
  - From: Load Operation (SAXPY\_block.cl: 21)
  - To: Store Operation (SAXPY\_block.cl: 25)
- See Best Practices Guide : Nested Loops for more information

### **Resolving Serial Execution (2)**

10

12

13 -

14

15 16

17 -

18

19

20

21 -22

- Tell compiler that blocks are independent
  - #pragma ivdep

|                                                     | Pipelined | н   | Bottleneck | Details            |
|-----------------------------------------------------|-----------|-----|------------|--------------------|
| Kernel: SAXPY (SAXPY_ivdep.cl:6)                    |           |     |            | Single work-item   |
| SAXPY.B2 (SAXPY_ivdep.cl:13)                        | Yes       | >=1 | n/a        |                    |
| 16X Partially unrolled SAXPY.B4 (SAXPY_ivdep.cl:17) | Yes       | ~1  | n/a        | II is an approxima |
| 16X Partially unrolled SAXPY.B6 (SAXPY_ivdep.cl:21) | Yes       | ~1  | n/a        | II is an approxima |
| 16X Partially unrolled SAXPY.B8 (SAXPY_ivdep.cl:25) | Yes       | ~1  | n/a        | II is an approxima |
|                                                     |           |     |            |                    |

```
const int size)
11 - {
         #pragma ivdep
         for (int i=0; i<size; i+=BLOCK_SIZE){</pre>
             float local_x[BLOCK_SIZE];
             float local_y[BLOCK_SIZE];
             __attribute__((opencl_unroll_hint(16)))
             for (int j=0; j<BLOCK_SIZE; j++){</pre>
                 local_x[j] = x[i+j];
             3
             __attribute__((opencl_unroll_hint(16)))
             for (int j=0; j<BLOCK_SIZE; j++){</pre>
                 local_y[j] = y[i+j];
```

### **Execution flow with Outer Loop Pipelining**



#### **Outer Loop Pipelining Performance**

- Asymptotically all functional units filled in every cycle
- Pipeline takes long to fill
  - recap from earlier example

| Ν      | С      | Efficiency |  |
|--------|--------|------------|--|
| 10     | 666    | 1.5%       |  |
| 100    | 756    | 13.2%      |  |
| 1000   | 1656   | 60.4%      |  |
| 10000  | 10656  | 93.8%      |  |
| 100000 | 100656 | 99.3%      |  |

- now similar efficiency considerations apply to inner and outer loops
- e.g. N inner = N outer = 1000
  - efficiency = 0.604 \* 0.604 = 0.365 -> 36.5%
- in practice, latency of outer loop is much higher!

#### **Intel Outer Loop Pipelining Summary**

- Very powerful tool
  - this example: constant and identical trip counts of inner loops
  - successfully tested: different trip counts of inner loops based on runtime arguments
  - works also for deeper nesting levels
- Memory replication can be very costly
  - resource balance: ~2 block RAMs for 1 DSP
  - replication can easily lead to 3-5 x more block RAM usage

#### **Xilinx Counterpart**

- Can request pipelining in one outer loop (or function)
- \_\_attribute\_\_((xcl\_dataflow))
- Generally: less flexible than Intel counterpart
- In this example: doesn't overcome 'gmem' conflict

ERROR: [XOCC 203-711] Bundled bus interface gmem failed dataflow checking: it cannot read data in multiple processes. ERROR: [XOCC 203-711] Bundled bus interface gmem has read operations in function: 'SAXPY\_proc.1.1' and 'SAXPY\_proc.1.1.2'.

m failed dataflow checking: it cannot read data in multiple processes.
m has read operations in function: 'SAXPY\_proc.1.1' and 'SAXPY\_proc.1.1.

# **Streaming Kernels**

#### **Task-level Parallelism**

#### use (expensive) arithmetic units (almost) every cycle

have scaling designs up to resource or bandwidth limits

- Scaling option: add more different tasks
- Advantage: may lead to better balanced resource mix
- Key goals
  - execute tasks concurrently
  - forward reused data on chip from one task to the next
  - FPGA architecture: wires, FIFO buffers
- OpenCL 2.0 feature: pipe

#### **OpenCL FPGA Tool Adaptions of Pipes**

- OpenCL 2.0 pipe
  - dynamic allocation from host code
  - CPUs and GPUs don't have kernel-to-kernel wires, use shared memory
  - default: non blocking (polling)
- Intel FPGA adaptation
  - introduce name channel
  - #pragma OPENCL EXTENSION cl\_intel\_channels : enable
  - require static instantiation in .cl file
  - allow \_\_attribute\_\_((depth(N)))
  - default: blocking
  - less efficient, more standard conform pipes available
- Xilinx adaptation
  - require static instantiation in .cl file
  - require \_\_attribute \_\_((xcl\_reqd\_pipe\_depth(N))) N in [16,32,64,...32768]
  - add blocking mode (and recommend using it)

#### **Header File for Portable FPGA Pipes**

• Use blocking semantics by default

```
#pragma OPENCL EXTENSION cl_intel_channels : enable
 2
 3
   #if defined(___xilinx___)
 4
 5
       #define PIPE pipe
 6
       #define PIPE_READ(name, val) read_pipe_block(name, &val)
 7
       #define PIPE_WRITE(name, val) write_pipe_block(name, &val)
 8
       #define LABEL(x) x:
 9
   #elif defined(INTELFPGA_CL)
10
       #define PIPE channel
       #define PIPE_READ(name, val) val = read_channel_intel(name)
11
       #define PIPE_WRITE(name, val) write_channel_intel(name, val)
12
       #define LABEL(x)
13
14 #endif
```

#### **Pipes in SAXPY Streaming Kernel**

```
#include "macros.h"
                                         kernel
PIPE float p y
                                        void SAXPY(
 _attribute__((xcl_reqd_pipe_depth
                                          global const float16 *restrict x,
(32)));
                                         global float16 *restrict y,
                                        const int a,
 kernel
                                        const int size16
void readY(
 global float16 *restrict y,
const int size16
                                            for (int i=0; i<size16; i++){</pre>
                                                float16 y in;
                                                PIPE_READ(p y, y in);
    for (int i=0; i<size16; i++){</pre>
                                                y[i] = a * x[i] + y in;
        float16 y in = y[i];
        PIPE_WRITE(p y, y in);
```

#### **SAXPY Streaming Result**

• Xilinx (and Intel) design with 2 overlapping kernels with II = 1 loops

==>The following messages were generated while performing high-level synthesis for kernel: SAXPY Log file: /upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/\_x/SAXPY\_streaming16/SAXPY/vivado\_hls.log : INF0: [XOCC 204-61] Option 'relax\_ii\_for\_timing' is enabled, will increase II to pre serve clock frequency constraints. INF0: [XOCC 204-61] Pipelining loop 'XCL\_WG\_DIM\_Z\_XCL\_WG\_DIM\_Y\_XCL\_WG\_DIM\_X.1'. INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 21.

==>The following messages were generated while performing high-level synthesis for kernel: readY Log file: /upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/ gitlab/2019-date-tutorial/examples/\_x/SAXPY\_streaming16/readY/vivado\_hls.log : INFO: [XOCC 204-61] Option 'relax\_ii\_for\_timing' is enabled, will increase II to pre serve clock frequency constraints. INFO: [XOCC 204-61] Pipelining loop 'XCL\_WG\_DIM\_Z\_XCL\_WG\_DIM\_Y\_XCL\_WG\_DIM\_X.1'. INFO: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 3. INFO: [XOCC 60-594] Finished kernel compilation

#### **Streaming Kernel Summary**

- Pipes for task-level parallelism
- Decoupling with pipes can also resolve other pipelining obstacles or kernel stalls
  - here: global memory interface restrictions for Xilinx
- Note on resources
  - visible resource utilization low
  - but pipes need wires can prohibit successful routing
    - rule of thumb 512 bit pipes (like memory interface) are fine
    - much wider pipes cause problems
- Note on host code
  - OpenCL command queues are sequential by default
  - use multiple command queues for concurrent kernel execution
  - Xilinx only alternative: out-of-order command queue

## **Conclusion Part 1**

#### **Outline Part 1**

- Overview FPGAs and Goals
- OpenCL Overview
- Example 1: Vector Scale
  - compilation
  - reports
  - performance analysis
- Vector Scale Variations
  - automatic unrolling
- Example 2: SAXPY
  - blockwise design pattern
- Outer Loop Pipelining
- Streaming Kernels

#### **Concept Summary**

- Covered concepts
  - Pipelining
  - Unrolling / Vectorization
  - Local Memory
  - Blockwise operations
  - Outer loop pipelining
  - Streaming
- Other important concepts
  - Local memory layout
  - Loop coalescing
  - Reductions
  - Shift Registers
  - Latency hiding

#### **General Remarks**

- Intel and Xilinx OpenCL compilers have mostly improved over the last 3+ years
  - Intel removed support for variadic macros
- New FPGA architectures always pose challenges
  - Xilinx introduction of super logic regions (SLRs) seems well resolved now
  - Xilinx introduction of UltraRAM unknown status to me
  - Intel Stratix 10 HyperFlex
    - higher clock frequencies partially realized already
    - tools have introduced much higher latencies
    - blocking channels discouraged
  - next challenge for both: high bandwidth memory (HMB)
    - 32 x 512 bit memory interfaces?

# Part 2

# Vendor Matrix Multiplications Complex Design Examples

# Simple, yet Efficient Matrix Multiplication Designs with OpenCL

#### **Vendor Example Resources**

#### **Intel FPGA**

- <u>https://www.intel.com/content/www/us/en/programmable/products/design-</u> software/embedded-software-developers/opencl/support.html#design-examples
  - examples driven by application scenario, pargmatic combination of concepts
  - each example optimized for peak performance on one target device

#### Xilinx

- https://github.com/Xilinx/SDAccel\_Examples
  - focus on presenting one or few concepts in working example
  - most examples (getting started group) not optimized to fully utilize device

#### **Matrix Multiplication**

•  $C = A \times B$ 



- Overall data used: 3 x N^2
- Computations (MAC) per element: N
- Overall computations: N^3
- Peak arithmetic intensity: N

#### **Tutorial Examples for Matrix Multiplication**

#### **Intel FPGA**

- <u>https://www.intel.com/content/www/us/en/programmable/support/support-</u> resources/design-examples/design-software/opencl/matrix-multiplication.html
- Matrix Multiplication with ND range kernel
  - 64x64 tiles, up to 16x64 MAC operations per cycle

Xilinx

- <u>https://github.com/Xilinx/SDAccel\_Examples/tree/master/getting\_started/kernel\_o</u> <u>pt/systolic\_array\_ocl</u>
- Matrix Multiplication with systolic array
  - integer operations

**Tutorial copies** 

- <u>https://github.com/kenter/OpenCL-FPGA-examples</u>
  - matrix\_mult.cl
  - mmult.cl

# Intel FPGA matrix\_mul

#### Intel matrix\_mult.cl

• NDRange kernel

```
105 <u>kernel</u>
```

- 106 \_\_attribute((reqd\_work\_group\_size(BLOCK\_SIZE,BLOCK\_SIZE,1)))
- 107 \_\_attribute((num\_simd\_work\_items(SIMD\_WORK\_ITEMS)))
- 108 void matrixMult( // Input and output matrices
- Read code inside kernel from perspective of one work item
- IDs are used to determine element positions

```
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
```

### **Tiling in Work Groups and Items**

- Work item computes result element
- Work group computes result tile



• Process inputs per tile



- <u>https://github.com/kenter/OpenCL-FPGA-examples</u> -> matrix\_mult.cl
  - additional reports in reportIntel
  - Throughput of this design?

#### **Allocation of Local Memory**

#### • Input Tiles

```
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
```

- Where's the output tile?
- float running\_sum = 0.0f;
- Only output value work item
- Input tiles are shared in group (\_\_local)
- Output elements are work-item private (still \_\_local memory space)

```
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] =
    running_sum;
```

### Two loops

• Loop over input tiles

```
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b
    += (BLOCK_SIZE * B_width))
{</pre>
```

- Note: need to synchronize between work items after loading tiles
- Loop over tile vectors

```
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
    running_sum += A_local[local_y][k] * B_local[local_x][k];
}</pre>
```

• Fully unrolled: 64 MACs per cycle



#### **SIMD Work Items**

• NDRange kernel feature SIMD Work Items (max 16)

105 <u>kernel</u>

- 106 \_\_attribute((reqd\_work\_group\_size(BLOCK\_SIZE,BLOCK\_SIZE,1)))
- 107 \_\_attribute((num\_simd\_work\_items(SIMD\_WORK\_ITEMS)))
- 108 void matrixMult( // Input and output matrices
- Process multiple work items per cycle
- Need more elements of B concurrently

#### **Resource Usage of Banking**

• Local buffer size increased by 16 banks

| matrix_mult_v1.cl:116 (A_local)                                   | 0      | 0 | 52                       | 87 |         | multipl<br>value<br>output (<br>compute    |
|-------------------------------------------------------------------|--------|---|--------------------------|----|---------|--------------------------------------------|
| matrix_mult_v1.cl:117 (B_local)                                   | 0      | 0 | 832                      | 88 | //      | in a !<br>BLOCK_S:<br>multiple<br>SIMD_WOI |
|                                                                   | _      |   |                          | 89 | 11      | See th                                     |
| San State                                                         |        |   |                          |    |         |                                            |
| Details<br>matrix_mult_v1.cl:117 (B_le                            | ocal): |   |                          |    |         |                                            |
|                                                                   | ocal): |   | 16384 byte               | es |         |                                            |
| matrix_mult_v1.cl:117 (B_l                                        | ocal): |   | 16384 byte<br>49152 byte |    |         |                                            |
| matrix_mult_v1.cl:117 (B_lo<br>Requested size                     | ocal): |   |                          |    |         |                                            |
| matrix_mult_v1.cl:117 (B_le<br>Requested size<br>Implemented size | ocal): |   | 49152 byte               | es | 3, 9, 1 | 0, 11)                                     |

#### **Intel MM Design Evaluation**

- 64 (unrolled inner loop) x 16 (SIMD work items) MAC operations per cycle
- Balanced resource usage (good fit for Arria 10 GX1150)
  - ~1024 DSPs, ~1350 BRAMs (832 for local B tile)
- Performance considerations, per pair of input tiles
  - calculate 64x64 work items, 16 in parallel -> 64x64/16 = 64x4 = 256 cycles per pair of input tiles
  - need to load 2 tiles à 64x64 floats (eventually store 1 tile à 64x64 floats)
    - 2 x 64x64 x 32 bit = 2 x 128kb per tile
    - have 256 cycles: 2 x 512 bit per cycle perfect match for memory interface
- Scaling considerations (Stratix 10)
  - higher compute to bandwidth ratio needs larger tiles
  - scaling problems for banked B tiles and registers for work item state (running\_sum and more)

#### **Concepts Used**

- Covered concepts
  - Pipelining (different here: NDRange)
  - Unrolling / Vectorization
  - Local Memory
  - Blockwise operations
  - Outer loop pipelining (different here: work groups)
  - Streaming
- Other important concepts
  - Local memory layout
  - Loop coalescing
  - Reductions
  - Shift Registers
  - Latency hiding

## Xilinx mmult

#### Xilinx mmult.cl

- Educational example on technique systolic array
- <u>https://github.com/kenter/OpenCL-FPGA-examples</u> -> mmult.cl
  - additional reports in reportXilinx
  - How many pipelined loops?
- Blockwise processing
  - 2 read blocks, 1 compute block, 1 write back block

```
INF0: [XOCC 204-61] Pipelining loop 'readA'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 74.
INF0: [XOCC 204-61] Pipelining loop 'readB'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 74.
INF0: [XOCC 204-61] Pipelining loop 'systolic1'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 8.
INF0: [XOCC 204-61] Pipelining loop 'writeC'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 73.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 73.
INF0: [XOCC 60-594] Finished kernel compilation
```

#### Xilinx mmult.cl Snippets (1)

- Array partitioning for parallel access
  - in first dimension
  - in second dimension
  - in all dimensions

int localA[MAX\_SIZE][MAX\_SIZE] \_\_attribute\_\_((xcl\_array\_partition(complete, 1)));; int localB[MAX\_SIZE][MAX\_SIZE] \_\_attribute\_\_((xcl\_array\_partition(complete, 2)));; int localC[MAX\_SIZE][MAX\_SIZE] \_\_attribute\_\_((xcl\_array\_partition(complete, 0)));;

#### Xilinx mmult.cl Snippets (2)

• Outer loop pipelined

\_\_attribute\_\_((xcl\_pipeline\_loop(1)))
\_\_attribute\_\_((xcl\_loop\_tripcount(c\_size, c\_size)))
systolic1: for(int k = 0; k < a\_col; k++) {
 \_\_attribute\_\_((xcl\_loop\_tripcount(c\_size, c\_size)))
 systolic2: for(int i = 0; i < MAX\_SIZE; i++) {
 \_\_attribute\_\_((xcl\_loop\_tripcount(c\_size, c\_size)))
 systolic3: for(int j = 0; j < MAX\_SIZE; j++) {</pre>

- what about the two loops inside?
  - again, code pattern determines further transformations

INF0: [XFORM 203-502] Unrolling all sub-loops inside loop 'systolic1'
(/upb/scratch/departments/pc2/groups/pc2-mitarbeiter/kenter/
gitlab/2019-date-tutorial/examples/device/mmult.cl:151) in function
'mmult' for pipelining.

## Xilinx mmult.cl Snippets (3)

| <ul> <li>2D loop unrolling</li> <li>– simple form of systolic array</li> </ul> |
|--------------------------------------------------------------------------------|
| <pre>attribute((xcl_pipeline_loop(1)))</pre>                                   |
| <pre>attribute((xcl_loop_tripcount(c_size, c_size)))</pre>                     |
| <pre>systolic1: for(int k = 0; k &lt; a_col; k++) {</pre>                      |
| <pre>attribute((xcl_loop_tripcount(c_size, c_size)))</pre>                     |
| <pre>systolic2: for(int i = 0; i &lt; MAX_SIZE; i++) {</pre>                   |
| <pre>attribute((xcl_loop_tripcount(c_size, c_size))</pre>                      |
| <pre>systolic3: for(int j = 0; j &lt; MAX_SIZE; j++) {</pre>                   |
|                                                                                |

| 11 | B_0     | B_1 | B_2 | B_3 |
|----|---------|-----|-----|-----|
| 11 | 1       | 1   | 1   | 1   |
| 11 | v       | v   | v   | v   |
| 11 |         |     |     |     |
| 11 | 1 1     | 1 1 | 1.1 | 1 1 |
| 11 | A0> C00 | C01 | C02 | C03 |
| 11 |         | 11  |     |     |
| 11 | 1       | 1   | 1   | 1   |
| 11 |         | _   |     |     |
| 11 | 1 1     | 1 1 | 1 1 | 1 1 |
| 11 | A1> C10 | C11 | C12 | C13 |
| 11 | 11      | 11  | I]  |     |
| // | 1       | 1   | 1   | 1   |
| 11 |         |     |     |     |
| 11 | 1 1     | 1 1 | 1 1 | 1 1 |
| // | A2> C20 | C21 | C22 | C23 |
| 11 | 11      | 1   |     | 11  |
| 11 | 1       | 1   | 1.  | 1   |
| 11 |         |     |     |     |
| 11 | 1 1     | 1 1 |     | 1 1 |
| 11 | A3> C30 | C31 | C32 | C33 |
| 11 |         |     | II  |     |
|    |         |     |     |     |

### Xilinx mmult.cl Snippets (4)

| code inside loop <ul> <li>PEs get data directly from input array</li> </ul>          |        |
|--------------------------------------------------------------------------------------|--------|
| // Get previous sum                                                                  |        |
| <pre>int last = (k==0) ? 0 : localC[i][j];</pre>                                     |        |
| <pre>// Update current sum // Handle boundary conditions</pre>                       |        |
| <pre>int a_val = (i &lt; a_row &amp;&amp; k &lt; a_col)? localA[i][k]</pre>          | : 0;   |
| <pre>int b_val = (k &lt; b_row &amp;&amp; j &lt; b_col)? localB[k][j]</pre>          | : 0;   |
| <pre>int result = last + a_val*b_val;</pre>                                          |        |
| <pre>// Write back results localC[i][j] = result; </pre> accumulation in 1 cycle red | quired |

| 11 | B_0     | B_1 | B_2 | B_3      |
|----|---------|-----|-----|----------|
| 11 | 1       | 1   | 1   | 1        |
| 11 | v       | v   | v   | V        |
| 11 |         |     |     |          |
| 11 | 1 1     | 1 1 | 1.1 | 1 1      |
| 11 | A0> C00 | C01 | C02 | C03      |
| 11 |         |     |     |          |
| 11 | 1       | 1   | 1   | 1        |
| 11 |         |     |     |          |
| 11 | 1 1     | 1 1 | 1 1 | 1 1      |
| 11 | A1> C10 | C11 | C12 | C13      |
| 11 |         | 11  |     | <u> </u> |
| 11 | 1       | 1   | 1   | 1        |
| 11 |         |     |     |          |
| 11 | 1 1     | 1 1 | 1 1 | 1 1      |
| // | A2> C20 | C21 | C22 | C23      |
| 11 | 11      | 1   |     | 1        |
| 11 | 1       | 1   | 1.  | 1        |
| 11 |         |     |     |          |
| 11 | 1 1     | 1 1 |     | 1 1      |
| 11 | A3> C30 | C31 | C32 | C33      |
| 11 |         |     |     | II       |
|    |         |     |     |          |

#### Xilinx mmult.cl Results + Limitations

• Given example with 12 x 12 = 144 parallel operations

| Area Information |             |             |       |       |     |              |  |
|------------------|-------------|-------------|-------|-------|-----|--------------|--|
| Compute Unit     | Kernel Name | Module Name | FF    | LUT   | DSP | BRAM         |  |
|                  |             |             |       |       |     |              |  |
| mmult_1          | mmult       | mmult       | 52635 | 23720 | 441 | 30           |  |
|                  |             |             |       |       |     | 2.2.2.2.2.2. |  |

• Single cycle accumulation not possible for floating point

```
INF0: [XOCC 204-61] Pipelining loop 'readA'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 74.
INF0: [XOCC 204-61] Pipelining loop 'readB'.
INF0: [XOCC 204-61] Pipelining loop 'systolic1'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 7, Depth = 23.
INF0: [XOCC 204-61] Pipelining loop 'writeC'.
INF0: [XOCC 204-61] Pipelining result : Target II = 1, Final II = 1, Depth = 73.
```

• Need to accumulate into several registers (~latency) and later sum up

#### **Concepts Used**

- Covered concepts
  - Pipelining
  - Unrolling / Vectorization (different here: systolic array, unrolling in 2 dimensions)
  - Local Memory
  - Blockwise operations
  - Outer loop pipelining
  - Streaming
- Other important concepts
  - Local memory layout
  - Loop coalescing
  - Reductions
  - Shift Registers
  - Latency hiding

Success Stories Complex Design Examples

#### **Electrical Engineering: Nanophotonics Simulations**

- FDTD stencil solver for Maxwell
   equations
  - regular 2D grid
  - acceleration with FPGAs
  - generalization of OpenCL design for Xilinx and Intel FPGA compilers
- Kenter et. al: Flexible FPGA design for FDTD using OpenCL. Proc. Int. Conf. on Field Programmable Logic and Applications (FPL). Sep. 2017.

0.2

0.15

#### **Electrical Engineering: Nanophotonics Simulations (2)**



- Discontinuous Galerkin solver for Maxwell equations
  - regular opeartions on unstructured grids
  - acceleration mit FPGAs
  - generalization in domain specific language (DSL) and compiler
- Kenter et. al: OpenCL-based FPGA design to accelerate the nodal Discontinuous Galerkin method for unstructured meshes. Proc. Int. Symp. on Field-Programmable Custom Computing Machines (FCCM). Apr. 2018.

## Thank you! Questions?