Are FPGAs the answer to HPC's woes?
Executive Summary
Not yet. I’ll demonstrate why no domain scientist would ever want to program in Verilog, then highlight a few promising directions of development that are addressing this fact.
The usual disclaimer also applies: the opinions and conjectures expressed below are mine alone and not those of my employer. Also I am not a computer scientist, so I probably don’t know what I’m talking about. And even if it seems like I do, remember that I am a storage architect who is wholly unqualified to speak on applications and processor performance.
<h2>Premise</h2>We’re now in an age where CPU cores aren’t getting any faster, and the difficulties of shrinking processes below 10 nm means we can’t really pack any more CPU cores on a die. Where’s performance going to come from if we ever want to get to exascale and beyond?
Some vendors are betting on larger and larger vectors–ARM (with its Scalable Vector Extensions) and NEC (with its Aurora coprocessors) are going down this path. However, algorithms that aren’t predominantly dense linear algebra will need very efficient scatter and gather operations that can pack vector registers quickly enough to make doing a single vector operation worthwhile. For example, gathering eight 64-bit values from different parts of memory to issue an eight-wide (512-bit) vector multiply requires pulling eight different cache lines–that’s moving 4096 bits of memory for what amounts to 512 bits of computation. In order to continue scaling vectors out, CPUs will have to rethink how their vector units interact with memory. This means either (a) getting a lot more memory bandwidth to support these low flops-per-byte ratios, or (b) pack vectors closer to the memory so that pre-packed vectors can be fetched through the existing memory channels.
Another option to consider are GPUs, which work around the vector packing issue by implementing a massive numbers of registers and giant crossbars to plumb those bytes into arithmetic units. Even then, though, relying on a crossbar to connect compute and data is difficult to continue scaling; the interconnect industry gave up on this long ago, which is why today’s clusters now connect hundreds or thousands of crossbars into larger fat trees, hypercubes, and dragonflies. GPUs are still using larger and larger crossbars–NVIDIA’s V100 GPU is one of the physically largest single-die chips ever made–but there’s an economic limit to how large a die can be.
This bleak outlook has begun to drive HPC designers towards thinking about smarter ways to use silicon. Rather than build a general-purpose processor that can do all multiplication and addition operations at a constant rate, the notion is to bring hardware design closer to the algorithms being implemented. This isn’t a new idea (for example, RIKEN’s MDGRAPE and DESRES’s Anton are famous examples of purpose-built chips for specific scientific application areas), but this approach historically has been very expensive relative to just using general-purpose processor parts. Only now are we at a place where special-purpose hardware may be the only way to sustain HPC’s performance trajectory.
Given the diversity of applications that run on the modern supercomputer though, expensive and custom chips that only solve one problem aren’t very appetizing. A close compromise are FPGAs though, and there has been a growing buzz surrounding the viability of relying on FPGAs in mainstream HPC workloads.
Many of us non-computer scientists in the HPC business only have a vague and qualitative notion of how FPGAs can realistically be used to carry out computations, though. Since there is growing excitement around FPGAs for HPC as exascale approaches though, I set out to get my hands dirty and figure out how they might fit in the larger HPC ecosystem.
<h2>Crash course in Verilog</h2>Verilog can be very difficult to grasp for people who already know how to program languages like C or Fortran (like me!). On the one hand, it looks a bit like C in that has variables to which values can be assigned, if/then/else controls, for loops, and so on. However these similarities are deceptive because Verilog does not execute like C; whereas a C program executes code line by line, one statement after the other, Verilog sort of execute all of the lines at the same time, all the time.
A C program to turn an LED on and off repeatedly might look like:
<div></div>
where the LED is turned on, then the LED is turned off, then we repeat.
In Verilog, you really have to describe what components your program will have and how they are connected. In the most basic way, the code to blink an LED in Verilog would look more like
<div></div>
Whereas C is a procedural language in that you describe a procedure for solving a problem, Verilog is more like a declarative language in that you describe how widgets can be arranged to solve the problem.
This can make tasks that are simple to accomplish in C comparatively awkward in Verilog. Take our LED blinker C code above as an example; if you want to slow down the blinking frequency, you can do something like
<div></div>
Because Verilog is not procedural, there is no simple way to say “wait a second after you turn on the LED before doing something else.” Instead, you have to rely on knowing how much time passes between consecutive clock signals (clk
incrementing).
For example, the DE10-Nano has a 50 MHz clock generator, so every 1/(50 MHz) (20 nanoseconds), and everything time-based has to be derived from this fundamental clock timer. The following Verilog statement:
<div></div>
indicates that every 20 ns, increment the cnt
register (variable) by one. To make the LED wait for one second after the LED is turned on, we need to figure out a way to do nothing for 50,000,000 clock cycles (1 second / 20 nanoseconds). The canonical way to do this is to
<ol><li>create a big register that can store a number up to 50 million</li><li>express that this register should be incremented by 1 on every clock cycle</li><li>create a logic block that turns on the LED when our register is larger than 50 million</li><li>rely on the register eventually overflowing to go back to zero</li></ol>If we make cnt
a 26-bit register, it can count up to 67,108,864 different numbers and our Verilog can look something like
<div></div>
However, we are still left with two problems:
<ol><li>cnt
will overflow back to zero once cnt
surpasses 226 - 1</li><li>We don’t yet know how to express how the LED is connected to our FPGA and should be controlled by our circuit</li></ol>Problem #1 (cnt
overflows) means that the LED will stay on for exactly 50,000,000 clock cycles (1 second), but it’ll turn off for only 226 - 1 - 50,000,000 cycles (17,108,860 cycles, or 0.34 seconds). Not exactly the one second on, one second off that our C code does.
Problem #2 is solved by understanding the following:
<ul><li>our LED is external to the FPGA, so it will be at the end of an output wire</li><li>the other end of that output wire must be connected to something inside our circuit–a register, another wire, or something else</li></ul>
The conceptually simplest solution to this problem is to create another register (variable), this time only one bit wide, in which our LED state will be stored. We can then change the state of this register in our if (cnt > 5000000)
block and wire that register to our external LED:
<div></div>
Note that our assign
statement is outside of our always @(posedge clk)
block because this assignment–connecting our led
output wire to our led_state
register–is a persistent declaration, not the assignment of a particular value. We are saying “whatever value is stored in led_state
should always be carried to whatever is on the other end of the led
wire.” Whenever led_state
changes, led
will simultaneously change as a result.
With this knowledge, we can actually solve Problem #1 now by
<ol><li>only counting up to 50 million and not relying on overflow of cnt
to turn the LED on or off, and</li><li>overflowing the 1-bit led_state
register every 50 million clock cycles</li></ol>Our Verilog module would look like
<div></div>
and we accomplish the “hello world” of circuit design:
<div class="separator" style="clear: both; text-align: center;"></div>
This Verilog is actually still missing a number of additional pieces and makes very inefficient use of the FPGA’s hardware resources. However, it shows how awkward it can be to express a simple, four-line procedural program using a hardware description language like Verilog.
<h2>So why bother with FPGAs at all?</h2>It should be clear that solving a scientific problem using a procedural language like C is generally more straightforward than with a declarative language like Verilog. That ease of programming is made possible by a ton of hardware logic that isn’t always used, though.
Consider our blinking LED example; because the C program is procedural, it takes one CPU thread to walk through the code in our program. Assuming we’re using a 64-core computer, that means we can only blink up to 64 LEDs at once. On the other hand, our Verilog module consumes a tiny number of the programmable logic blocks on an FPGA. When compiled for a $100 hobbyist-grade DE10-Nano FPGA system, it uses only 21 of 41,910 programmable blocks, meaning it can control almost 2,000 LEDs concurrently**. A high-end FPGA would easily support tens of thousands.
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="display: block; float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"></td></tr><tr><td class="tr-caption" style="text-align: center;">The CM2 illuminated an LED whenever an operation was in flight. Blinking the LED in Verilog is easy. Reproducing the CM2 microarchitecture is a different story. Image credit to Corestore.</td></tr></tbody></table>Of course, blinking LEDs haven’t been relevant to HPC since the days of Connection Machines, but if you were to replace LED-blinking logic with floating point arithmetic units, the same conclusions apply. In principle, a single FPGA can process a huge number of FLOPS every cycle by giving up its ability to perform many of the tasks that a more general-purpose CPU would be able to do. And because FPGAs are reprogrammable, they can be quickly configured to have an optimal mix of special-purpose parallel ALUs and general purpose capabilities to suit different application requirements.
However, the fact that the fantastic potential of FPGAs hasn’t materialized into widespread adoption is a testament to how difficult it is to bridge the wide chasm between understanding how to solve a physics problem and understanding how to design a microarchitecture.
<h2>Where FPGAs fit in HPC today</h2>To date, a few scientific domains have had success in using FPGAs. For example,
<ul><li>Experimental instruments that generate data commonly deploy FPGAs close to their detectors to perform very repetitive, relatively simple data filtering or manipulation at extremely high rates. For example, Illumina HiSeq DNA sequencers incorporate both Altera and Xilinx FPGAs to assist with the high-throughput image processing, and high-energy physics experiments routinely use FPGAs for signal processing.</li><li>Closer to the HPC side, Convey implemented loadable FPGA blocks to perform many algorithms common to bioinformatics. For example, they provided an FPGA-accelerated Smith-Waterman algorithm; this algorithm is used to align short DNA sequences along a reference genome and must be executed thousands of times per genome before actual genomic analysis can start.</li><li>More recently, Edico Genome has been very successful in implementing a wide range of common bioinformatics algorithms on FPGA and providing end-to-end analysis processing pipelines that act as drop-in replacements for standard genomic analysis pipelines.</li></ul><div>The success of these FPGA products is due in large part to the fact that the end-user scientists don’t ever have to directly interact with the FPGAs. In the case of experimental detectors, FPGAs are sufficiently close to the detector that the “raw” data that is delivered to the researcher has already been processed by the FPGAs. Convey and Edico products incorporate their FPGAs into an appliance, and the process of offloading certain tasks to the FPGA in proprietary applications that, to the research scientist, look like any other command-line analysis program.</div>
Where FPGAs will fit in HPC tomorrow
- Users must be able to integrate FPGA acceleration into their existing applications rather than replace their applications wholesale with proprietary FPGA analogues.
- It has to be as easy as f90 -fopenacc or nvcc to build an FPGA-accelerated application, and running the resulting accelerated binary has to be as easy as running an unaccelerated binary.
- OpenCL tends to be very messy to code in compared to simpler APIs such as OpenACC, OpenMP, CUDA, or HIP. As a result, not many HPC application developers are investing in OpenCL anymore.
- Compiling an application for OpenCL on an FPGA still requires going through the entire Xilinx or Altera toolchain. At present, this is not as simple as f90 -fopenacc or nvcc, and the process of compiling code that targets an FPGA can take orders of magnitude longer than it would for a CPU due to the NP-hard nature of placing and routing across all the programmable blocks.
- The FPGA OpenCL stacks are not as polished and scientist-friendly right now; performance analysis and debugging generally still has to be done at the circuit level, which is untenable for domain scientists.
Concluding Thoughts
** This is not really true. Such a design would be limited by the number of physical pins coming out of the FPGA; in reality, output pins would have to be multiplexed, and additional logic to drive this multiplexing would take up FPGA real estate. But you get the point.
SaveSave
SaveSaveSaveSave