Firmware Development

Making an FFT Block in Vitis HLS – Part 2

Making a bigger FFT

In our last post we made an HLS project with the top function “fftTop”.  In this article I’d like to change a few things around. First lets make the FFT much bigger.  We are using an FPGA for this so it won’t bog down the processor at all, I want to see how many resources wil be used if I go for broke and make it as big as possible.  The Xilinx FFT core is limited to 65536 samples.  The size of it is in param1 which we defined in fftTop.h. 

struct param1 : hls::ip_fft::params_t {
// static const unsigned ordering_opt = hls::ip_fft::natural_order;
static const unsigned max_nfft = 10;

Our new header file looks like this:

#ifndef FFT_TOP_H
#define FFT_TOP_H
#include "hls_fft.h"
#include <ap_fixed.h>
#include <complex>

#define FFT_INPUT_WIDTH 16
#define FFT_LENGTH 65536

struct param1 : hls::ip_fft::params_t {
// static const unsigned ordering_opt = hls::ip_fft::natural_order;
static const unsigned max_nfft = 16;
typedef hls::ip_fft::config_t<param1> config_t;
typedef ap_fixed<FFT_INPUT_WIDTH,1> data_in_t;
typedef hls::x_complex<data_in_t> cmpxDataIn;
typedef hls::x_complex<data_out_t> cmpxDataOut;
typedef hls::stream<cmpxDataIn> cmpxDataInStream;
typedef hls::stream<cmpxDataOut> cmpxDataOutStream;
typedef hls::ip_fft::status_t<param1> status_t;


The max_nfft value is what provides our number of samples.  The number of samples per transform is 2^max_nfft.  Right now it’s Latex formula or 1024.  Lets change it to 16, and have 65536 samples. There is a scaling schedule to set as well, we can revisit that later.  I have constants in the header file for conservative values for 1024 and 65536.  We’ll change one line of code in fftTop.cpp as well.


Current (Bad) Synthesis Results

If you look at our BRAM usage, it’s 265% of our available BRAMs.  That is about 165% too much! But looking at the fft core itself “fft_param1_s”, its only using 51 percent. Lets scroll down a little and see where the rest went. 

inArray and outArray are being created in block ram.  Together they are using 256 blockrams, far more than our lowly zynq 7010 has.  Furthermore, this is a huge waste of memory, since these values don’t need to be buffered.  The Xilinx FFT core uses streams natively, and the HLS fft() function is just a wrapper for the FFT core.  Since 2022.1 we are able to use streams to interface with the fft() function. To do this we just need to create a new stream of type hls::stream<config_t>, and place our config into it.  Then we can tie the in and out arguments straight into our hls::fft() call.  Our fftTop.cpp now looks like this:

#include "fftTop.h"
#include <iostream>

void fftTop(cmpxDataInStream &in, cmpxDataOutStream &out, bool direction, bool &ovflo)
config_t config;
hls::stream<config_t> configStream;
status_t status;
hls::stream<status_t> statusStream;


configStream << config;
hls::fft<param1>(in, out, statusStream , configStream);

Let’s have a look at our Synthesis results.

Much better, this will now easily fit on the FPGA.  Granted we are using 79 percent of available LUTs.  We may later decide to make this FFT smaller again.  But now we aren’t wasting resources on storing the input and output before sending it, and also aren’t wasting clock cycles queuing up data.

Next Steps

We are missing several important things still. 

  • We aren’t looking at the status an outputting our overflow bit
  • We aren’t using AXI-Stream input and output on our function, which is important if we want to link it to the AXI-DMA IP core later (we do)
  • This is just a nice to have, but I want the ONLY interface to this block to be the AXI-Stream.  That means I’d like to have to extract the direction from a header word on the input side, and add a final footer word to my output containing the overflow boolean.  We will also have to change the interface to our block to be AP_NONE.