DEV Community

Hedy
Hedy

Posted on

How to implement FFT algorithm in FPGA?

You don’t “write an FFT in C and compile it to the FPGA” – you build an FFT datapath (butterflies, memories, twiddle-factor multipliers) plus control logic, or you use the vendor’s FFT IP core.

I’ll give you both paths: (A) use IP and (B) build your own.

A. The practical way: use the vendor’s FFT IP core

If your goal is “I want an FFT working on FPGA”, not “I want to learn FFT architecture design in depth”, do this:

1. Decide your requirements

  • FFT length: e.g. N = 256, 1024, 4096
  • Data type: typically fixed-point, e.g. 16-bit real/imag
  • Transform type:

    • Complex → complex
    • Real → complex (more efficient if input is real only)
  • Throughput:

    • Streaming (one sample per clock)
    • Burst/block (feed an N-point block, wait, get N outputs)

2. In Xilinx / Intel tools

Xilinx (Vivado / Vitis HLS)

  • Use “Discrete Fourier Transform (DFT) / FFT” IP.
  • Configure:

    • N, scaling mode (block floating point or fixed scaling)
    • Data width, twiddle width
    • Streaming vs burst
  • It gives you:

    • AXI-Stream or simple handshake interfaces
    • RTL wrapper you can drop into your design

Intel (Quartus)

  • Use “FFT Megacore / FFT IP”.
  • Same idea: configure params, generate, then connect to your logic.

3. Hook it up

  • Input: your ADC / front-end data → format to fixed-point → feed into FFT IP.
  • Output: FFT IP → magnitude/phase calculation or further processing → maybe DMA to RAM / CPU.

This is how most professional designs do FFTs in FPGAs unless there's a strong reason to go fully custom.

B. The educational way: build your own FFT in HDL

If you want to understand the architecture, here’s a high-level roadmap for a radix-2 FFT.

1. Choose fixed-point format

For example: Q1.15 (16-bit signed: 1 sign + 15 fractional bits)

  • Input: 16-bit real (and optionally 16-bit imag)
  • Internal: you may need more bits due to growth (e.g. 18–24 bits).

Decide how you handle growth:

  • Scale down (right shift) at each stage,
  • Or use block floating point,
  • Or just give more headroom bits.

2. Algorithm: radix-2 DIF or DIT

DIT (Decimation in Time):

Input reordering, butterflies operate on time-ordered input, outputs are bit-reversed.

DIF (Decimation in Frequency):

Input natural, output bit-reversed.

Pick one and stick with it. DIF is often convenient for streaming, but both work.

For an N-point FFT:

  • Number of stages = log₂(N).
  • Each stage consists of N/2 butterfly units.

3. Design a butterfly module

Each radix-2 butterfly computes:

𝑋𝑜𝑢𝑡,0=𝐴+𝑊⋅𝐵
𝑋𝑜𝑢𝑡,1=𝐴−𝑊⋅𝐵

Where:

  • A, B = complex inputs
  • W = complex twiddle factor = cosθ – j sinθ

In HDL, conceptually:

module butterfly(
    input  signed [W-1:0] a_re, a_im,
    input  signed [W-1:0] b_re, b_im,
    input  signed [W-1:0] w_re, w_im, // twiddle factor
    output signed [W:0]   y0_re, y0_im,
    output signed [W:0]   y1_re, y1_im
);
    // complex multiplication: W * B
    wire signed [2*W-1:0] mul_re = b_re * w_re - b_im * w_im;
    wire signed [2*W-1:0] mul_im = b_re * w_im + b_im * w_re;

    // scale/truncate mul_re/mul_im back to W bits as needed...

    assign y0_re = a_re + mul_re_trunc;
    assign y0_im = a_im + mul_im_trunc;

    assign y1_re = a_re - mul_re_trunc;
    assign y1_im = a_im - mul_im_trunc;
endmodule
Enter fullscreen mode Exit fullscreen mode

You’ll also likely:

  • Pipeline these operations (registers between multiplies and adds).
  • Right-shift to maintain scaling.

4. Twiddle factors

You need twiddle factors for each stage & butterfly position:

Implementation options:

1. Lookup ROM:

  • Precompute cos/sin values offline (Python/MATLAB).
  • Store them as fixed-point in a ROM (BRAM).
  • Address ROM based on stage and butterfly index.

2. CORDIC:

  • Use a CORDIC block to generate sin/cos on the fly.
  • Less ROM, more logic and latency.

For a first design, ROM is simpler and very common.

5. Memory and data flow

You need a way to feed pairs (A, B) in the right order for each stage.

Two classic architectures:

(a) Memory-based (iterative) FFT

Single or few butterfly units.

  • You store the data in RAM (BRAM).
  • For each stage:

    • You read two points A and B, perform butterfly, write results back.
    • The address generation logic handles the bit patterns and strides.

Pros:

  • Uses fewer resources (reuses butterfly).
  • Good when throughput is moderate.

Cons:

  • More complex control (address generation).
  • Latency of N·logN cycles.

(b) Pipelined streaming FFT (SDF / MDC)

  • Data flows through a chain of stages.
  • Each stage consists of:

    • Delay lines (FIFOs)
    • One butterfly
    • Twiddle multiply, etc.
  • Can achieve one sample per clock after pipeline fill.

Pros:

  • Very high throughput.
  • Natural streaming interface.

Cons:

Fixed N, more resources, more design effort.

For learning, a memory-based radix-2 is a good starting point. For real products, people often use pipelined FFT IP cores.

6. Control logic

You’ll need:

A global state machine:

  • Stage counter (0..log₂N–1),
  • Butterfly index within the stage,
  • Address generation for RAM reads/writes.

Optional:

  • start, busy, done signals.
  • Valid/ready handshaking if using streaming.

7. Verification

Before going to hardware:

  1. Write a testbench:
  • Feed known input vectors (e.g. impulse, single-tone sinusoid).
  • Use a reference FFT from Python/MATLAB/NumPy.
  1. Compare FPGA output vs reference (allowing for fixed-point quantization error).

  2. Check:

  • Spectral shape (sinusoid → single bin).
  • DC offset behavior.
  • Scaling (is magnitude roughly correct?).

C. When to go IP vs custom

Use IP when:

  • You mainly care about “it works” and can tweak parameters.
  • Project is real-world / time-sensitive.

Build custom when:

  • You want to learn FPGA DSP architectures.
  • You need a very specific architecture (e.g., ultra-low area, strange word lengths, partial FFT).

Top comments (0)