Hedy

Posted on Nov 17

How to implement FFT algorithm in FPGA?

#fpga #fft #dsp #xilinxfpga

You don’t “write an FFT in C and compile it to the FPGA” – you build an FFT datapath (butterflies, memories, twiddle-factor multipliers) plus control logic, or you use the vendor’s FFT IP core.

I’ll give you both paths: (A) use IP and (B) build your own.

A. The practical way: use the vendor’s FFT IP core

If your goal is “I want an FFT working on FPGA”, not “I want to learn FFT architecture design in depth”, do this:

1. Decide your requirements

FFT length: e.g. N = 256, 1024, 4096
Data type: typically fixed-point, e.g. 16-bit real/imag
Transform type:
- Complex → complex
- Real → complex (more efficient if input is real only)
Throughput:
- Streaming (one sample per clock)
- Burst/block (feed an N-point block, wait, get N outputs)

2. In Xilinx / Intel tools

Xilinx (Vivado / Vitis HLS)

Use “Discrete Fourier Transform (DFT) / FFT” IP.
Configure:
- N, scaling mode (block floating point or fixed scaling)
- Data width, twiddle width
- Streaming vs burst
It gives you:
- AXI-Stream or simple handshake interfaces
- RTL wrapper you can drop into your design

Intel (Quartus)

Use “FFT Megacore / FFT IP”.
Same idea: configure params, generate, then connect to your logic.

3. Hook it up

Input: your ADC / front-end data → format to fixed-point → feed into FFT IP.
Output: FFT IP → magnitude/phase calculation or further processing → maybe DMA to RAM / CPU.

This is how most professional designs do FFTs in FPGAs unless there's a strong reason to go fully custom.

B. The educational way: build your own FFT in HDL

If you want to understand the architecture, here’s a high-level roadmap for a radix-2 FFT.

1. Choose fixed-point format

For example: Q1.15 (16-bit signed: 1 sign + 15 fractional bits)

Input: 16-bit real (and optionally 16-bit imag)
Internal: you may need more bits due to growth (e.g. 18–24 bits).

Decide how you handle growth:

Scale down (right shift) at each stage,
Or use block floating point,
Or just give more headroom bits.

2. Algorithm: radix-2 DIF or DIT

DIT (Decimation in Time):

Input reordering, butterflies operate on time-ordered input, outputs are bit-reversed.

DIF (Decimation in Frequency):

Input natural, output bit-reversed.

Pick one and stick with it. DIF is often convenient for streaming, but both work.

For an N-point FFT:

Number of stages = log₂(N).
Each stage consists of N/2 butterfly units.

3. Design a butterfly module

Each radix-2 butterfly computes:

𝑋𝑜𝑢𝑡,0=𝐴+𝑊⋅𝐵
𝑋𝑜𝑢𝑡,1=𝐴−𝑊⋅𝐵

Where:

A, B = complex inputs
W = complex twiddle factor = cosθ – j sinθ

In HDL, conceptually:

module butterfly(
    input  signed [W-1:0] a_re, a_im,
    input  signed [W-1:0] b_re, b_im,
    input  signed [W-1:0] w_re, w_im, // twiddle factor
    output signed [W:0]   y0_re, y0_im,
    output signed [W:0]   y1_re, y1_im
);
    // complex multiplication: W * B
    wire signed [2*W-1:0] mul_re = b_re * w_re - b_im * w_im;
    wire signed [2*W-1:0] mul_im = b_re * w_im + b_im * w_re;

    // scale/truncate mul_re/mul_im back to W bits as needed...

    assign y0_re = a_re + mul_re_trunc;
    assign y0_im = a_im + mul_im_trunc;

    assign y1_re = a_re - mul_re_trunc;
    assign y1_im = a_im - mul_im_trunc;
endmodule

You’ll also likely:

Pipeline these operations (registers between multiplies and adds).
Right-shift to maintain scaling.

4. Twiddle factors

You need twiddle factors for each stage & butterfly position:

Implementation options:

1. Lookup ROM:

Precompute cos/sin values offline (Python/MATLAB).
Store them as fixed-point in a ROM (BRAM).
Address ROM based on stage and butterfly index.

2. CORDIC:

Use a CORDIC block to generate sin/cos on the fly.
Less ROM, more logic and latency.

For a first design, ROM is simpler and very common.

5. Memory and data flow

You need a way to feed pairs (A, B) in the right order for each stage.

Two classic architectures:

(a) Memory-based (iterative) FFT

Single or few butterfly units.

You store the data in RAM (BRAM).
For each stage:
- You read two points A and B, perform butterfly, write results back.
- The address generation logic handles the bit patterns and strides.

Pros:

Uses fewer resources (reuses butterfly).
Good when throughput is moderate.

Cons:

More complex control (address generation).
Latency of N·logN cycles.

(b) Pipelined streaming FFT (SDF / MDC)

Data flows through a chain of stages.
Each stage consists of:
- Delay lines (FIFOs)
- One butterfly
- Twiddle multiply, etc.
Can achieve one sample per clock after pipeline fill.

Pros:

Very high throughput.
Natural streaming interface.

Cons:

Fixed N, more resources, more design effort.

For learning, a memory-based radix-2 is a good starting point. For real products, people often use pipelined FFT IP cores.

6. Control logic

You’ll need:

A global state machine:

Stage counter (0..log₂N–1),
Butterfly index within the stage,
Address generation for RAM reads/writes.

Optional:

start, busy, done signals.
Valid/ready handshaking if using streaming.

7. Verification

Before going to hardware:

Write a testbench:

Feed known input vectors (e.g. impulse, single-tone sinusoid).
Use a reference FFT from Python/MATLAB/NumPy.

Compare FPGA output vs reference (allowing for fixed-point quantization error).
Check:

Spectral shape (sinusoid → single bin).
DC offset behavior.
Scaling (is magnitude roughly correct?).

C. When to go IP vs custom

Use IP when:

You mainly care about “it works” and can tweak parameters.
Project is real-world / time-sensitive.

Build custom when:

You want to learn FPGA DSP architectures.
You need a very specific architecture (e.g., ultra-low area, strange word lengths, partial FFT).

DEV Community

How to implement FFT algorithm in FPGA?

Top comments (0)