You don’t “write an FFT in C and compile it to the FPGA” – you build an FFT datapath (butterflies, memories, twiddle-factor multipliers) plus control logic, or you use the vendor’s FFT IP core.
I’ll give you both paths: (A) use IP and (B) build your own.
A. The practical way: use the vendor’s FFT IP core
If your goal is “I want an FFT working on FPGA”, not “I want to learn FFT architecture design in depth”, do this:
1. Decide your requirements
- FFT length: e.g. N = 256, 1024, 4096
- Data type: typically fixed-point, e.g. 16-bit real/imag
-
Transform type:
- Complex → complex
- Real → complex (more efficient if input is real only)
-
Throughput:
- Streaming (one sample per clock)
- Burst/block (feed an N-point block, wait, get N outputs)
2. In Xilinx / Intel tools
Xilinx (Vivado / Vitis HLS)
- Use “Discrete Fourier Transform (DFT) / FFT” IP.
-
Configure:
- N, scaling mode (block floating point or fixed scaling)
- Data width, twiddle width
- Streaming vs burst
-
It gives you:
- AXI-Stream or simple handshake interfaces
- RTL wrapper you can drop into your design
Intel (Quartus)
- Use “FFT Megacore / FFT IP”.
- Same idea: configure params, generate, then connect to your logic.
3. Hook it up
- Input: your ADC / front-end data → format to fixed-point → feed into FFT IP.
- Output: FFT IP → magnitude/phase calculation or further processing → maybe DMA to RAM / CPU.
This is how most professional designs do FFTs in FPGAs unless there's a strong reason to go fully custom.
B. The educational way: build your own FFT in HDL
If you want to understand the architecture, here’s a high-level roadmap for a radix-2 FFT.
1. Choose fixed-point format
For example: Q1.15 (16-bit signed: 1 sign + 15 fractional bits)
- Input: 16-bit real (and optionally 16-bit imag)
- Internal: you may need more bits due to growth (e.g. 18–24 bits).
Decide how you handle growth:
- Scale down (right shift) at each stage,
- Or use block floating point,
- Or just give more headroom bits.
2. Algorithm: radix-2 DIF or DIT
DIT (Decimation in Time):
Input reordering, butterflies operate on time-ordered input, outputs are bit-reversed.
DIF (Decimation in Frequency):
Input natural, output bit-reversed.
Pick one and stick with it. DIF is often convenient for streaming, but both work.
For an N-point FFT:
- Number of stages = log₂(N).
- Each stage consists of N/2 butterfly units.
3. Design a butterfly module
Each radix-2 butterfly computes:
𝑋𝑜𝑢𝑡,0=𝐴+𝑊⋅𝐵
𝑋𝑜𝑢𝑡,1=𝐴−𝑊⋅𝐵
Where:
- A, B = complex inputs
- W = complex twiddle factor = cosθ – j sinθ
In HDL, conceptually:
module butterfly(
input signed [W-1:0] a_re, a_im,
input signed [W-1:0] b_re, b_im,
input signed [W-1:0] w_re, w_im, // twiddle factor
output signed [W:0] y0_re, y0_im,
output signed [W:0] y1_re, y1_im
);
// complex multiplication: W * B
wire signed [2*W-1:0] mul_re = b_re * w_re - b_im * w_im;
wire signed [2*W-1:0] mul_im = b_re * w_im + b_im * w_re;
// scale/truncate mul_re/mul_im back to W bits as needed...
assign y0_re = a_re + mul_re_trunc;
assign y0_im = a_im + mul_im_trunc;
assign y1_re = a_re - mul_re_trunc;
assign y1_im = a_im - mul_im_trunc;
endmodule
You’ll also likely:
- Pipeline these operations (registers between multiplies and adds).
- Right-shift to maintain scaling.
4. Twiddle factors
You need twiddle factors for each stage & butterfly position:
Implementation options:
1. Lookup ROM:
- Precompute cos/sin values offline (Python/MATLAB).
- Store them as fixed-point in a ROM (BRAM).
- Address ROM based on stage and butterfly index.
2. CORDIC:
- Use a CORDIC block to generate sin/cos on the fly.
- Less ROM, more logic and latency.
For a first design, ROM is simpler and very common.
5. Memory and data flow
You need a way to feed pairs (A, B) in the right order for each stage.
Two classic architectures:
(a) Memory-based (iterative) FFT
Single or few butterfly units.
- You store the data in RAM (BRAM).
-
For each stage:
- You read two points A and B, perform butterfly, write results back.
- The address generation logic handles the bit patterns and strides.
Pros:
- Uses fewer resources (reuses butterfly).
- Good when throughput is moderate.
Cons:
- More complex control (address generation).
- Latency of N·logN cycles.
(b) Pipelined streaming FFT (SDF / MDC)
- Data flows through a chain of stages.
-
Each stage consists of:
- Delay lines (FIFOs)
- One butterfly
- Twiddle multiply, etc.
Can achieve one sample per clock after pipeline fill.
Pros:
- Very high throughput.
- Natural streaming interface.
Cons:
Fixed N, more resources, more design effort.
For learning, a memory-based radix-2 is a good starting point. For real products, people often use pipelined FFT IP cores.
6. Control logic
You’ll need:
A global state machine:
- Stage counter (0..log₂N–1),
- Butterfly index within the stage,
- Address generation for RAM reads/writes.
Optional:
- start, busy, done signals.
- Valid/ready handshaking if using streaming.
7. Verification
Before going to hardware:
- Write a testbench:
- Feed known input vectors (e.g. impulse, single-tone sinusoid).
- Use a reference FFT from Python/MATLAB/NumPy.
Compare FPGA output vs reference (allowing for fixed-point quantization error).
Check:
- Spectral shape (sinusoid → single bin).
- DC offset behavior.
- Scaling (is magnitude roughly correct?).
C. When to go IP vs custom
Use IP when:
- You mainly care about “it works” and can tweak parameters.
- Project is real-world / time-sensitive.
Build custom when:
- You want to learn FPGA DSP architectures.
- You need a very specific architecture (e.g., ultra-low area, strange word lengths, partial FFT).


Top comments (0)