DEV Community

ai pics
ai pics

Posted on

How to Optimize HLS Designs for FPGAs (A Practical, Vendor-Agnostic Playbook)

Optimizing High-Level Synthesis (HLS) for FPGAs is about turning C/C++ into RTL that meets your throughput, latency, area, and power targets—without breaking correctness. Below is a concise, field-tested checklist you can apply in Vitis HLS (Xilinx), Intel HLS, Catapult, etc. Examples use Vitis HLS-style pragmas, with notes for portability.

1) Know the Optimization Stack

Algorithm level – choose math/data representations that minimize work.

Loop & task level – expose parallelism (pipeline, unroll, dataflow).

Memory & I/O – feed the beast (partition, reshape, burst, stream).

Micro-architecture – bind operators/memories, balance latencies, share resources.

Closure – verify (C/COSIM), analyze (util/timing/II/latency), iterate.

2) Numerics & Code Structure
Use bit-accurate fixed types

Prefer ap_(u)int / ap_fixed (or vendor equivalents) over float/double when error budget allows.

Right-size widths aggressively to cut LUTs, FFs, and DSP usage.

include "ap_int.h"

include "ap_fixed.h"

using pix_t = ap_uint<10>; // example: 10-bit pixel
using coeff_t = ap_fixed<16,2>; // 2 integer bits, 14 fractional

Make dependencies obvious (or remove them)

Keep hot loops simple; hoist conditionals outside loops when possible.

Replace complex if/else trees on the critical path with tables or precomputed constants where sensible.

Use const, restrict (where safe), and pass-by-reference to help the compiler infer no-aliasing and enable bursting.

3) Loop-Level Optimization
Pipeline first

Goal: II=1 on the critical loop whenever feasible.

pragma HLS PIPELINE II=1

for (int i = 0; i < N; i++) {
// body with no loop-carried true deps
}

Tip: If HLS won’t reach II=1, check the synthesis log’s “stall” reason:

Memory port conflicts → partition/reshape arrays or widen the data path.

Loop-carried dependency (RAW/WAR/WAW) → restructure buffers or prove independence:

pragma HLS DEPENDENCE variable=buf inter false

Unroll to trade area for throughput

Partial unroll to match available memory banks/ports; full unroll only if you can feed it.

pragma HLS UNROLL factor=4

for (int k=0; k<K; k++) { ... }

Tile / block for locality

Break large loops into tiles that fit BRAM/URAM; combine with on-chip buffers to reduce DDR traffic.

for (int ii=0; ii<N; ii+=Ti)
for (int jj=0; jj<M; jj+=Tj)
compute_tile(ii, jj);

Help the estimator

Tripcounts improve latency reports and scheduling:

pragma HLS LOOP_TRIPCOUNT min=64 max=128

4) Task-Level Concurrency (DATAFLOW)

Use dataflow to run producer/consumer stages concurrently. Connect stages with hls::stream (or Intel channels).

include "hls_stream.h"

void stageA(hls::stream& out);
void stageB(hls::stream& in, hls::stream& out);
void stageC(hls::stream& in);

void top(hls::stream& in, hls::stream& out) {

pragma HLS DATAFLOW

static hls::stream s1("s1"), s2("s2");

pragma HLS STREAM variable=s1 depth=64

pragma HLS STREAM variable=s2 depth=64

stageA(s1);
stageB(s1, s2);
stageC(s2);
}

Tips

Choose FIFO depths to absorb burstiness and meet initiation intervals across stages.

Avoid reading/writing the same array from multiple tasks unless you bank/partition correctly.

5) Memory & Interface Tuning
Partition / reshape arrays to add ports

PARTITION creates true parallel banks (good for random access).

RESHAPE packs multiple elements per word (great for sequential access and burst width).

// Random parallel reads

pragma HLS ARRAY_PARTITION variable=buf cyclic factor=4 dim=1

// Wide sequential loads/stores (e.g., 512-bit DDR beats)

pragma HLS ARRAY_RESHAPE variable=line factor=16 dim=1

Burst DDR and align widths

Use m_axi (Vitis) and wide types (ap_uint<256/512>) to match DDR or NoC widths; ensure contiguous access patterns.

Add offset=slave & proper bundle= names for multiple ports.

void kernel(ap_uint<512>* in, ap_uint<512>* out, int N) {
#pragma HLS INTERFACE m_axi port=in offset=slave bundle=gmem0 depth=1024
#pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem1 depth=1024
#pragma HLS INTERFACE s_axilite port=N bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
// ...
}

Stream for high throughput and low latency

Use AXI4-Stream at the top and hls::stream internally for line-rate pipelines (video, radio, ML).

pragma HLS INTERFACE axis port=in_axis

pragma HLS INTERFACE axis port=out_axis

6) Resource Binding & Micro-Architecture
Bind operations and memories

Map multiplies to DSPs (throughput) or LUTs (save DSPs).

Choose BRAM vs URAM for large buffers; single-/dual-port appropriately.

pragma HLS RESOURCE variable=mul_op core=DSP48

pragma HLS BIND_STORAGE variable=tile type=ram_2p impl=bram

Control sharing vs. replication

Use UNROLL to replicate compute, or ALLOCATION/RESOURCE pragmas to limit operator instances for area.

pragma HLS ALLOCATION operation instances=mul limit=2

Latency balancing

For long adder trees or MAC chains, HLS will usually insert registers; you can constrain:

pragma HLS LATENCY min=1 max=6

7) Throughput vs. Latency vs. Fmax

II (Initiation Interval) controls throughput (samples/cycle).

Latency is total cycles from input to output.

Fmax comes from post-synthesis timing; shorten critical paths (reduce fan-out, balance trees, use DSPs).

Clocking note: Set the target period in tool constraints (e.g., Vitis HLS create_clock -period 5) rather than in code; adjust until timing is clean with margin.

8) Verification & Reporting

C-sim: Prove algorithm correctness fast.

C/RTL Co-sim: Validate that RTL matches C under realistic I/O.

Reports: Inspect

Achieved II and latency,

Stall reasons (dependencies/ports),

Resource map (LUT/FF/DSP/BRAM/URAM),

Interface burst efficiency.

Bit-exact testing for fixed-point: measure SNR/PSNR or error budgets vs. floating-point golden.

9) Example: Streaming FIR with One-Sample-per-Cycle

This version sustains II=1 by unrolling the tap MAC and fully partitioning coefficients and the shift register. It uses fixed-point, AXI-Stream I/O, and works nicely inside a DATAFLOW pipeline.

include "ap_fixed.h"

include "hls_stream.h"

using data_t = ap_fixed<16,8>;
using acc_t = ap_fixed<32,12>; // wider accumulator
const int N = 64;

struct axis_t {
data_t data;
bool last;
};

void fir64(hls::stream& in, hls::stream& out, const data_t coeff[N]) {

pragma HLS INTERFACE axis port=in

pragma HLS INTERFACE axis port=out

pragma HLS INTERFACE ap_ctrl_none port=return

pragma HLS ARRAY_PARTITION variable=coeff complete dim=1

static data_t shift_reg[N];

pragma HLS ARRAY_PARTITION variable=shift_reg complete dim=1

while (true) {

pragma HLS PIPELINE II=1

axis_t x = in.read();

// shift
for (int i = N-1; i > 0; --i) {
Enter fullscreen mode Exit fullscreen mode

pragma HLS UNROLL

  shift_reg[i] = shift_reg[i-1];
}
shift_reg[0] = x.data;

// MAC
acc_t acc = 0;
for (int i = 0; i < N; ++i) {
Enter fullscreen mode Exit fullscreen mode

pragma HLS UNROLL

  acc += (acc_t)shift_reg[i] * (acc_t)coeff[i];
}

axis_t y;
y.data = (data_t)acc;
y.last = x.last;
out.write(y);

if (x.last) break;  // simple frame terminator
Enter fullscreen mode Exit fullscreen mode

}
}

Top comments (0)