Optimizing High-Level Synthesis (HLS) for FPGAs is about turning C/C++ into RTL that meets your throughput, latency, area, and power targets—without breaking correctness. Below is a concise, field-tested checklist you can apply in Vitis HLS (Xilinx), Intel HLS, Catapult, etc. Examples use Vitis HLS-style pragmas, with notes for portability.
1) Know the Optimization Stack
Algorithm level – choose math/data representations that minimize work.
Loop & task level – expose parallelism (pipeline, unroll, dataflow).
Memory & I/O – feed the beast (partition, reshape, burst, stream).
Micro-architecture – bind operators/memories, balance latencies, share resources.
Closure – verify (C/COSIM), analyze (util/timing/II/latency), iterate.
2) Numerics & Code Structure
Use bit-accurate fixed types
Prefer ap_(u)int / ap_fixed (or vendor equivalents) over float/double when error budget allows.
Right-size widths aggressively to cut LUTs, FFs, and DSP usage.
include "ap_int.h"
include "ap_fixed.h"
using pix_t = ap_uint<10>; // example: 10-bit pixel
using coeff_t = ap_fixed<16,2>; // 2 integer bits, 14 fractional
Make dependencies obvious (or remove them)
Keep hot loops simple; hoist conditionals outside loops when possible.
Replace complex if/else trees on the critical path with tables or precomputed constants where sensible.
Use const, restrict (where safe), and pass-by-reference to help the compiler infer no-aliasing and enable bursting.
3) Loop-Level Optimization
Pipeline first
Goal: II=1 on the critical loop whenever feasible.
pragma HLS PIPELINE II=1
for (int i = 0; i < N; i++) {
// body with no loop-carried true deps
}
Tip: If HLS won’t reach II=1, check the synthesis log’s “stall” reason:
Memory port conflicts → partition/reshape arrays or widen the data path.
Loop-carried dependency (RAW/WAR/WAW) → restructure buffers or prove independence:
pragma HLS DEPENDENCE variable=buf inter false
Unroll to trade area for throughput
Partial unroll to match available memory banks/ports; full unroll only if you can feed it.
pragma HLS UNROLL factor=4
for (int k=0; k<K; k++) { ... }
Tile / block for locality
Break large loops into tiles that fit BRAM/URAM; combine with on-chip buffers to reduce DDR traffic.
for (int ii=0; ii<N; ii+=Ti)
for (int jj=0; jj<M; jj+=Tj)
compute_tile(ii, jj);
Help the estimator
Tripcounts improve latency reports and scheduling:
pragma HLS LOOP_TRIPCOUNT min=64 max=128
4) Task-Level Concurrency (DATAFLOW)
Use dataflow to run producer/consumer stages concurrently. Connect stages with hls::stream (or Intel channels).
include "hls_stream.h"
void stageA(hls::stream& out);
void stageB(hls::stream& in, hls::stream& out);
void stageC(hls::stream& in);
void top(hls::stream& in, hls::stream& out) {
pragma HLS DATAFLOW
static hls::stream s1("s1"), s2("s2");
pragma HLS STREAM variable=s1 depth=64
pragma HLS STREAM variable=s2 depth=64
stageA(s1);
stageB(s1, s2);
stageC(s2);
}
Tips
Choose FIFO depths to absorb burstiness and meet initiation intervals across stages.
Avoid reading/writing the same array from multiple tasks unless you bank/partition correctly.
5) Memory & Interface Tuning
Partition / reshape arrays to add ports
PARTITION creates true parallel banks (good for random access).
RESHAPE packs multiple elements per word (great for sequential access and burst width).
// Random parallel reads
pragma HLS ARRAY_PARTITION variable=buf cyclic factor=4 dim=1
// Wide sequential loads/stores (e.g., 512-bit DDR beats)
pragma HLS ARRAY_RESHAPE variable=line factor=16 dim=1
Burst DDR and align widths
Use m_axi (Vitis) and wide types (ap_uint<256/512>) to match DDR or NoC widths; ensure contiguous access patterns.
Add offset=slave & proper bundle= names for multiple ports.
void kernel(ap_uint<512>* in, ap_uint<512>* out, int N) {
#pragma HLS INTERFACE m_axi port=in offset=slave bundle=gmem0 depth=1024
#pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem1 depth=1024
#pragma HLS INTERFACE s_axilite port=N bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
// ...
}
Stream for high throughput and low latency
Use AXI4-Stream at the top and hls::stream internally for line-rate pipelines (video, radio, ML).
pragma HLS INTERFACE axis port=in_axis
pragma HLS INTERFACE axis port=out_axis
6) Resource Binding & Micro-Architecture
Bind operations and memories
Map multiplies to DSPs (throughput) or LUTs (save DSPs).
Choose BRAM vs URAM for large buffers; single-/dual-port appropriately.
pragma HLS RESOURCE variable=mul_op core=DSP48
pragma HLS BIND_STORAGE variable=tile type=ram_2p impl=bram
Control sharing vs. replication
Use UNROLL to replicate compute, or ALLOCATION/RESOURCE pragmas to limit operator instances for area.
pragma HLS ALLOCATION operation instances=mul limit=2
Latency balancing
For long adder trees or MAC chains, HLS will usually insert registers; you can constrain:
pragma HLS LATENCY min=1 max=6
7) Throughput vs. Latency vs. Fmax
II (Initiation Interval) controls throughput (samples/cycle).
Latency is total cycles from input to output.
Fmax comes from post-synthesis timing; shorten critical paths (reduce fan-out, balance trees, use DSPs).
Clocking note: Set the target period in tool constraints (e.g., Vitis HLS create_clock -period 5) rather than in code; adjust until timing is clean with margin.
8) Verification & Reporting
C-sim: Prove algorithm correctness fast.
C/RTL Co-sim: Validate that RTL matches C under realistic I/O.
Reports: Inspect
Achieved II and latency,
Stall reasons (dependencies/ports),
Resource map (LUT/FF/DSP/BRAM/URAM),
Interface burst efficiency.
Bit-exact testing for fixed-point: measure SNR/PSNR or error budgets vs. floating-point golden.
9) Example: Streaming FIR with One-Sample-per-Cycle
This version sustains II=1 by unrolling the tap MAC and fully partitioning coefficients and the shift register. It uses fixed-point, AXI-Stream I/O, and works nicely inside a DATAFLOW pipeline.
include "ap_fixed.h"
include "hls_stream.h"
using data_t = ap_fixed<16,8>;
using acc_t = ap_fixed<32,12>; // wider accumulator
const int N = 64;
struct axis_t {
data_t data;
bool last;
};
void fir64(hls::stream& in, hls::stream& out, const data_t coeff[N]) {
pragma HLS INTERFACE axis port=in
pragma HLS INTERFACE axis port=out
pragma HLS INTERFACE ap_ctrl_none port=return
pragma HLS ARRAY_PARTITION variable=coeff complete dim=1
static data_t shift_reg[N];
pragma HLS ARRAY_PARTITION variable=shift_reg complete dim=1
while (true) {
pragma HLS PIPELINE II=1
axis_t x = in.read();
// shift
for (int i = N-1; i > 0; --i) {
pragma HLS UNROLL
shift_reg[i] = shift_reg[i-1];
}
shift_reg[0] = x.data;
// MAC
acc_t acc = 0;
for (int i = 0; i < N; ++i) {
pragma HLS UNROLL
acc += (acc_t)shift_reg[i] * (acc_t)coeff[i];
}
axis_t y;
y.data = (data_t)acc;
y.last = x.last;
out.write(y);
if (x.last) break; // simple frame terminator
}
}
Top comments (0)