apache nifi is the flow-based data movement engine that quietly powers half the on-prem and hybrid data platforms in banking, telco, defence, and healthcare — the workloads that will never move to a fully-managed cloud stack, and the ones where a senior data engineer interview will absolutely include the question "walk me through what a nifi flowfile actually is and why backpressure is different from a rate limiter". NiFi's mental model is not the DAG-of-tasks that Airflow drilled into an entire generation of engineers; it is flow-based programming — data packets flowing through a graph of independent processors, each processor a single-responsibility gear, each connection a bounded queue that decides on its own when the upstream processor needs to pause. The engineering trade-off is not "should I use NiFi" but "which slice of my pipeline is NiFi-shaped and which slice is Airflow-shaped" — and the senior signal is answering that on the whiteboard without hand-waving.
This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "explain the four axes of a nifi processor" or "how does nifi backpressure work when the downstream S3 sink stalls?" or "walk me through promoting a flow from dev to staging to prod using nifi registry". It walks through the FlowFile-plus-processor atom that every NiFi flow is built from, the connection queue with its size and object thresholds, the nifi expression language you'll use to route on attribute values, the Registry's git-for-process-groups model with parameter contexts filling in env-specific variables, cluster mode with a coordinator and primary node, nifi provenance and its replay-a-single-FlowFile superpower, nifi kafka integration via the Consume/Publish processors, nifi cluster deployment on Kubernetes, and the honest nifi vs airflow comparison that senior interviewers keep asking about. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on the ETL practice library →, and sharpen the routing axis with the real-time analytics practice library →.
On this page
- Why NiFi still ships in on-prem and hybrid stacks
- FlowFile + Processor model
- Connections, queues, backpressure
- NiFi Registry + versioning + templates
- Cluster, provenance, and production patterns
- Cheat sheet — NiFi recipes
- Frequently asked questions
- Practice on PipeCode
1. Why NiFi still ships in on-prem and hybrid stacks
Flow-based programming — the mental model interviewers actually test
The one-sentence invariant: NiFi is not a DAG scheduler and not a stream processor — it is a flow-based programming runtime where data packets called FlowFiles travel through a graph of always-on processors, each connection is a bounded queue with its own backpressure policy, and the whole flow is described declaratively in a UI-editable canvas that also serialises to versioned JSON in NiFi Registry. Every other NiFi interview question is a consequence of that invariant.
The four "must-answer" axes interviewers actually probe.
- FlowFile. A FlowFile is not a file on disk — it is a lightweight reference object that carries a set of key-value attributes (metadata) and a pointer into the content repository (the payload bytes). Understanding that the attributes travel separately from the content is the first senior signal.
-
Processor. A processor is a single-responsibility node that reads FlowFiles from an input queue, optionally transforms the attributes and/or the content, then emits FlowFiles to one or more named output relationships (
success,failure,retry,matched,unmatched). Processors are always-on and event-driven — there is no cron, no start time, no "run this once at 3 AM". - Backpressure. Every connection between two processors is a bounded queue with a size threshold (bytes) and an object threshold (FlowFile count). When either threshold is crossed, NiFi automatically pauses the upstream processor. This is not a rate limiter — it is a hard stop that propagates upstream through the whole graph until the slowest downstream sink drains the pressure.
- Provenance. NiFi records the complete lineage of every FlowFile — every processor it touched, every transformation, every fork or clone, every drop. You can select a single FlowFile from three days ago and replay it through the same or a modified flow. No other open-source data-movement tool ships this out of the box.
The 2026 reality of Apache NiFi.
-
NiFi 2.x. The 2.x line dropped Java 8, adopted Java 21 as baseline, and introduced first-class Python processors — you can now write a NiFi processor in a
.pyfile that lives alongside your Java processors and gets picked up automatically on flow load. The mental model is unchanged; the developer ergonomics moved forward a decade. -
Kubernetes-native deployment. The community
nifi-on-k8soperator and Cloudera's DataFlow Kubernetes distribution let you run a NiFi cluster as a StatefulSet with persistent volumes for the flow, content, and provenance repositories. Auto-scaling remains manual (add nodes and rebalance), but the operational story is night-and-day better than the 1.x baremetal era. - NiFi Registry. A separate Spring Boot service that stores versioned snapshots of process groups. Registry is the glue between dev, staging, and prod NiFi instances — a flow committed on dev becomes an atomic import target for staging, then prod, with per-environment parameter contexts filling in the different credentials, endpoints, and connection strings.
-
Python processors. In NiFi 2.x,
ExecuteScriptis deprecated in favour of a proper Python SDK. You subclassFlowFileTransform, implementtransform(context, flowfile), return a result with new attributes and content bytes. This is the single biggest developer-experience change since NiFi hit Apache incubation in 2015.
What interviewers actually listen for.
- Do you say "flow-based programming, not DAG scheduling" in the first sentence? — senior signal.
- Do you distinguish FlowFile attributes (metadata) from content (payload bytes) without prompting? — senior signal.
- Do you describe backpressure as a bounded-queue property, not a processor property? — required answer.
- Do you know provenance replays are per-FlowFile and can be run against a modified flow? — senior signal.
- Do you have a real answer to "NiFi vs Airflow" that isn't "NiFi is better for streaming"? — required answer.
Where NiFi wins over Airflow, Kafka Streams, and custom code.
- Configuration-heavy integrations. SFTP polling with retry, HTTPS ingestion with cert pinning, JMS listeners, JDBC extraction from a legacy Oracle DB — NiFi ships a canonical processor for each. You configure it in the UI in minutes; the equivalent Airflow operator or custom Python is hours or days.
- On-prem and hybrid. Airflow assumes a Kubernetes cluster and a cloud stack. NiFi runs happily on a bank's air-gapped baremetal, a telco edge site, or a hospital's HIPAA-scoped VLAN — places where "just deploy to AWS" is not on the table.
- Continuous, low-latency flows. Airflow's minimum scheduling granularity is a minute; NiFi's is a millisecond. If your flow needs to react to files landing in an SFTP within seconds, NiFi is the shorter path.
- Data provenance at the FlowFile grain. Airflow tracks task-run lineage; NiFi tracks per-FlowFile lineage. When compliance asks "which specific customer record did we drop in yesterday's ingestion?", NiFi answers in one Provenance query.
Where Airflow wins over NiFi.
- Batch ETL scheduling. Nightly warehouse builds, monthly aggregations, cross-DAG dependencies — Airflow's scheduler and dagster/Prefect-style DAG modelling are stronger.
-
Python-first shops. Airflow's DAG-as-code and dbt-friendly operator ecosystem lets a Python team ship end-to-end pipelines in a repo. NiFi's canvas is graphical; the artifact is a JSON blob, not a
.pyfile. - Cross-cluster orchestration. Airflow can orchestrate Spark jobs on EMR, dbt runs in Snowflake, and a Kafka topic reset — cross-service orchestration is its wheelhouse. NiFi does data movement first and orchestration second.
Worked example — a NiFi flow versus an Airflow DAG for the same requirement
Detailed explanation. The classic ambiguous requirement: "ingest new files from an SFTP server, decompress them, validate the schema, drop the bad rows, land the good rows in S3, and notify a downstream job." The workload is a five-node linear flow with a 60-second end-to-end latency SLA and no cross-DAG dependencies — the exact shape where senior engineers must compare NiFi against Airflow and a custom Python cron with actual code, not vibes.
Question. Design the flow in both NiFi and Airflow. Compare the two solutions on latency, code volume, provenance, and operational surface area. Recommend one for a bank running on-prem NiFi already.
Input.
| Requirement | Value |
|---|---|
| Source | SFTP server (on-prem) |
| File format | gzipped CSV |
| Volume | 200–500 files/day, 10 MB–2 GB each |
| Latency SLA | End-to-end 60 s |
| Sink | S3 bucket (cross-network) |
| Notification | Kafka topic ingest.done
|
| Team | On-prem infra + cloud data team |
Code.
<!-- NiFi flow — 5 processors, wired top-to-bottom on the canvas -->
<processor id="p1" type="org.apache.nifi.processors.standard.ListSFTP">
<property name="Hostname">sftp.internal</property>
<property name="Remote Path">/incoming</property>
<property name="Polling Interval">10 sec</property>
<property name="Search Recursively">false</property>
</processor>
<processor id="p2" type="org.apache.nifi.processors.standard.FetchSFTP">
<property name="Hostname">sftp.internal</property>
<property name="Completion Strategy">Move File</property>
<property name="Move Destination Directory">/processed</property>
</processor>
<processor id="p3" type="org.apache.nifi.processors.standard.UnpackContent">
<property name="Packaging Format">gzip</property>
</processor>
<processor id="p4" type="org.apache.nifi.processors.standard.ValidateCsv">
<property name="Schema">customer_id:INT,event_ts:TIMESTAMP,payload:STRING</property>
<property name="Header">true</property>
<!-- routes to `valid` and `invalid` relationships -->
</processor>
<processor id="p5" type="org.apache.nifi.processors.aws.s3.PutS3Object">
<property name="Bucket">analytics-landing</property>
<property name="Object Key">events/${now():format('yyyy/MM/dd')}/${filename}</property>
<property name="Region">us-east-1</property>
</processor>
<connection source="p1" destination="p2" relationship="success"
back-pressure-object-threshold="1000"
back-pressure-data-size-threshold="1 GB"/>
<connection source="p2" destination="p3" relationship="success"/>
<connection source="p3" destination="p4" relationship="success"/>
<connection source="p4" destination="p5" relationship="valid"/>
# Equivalent Airflow DAG — Python 3, Airflow 2.9
from airflow import DAG
from airflow.providers.sftp.operators.sftp import SFTPOperator
from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator
from airflow.decorators import task
from datetime import datetime, timedelta
import gzip, csv, io
with DAG(
dag_id="sftp_to_s3_ingest",
start_date=datetime(2026, 7, 1),
schedule="* * * * *", # poll every minute — Airflow's minimum granularity
catchup=False,
max_active_runs=1,
) as dag:
@task
def list_sftp_files() -> list[str]:
# ... SSHHook, sftp.listdir("/incoming"), filter new ...
return ["file_2026_07_04_1200.csv.gz", "file_2026_07_04_1201.csv.gz"]
@task
def process_file(filename: str) -> dict:
# ... download, decompress, validate, upload to S3 ...
with gzip.open(f"/tmp/{filename}", "rt") as f:
reader = csv.DictReader(f)
valid_rows = [r for r in reader if r.get("customer_id", "").isdigit()]
# ... upload valid_rows to s3://analytics-landing/events/...
return {"filename": filename, "row_count": len(valid_rows)}
@task
def notify_kafka(result: dict):
# ... kafka.produce("ingest.done", key=result["filename"], value=str(result["row_count"]))
pass
files = list_sftp_files()
results = process_file.expand(filename=files)
notify_kafka.expand(result=results)
Step-by-step explanation.
- The NiFi flow is five processors wired end-to-end.
ListSFTPpolls every 10 seconds and emits one FlowFile per new file — attributes carryfilename,path,size,mod.date.FetchSFTPdownloads the content bytes and marks the source file as moved.UnpackContentdecompresses the gzip.ValidateCsvsplits the FlowFile intovalidandinvalidrelationships (bad rows go to a separate downstream flow — usually a dead-letter Kafka topic).PutS3Objectuploads. The whole flow reacts within seconds of the file appearing on SFTP. - The Airflow DAG polls once per minute (Airflow's minimum granularity) — so worst-case latency is 60 s just to notice the file, before any processing. The DAG uses dynamic task mapping (
.expand) to fan out across files, but every mapped task is a separate task instance in the metadata DB. 500 files per day = 500 task rows per day per stage = tens of thousands of rows per year. - The provenance story is stark. NiFi tracks each FlowFile end-to-end — you can select a single file in the Provenance UI, see every processor it hit with timings, and replay it. Airflow's task-run history is at task granularity; per-file replay requires custom XCom bookkeeping or a bespoke lineage layer (OpenLineage / Marquez).
- The code-volume story: NiFi is 40 lines of XML that a UI designer built by dragging processors onto a canvas. Airflow is 40 lines of Python plus the runtime dependencies (SSHHook, S3Hook, Kafka producer), plus a scheduler, plus a metadata DB. The Airflow surface area is larger by an order of magnitude.
- The recommendation for the bank: NiFi — the on-prem infra team already runs it, the latency requirement fits the flow-based model, the provenance story satisfies compliance, and there is no cross-DAG scheduling requirement that would favour Airflow.
Output.
| Dimension | NiFi | Airflow |
|---|---|---|
| End-to-end latency | ~10 s | 60 s minimum |
| Lines of code / config | ~40 XML | ~80 Python + operators + scheduler |
| Provenance grain | Per-FlowFile, out of the box | Per-task-run; per-record needs OpenLineage |
| Operational surface | NiFi cluster only | Scheduler + workers + metadata DB + queue |
| Team fit for on-prem bank | Strong | Weak |
| Cross-DAG dependencies | Poor | Strong |
Rule of thumb. Match the tool to the shape of the workload. Continuous, per-record, low-latency, on-prem, integration-heavy → NiFi. Batch, cross-cluster, Python-first, cloud-native, schedule-driven → Airflow. Almost every senior interview NiFi question ultimately tests whether you can name the workload shape before naming the tool.
Worked example — the "why not custom Python?" pushback
Detailed explanation. A backend team pushes back on introducing NiFi with "we can just write 300 lines of Python." The concrete answer is a list of five capabilities that turn 300 lines into 3000 lines of undifferentiated retry / lineage / observability plumbing — retry with backoff, idempotency, per-file state, structured logging with metrics, and provenance. NiFi ships all five out of the box; the team's Python does none until someone spends a sprint on each.
Question. Enumerate the five capabilities that custom Python has to grow to match a NiFi flow, estimate the engineering cost in senior-engineer-weeks, and quantify the maintenance burden.
Input.
| Capability | NiFi | Custom Python (naive) | Custom Python (production) |
|---|---|---|---|
| Retry with backoff | Built-in per-processor | Not present | 2 weeks |
| Idempotency + dedupe | State store + provenance | Not present | 3 weeks |
| Per-file state (offset) | Persistent Repo | In-memory | 2 weeks |
| Structured logs + metrics | Prometheus endpoint | print()s | 1 week |
| Provenance / replay | Per-FlowFile UI | Not present | 4 weeks |
Code.
# The naive custom Python — 300 lines, works in dev, breaks in prod
import paramiko, gzip, csv, boto3, time
from pathlib import Path
def poll_and_process():
while True:
with paramiko.SFTPClient.from_transport(...) as sftp:
for filename in sftp.listdir("/incoming"):
# download, decompress, validate, upload
sftp.get(f"/incoming/{filename}", f"/tmp/{filename}")
with gzip.open(f"/tmp/{filename}", "rt") as f:
rows = list(csv.DictReader(f))
valid = [r for r in rows if r.get("customer_id", "").isdigit()]
# boto3 upload
boto3.client("s3").put_object(
Bucket="landing", Key=f"events/{filename}", Body="\n".join(str(r) for r in valid)
)
sftp.rename(f"/incoming/{filename}", f"/processed/{filename}")
time.sleep(10)
# What's missing:
# - retry with backoff on SFTP disconnect
# - idempotency (what if we crash after S3 put but before SFTP rename?)
# - state persistence (what if the process restarts mid-batch?)
# - metrics (how do we know the job is healthy?)
# - provenance (which specific file failed validation and why?)
<!-- The equivalent NiFi flow — five processors, all five capabilities baked in -->
<processor type="ListSFTP">
<property name="Polling Interval">10 sec</property>
<!-- state (offset) persisted automatically in the State Manager -->
</processor>
<processor type="FetchSFTP">
<property name="Completion Strategy">Move File</property>
<!-- idempotency: file is moved only after successful download -->
</processor>
<processor type="UnpackContent"/>
<processor type="ValidateCsv"/>
<processor type="PutS3Object">
<property name="Retry Count">5</property>
<!-- retry with exponential backoff built-in -->
</processor>
<!-- Prometheus /metrics endpoint, Provenance UI, State Manager, Bulletin Board — all free -->
Step-by-step explanation.
- The naive Python is 300 lines and does the happy path. It ignores every real-world failure mode: SFTP disconnects mid-download, the S3 put succeeds but the process crashes before renaming the source file, the process restarts and reprocesses everything, two workers pick up the same file, or the file is malformed.
- Retry with backoff: NiFi's per-processor retry policy handles transient failures automatically with per-relationship retry counts and back-off. The Python needs a
tenacitydecorator or a custom retry helper on every I/O call — 2 weeks of engineering plus edge cases forever. - Idempotency and state: NiFi's State Manager persists ListSFTP's cursor across restarts; the FetchSFTP "Move File" completion strategy ensures a file is atomically moved only after successful download. Python needs an external state store (Redis, DynamoDB, a local SQLite) plus a two-phase-commit-style protocol — 3 weeks plus a whole class of bugs.
- Structured logs, metrics, provenance: NiFi exposes a Prometheus
/metricsendpoint out of the box, a Provenance UI where you can pick any FlowFile and replay it, and a Bulletin Board for structured operator messages. Python needsstructlog+prometheus_client+ a custom lineage layer — 5 weeks. - The total custom-Python cost to reach production-parity with NiFi: ~12 senior-engineer-weeks, plus permanent maintenance burden. NiFi's cost is learning the framework, which is a one-time investment and easier to hire against.
Output.
| Metric | Naive Python | Prod Python | NiFi |
|---|---|---|---|
| Time to first prod deploy | 1 week | 12 weeks | 1 week |
| Lines of code / config | 300 | 3000+ | ~40 config |
| Retry with backoff | none | custom | built-in |
| Provenance | none | custom | built-in |
| Ongoing maintenance | high | high | low |
Rule of thumb. The push-back on frameworks is almost always a misjudgement of what "production" costs. If the workload has any of retry, idempotency, per-record state, provenance, or observability requirements, the framework wins on total cost within 6 months — and NiFi is specifically the framework whose five built-ins are exactly the five that custom Python grows into.
Worked example — NiFi 2.x Python processor for schema-drift routing
Detailed explanation. A telco team ingests device logs from 40 site-router types, each emitting a slightly different CSV schema. The requirement is a per-FlowFile schema fingerprint attribute so a RouteOnAttribute step can branch to a type-specific downstream flow. In NiFi 1.x this was a Groovy ExecuteScript; NiFi 2.x lets you write a proper Python processor that subclasses FlowFileTransform, drop it in $NIFI_HOME/python_extensions/, and get auto-discovery on flow load — no .nar build, no Java.
Question. Write the Python processor, wire it into a flow, and show the resulting FlowFile attributes after processing.
Input.
| Parameter | Value |
|---|---|
| Input format | CSV, header line first |
| Attribute to add | schema.fp |
| Fingerprint | SHA-256 of the header line, first 8 hex chars |
| Downstream router | RouteOnAttribute on ${schema.fp}
|
Code.
# nifi-schema-fingerprint.py — drop this into $NIFI_HOME/python_extensions/
import hashlib
from nifiapi.flowfiletransform import FlowFileTransform, FlowFileTransformResult
from nifiapi.properties import PropertyDescriptor, StandardValidators
class SchemaFingerprint(FlowFileTransform):
class Java:
implements = ["org.apache.nifi.python.processor.FlowFileTransform"]
class ProcessorDetails:
version = "1.0.0"
description = "Adds a schema.fp attribute derived from the CSV header line."
tags = ["csv", "schema", "fingerprint", "routing"]
HEADER_ONLY = PropertyDescriptor(
name="Header Only",
description="Compute fingerprint from the first line only.",
required=True,
default_value="true",
validators=[StandardValidators.BOOLEAN_VALIDATOR],
)
def getPropertyDescriptors(self):
return [self.HEADER_ONLY]
def transform(self, context, flowfile):
# Read the content bytes; pull the first line
content = flowfile.getContentsAsBytes()
first_newline = content.find(b"\n")
if first_newline < 0:
return FlowFileTransformResult(
relationship="failure",
attributes={"schema.error": "no header line"},
)
header = content[:first_newline].decode("utf-8", errors="replace").strip()
fp = hashlib.sha256(header.encode("utf-8")).hexdigest()[:8]
return FlowFileTransformResult(
relationship="success",
attributes={"schema.fp": fp, "schema.header": header},
# content unchanged
)
<!-- Downstream routing on the attribute -->
<processor type="python.SchemaFingerprint"/>
<processor type="RouteOnAttribute">
<property name="Routing Strategy">Route to Property Name</property>
<property name="router_type_A">${schema.fp:equals('a1b2c3d4')}</property>
<property name="router_type_B">${schema.fp:equals('e5f6a7b8')}</property>
<property name="router_type_C">${schema.fp:equals('c9d0e1f2')}</property>
<property name="unmatched">${schema.fp:isNull()}</property>
</processor>
Step-by-step explanation.
- The Python processor lives at
$NIFI_HOME/python_extensions/nifi-schema-fingerprint.py. On NiFi 2.x startup, the runtime scans that directory, imports the class, and makes it available on the canvas underpython.SchemaFingerprint. No.narbuild, no Maven, no Java tooling. -
transform(context, flowfile)is the entry point. It reads the content bytes lazily (NiFi caches the content in the content repository, not in-memory), pulls the first line, computes an 8-char SHA-256 prefix, and returns aFlowFileTransformResultwith the new attribute. - The processor emits to the
successrelationship for good FlowFiles andfailurefor headerless FlowFiles. Thefailurerelationship is wired to a dead-letter Kafka topic downstream. - The
RouteOnAttributeprocessor uses NiFi's Expression Language —${schema.fp:equals('a1b2c3d4')}returns true for FlowFiles whose fingerprint matches the router-type-A pattern. Each matched relationship carries the FlowFile to a downstream site-router-specific flow. - The full pipeline: ingest → fingerprint → route-on-attribute → 40 type-specific downstream flows. Adding a new router type is a one-line addition to the
RouteOnAttributeproperty list; no code deploy, no Java build.
Output.
| FlowFile | schema.fp attribute | Routed to |
|---|---|---|
| router-A-log-2026-07-04.csv | a1b2c3d4 | router_type_A relationship |
| router-B-log-2026-07-04.csv | e5f6a7b8 | router_type_B relationship |
| router-Z-log-2026-07-04.csv | 9f8e7d6c | unmatched relationship |
| headerless-file.csv | (none, failure) | dead-letter topic |
Rule of thumb. Whenever a flow needs a per-record transform that fits in <100 lines, write a NiFi 2.x Python processor — it's the shortest path from "custom logic" to "on the canvas, reusable, observable." Reserve Java processors for high-throughput hot-path work where JVM performance matters.
Senior interview question on when to reach for NiFi
A senior interviewer often opens with: "You've inherited an on-prem data platform in a bank. The team is drowning in bespoke Python cron jobs that ingest from SFTP, JMS, and JDBC. Walk me through how you'd evaluate whether NiFi is the right consolidation target, what specific processors you'd start with, and what you'd measure to declare the migration a success."
Solution Using a three-week NiFi consolidation POC
<!-- Week 1 flow — SFTP → validate → HDFS (one of the easier legacy scripts) -->
<processor id="p1" type="ListSFTP">
<property name="Hostname">sftp.on-prem.bank</property>
<property name="Remote Path">/eod/positions</property>
<property name="Polling Interval">30 sec</property>
</processor>
<processor id="p2" type="FetchSFTP">
<property name="Completion Strategy">Move File</property>
<property name="Move Destination Directory">/eod/processed</property>
</processor>
<processor id="p3" type="ValidateCsv">
<property name="Schema">book:STRING,pos_ccy:STRING,pos_amt:NUMBER,as_of:DATE</property>
</processor>
<processor id="p4" type="PutHDFS">
<property name="Directory">/warehouse/positions/${now():format('yyyy/MM/dd')}</property>
<property name="Compression codec">SNAPPY</property>
</processor>
<connection source="p1" destination="p2" back-pressure-object-threshold="500"/>
<connection source="p2" destination="p3" back-pressure-object-threshold="500"/>
<connection source="p3" destination="p4" relationship="valid"/>
<connection source="p3" destination="deadletter" relationship="invalid"/>
# Success metrics — measured at end of week 3
week_3_success_criteria:
latency_p99: "< 60 s" # SFTP → HDFS end-to-end
throughput: "> 500 files/hr" # equivalent to legacy cron
replay_capability: "single-FlowFile" # provenance-driven
lines_of_code_deleted: "~2500" # legacy Python cron
new_ops_surface: "1 NiFi cluster" # replaces 8 cron scripts
compliance_ready: true # provenance satisfies audit
Step-by-step trace.
| Week | Activity | Deliverable |
|---|---|---|
| 1 | Migrate SFTP → HDFS positions flow | One flow, one team trained |
| 2 | Add JMS → Kafka flow; wire NiFi Registry | Two flows, versioned in Registry |
| 3 | Add JDBC extract → Parquet; instrument provenance + Prometheus | Three flows, observability baseline |
After three weeks, the team has three canonical NiFi flows replacing eight custom Python scripts, a Registry-backed dev/staging/prod promotion path, and a Provenance UI they can hand to compliance during audit. The latency and throughput numbers match or beat the legacy cron; the operational surface is a single NiFi cluster instead of eight independent scripts.
Output:
| Metric | Legacy Python cron | NiFi POC |
|---|---|---|
| Time to first working flow | 1 sprint per script | 1 day per flow |
| End-to-end latency | 15 min (cron interval) | 30 s (polling) |
| Provenance | ad-hoc logging | per-FlowFile UI |
| Retry on failure | manual | built-in per relationship |
| Compliance evidence | grep the logs | Provenance report |
Why this works — concept by concept:
- Flow-based programming — NiFi models data movement as a graph of independent processors with bounded queues, not a schedule of Python scripts. The mental model matches the workload; the code matches the mental model.
- Per-processor retry + idempotency — every processor has a retry policy and an at-least-once delivery guarantee. The legacy Python invented these one script at a time; NiFi ships them once.
- Registry-backed versioning — commit a flow to Registry, promote across environments, roll back to a prior version — the same git workflow, applied to data-movement configuration.
- Provenance as compliance evidence — auditors ask "which record did we drop, why, and can you replay it?" — Provenance answers all three in one UI. Custom Python has to invent this from scratch.
- Cost — three senior-engineer-weeks for the POC; avoided cost of ~2500 lines of legacy Python maintenance forever. O(1) operational surface per additional flow after the framework is installed.
Streaming
Topic — streaming
Streaming and event-driven ingestion problems
2. FlowFile + Processor model
The atom of NiFi — FlowFile is envelope + content, Processor is a single-responsibility gear
The mental model in one line: a nifi flowfile is a lightweight reference object with two parts — a map of key-value **attributes and a pointer into the content repository for the payload bytes — and a nifi processor is a stateless single-responsibility node that reads FlowFiles from an incoming queue, transforms attributes and/or content, and emits FlowFiles onto one or more named output relationships**. Every advanced NiFi concept — routing, provenance, backpressure, clustering — is built on top of this two-part atom.
The FlowFile — anatomy of the envelope.
-
Attributes. A
Map<String, String>carried with the FlowFile. Standard attributes always present:uuid(globally unique across the cluster),filename(usually the source file name),path(source path),entryDate(when the FlowFile entered NiFi). Custom attributes are added by processors and by user-definedUpdateAttributesteps. - Content. A pointer into the content repository — NiFi's on-disk store of FlowFile payload bytes. Content is copy-on-write: two FlowFiles that share the same payload point at the same content-repository claim. The pointer is cheap; the bytes are only rewritten when a processor actually modifies the payload.
- Lineage. The provenance repository records every event that touched the FlowFile — created, forked, cloned, dropped, sent, received, routed, attributes-modified, content-modified. This is what makes per-FlowFile replay possible.
- Persistence. Attributes live in the flowfile repository (a fast write-ahead log). Content lives in the content repository (a claim-based bag of bytes). Provenance lives in the provenance repository (a search index). Together, the three repositories give NiFi its "no data lost on crash" guarantee.
The Processor — anatomy of the gear.
-
Input. One or more incoming connections (queues). A processor with zero inputs is a source — typically a
Get*processor that polls an external system. -
Output relationships. Named outputs like
success,failure,retry,matched,unmatched,valid,invalid. Each relationship is a distinct wire on the canvas; each wire is a distinct queue with its own backpressure config. - Configuration. Every processor exposes typed properties — thresholds, host/port, schema, cron expressions. Sensitive properties (passwords, keys) are stored encrypted in the flow definition.
-
Concurrency. Each processor has a "Concurrent Tasks" property — the number of threads the framework will use to invoke
onTrigger(). Default is 1; high-throughput processors bump this to 4–16. - Statelessness. Processors do not hold data — every FlowFile that a processor processes is a fresh input from the incoming queue. Any per-flow state (offsets, counters) lives in the NiFi State Manager, not in processor memory.
The five most-used processors in an interview answer.
-
GetFile/ListFile+FetchFile. Poll a directory for new files.GetFileis the simplest single-step version;ListFile+FetchFileis the cluster-safe pair (list on the primary node, fetch on any node — see clustering section). -
GetHTTP/InvokeHTTP/ListenHTTP. HTTP client and server.InvokeHTTPis the modern general-purpose HTTP call;ListenHTTPaccepts inbound requests. -
ExecuteSQL/PutDatabaseRecord/QueryDatabaseTable. JDBC extraction and load.QueryDatabaseTablesupports incremental extraction by tracking a max value column. -
ConsumeKafka/PublishKafka. Kafka integration. Both processors support the whole Kafka client config surface — SSL, SASL, transactional, headers. -
PutS3Object/FetchS3Object. S3 read/write. Similar processors exist for Azure Blob, GCS, HDFS, JMS, MQTT, MongoDB, Elasticsearch, and a hundred other targets.
Attributes vs content — the interview probe you must answer.
-
Cheap operations. Attribute changes are
Mapmutations — nanoseconds. Routing on an attribute (RouteOnAttribute) reads the attribute value and picks a relationship without ever touching the content bytes. -
Expensive operations. Content changes copy the bytes into a new content-repository claim.
UpdateContent,ReplaceText,ConvertRecord— all rewrite the payload and pay the disk-I/O cost. -
The optimization. Whenever you can route or filter on an attribute without touching content, do so. Extract the routing key into an attribute once (with a cheap
ExtractTextor a Python processor that only reads the first N bytes), then route on the attribute forever after.
The Expression Language — NiFi's attribute-templating grammar.
-
Syntax.
${attribute_name}for value lookups;${attr:function()}for chained functions. -
Functions.
equals,contains,startsWith,matches(regex),substring,toUpper,toLower,format(dates),now,getStateValue,math(plus,minus,mod). -
Boolean logic.
${attr1:equals('a'):and(${attr2:contains('b')})}. - The gotcha. Expression Language is evaluated per-FlowFile against attributes only — you cannot reach into the content bytes from Expression Language. If you need content-driven routing, use a processor that extracts the value to an attribute first.
Common interview probes on FlowFile + processor.
- "What does a FlowFile contain?" — attributes + content pointer, not the bytes themselves.
- "How does NiFi survive a crash?" — three repositories with write-ahead logs.
- "Why is routing-by-attribute cheaper than routing-by-content?" — attributes are
Mapreads; content is disk I/O. - "What is a relationship?" — a named output edge on a processor with its own downstream queue.
- "How do you extract a value from content into an attribute?" —
ExtractText,EvaluateJsonPath,EvaluateXPath, or a custom processor.
Worked example — a five-processor flow: get → validate → enrich → route → sink
Detailed explanation. The canonical starter flow for a NiFi interview whiteboard — ingest JSON events from an HTTPS endpoint, validate the schema, enrich with a lookup, route on a region business field, and sink to one of three Kafka topics. Five processors, four connections; each processor adds attributes and rewrites content only when strictly required. Every senior interview flow-design question is a variant of this pattern.
Question. Design the flow, list the attributes each processor adds, and explain why the routing-by-attribute pattern is cheaper than the routing-by-content alternative.
Input.
| Parameter | Value |
|---|---|
| Source | HTTPS POST on /events
|
| Payload | JSON `{"event_id":..., "customer_id":..., "region":"us |
| Validation | Schema — event_id, customer_id, region are required |
| Enrichment | Lookup {% raw %}customer_tier from a Postgres lookup service |
| Routing |
region decides which Kafka topic |
| Sinks | Three Kafka topics: events.us, events.eu, events.apac
|
Code.
<!-- Flow: 5 processors, 6 connections (route branches into 3 sinks) -->
<processor id="p1" type="ListenHTTP">
<property name="Listening Port">8443</property>
<property name="Base Path">events</property>
<property name="SSL Context Service">ssl-svc</property>
</processor>
<processor id="p2" type="ValidateRecord">
<property name="Record Reader">json-tree-reader</property>
<property name="Record Writer">json-record-writer</property>
<property name="Schema Access Strategy">Use 'Schema Name' Property</property>
<property name="Schema Name">event-v1</property>
<!-- routes: valid, invalid -->
</processor>
<processor id="p3" type="LookupRecord">
<property name="Record Reader">json-tree-reader</property>
<property name="Lookup Service">customer-tier-lookup-svc</property>
<property name="Result RecordPath">/customer_tier</property>
<property name="Lookup Key Path">/customer_id</property>
</processor>
<processor id="p4" type="EvaluateJsonPath">
<property name="Destination">flowfile-attribute</property>
<property name="region">$.region</property>
<!-- extracts region into attribute; content unchanged -->
</processor>
<processor id="p5" type="RouteOnAttribute">
<property name="Routing Strategy">Route to Property Name</property>
<property name="us">${region:equals('us')}</property>
<property name="eu">${region:equals('eu')}</property>
<property name="apac">${region:equals('apac')}</property>
</processor>
<processor id="s_us" type="PublishKafka_2_6">
<property name="Topic Name">events.us</property>
</processor>
<processor id="s_eu" type="PublishKafka_2_6">
<property name="Topic Name">events.eu</property>
</processor>
<processor id="s_apac" type="PublishKafka_2_6">
<property name="Topic Name">events.apac</property>
</processor>
<connection source="p1" destination="p2" relationship="success"/>
<connection source="p2" destination="p3" relationship="valid"/>
<connection source="p3" destination="p4" relationship="success"/>
<connection source="p4" destination="p5" relationship="matched"/>
<connection source="p5" destination="s_us" relationship="us"/>
<connection source="p5" destination="s_eu" relationship="eu"/>
<connection source="p5" destination="s_apac" relationship="apac"/>
Step-by-step explanation.
-
ListenHTTPaccepts inbound POSTs on port 8443. Every request becomes one FlowFile — the request body is the content, and headers plus source IP go into attributes (http.method,http.remote.host,http.request.uri). Attributes are set once; content is written once. -
ValidateRecordruns the JSON payload against the registeredevent-v1schema. FlowFiles that pass go to thevalidrelationship; failures go toinvalid(usually wired to a dead-letter Kafka topic for later inspection). No attribute changes here. -
LookupRecordcalls out to the Postgres-backedcustomer-tier-lookup-svcfor every FlowFile, using thecustomer_idfrom the JSON body as the lookup key, and writes the result back into acustomer_tierfield in the JSON. This does rewrite content — the enriched JSON is a new content-repository claim. -
EvaluateJsonPathextracts theregionfield from the JSON body into a FlowFile attribute (region). This costs one JSONPath read plus one attribute write — negligible compared to reading the content bytes downstream. -
RouteOnAttributereads theregionattribute and picks one of three relationships based on the value. Because we routed on the attribute (not on the content), NiFi never re-reads the payload bytes from disk for the routing decision — it's aMaplookup in memory. Downstream PublishKafka processors take the FlowFile as-is and publish the JSON to the region-specific topic.
Output.
| Step | Attribute changes | Content changes | Cost |
|---|---|---|---|
| ListenHTTP | +http.* attrs | write once (content) | O(1) attr + O(payload) disk |
| ValidateRecord | +record.count | none | O(payload) parse |
| LookupRecord | none | rewrite content | O(payload) parse + O(1) DB call |
| EvaluateJsonPath | +region | none | O(payload) JSONPath |
| RouteOnAttribute | none | none | O(1) attr read |
| PublishKafka | none | read once | O(payload) network |
Rule of thumb. In every NiFi flow, push the cheapest decisions to attributes and the inevitable content rewrites to the smallest possible number of processors. Routing on attributes is the difference between a flow that scales to millions of FlowFiles per minute and one that saturates the content repository disk.
Worked example — attribute vs content routing at scale
Detailed explanation. A high-volume telemetry flow processes 100k events/minute. The naive design routes on a field pulled from content at every hop — each hop is a disk seek plus a JSON parse. The tuned design extracts the routing key into an attribute exactly once at ingest, and every downstream routing step reads the attribute (a Map lookup in memory). Quantify the difference in CPU, disk I/O, and max single-node throughput.
Question. Compare the two designs on CPU cost per FlowFile, content-repository disk I/O, and the max throughput of a single-node NiFi.
Input.
| Parameter | Value |
|---|---|
| Throughput | 100k FlowFiles/minute (~1670/s) |
| Payload size | 2 KB JSON |
| Routing hops | 3 (region → environment → tier) |
| Node | 8-core, NVMe content repo |
Code.
<!-- NAIVE — route on content at every hop -->
<processor id="naive1" type="RouteOnContent">
<property name="Match Requirement">content must contain match</property>
<property name="us">"region":"us"</property>
<property name="eu">"region":"eu"</property>
</processor>
<processor id="naive2" type="RouteOnContent">
<property name="dev">"env":"dev"</property>
<property name="prod">"env":"prod"</property>
</processor>
<processor id="naive3" type="RouteOnContent">
<property name="gold">"tier":"gold"</property>
<property name="silver">"tier":"silver"</property>
</processor>
<!-- TUNED — extract to attributes once, route on attributes -->
<processor id="extract" type="EvaluateJsonPath">
<property name="Destination">flowfile-attribute</property>
<property name="region">$.region</property>
<property name="env">$.env</property>
<property name="tier">$.tier</property>
</processor>
<processor id="tuned1" type="RouteOnAttribute">
<property name="us">${region:equals('us')}</property>
<property name="eu">${region:equals('eu')}</property>
</processor>
<processor id="tuned2" type="RouteOnAttribute">
<property name="dev">${env:equals('dev')}</property>
<property name="prod">${env:equals('prod')}</property>
</processor>
<processor id="tuned3" type="RouteOnAttribute">
<property name="gold">${tier:equals('gold')}</property>
<property name="silver">${tier:equals('silver')}</property>
</processor>
Step-by-step explanation.
- Under the naive design, every routing hop reads the content bytes (2 KB) from the content repository, streams them through the RouteOnContent regex matcher, and picks a relationship. Three hops × 2 KB = 6 KB of disk read per FlowFile. At 1670 FlowFiles/second, this is ~10 MB/s of routing-only disk I/O.
- Under the tuned design,
EvaluateJsonPathreads the content once, extracts three attributes, and hands off. All three downstreamRouteOnAttributeprocessors read attributes from memory — zero disk I/O per hop. Total content reads per FlowFile: 1 (the extract). - CPU cost per hop: naive = disk seek + JSON parse ≈ 100 µs; tuned =
Maplookup ≈ 100 ns. Three-hop naive = 300 µs per FlowFile; three-hop tuned = 300 ns + the one 100 µs extract = ~100 µs. 3× CPU reduction. - Max throughput on a single node: naive is I/O bound at ~5000 FlowFiles/s once the content repo saturates; tuned is CPU bound at ~50000 FlowFiles/s. 10× throughput.
- The extract step is the single most valuable pattern in a production NiFi flow. Every senior NiFi engineer's first tuning pass is "find every
RouteOnContent, replace withEvaluateJsonPath+RouteOnAttribute."
Output.
| Design | Content reads / FlowFile | CPU / FlowFile | Max throughput |
|---|---|---|---|
| Naive (RouteOnContent × 3) | 3 | ~300 µs | ~5000/s |
| Tuned (EvaluateJsonPath + RouteOnAttribute × 3) | 1 | ~100 µs | ~50000/s |
Rule of thumb. Content is expensive; attributes are free. Extract the routing key to an attribute exactly once per flow. Never route on content when you can route on an attribute.
Worked example — a Processor with retry, failure, and a dead-letter relationship
Detailed explanation. A JDBC extract processor hits an unstable legacy Oracle DB — 95% success, 4% transient (retry), 1% permanent (dead-letter). Interviewers ask "how do you handle a flaky upstream in NiFi?" and the answer is the three-relationship pattern: success → happy path; failure → RetryFlowFile gate that caps retries; retries_exceeded → dead-letter Kafka topic where operators can inspect out-of-band.
Question. Design the three-relationship wiring for ExecuteSQL against a flaky Oracle DB with a per-FlowFile retry cap of 5 and a dead-letter topic.
Input.
| Parameter | Value |
|---|---|
| Source processor | ExecuteSQL against Oracle |
| Success rate | 95% |
| Transient failure rate | 4% |
| Permanent failure rate | 1% |
| Retry cap | 5 attempts |
| Dead-letter | Kafka topic oracle.deadletter
|
Code.
<processor id="query_gen" type="GenerateFlowFile">
<property name="Custom Text">SELECT id, name, updated_at FROM customers WHERE updated_at > '${last_run}'</property>
<property name="Batch Size">1</property>
</processor>
<processor id="exec_sql" type="ExecuteSQL">
<property name="Database Connection Pooling Service">oracle-pool</property>
<property name="SQL select query">${flowfile-attribute-value:content}</property>
<property name="Max Rows Per Flow File">10000</property>
<property name="Output Format">Avro</property>
<!-- Relationships: success, failure -->
</processor>
<processor id="retry_gate" type="RetryFlowFile">
<property name="Retry Attribute">retry.count</property>
<property name="Maximum Retries">5</property>
<property name="Penalize Retried FlowFile">true</property>
<property name="Retry Attribute Cache Cleanup">On Success</property>
<!-- Relationships: retry, retries_exceeded -->
</processor>
<processor id="deadletter" type="PublishKafka_2_6">
<property name="Topic Name">oracle.deadletter</property>
</processor>
<processor id="parquet_sink" type="PutHDFS">
<property name="Directory">/warehouse/customers/${now():format('yyyy/MM/dd')}</property>
</processor>
<!-- Wiring:
query_gen -> exec_sql
exec_sql (success) -> parquet_sink
exec_sql (failure) -> retry_gate
retry_gate (retry) -> exec_sql ← loops back with penalty
retry_gate (retries_exceeded) -> deadletter
-->
Step-by-step explanation.
-
GenerateFlowFilebuilds the SQL query template into a FlowFile.ExecuteSQLruns the query against Oracle. On success, the result set is emitted as Avro on thesuccessrelationship. - On a transient failure (network blip, TNS timeout), Oracle drops the connection mid-query;
ExecuteSQLemits the input FlowFile on thefailurerelationship. This FlowFile is not the query result — it's the original query, so we can retry it. - The
RetryFlowFileprocessor sits betweenExecuteSQL's failure output and its own input. It tracks aretry.countattribute on each FlowFile and increments on every pass. Below the retry cap (5), it emits onretry— wired back intoExecuteSQLfor another attempt. Above the cap, it emits onretries_exceeded. - The
retries_exceededFlowFiles go to aPublishKafkasink writing tooracle.deadletter. An operator can consume this topic, inspect the failed queries, decide whether it's an Oracle outage (rerun the batch) or a query bug (fix and redeploy). - The
Penalize Retried FlowFileproperty adds a per-FlowFile delay (default 30 s) on each retry. This prevents a stuck query from thundering the Oracle DB and gives the transient failure time to clear.
Output.
| FlowFile outcome | Relationship path | Terminal destination |
|---|---|---|
| Success | success | parquet_sink → HDFS |
| Transient failure, retried 3× then success | failure → retry → success | parquet_sink → HDFS |
| Transient failure, retried 5× and still failing | failure → retry → retries_exceeded | deadletter → Kafka |
| Permanent failure (bad SQL) | failure → retry → retries_exceeded | deadletter → Kafka |
Rule of thumb. Wire success, retry (via RetryFlowFile), and dead-letter (retries_exceeded) on every processor that touches an external system. The three-relationship pattern is the canonical NiFi answer to "handle transient failure without losing data."
Senior interview question on FlowFile + processor design
A senior interviewer might ask: "Design a NiFi flow that ingests JSON events from an HTTPS endpoint, validates each event against a registered schema, enriches with a customer-tier lookup, routes on a priority attribute to three downstream Kafka topics with different SLAs, and dead-letters malformed events. Walk me through the processors, relationships, and where you'd tune concurrency."
Solution Using the extract-once + attribute-route pattern with tuned concurrency
<!-- Full flow definition, 7 processors + 3 sinks -->
<processor id="p1" type="ListenHTTP">
<property name="Listening Port">8443</property>
<property name="Base Path">events</property>
<property name="SSL Context Service">ssl-svc</property>
<property name="Concurrent Tasks">4</property>
</processor>
<processor id="p2" type="ValidateRecord">
<property name="Record Reader">json-tree-reader</property>
<property name="Record Writer">json-record-writer</property>
<property name="Schema Name">event-v1</property>
<property name="Concurrent Tasks">4</property>
</processor>
<processor id="p3" type="LookupRecord">
<property name="Lookup Service">customer-tier-lookup-svc</property>
<property name="Result RecordPath">/customer_tier</property>
<property name="Lookup Key Path">/customer_id</property>
<property name="Concurrent Tasks">8</property>
</processor>
<processor id="p4" type="EvaluateJsonPath">
<property name="Destination">flowfile-attribute</property>
<property name="priority">$.priority</property>
<property name="region">$.region</property>
<property name="tier">$.customer_tier</property>
<property name="Concurrent Tasks">4</property>
</processor>
<processor id="p5" type="RouteOnAttribute">
<property name="Routing Strategy">Route to Property Name</property>
<property name="critical">${priority:equals('critical')}</property>
<property name="high">${priority:equals('high')}</property>
<property name="normal">${priority:equals('normal')}</property>
<property name="Concurrent Tasks">2</property>
</processor>
<processor id="p6_dlq" type="UpdateAttribute">
<property name="dlq.reason">${literal('schema-validation-failed')}</property>
</processor>
<processor id="s_critical" type="PublishKafka_2_6">
<property name="Topic Name">events.critical</property>
<property name="Delivery Guarantee">Guarantee Replicated Delivery</property>
<property name="Concurrent Tasks">4</property>
</processor>
<processor id="s_high" type="PublishKafka_2_6">
<property name="Topic Name">events.high</property>
<property name="Concurrent Tasks">4</property>
</processor>
<processor id="s_normal" type="PublishKafka_2_6">
<property name="Topic Name">events.normal</property>
<property name="Concurrent Tasks">2</property>
</processor>
<processor id="s_dlq" type="PublishKafka_2_6">
<property name="Topic Name">events.deadletter</property>
</processor>
<!-- Wiring
p1 -> p2
p2 (valid) -> p3
p2 (invalid) -> p6_dlq -> s_dlq
p3 -> p4 -> p5
p5 (critical) -> s_critical
p5 (high) -> s_high
p5 (normal) -> s_normal
p5 (unmatched) -> s_normal
-->
Step-by-step trace.
| Step | Input | Processor | Attribute changes | Content changes | Output relationship |
|---|---|---|---|---|---|
| 1 | HTTPS POST | ListenHTTP | +http.* | write once | success |
| 2 | JSON body | ValidateRecord | +record.count | none | valid | invalid |
| 3 | JSON body | LookupRecord | none | rewrite (add customer_tier) | success |
| 4 | JSON body | EvaluateJsonPath | +priority, +region, +tier | none | matched |
| 5 | attrs only | RouteOnAttribute | none | none | critical|high|normal |
| 6 | invalid FlowFile | UpdateAttribute | +dlq.reason | none | success |
| 7 | routed FlowFile | PublishKafka | +kafka.offset | read once | success |
After deployment, the flow processes 50k events/minute per node. The critical-topic PublishKafka runs at 4 concurrent tasks with Guarantee Replicated Delivery; the normal-topic runs at 2 concurrent tasks with the default at-least-once semantics. LookupRecord is at 8 concurrent tasks because the Postgres lookup is the hop with the highest wall-clock latency and benefits most from parallelism.
Output:
| Metric | Value |
|---|---|
| Total processors | 10 (7 pipeline + 3 sinks) |
| Peak throughput | 50k events/min per node |
| p99 end-to-end latency | 250 ms |
| Content rewrites per FlowFile | 2 (ListenHTTP write + LookupRecord enrich) |
| Attribute mutations per FlowFile | 5 |
| Dead-letter rate | 0.2% (schema validation only) |
Why this works — concept by concept:
-
FlowFile as envelope + content — attributes carry the routing keys; content carries the payload. Every routing decision after the extract step is a
Maplookup, not a disk read. -
Extract-once pattern —
EvaluateJsonPathpulls priority, region, and tier into attributes exactly once. All three downstream routing decisions read from attributes. Zero redundant content parses. - Concurrent Tasks tuning — LookupRecord is the slowest hop (external DB call) and gets the highest concurrency (8). RouteOnAttribute is CPU-cheap and needs only 2. Concurrency is set per hop, not globally.
-
Three-relationship failure handling — valid/invalid split at ValidateRecord; matched/unmatched at RouteOnAttribute; the invalid path funnels into the dead-letter Kafka topic with an added
dlq.reasonattribute for operators. - Cost — 10 processors, one canvas, zero custom code. Adding a new priority tier is a one-line addition to RouteOnAttribute plus one new PublishKafka processor and a wire. O(1) marginal change; O(N) code reuse.
Streaming
Topic — streaming
FlowFile + processor routing problems
3. Connections, queues, backpressure
Every connection is a bounded queue — backpressure pauses upstream when downstream fills
The mental model in one line: a nifi connection is not a wire — it is a bounded, on-disk queue with a size threshold (bytes) and an object threshold (FlowFile count), a prioritizer that decides consumption order, and an automatic backpressure signal that pauses the upstream processor the instant either threshold is crossed. Every NiFi performance question — "why is my flow slow?", "how does NiFi handle a downstream outage?", "what happens when Kafka goes down?" — is answered by walking through the connection queue's state, not the processors themselves.
The connection queue — anatomy.
- Object threshold. The maximum count of FlowFiles allowed in the queue before backpressure activates. Default 10,000. Set lower on hot paths where a queue depth spike is a leading indicator of a downstream stall.
- Size threshold. The maximum bytes of aggregate content in the queue. Default 1 GB. Set based on content-repository disk budget divided by the number of connections you expect to fill simultaneously.
-
Prioritizer. Optional ordering strategy:
FirstInFirstOutPrioritizer(default),NewestFlowFileFirstPrioritizer(LIFO),OldestFlowFileFirstPrioritizer,PriorityAttributePrioritizer(reads apriorityattribute). Multiple prioritizers can be stacked; NiFi evaluates them in configured order. -
Expiration. Optional per-FlowFile TTL. If a FlowFile sits in the queue longer than the TTL, NiFi drops it (with a Provenance
EXPIREevent). Useful for real-time flows where stale data is worse than no data. -
Load-balance strategy. In cluster mode, connections can be configured to redistribute FlowFiles across nodes (
Partition by Attribute,Round Robin,Single Node). This is the connection-level analog of a Kafka partitioner.
Backpressure — the propagating pause.
-
The trigger. When a queue's current count exceeds
back-pressure-object-thresholdOR the current size exceedsback-pressure-data-size-threshold. -
The effect. The upstream processor's
onTrigger()is not scheduled. From the outside, the processor appears "yellow" or "paused" on the canvas. - The propagation. If the upstream is itself starved of scheduling because its downstream queue is full, then its upstream is also paused. Backpressure propagates the whole way to the source.
- The recovery. When the downstream processor drains the queue below the threshold, the upstream is re-scheduled automatically. There is no manual reset.
- Why it's not a rate limiter. A rate limiter caps throughput at a fixed rate; backpressure is on/off. Below threshold, upstream runs at full speed; above threshold, upstream stops entirely. This binary behaviour is what gives NiFi flows their crisp, self-adjusting steady state.
Prioritizers — controlling consumption order.
-
FirstInFirstOutPrioritizer(default). Oldest FlowFile in the queue is picked first. Fair, simple, matches ingestion order. -
NewestFlowFileFirstPrioritizer. Newest first. Useful when stale data is worthless — e.g. real-time dashboards where you'd rather show the most recent event than backfill. -
OldestFlowFileFirstPrioritizer. Explicitly oldest-first even in the presence of clock skew across cluster nodes. UsesentryDateattribute for ordering. -
PriorityAttributePrioritizer. Reads a FlowFile attribute (priority) and orders numerically. Combined withFirstInFirstOutPrioritizeras a tiebreaker gives you priority-then-FIFO — the classic priority queue.
Common interview probes on backpressure.
- "What happens when the downstream S3 is slow?" — the connection queue to
PutS3Objectfills, backpressure fires, upstream processors pause. - "How is backpressure different from a rate limit?" — binary vs continuous; automatic vs configured.
- "What is a prioritizer?" — a comparator on the queue that decides FlowFile pick order.
- "Do FlowFiles get lost when backpressure kicks in?" — no; they queue in the previous processor's output connection or, ultimately, don't get ingested at the source at all.
- "How do you size the object threshold?" — small enough to detect a stall quickly; large enough to absorb normal traffic bursts.
Worked example — setting backpressure to stop a runaway upstream
Detailed explanation. A flow ingests events from Kafka via ConsumeKafka and writes them to S3 via PutS3Object. During a partial S3 outage, PutS3Object slows from 5000 FlowFiles/second to 100. The connection queue between them starts filling. Backpressure is the mechanism that prevents ConsumeKafka from continuing to pull events off Kafka at 5000/second and building up an unbounded backlog in NiFi's content repository. Walk through the numbers.
- Steady-state. Both processors run at 5000 FF/s. Queue length ≈ 0.
-
The outage. S3 slows to 100 FF/s.
ConsumeKafkais still pulling at 5000 FF/s. Queue grows at 4900 FF/s. - The threshold. Object threshold = 50,000 → queue hits threshold in ~10 seconds.
-
The pause.
ConsumeKafkapauses; Kafka consumer group stops advancing; Kafka topic lag grows.
Question. Compute the queue fill time, choose the right thresholds for a flow processing 5000 FF/s with 2 KB payloads, and quantify what happens under a 5-minute S3 outage.
Input.
| Parameter | Value |
|---|---|
| Steady-state throughput | 5000 FF/s |
| Payload size | 2 KB |
| S3 outage duration | 5 min |
| S3 degraded throughput | 100 FF/s |
| Object threshold | 50,000 (proposed) |
| Size threshold | 512 MB (proposed) |
| Content repo disk budget | 100 GB |
Code.
<connection source="consume_kafka" destination="put_s3"
back-pressure-object-threshold="50000"
back-pressure-data-size-threshold="512 MB"
flowfile-expiration="0 sec">
<prioritizers>
<prioritizer>org.apache.nifi.prioritizer.FirstInFirstOutPrioritizer</prioritizer>
</prioritizers>
</connection>
# Backpressure-driven behaviour during a 5-min S3 outage
scenario: s3_partial_outage_5min
steady_state:
consume_kafka: 5000 ff/s
put_s3: 5000 ff/s
queue_depth: 0
outage_seconds:
0: {consume: 5000, put: 100, depth: 0, kafka_lag: 0}
1: {consume: 5000, put: 100, depth: 4900, kafka_lag: 0}
10:{consume: 5000, put: 100, depth: 49000, kafka_lag: 0}
11:{consume: 0, put: 100, depth: 50000, kafka_lag: 4900} # backpressure fires
60:{consume: 0, put: 100, depth: ~44000,kafka_lag: 245000} # kafka absorbs
300:{consume:0, put: 100, depth: ~0, kafka_lag: 1225000}
recovery:
put_s3 returns to 5000 ff/s -> queue drains -> consume_kafka resumes -> kafka lag decreases
Step-by-step explanation.
- At t=0, S3 slows. The queue fills at (5000 in − 100 out) = 4900 FlowFiles/second. At the 50,000 object threshold, the queue is full at t≈10 s.
- At t=10 s, backpressure fires.
ConsumeKafkais no longer scheduled. From the outside, NiFi looks paused; from Kafka's perspective, the consumer group stops advancing offsets. Consumer lag starts growing at 5000 events/second. - Over the 5-minute outage, Kafka accumulates 5000 × 300 = 1.5M events of lag. This is by design — Kafka is the durable buffer; NiFi is pushing back rather than swelling its own content repository indefinitely.
- If we had not set backpressure (or set it too high), the NiFi content repository would have grown at 4900 × 2 KB = 9.8 MB/s — 3 GB over 5 minutes. Multi-hour outages would fill the content repository disk and crash NiFi.
- When S3 recovers,
PutS3Objectreturns to 5000 FF/s. The queue drains (100 → 5000 is instant catch-up). Backpressure clears andConsumeKafkaresumes. Kafka lag decreases at up to 5000 events/second until caught up. The flow self-heals with zero operator intervention.
Output.
| Metric | Without backpressure | With backpressure (50k threshold) |
|---|---|---|
| Queue length at end of 5-min outage | ~1.47M FlowFiles | ~50k FlowFiles |
| Content repo disk used (added) | ~3 GB | ~100 MB |
| Kafka consumer lag at end of outage | 0 (drained into NiFi) | ~1.5M events |
| NiFi crash risk from disk exhaustion | high (multi-hour outages) | low (Kafka absorbs) |
| Recovery time | slow (drain content repo) | fast (drain 50k, unpause) |
Rule of thumb. Set backpressure thresholds below your content-repository budget so a multi-hour downstream outage cannot fill the disk. The rule: object_threshold × avg_payload_bytes × num_hot_connections < 0.5 × content_repo_disk. Kafka (or the upstream source) is the right place to buffer during outages; NiFi is the flow, not the buffer.
Worked example — priority-attribute prioritizer for critical-event promotion
Detailed explanation. A flow processes a mix of low-priority telemetry and rare critical alerts. Under load, the telemetry saturates the FIFO queue and delays the critical alerts. The fix is a PriorityAttributePrioritizer on the connection — critical alerts jump the queue.
- The steady state. 5000 telemetry events/s, 1 critical alert/s.
- The problem. During a spike, the queue holds 40k telemetry events + 8 critical alerts. FIFO would process 40k telemetry before touching the alerts.
-
The fix. Configure
PriorityAttributePrioritizeron the queue; add apriorityattribute upstream.
Question. Configure the prioritizer, show the attribute that must be set upstream, and quantify the p99 latency for critical alerts before and after.
Input.
| Parameter | Value |
|---|---|
| Telemetry throughput | 5000/s |
| Critical throughput | 1/s |
| Downstream throughput | 5000/s |
| Priority values | 1 (critical), 10 (telemetry) |
| Sort order | Ascending — lower number = higher priority |
Code.
<!-- Upstream — set the priority attribute -->
<processor id="upstream_router" type="RouteOnAttribute">
<property name="critical">${event_type:equals('alert')}</property>
<property name="telemetry">${event_type:equals('telemetry')}</property>
</processor>
<processor id="set_prio_critical" type="UpdateAttribute">
<property name="priority">1</property>
</processor>
<processor id="set_prio_telemetry" type="UpdateAttribute">
<property name="priority">10</property>
</processor>
<!-- The prioritized connection -->
<connection source="merge_point" destination="downstream_sink"
back-pressure-object-threshold="50000"
back-pressure-data-size-threshold="512 MB">
<prioritizers>
<prioritizer>org.apache.nifi.prioritizer.PriorityAttributePrioritizer</prioritizer>
<prioritizer>org.apache.nifi.prioritizer.FirstInFirstOutPrioritizer</prioritizer>
</prioritizers>
</connection>
Step-by-step explanation.
- Upstream, a
RouteOnAttributesplits the flow into critical and telemetry paths. Each path passes through anUpdateAttributethat sets thepriorityattribute (1for critical,10for telemetry). Both paths then re-merge into a single connection. - The connection is configured with two prioritizers, evaluated in order.
PriorityAttributePrioritizerreads thepriorityattribute and sorts ascending — lower is higher priority.FirstInFirstOutPrioritizeris the tiebreaker for FlowFiles with the same priority. - Under FIFO, a critical alert added at t=0 with 40k queued telemetry events ahead of it would wait until all 40k drained. At 5000/s downstream, that's 8 seconds — an unacceptable p99 for a critical alert.
- With the priority prioritizer, the critical alert jumps to the front of the queue on its next scheduling cycle. Downstream picks the critical FlowFile first; the alert reaches its destination in <1 ms.
- Telemetry throughput is unchanged — the 8 critical alerts per second is a rounding error against 5000 telemetry/s. The priority queue re-orders without slowing.
Output.
| Scenario | Critical p99 latency | Telemetry throughput |
|---|---|---|
| FIFO only | 8 s (during 40k queue) | 5000/s |
| Priority + FIFO tiebreaker | < 1 ms | 5000/s |
Rule of thumb. Use PriorityAttributePrioritizer for any flow that mixes hot alerts with bulk traffic. Set the priority attribute upstream (never at the queue), stack a FIFO tiebreaker, and you get a working priority queue with zero code.
Worked example — expiration for a real-time dashboard flow
Detailed explanation. A dashboard flow displays live metrics. If the downstream renderer stalls, backfilled 10-minute-old FlowFiles are worthless — the dashboard shows stale data. Configure per-FlowFile expiration on the connection so old FlowFiles are dropped rather than delivered late.
- The trade-off. Late data is worse than no data (for a real-time dashboard).
-
The mechanism.
flowfile-expiration = "60 sec"on the connection — anything older than 60 seconds is dropped. -
The provenance. Dropped FlowFiles emit an
EXPIREprovenance event so operators can see the loss.
Question. Configure a real-time dashboard connection with a 60-second TTL, show the resulting behaviour under a 5-minute downstream outage, and explain when this pattern is appropriate.
Input.
| Parameter | Value |
|---|---|
| Upstream throughput | 1000 FF/s |
| Downstream throughput | 1000 FF/s (steady) → 0 (outage) |
| Outage duration | 5 min |
| TTL | 60 s |
Code.
<connection source="metrics_source" destination="dashboard_renderer"
back-pressure-object-threshold="100000"
back-pressure-data-size-threshold="1 GB"
flowfile-expiration="60 sec">
<prioritizers>
<prioritizer>org.apache.nifi.prioritizer.NewestFlowFileFirstPrioritizer</prioritizer>
</prioritizers>
</connection>
Step-by-step explanation.
- Under steady state, FlowFiles flow through the connection with sub-second wait time; the TTL never fires.
- During the 5-minute outage, FlowFiles queue up. Backpressure fires when the queue hits 100,000 (which happens at t=100 s). Upstream pauses.
- The TTL mechanism runs on a background timer inside NiFi. Every 30 seconds, NiFi walks the connection queue and drops any FlowFile whose
entryDateis more than 60 seconds ago. During the outage, FlowFiles that entered before t−60 s are dropped continuously. - When downstream recovers, the queue contains only FlowFiles that entered within the last 60 seconds. The dashboard sees fresh data — not the 5-minute-old stale backlog.
-
NewestFlowFileFirstPrioritizeris layered on for the same reason: even within the surviving 60-second window, prefer the newest to render first. Older FlowFiles that survive the TTL get processed only if there's slack.
Output.
| Timepoint | Queue size | Oldest FlowFile age | Dashboard shows |
|---|---|---|---|
| t=0 (outage start) | ~0 | 0 s | fresh |
| t=60 s | ~60k | 60 s | last-second data |
| t=300 s (outage end) | ~60k (steady TTL trim) | 60 s | last-minute data |
| t=301 s (recovery) | ~60k drained fast (newest first) | 0 s | fresh again |
Rule of thumb. Use flowfile-expiration on any real-time flow where stale data is worse than no data — dashboards, alerts, live position feeds. Combine with NewestFlowFileFirstPrioritizer so recovery renders freshest first. Never use TTL on a batch flow — losing data is unacceptable.
Senior interview question on connection sizing and backpressure
A senior interviewer might ask: "You have a NiFi flow that reads from Kafka at 10k events/s and writes to a JDBC sink. The JDBC sink can absorb 8k/s in steady state but only 500/s during a nightly maintenance window when the Postgres primary is failing over. Design the connection queue between ConsumeKafka and the JDBC sink — thresholds, prioritizer, expiration, and where you'd measure to prove the design."
Solution Using bounded queues + backpressure + no expiration (data preservation)
<!-- Sized against a 2-hour worst-case Postgres degradation -->
<connection source="consume_kafka" destination="put_database"
back-pressure-object-threshold="200000"
back-pressure-data-size-threshold="2 GB"
flowfile-expiration="0 sec">
<prioritizers>
<prioritizer>org.apache.nifi.prioritizer.FirstInFirstOutPrioritizer</prioritizer>
</prioritizers>
</connection>
<!--
Sizing rationale:
- steady-state fill: (10000 in - 8000 out) = 2000 FF/s → 200000 = ~100s buffer
- degraded fill: (10000 in - 500 out) = 9500 FF/s
- object_threshold hit: 200000 / 9500 ≈ 21 s → backpressure fires early
- kafka absorbs the rest of the outage (durable, retained 7 days)
- expiration = 0 sec because data preservation is required (analytics workload)
- FIFO prioritizer because ingestion order matters for the downstream analytics
-->
-- Observability: measure the design in production
-- Prometheus scrape endpoint on NiFi
-- Metric: nifi_amount_flowfiles_queued{connection_id="..."} — should stay below 200000
-- Metric: nifi_amount_bytes_queued{connection_id="..."} — should stay below 2 GB
-- Alert: queued > 150000 for 5 min → downstream degradation warning
-- Alert: queued > 199000 for 1 min → backpressure imminent
-- Alert: nifi_backpressure_active{connection_id="..."} = 1 → upstream paused
Step-by-step trace.
| Time | consume_kafka rate | put_database rate | queue depth | Backpressure | Kafka lag |
|---|---|---|---|---|---|
| t=0 (steady) | 10000/s | 8000/s (spike absorbed) | ~0 | off | 0 |
| t=1s (maint start) | 10000/s | 500/s | 9500 | off | 0 |
| t=20s | 10000/s | 500/s | 190000 | off | 0 |
| t=21s | 0/s | 500/s | 200000 | on | 9500 |
| t=60s (maint mid) | 0/s | 500/s | ~180000 | on | 570000 |
| t=1800s (maint end) | 0/s | 500/s | ~30000 | on | 17.1M |
| t=1801s (recovery) | 10000/s | 8000/s | ~30000 | off | drops fast |
After the maintenance window, the NiFi queue never exceeds 200k FlowFiles (2 GB content repo footprint), Kafka absorbs the accumulated backlog (7-day retention easily covers 30 minutes of lag), and the flow recovers automatically once Postgres is back. Zero data loss; zero operator action required.
Output:
| Metric | Value |
|---|---|
| Peak NiFi queue depth | 200,000 FlowFiles |
| Peak content repo bytes | ~2 GB |
| Peak Kafka consumer lag | 17.1M events |
| Data loss | 0 |
| Manual intervention | 0 |
| Recovery time | ~1 hour after maintenance ends (drain lag) |
Why this works — concept by concept:
- Bounded connection queue — object threshold + size threshold cap the NiFi content-repository exposure. The disk cost of an outage is bounded and predictable, not linear-in-outage-duration.
- Backpressure as durable-buffer signal — when NiFi pauses, Kafka becomes the buffer. Kafka's disk is cheaper and its retention model is designed exactly for this shape of outage.
- No expiration (data preservation) — analytics workload requires every event delivered. TTL is the wrong choice here — better to widen thresholds and lean on Kafka than to drop data.
- FIFO prioritizer — ingestion order matters for the downstream analytics (event-time ordering guarantees). FIFO preserves the property.
- Cost — 2 GB of content-repo disk budget, 7 days of Kafka retention, one Prometheus alert. The alternative — unbounded NiFi queue — would fail an audit on data-loss-under-disk-full and cost a full outage.
Streaming
Topic — streaming
Backpressure and queue-sizing problems
4. NiFi Registry + versioning + templates
Registry is git for process groups — flow_v3 lives once, deploys to dev, staging, prod
The mental model in one line: nifi registry is a separate Spring Boot service that stores versioned snapshots of NiFi process groups exactly like git stores versioned snapshots of code — you commit a flow from any NiFi instance, browse versions in the Registry UI, and import a specific version into another NiFi instance where **parameter contexts fill in the environment-specific credentials, endpoints, and connection strings**. Every "how do you promote a flow across environments?" interview question has this exact answer.
NiFi Registry — anatomy of the service.
- Deployment. A standalone Spring Boot JAR (or a Docker image). Runs on port 18080 by default. Backed by a local H2 database or an external Postgres/MySQL for production.
-
Buckets. Top-level namespaces inside Registry. Typical layout: one bucket per team, or one bucket per environment class (e.g.
retail,analytics,security). Buckets have access-control policies. - Flows. Each bucket contains flows. A "flow" in Registry is a versioned process group — the same conceptual unit you dragged onto the NiFi canvas.
- Versions. Each commit to a flow creates a new version. Version metadata carries a comment, a timestamp, and the identity of the user who committed. Registry keeps every version forever.
- Access control. Registry integrates with LDAP, Kerberos, and OIDC for identity, and has bucket-level read/write/delete permissions per user or group.
Parameter contexts — the env-specific-vars layer.
-
What they are. Named collections of parameters (key-value pairs) that a NiFi flow references via
#{param_name}syntax. Parameters can be marked sensitive (encrypted at rest, never displayed in the UI). - Where they live. Parameter contexts live on each NiFi instance, not in Registry. This is the right separation of concerns — the flow (Registry) is env-agnostic; the parameters (per-instance) are env-specific.
-
The promotion story. A single flow committed to Registry references
#{db_url},#{db_user},#{db_password},#{s3_bucket}. On the dev NiFi, thedev-paramscontext supplies dev values; on prod,prod-paramssupplies prod. Zero flow changes across environments. -
Inheritance. Parameter contexts can inherit from other contexts, enabling a base
common-paramscontext with per-env overrides. Reduces duplication.
The versioning workflow — commit, browse, import, revert.
- Start version control. Right-click a process group on the NiFi canvas → "Version → Start version control" → pick a Registry + bucket + flow name → enter a commit message. The process group is now Registry-tracked.
- Commit changes. Edit the flow. NiFi shows a badge indicating uncommitted changes. Right-click → "Version → Commit local changes" → enter a message. A new version is stored in Registry.
- Browse and revert. Right-click → "Version → Show version history" → pick a version → "Change to this version". NiFi rolls the process group back to that snapshot.
-
Import on another instance. On a fresh NiFi (say, staging), drag an empty process group onto the canvas → "Import from Registry" → pick the same bucket + flow + version. The flow is materialized on staging with
#{param}references pointing at staging's parameter contexts.
Common interview probes on Registry.
- "How do you promote a flow from dev to prod?" — commit to Registry from dev; import the same version to prod; parameter contexts fill in env-specific values.
- "What's stored in Registry vs on the NiFi instance?" — Registry holds the flow definition (env-agnostic); the NiFi instance holds parameter contexts, controller services, and runtime state.
- "How do you handle a bad deploy?" — Change to the previous version via "Show version history". Instant rollback, no code deploy.
- "What are parameter contexts?" — Named parameter maps referenced from the flow via
#{name}syntax. Per-environment, per-instance. - "How does Registry differ from git?" — Registry stores flow-graph JSON specifically; git stores arbitrary files. Registry has NiFi-native access control and instance integration.
Worked example — promote a flow from dev → staging → prod
Detailed explanation. The canonical Registry workflow. A data engineer builds a new SFTP ingestion flow on the dev NiFi, commits to Registry, promotes to staging for QA, then to prod. The whole cycle takes ~15 minutes and requires zero code deploy.
-
The setup. Registry deployed at
registry.on-prem.bank:18080. Three NiFi instances:nifi-dev,nifi-staging,nifi-prod. Three parameter contexts:dev-params,staging-params,prod-params. -
The flow. SFTP ingestion — the same 4-processor flow from the earlier examples, parameterised on
#{sftp_host},#{sftp_user},#{sftp_password},#{s3_bucket}. - The promotion. Commit on dev, import on staging, verify, import on prod.
Question. Walk through the CLI commands (nifi-toolkit and nifi-registry-toolkit) for the full promotion cycle. Show what changes between environments and what stays constant.
Input.
| Environment | NiFi URL | Registry URL | Parameter context |
|---|---|---|---|
| dev | https://nifi-dev:8443 | https://registry:18080 | dev-params |
| staging | https://nifi-staging:8443 | (same) | staging-params |
| prod | https://nifi-prod:8443 | (same) | prod-params |
Code.
# Step 1 — on the dev NiFi, commit the flow to Registry
# (typically via the UI; equivalent CLI below)
./bin/nifi-toolkit-cli.sh nifi start-version-control \
--registryClientId <registry-uuid> \
--bucketId <bucket-uuid> \
--flowName sftp-ingest \
--processGroupId <pg-uuid> \
--flowDesc "Initial version" \
--url https://nifi-dev:8443
# Step 2 — verify the version landed in Registry
./bin/nifi-toolkit-cli.sh registry list-flow-versions \
--bucketId <bucket-uuid> \
--flowId <flow-uuid> \
--url https://registry:18080
# Output: version 1, timestamp 2026-07-04T10:00:00Z, "Initial version"
# Step 3 — on the staging NiFi, ensure parameter context exists
# staging-params.yml (applied via nifi-toolkit-cli.sh)
name: staging-params
parameters:
sftp_host: sftp.staging.internal
sftp_user: ingest-svc
sftp_password: '{{sensitive-vault-lookup:staging/sftp/password}}'
s3_bucket: staging-analytics-landing
inheritedContexts: [common-params]
# Step 4 — import the versioned flow into staging
./bin/nifi-toolkit-cli.sh nifi pg-import \
--registryClientId <registry-uuid> \
--bucketId <bucket-uuid> \
--flowId <flow-uuid> \
--flowVersion 1 \
--parentPgId <staging-parent-pg> \
--url https://nifi-staging:8443
# Bind the imported process group to the staging parameter context (once)
./bin/nifi-toolkit-cli.sh nifi pg-set-param-context \
--processGroupId <new-pg-uuid> \
--paramContextId <staging-params-uuid> \
--url https://nifi-staging:8443
# Step 5 — smoke test on staging, then promote to prod
# (identical to step 4, with --url https://nifi-prod:8443 and --paramContextId <prod-params-uuid>)
Step-by-step explanation.
- Step 1 commits the flow from dev to Registry. NiFi calculates the process-group JSON, sends it to Registry over HTTPS, and stores it as version 1 in the
retailbucket. The dev NiFi's UI now shows the process group with a green "up-to-date" badge. - Step 2 verifies the commit landed.
nifi-registry-toolkitlists every version of the flow with metadata. This is the "did the commit actually work" sanity check before promoting. - Step 3 defines the staging parameter context.
staging-paramssupplies staging-specific SFTP credentials, endpoints, and S3 buckets. Thesftp_passworduses a sensitive-value indirection (Vault lookup) so the password is never in cleartext in any config file. - Step 4 materialises the flow on staging NiFi. The
pg-importCLI creates a new process group under the specified parent, materialised from Registry version 1. It's then bound to the staging parameter context viapg-set-param-context. The flow now runs on staging, hitting staging SFTP and staging S3 — same code, different credentials. - Step 5 repeats step 4 against prod NiFi with prod parameter context. Zero flow changes; zero code deploy; total elapsed wall-clock time for the promotion is ~5 minutes.
Output.
| Component | dev | staging | prod |
|---|---|---|---|
| Flow definition | Registry v1 | Registry v1 | Registry v1 |
| SFTP host | sftp.dev | sftp.staging | sftp.prod |
| SFTP password | (dev secret) | (staging secret) | (prod secret) |
| S3 bucket | dev-landing | staging-landing | prod-landing |
| Configuration source | dev-params | staging-params | prod-params |
| Rollback | v0 (empty) | v0 (empty) | v0 (empty) |
Rule of thumb. Every production NiFi deployment runs behind Registry. Flow definition in Registry, credentials in per-environment parameter contexts, promotion via the CLI or UI — this is the canonical shape of NiFi CI/CD. Anyone editing flows directly on prod without committing to Registry is shipping an incident.
Worked example — parameter contexts with sensitive values and inheritance
Detailed explanation. A team runs 20 flows that all need the same base parameters (Kafka bootstrap servers, Zookeeper hosts, S3 region) plus flow-specific overrides. Instead of duplicating the base parameters across 20 contexts, they build a common-params context and inherit from it. Sensitive values (Kafka SSL keys, S3 credentials) go through a vault lookup, not plaintext.
-
The base.
common-params— Kafka bootstrap, Zookeeper, S3 region. -
The overrides.
retail-params— inherits fromcommon-params, adds retail-specific SFTP endpoint. -
The sensitive values.
s3_secret_keyis a{{vault-lookup}}reference, not a literal.
Question. Design the parameter-context inheritance graph, mark sensitive values, and show how the flow references parameters.
Input.
| Team | Common parameters | Team-specific parameters |
|---|---|---|
| retail | kafka_bootstrap, zk_hosts, s3_region | sftp_retail, retail_s3_bucket |
| analytics | kafka_bootstrap, zk_hosts, s3_region | dwh_url, analytics_s3_bucket |
| security | kafka_bootstrap, zk_hosts, s3_region | siem_url, security_s3_bucket |
Code.
# common-params — base shared across all teams
name: common-params
parameters:
- name: kafka_bootstrap
value: "kafka1:9093,kafka2:9093,kafka3:9093"
sensitive: false
- name: kafka_ssl_truststore
value: "/etc/nifi/ssl/kafka-truststore.jks"
sensitive: false
- name: kafka_ssl_password
value: "{{vault:secret/nifi/kafka/truststore_password}}"
sensitive: true
- name: zk_hosts
value: "zk1:2181,zk2:2181,zk3:2181"
sensitive: false
- name: s3_region
value: "us-east-1"
sensitive: false
- name: s3_access_key
value: "{{vault:secret/nifi/s3/access_key}}"
sensitive: true
- name: s3_secret_key
value: "{{vault:secret/nifi/s3/secret_key}}"
sensitive: true
---
# retail-params — inherits from common-params
name: retail-params
inheritedContexts: [common-params]
parameters:
- name: sftp_retail
value: "sftp.retail.on-prem.bank"
sensitive: false
- name: sftp_retail_user
value: "ingest-retail"
sensitive: false
- name: sftp_retail_password
value: "{{vault:secret/nifi/retail/sftp_password}}"
sensitive: true
- name: retail_s3_bucket
value: "retail-analytics-landing"
sensitive: false
<!-- Flow references parameters via #{name} syntax -->
<processor id="fetch_sftp" type="FetchSFTP">
<property name="Hostname">#{sftp_retail}</property>
<property name="Username">#{sftp_retail_user}</property>
<property name="Password">#{sftp_retail_password}</property>
</processor>
<processor id="put_s3" type="PutS3Object">
<property name="Bucket">#{retail_s3_bucket}</property>
<property name="Region">#{s3_region}</property>
<property name="Access Key ID">#{s3_access_key}</property>
<property name="Secret Access Key">#{s3_secret_key}</property>
</processor>
Step-by-step explanation.
-
common-paramsholds every parameter shared across teams. Kafka bootstrap, Zookeeper, S3 region — the infrastructure basics. Sensitive values (Kafka SSL password, S3 credentials) use Vault-lookup syntax, so the parameter context in Registry never sees the cleartext. -
retail-paramsinherits fromcommon-paramsand adds retail-specific SFTP endpoint and S3 bucket. Theretail-paramscontext does not re-declare Kafka or Zookeeper — those are inherited automatically. - A flow references any parameter — from
retail-paramsdirectly or fromcommon-paramsvia inheritance — with#{name}syntax. The flow does not need to know whether the parameter is local or inherited. - The Vault-lookup syntax is a NiFi Sensitive Value Provider extension. On flow load, NiFi calls out to Vault, fetches the value, decrypts it in-memory, and injects it into the processor's sensitive property. The cleartext value never touches disk on the NiFi instance.
- Promoting a flow across environments now requires only changing the parameter context binding — the flow is identical; only which context it references changes.
dev-retail-paramsinherits fromdev-common-params;prod-retail-paramsinherits fromprod-common-params; the flow itself is env-agnostic.
Output.
| Layer | Purpose | Env-agnostic | Sensitive-safe |
|---|---|---|---|
| Flow (Registry) | Data-movement logic | yes | yes (only #{param} refs) |
| common-params | Shared infra | no (per env) | yes (Vault lookup) |
| retail-params | Team overrides | no (per env) | yes (Vault lookup) |
| Vault | Secret store | yes | yes (encrypted) |
Rule of thumb. Every non-trivial NiFi deployment has at least two parameter contexts per team: one inherited base and one team-specific override. Sensitive values always go through Vault (or the platform's equivalent), never in the raw parameter value. This is the difference between a compliant deployment and a hand-wave.
Worked example — rollback via Registry version history
Detailed explanation. A prod deploy introduces a bug — the enriched-attribute path drops the customer_tier field. Twenty minutes of downstream analytics data is affected. The on-call needs to roll back the flow to the previous version right now. Registry makes this a two-click operation.
-
The bug. The new
LookupRecordconfig uses the wrong RecordPath, socustomer_tieris written to the wrong JSON key. - The impact. Downstream analytics ingests 20 minutes of records with a missing field.
- The rollback. On the prod NiFi UI, right-click the process group → Version → Show version history → pick v2 (the last-known-good) → "Change to this version".
Question. Walk through the rollback via the NiFi UI and via the CLI. Show what happens to FlowFiles in-flight during the rollback.
Input.
| Version | Committed at | Description | State |
|---|---|---|---|
| v1 | 2026-06-15 | Initial version | superseded |
| v2 | 2026-07-01 | Add tier enrichment | last-known-good |
| v3 | 2026-07-04 09:30 | Refactor LookupRecord | BUG |
Code.
# UI path (30 seconds):
# Right-click the process group → Version → Show version history
# → click v2 → click "Change to this version" → confirm
#
# CLI path (for automated rollbacks in the on-call runbook):
./bin/nifi-toolkit-cli.sh nifi pg-change-version \
--processGroupId <pg-uuid> \
--flowVersion 2 \
--url https://nifi-prod:8443
# Verify the rollback
./bin/nifi-toolkit-cli.sh nifi get-pg \
--processGroupId <pg-uuid> \
--url https://nifi-prod:8443 \
| jq '.component.versionControlInformation'
# {
# "flowId": "...",
# "version": 2,
# "state": "UP_TO_DATE"
# }
Runbook — Registry-based rollback
=================================
t+0s PagerDuty alert: "downstream missing customer_tier field"
t+30s On-call opens NiFi UI, navigates to the affected process group
t+60s Right-click → Version → Show version history
t+90s Identify v3 as the bad version; v2 as last-known-good
t+120s Click "Change to this version" on v2
t+150s NiFi stops the running processors, materialises v2 config, restarts processors
t+180s Verify downstream: customer_tier field present again
t+240s File incident ticket; leave v3 in history for post-mortem
Step-by-step explanation.
- The Registry-based rollback is idempotent — NiFi stops the running processors in the process group, materialises the v2 flow definition (replacing v3's LookupRecord config), and restarts. In-flight FlowFiles in connection queues are preserved — they were already dequeued and were being processed; the rollback only affects future processor invocations.
- For a flow that's mid-batch when the rollback happens, this is exactly the right behaviour. The FlowFiles in-flight complete under v3's (buggy) logic; new FlowFiles pick up under v2's (working) logic. There's a small "smear" where a handful of records still get the bug — usually a few seconds' worth.
- If the operator wants a cleaner rollback, they can pause the source processors first (right-click → Stop), let the queues drain, then change to v2. This trades a minute of ingestion pause for zero bug records post-rollback.
- The CLI path is what goes into the on-call runbook —
pg-change-versionwith the target version. The runbook can literally be a shell script that any on-call runs when they see the alert. No manual UI navigation under time pressure. - After rollback, v3 stays in Registry's version history. The post-mortem examines exactly what changed between v2 and v3, why the tests didn't catch the bug, and what to add to the CI pipeline. Rolling back does not delete the bad version; it just stops using it.
Output.
| Metric | Value |
|---|---|
| Rollback wall-clock | ~3 minutes |
| FlowFiles lost | 0 |
| FlowFiles with residual bug | ~few seconds' worth (~5000 events at 5000/s) |
| Post-mortem artifact | v3 preserved in Registry |
| Manual code deploy | 0 |
Rule of thumb. Registry-based rollback is the fastest possible incident response for a NiFi flow. Every on-call runbook has a pg-change-version shell command; every alert has a documented last-known-good version to roll back to. This is why Registry is non-negotiable in prod.
Senior interview question on Registry-driven promotion
A senior interviewer might ask: "Design the end-to-end CI/CD story for a team owning 30 NiFi flows across dev, staging, and prod. Cover Registry layout, parameter contexts, promotion workflow, rollback plan, and what you'd add to catch a bad deploy before it reaches prod."
Solution Using Registry + parameter-context inheritance + a promotion pipeline
# Registry layout — one bucket per team, one flow per data-movement unit
registry:
buckets:
- name: retail
flows:
- name: sftp-positions-eod
versions: [v1, v2, v3, ...]
- name: kafka-orders-to-warehouse
versions: [v1, v2, ...]
# ... 28 more flows
permissions:
write: [retail-team]
read: [retail-team, promotion-svc, sre]
# Parameter contexts — inherited base + per-team overrides + per-env instances
parameter_contexts:
- name: base-common-params
scope: shared across all envs, all teams
parameters:
- kafka_bootstrap: "kafka.internal:9093"
- zk_hosts: "zk.internal:2181"
- s3_region: "us-east-1"
- kafka_ssl_pw: "{{vault:common/kafka/pw}}"
- name: dev-common-params
inherits: [base-common-params]
parameters:
- env_label: "dev"
- s3_bucket_prefix: "dev-"
- name: prod-common-params
inherits: [base-common-params]
parameters:
- env_label: "prod"
- s3_bucket_prefix: "prod-"
# Per-team, per-env layered on top
- name: dev-retail-params
inherits: [dev-common-params]
- name: prod-retail-params
inherits: [prod-common-params]
# Promotion pipeline — GitLab CI job
promotion_pipeline:
stages: [validate, deploy_staging, smoke_test, deploy_prod, tag]
jobs:
validate:
script: nifi-registry-toolkit validate-flow --bucket retail --flow sftp-positions-eod --version $VERSION
deploy_staging:
script: nifi-toolkit-cli.sh nifi pg-change-version --processGroupId $STAGING_PG --flowVersion $VERSION
smoke_test:
script: pytest tests/e2e/sftp_positions_eod.py --env staging
deploy_prod:
when: manual
script: nifi-toolkit-cli.sh nifi pg-change-version --processGroupId $PROD_PG --flowVersion $VERSION
tag:
script: nifi-registry-toolkit tag-flow --flow sftp-positions-eod --version $VERSION --tag "prod-$(date +%Y%m%d)"
Step-by-step trace.
| Stage | Actor | Action | Duration |
|---|---|---|---|
| Dev | Data engineer | Commit v4 to Registry | 30 s |
| Validate | GitLab CI | Schema + structural checks | 1 min |
| Staging deploy | GitLab CI | pg-change-version on staging | 30 s |
| Smoke test | GitLab CI (pytest) | End-to-end test against staging | 5 min |
| Manual approve | Team lead | Click "deploy_prod" | (async) |
| Prod deploy | GitLab CI | pg-change-version on prod | 30 s |
| Tag | GitLab CI | Registry version tagged prod-20260704
|
5 s |
After the pipeline, every prod version in Registry carries a prod-YYYYMMDD tag. Rollback targets the previous prod-tagged version. Every flow follows the same pipeline; new team members do not have to invent a CI/CD story per flow.
Output:
| Layer | Env-agnostic | Env-specific | Sensitive-safe |
|---|---|---|---|
| Registry (flow) | yes | — | yes (only #{param}) |
| base-common-params | yes | — | yes (Vault) |
| dev/prod-common-params | — | yes | yes |
| dev/prod-retail-params | — | yes | yes |
| Vault | yes | — | yes (encrypted) |
Why this works — concept by concept:
- Registry per team, flow per unit — one bucket per team, one flow per data-movement unit. Access control is at the bucket; version history is per flow. Team ownership and blast-radius are aligned.
-
Parameter-context inheritance —
base → env-common → env-teamis the canonical three-layer inheritance. Base shared across everything; env-common per environment; env-team for the team's specifics. Adding a new team is one new context. -
Vault for sensitive values —
{{vault:...}}indirection means the parameter context in Registry never carries the cleartext secret. Rotation is a Vault-only operation. - CI/CD pipeline with manual approve — validate → staging → smoke → manual approve → prod. Staging is automatic; prod requires a human. This is the shape of every mature deploy pipeline.
-
Cost — one Registry service, one Vault, one CI/CD pipeline template. Marginal cost of a new flow is a one-line entry in the flows registry and a
pg-change-versioncall. O(1) per new flow after the framework is in place.
ETL
Topic — etl
Registry and version-controlled ETL problems
5. Cluster, provenance, and production patterns
Cluster mode + provenance + Site-to-Site — the three production superpowers
The mental model in one line: production NiFi is a nifi cluster of three or more nodes coordinated through Zookeeper (or NiFi's embedded elector), with a **cluster coordinator managing node membership and a primary node running singleton processors, plus a nifi provenance repository recording every FlowFile event so operators can query and replay a specific FlowFile's journey, plus Site-to-Site as the canonical NiFi-to-NiFi transfer protocol for edge-to-core or DC-to-cloud patterns**. Every senior production question falls under one of those three.
Cluster mode — anatomy.
- Nodes. Every NiFi node runs the same flow definition. FlowFiles land on any node; processors run on every node in parallel; the cluster coordinator ensures the flow definition is identical across nodes.
- Cluster coordinator. A single node (elected by Zookeeper or NiFi's embedded ZK-alternative) that owns cluster membership decisions. Coordinator failure triggers a re-election in <5 seconds.
-
Primary node. A single node (elected separately from the coordinator) that runs singleton processors —
ListSFTP,ListS3,QueryDatabaseTable,ConsumeJMS— any processor that would double-fetch if run on multiple nodes. -
Load balancing on connections. Every connection can be configured to redistribute FlowFiles across nodes.
Partition by Attribute(consistent hash),Round Robin,Single Node(all to one). This is how you get parallelism after a Primary-Node-only fetch. - State Manager. Per-node local state (fast) and cluster-wide state via Zookeeper (slow but consistent). Processors choose per-property.
Provenance — the lineage superpower.
- What's recorded. Every FlowFile event: created, forked, cloned, dropped, sent, received, routed, attributes-modified, content-modified, expired.
- Where. Local disk on each node in the Provenance Repository. Configurable retention (default 30 days).
- How to query. NiFi UI → Data Provenance → filter by attribute, timestamp, event type, processor. Returns matching events with links to view attributes/content and to replay.
- Replay. Pick any FlowFile from the Provenance history → click "Replay" → NiFi re-injects that FlowFile at the same point in the flow. Useful for debugging or reprocessing after a fix.
- Compliance. Provenance is the audit answer. "Show me the lineage of customer_id=42 across the last 30 days" — one Provenance query.
Site-to-Site — NiFi-to-NiFi transfer.
- What it is. A native NiFi protocol for transferring FlowFiles between two NiFi instances. Compressed, encrypted, and backpressure-aware end-to-end.
- The topology. Edge NiFi (small footprint at each field site) sends to central NiFi (data-center cluster). Or on-prem NiFi → cloud NiFi across a WAN. The receiving instance publishes an "input port" that the sending instance targets.
- The batteries included. Site-to-Site handles TLS, node discovery in a cluster (send to any node, receive load-balanced), FlowFile attributes end-to-end, and clean backpressure — if the receiving cluster's queues are full, the sending cluster gets a "slow down" signal.
NiFi vs Airflow — the interview matrix.
| Dimension | NiFi | Airflow |
|---|---|---|
| Programming model | Flow-based (data packets through graph) | DAG-based (tasks with dependencies) |
| Latency floor | ~ms | ~1 minute (scheduler tick) |
| Data grain | Per-FlowFile | Per-task-run |
| Provenance | Per-FlowFile, built-in | Per-task, needs OpenLineage |
| Deployment | Self-contained cluster | Scheduler + workers + metadata DB + queue |
| Sweet spot | Integration-heavy, low-latency, on-prem | Batch ETL, cross-service orchestration |
| Programming interface | Canvas + Registry JSON | Python DAGs |
| Failure model | Bounded queues + backpressure | Task retry + SLA miss |
When you'd combine both.
- NiFi at the edge, Airflow at the warehouse. NiFi ingests, validates, lands to S3/HDFS. Airflow orchestrates the warehouse build after the daily landing completes. NiFi's Site-to-Site + PutHDFS is the ingest; Airflow's ExternalTaskSensor watches for the landing and triggers the transform DAG.
- NiFi for CDC, Airflow for downstream aggregation. NiFi captures CDC events from Debezium via Kafka, deduplicates, lands to raw table. Airflow runs the hourly aggregation into the analytics mart.
-
The anti-pattern. Running the whole batch ETL in NiFi with a
Cron-drivenGenerateFlowFile. NiFi can do this, but Airflow's DAG semantics and scheduler are strictly better for that shape.
Common interview probes on production NiFi.
- "How does a NiFi cluster work?" — coordinator + primary node; every node runs the flow; load-balancing on connections.
- "What's the difference between the coordinator and the primary node?" — coordinator manages membership; primary runs singleton processors.
- "Explain Provenance replay." — pick a Provenance event → click Replay → FlowFile re-injected at that processor.
- "When would you use Site-to-Site vs a Kafka bridge?" — Site-to-Site for NiFi ↔ NiFi with attribute preservation; Kafka when a durable buffer is required or non-NiFi consumers exist.
- "NiFi vs Airflow — pick one for our streaming pipeline." — depends on latency, data grain, and team stack; work through the matrix.
Worked example — cluster deployment on Kubernetes with a coordinator + primary + 3 workers
Detailed explanation. A team deploys a NiFi cluster on Kubernetes for a hybrid deployment — 3 NiFi nodes as a StatefulSet, backed by an external Zookeeper for the coordinator election. The single Primary Node runs ListSFTP (cluster-safe fetch pattern); all 3 nodes run the transform and sink. Walk through the K8s manifest and the load-balancing on the connection out of the ListSFTP.
- The pattern. ListSFTP (Primary-Node-only) → cluster-load-balanced connection → FetchSFTP (all nodes) → transform → sink.
- The K8s shape. StatefulSet with 3 replicas; each pod has its own PVC for flow/content/provenance repos; a headless service for inter-node discovery.
- The Zookeeper. External 3-node ZK ensemble for coordinator + primary election.
Question. Write the K8s StatefulSet, the NiFi flow, and show how load balancing on the connection distributes work after the singleton ListSFTP.
Input.
| Parameter | Value |
|---|---|
| NiFi nodes | 3 |
| Zookeeper | External 3-node ensemble |
| Storage | 3 × 200 GB PVC per node (flow/content/provenance) |
| ListSFTP | Runs on primary node only |
| FetchSFTP | Cluster-load-balanced (Round Robin) |
Code.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nifi
spec:
serviceName: nifi-headless
replicas: 3
selector:
matchLabels:
app: nifi
template:
metadata:
labels:
app: nifi
spec:
terminationGracePeriodSeconds: 300
containers:
- name: nifi
image: apache/nifi:2.0.0
ports:
- containerPort: 8443
name: https
- containerPort: 8080
name: cluster
- containerPort: 10000
name: site2site
env:
- name: NIFI_WEB_HTTPS_PORT
value: "8443"
- name: NIFI_CLUSTER_IS_NODE
value: "true"
- name: NIFI_CLUSTER_NODE_PROTOCOL_PORT
value: "8080"
- name: NIFI_ZK_CONNECT_STRING
value: "zk-0.zk:2181,zk-1.zk:2181,zk-2.zk:2181"
- name: NIFI_ELECTION_MAX_WAIT
value: "1 min"
- name: NIFI_JVM_HEAP_INIT
value: "8g"
- name: NIFI_JVM_HEAP_MAX
value: "8g"
volumeMounts:
- name: flow-repo
mountPath: /opt/nifi/nifi-current/flowfile_repository
- name: content-repo
mountPath: /opt/nifi/nifi-current/content_repository
- name: provenance-repo
mountPath: /opt/nifi/nifi-current/provenance_repository
volumeClaimTemplates:
- metadata:
name: flow-repo
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi
- metadata:
name: content-repo
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
- metadata:
name: provenance-repo
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
name: nifi-headless
spec:
clusterIP: None
selector:
app: nifi
ports:
- port: 8080
name: cluster
- port: 8443
name: https
- port: 10000
name: site2site
<!-- Flow: ListSFTP (primary only) → FetchSFTP (all nodes, load-balanced) -->
<processor id="list_sftp" type="ListSFTP">
<property name="Hostname">sftp.internal</property>
<property name="Remote Path">/incoming</property>
<property name="Polling Interval">30 sec</property>
<property name="Execution Node">Primary Node</property> <!-- singleton -->
</processor>
<processor id="fetch_sftp" type="FetchSFTP">
<property name="Hostname">sftp.internal</property>
<property name="Execution Node">All Nodes</property> <!-- parallel -->
<property name="Concurrent Tasks">4</property>
</processor>
<connection source="list_sftp" destination="fetch_sftp"
back-pressure-object-threshold="10000"
load-balance-strategy="Round Robin" <!-- redistribute -->
load-balance-compression="Compress Attributes and Content">
</connection>
Step-by-step explanation.
- The StatefulSet deploys 3 NiFi pods with stable network identities (
nifi-0,nifi-1,nifi-2). Each pod has three PVCs — flow, content, provenance — sized for the workload. The headless service enables pod-to-pod discovery over the cluster protocol port. - On startup, the 3 nodes contact Zookeeper. ZK elects a coordinator (say
nifi-0) and independently elects a primary (saynifi-1). The elections are separate — losing the coordinator does not lose the primary and vice versa. -
ListSFTPis configured withExecution Node = Primary Node. Only the primary polls SFTP; the other two nodes stay quiet on this processor. This is the cluster-safe pattern for source processors — no double-fetch, no cross-node race. - The connection out of
ListSFTPhasload-balance-strategy = Round Robin— every 3rd FlowFile listed by the primary is redistributed tonifi-0/nifi-1/nifi-2. This happens over the network, using the same Site-to-Site machinery internally. -
FetchSFTPruns on all 3 nodes in parallel withConcurrent Tasks = 4— up to 12 concurrent SFTP downloads across the cluster. Downstream processors (transform, sink) run in parallel on the same 3 nodes. The cluster achieves 3× the single-node throughput after the load-balance redistribution.
Output.
| Node | Coordinator? | Primary? | ListSFTP | FetchSFTP | Downstream |
|---|---|---|---|---|---|
| nifi-0 | yes | no | idle | active (4 threads) | active |
| nifi-1 | no | yes | active | active (4 threads) | active |
| nifi-2 | no | no | idle | active (4 threads) | active |
Rule of thumb. In a NiFi cluster, always mark source processors that would double-fetch (List*, Query*, Consume* where non-consumer-group) with Execution Node = Primary Node, and pair with load-balance-strategy = Round Robin on the outgoing connection to distribute work back across the cluster. This is the canonical "list on primary, fetch on all" pattern.
Worked example — Provenance query and replay of a specific FlowFile
Detailed explanation. A downstream analytics report shows one customer's aggregation is wrong. The team suspects a specific event was mis-routed three days ago. They use Provenance to find the exact FlowFile, view its attributes and content, identify the mis-routing, then replay a corrected FlowFile after fixing the flow.
-
The starting point. Customer complaint at t+3 days. All the operator knows is
customer_id=42and roughly what the event should look like. -
The Provenance query. Filter by
customer.id = 42and event type =RECEIVE(ingest point) within the 3-day window. -
The finding. One FlowFile ingested with
region = nullinstead ofregion = us— routed to theunmatcheddead-letter path instead of theuspath. - The fix + replay. Fix the upstream ExtractText config, then find the mis-routed FlowFile in Provenance, click Replay, and re-inject it at the corrected processor.
Question. Walk through the Provenance query, the diagnosis, the fix, and the replay. Explain what Replay actually does under the hood.
Input.
| Parameter | Value |
|---|---|
| Customer complaint | customer_id=42 aggregation wrong |
| Suspicion window | last 3 days |
| Flow | events pipeline (5 processors) |
Code.
# Step 1 — Provenance query (NiFi UI → Data Provenance)
query:
searchTerms:
- field: FlowFileAttribute
name: customer.id
value: "42"
eventTypes: [RECEIVE, ROUTE, DROP, SEND]
startDate: "2026-07-01T00:00:00Z"
endDate: "2026-07-04T00:00:00Z"
maxResults: 100
# Result — one matching FlowFile with lineage
event_1:
timestamp: "2026-07-02T14:32:11Z"
event_type: RECEIVE
processor: ListenHTTP
flowfile_uuid: "a1b2c3d4-1111-2222-3333-444455556666"
attributes:
customer.id: "42"
region: null # ← the bug
http.remote.host: "10.4.2.11"
event_2:
timestamp: "2026-07-02T14:32:11Z"
event_type: ROUTE
processor: RouteOnAttribute
relationship: "unmatched" # ← mis-routed
flowfile_uuid: "a1b2c3d4-1111-2222-3333-444455556666"
event_3:
timestamp: "2026-07-02T14:32:11Z"
event_type: SEND
processor: PublishKafka
topic: "events.unmatched" # ← wrong topic
flowfile_uuid: "a1b2c3d4-1111-2222-3333-444455556666"
<!-- Step 2 — the buggy upstream config -->
<processor id="extract_region_BUGGY" type="EvaluateJsonPath">
<property name="Destination">flowfile-attribute</property>
<property name="region">$.geo.region</property> <!-- wrong path; actual is $.region -->
</processor>
<!-- Fixed version -->
<processor id="extract_region_FIXED" type="EvaluateJsonPath">
<property name="Destination">flowfile-attribute</property>
<property name="region">$.region</property>
</processor>
# Step 3 — replay the mis-routed FlowFile after the fix
# UI: right-click event → Replay → confirm
# CLI:
./bin/nifi-toolkit-cli.sh nifi replay-provenance-event \
--eventId <event-uuid-for-event_2> \
--url https://nifi-prod:8443
Step-by-step explanation.
- The Provenance query filters by the
customer.id = 42attribute across a 3-day window. NiFi's Provenance repository is indexed on both attributes and timestamp, so this returns in seconds even against months of history. - The result shows three events for the mis-routed FlowFile: RECEIVE (ingested with
region = null), ROUTE (sent tounmatchedrelationship because the region attribute was null), SEND (published toevents.unmatchedKafka topic instead ofevents.us). The lineage tells the story: the extract step failed to populateregion. - The operator navigates to the
EvaluateJsonPathprocessor in the flow, sees the incorrect JSONPath ($.geo.regioninstead of$.region), fixes it, and commits the fix to Registry as v4. Future FlowFiles route correctly. - To fix the historical record, the operator finds the mis-routed FlowFile's SEND event in Provenance and clicks Replay. NiFi re-injects the FlowFile at the point of the SEND event's processor — but critically, the flow definition is now the fixed v4. The FlowFile is re-processed under the corrected logic, with
region = uspopulated, and it's published toevents.usthis time. - Provenance Replay uses the content-repository claim that's still preserved (unless GC'd) to reconstruct the FlowFile bytes. Attributes are read from Provenance's per-event snapshot. The replayed FlowFile is a fresh FlowFile with the same content — not a "resurrected" original — but for downstream purposes it's indistinguishable.
Output.
| Step | Command | Duration | Result |
|---|---|---|---|
| Query | Data Provenance → filter | 5 s | 3 events for FlowFile a1b2c3d4 |
| Diagnose | Inspect attributes | 30 s | region = null → mis-routed |
| Fix flow | Edit EvaluateJsonPath | 2 min | JSONPath corrected |
| Commit to Registry | Right-click → Commit | 30 s | v4 saved |
| Replay | Right-click event → Replay | 3 s | FlowFile re-processed correctly |
Rule of thumb. Provenance is the single most useful production-debugging feature in NiFi. Every senior operator's first step on a data-quality alert is a Provenance query. Replay lets you correct historical data without rebuilding the entire pipeline. No other data-movement tool ships this.
Worked example — Site-to-Site edge → central NiFi transfer
Detailed explanation. A retail company has 200 store sites. Each store runs a lightweight NiFi (1 node, 4 GB heap) that captures POS transactions in real time. Every store's NiFi sends transactions to a central NiFi cluster in the DC via Site-to-Site. Central NiFi normalises, enriches, and lands to the analytics warehouse.
- The pattern. Edge NiFi (per-store) → Site-to-Site → central NiFi cluster (DC).
-
The receiving surface. Central NiFi publishes a Site-to-Site "Input Port" named
pos-ingest. -
The sending surface. Each edge NiFi has a
RemoteProcessGrouptargeting the central cluster's URL and thepos-ingestport. - The wins. Attribute preservation across the WAN, TLS/compression, cluster-aware load balancing on receive, end-to-end backpressure.
Question. Configure the sending and receiving sides. Explain what happens when the central cluster's queues fill (does the store back up?).
Input.
| Parameter | Value |
|---|---|
| Stores | 200 |
| Edge NiFi | 1 node per store, 4 GB heap |
| Central NiFi | 5-node cluster in DC |
| Transport | Site-to-Site over TLS |
| Compression | Content compressed |
Code.
<!-- Central NiFi — root canvas -->
<inputPort id="pos_ingest" name="pos-ingest">
<accessControl>
<group>site-to-site-clients</group>
<permission>SEND</permission>
</accessControl>
</inputPort>
<!-- Downstream from pos-ingest: normalise → enrich → publish to Kafka -->
<processor id="normalize" type="JoltTransformJSON">...</processor>
<processor id="enrich" type="LookupRecord">...</processor>
<processor id="publish" type="PublishKafka_2_6">
<property name="Topic Name">pos.transactions</property>
</processor>
<!-- Edge NiFi (each store) — capture and send -->
<processor id="capture_pos" type="ListenHTTP">
<property name="Listening Port">9000</property>
<property name="Base Path">pos</property>
</processor>
<remoteProcessGroup id="central_dc" name="Central NiFi">
<targetUri>https://nifi-central.dc.internal:8443/nifi</targetUri>
<transportProtocol>HTTP</transportProtocol>
<communicationsTimeout>30 sec</communicationsTimeout>
<yieldDuration>10 sec</yieldDuration>
<inputPort id="pos_ingest">
<name>pos-ingest</name>
<useCompression>true</useCompression>
<batchCount>1000</batchCount>
<batchSize>4 MB</batchSize>
<batchDuration>5 sec</batchDuration>
</inputPort>
</remoteProcessGroup>
<connection source="capture_pos" destination="central_dc:pos_ingest"
back-pressure-object-threshold="20000"
back-pressure-data-size-threshold="512 MB">
</connection>
Step-by-step explanation.
- Central NiFi exposes
pos-ingestas a Site-to-Site Input Port on the root canvas. The port has an access-control policy — only members ofsite-to-site-clientsgroup can SEND. Downstream of the port, the flow normalises the POS JSON, enriches with SKU-to-department lookup, and publishes to Kafka. - Each edge NiFi has a
RemoteProcessGrouppointing at the central cluster's URL. On startup, the RPG discovers the cluster's node list via the coordinator, and thereafter distributes FlowFiles round-robin across the 5 central nodes. Losing one central node just means the RPG rebalances across the remaining 4. - The
useCompression = trueand batching settings pack many small FlowFiles into fewer wire batches — critical for a WAN link with per-request TLS overhead. Batch of 1000 FlowFiles or 4 MB or 5 seconds, whichever comes first. - Backpressure end-to-end: when central NiFi's downstream (Kafka) slows and its post-
pos-ingestqueue fills, the input port stops accepting new FlowFiles. The edge NiFi's RPG sees the "slow down" signal on the next batch and pauses. The edge NiFi's connection out ofcapture_posthen fills; backpressure fires again; the edge stops accepting new POS transactions onListenHTTP. - In the store's operator experience, a central-DC outage manifests as "store keeps working; POS transactions are buffered locally on the edge NiFi's content repo; when the DC comes back, the backlog drains automatically". The store manager never sees an incident.
Output.
| Layer | Behaviour when central Kafka is slow |
|---|---|
| Central Kafka | slow |
| Central publish processor | queues fill |
| Central input port | stops accepting |
| Edge RPG | pauses batches |
| Edge output connection | fills to threshold |
| Edge capture_pos | pauses (via backpressure) |
| Store POS system | HTTP 503 from ListenHTTP; app-side retry buffers |
Rule of thumb. Use Site-to-Site for every NiFi-to-NiFi transfer inside your infrastructure. The end-to-end backpressure, batched compression, and cluster-aware load balancing are three features you don't get with a Kafka bridge or a raw HTTP endpoint. Reserve Kafka for cases where a durable buffer or non-NiFi consumers are required.
Senior interview question on production NiFi patterns
A senior interviewer might ask: "You're deploying NiFi for a retail chain — 200 stores, on-prem edges, DC central cluster, hybrid analytics in the cloud. Walk me through the topology, the cluster shape, the Provenance retention, the Site-to-Site pattern, and how you'd argue for NiFi over Airflow to the CTO."
Solution Using a three-tier NiFi topology with Registry, Provenance, and Site-to-Site
# Topology — three tiers, one Registry
tiers:
edge:
count: 200 # one NiFi per store
resources: {heap: 4Gi, disk: 50Gi}
role: capture + local buffer + send-to-central
processors: [ListenHTTP, ExtractText, UpdateAttribute, RemoteProcessGroup]
central_dc:
count: 5 # 5-node cluster in DC
resources: {heap: 32Gi, disk: 2Ti per node}
role: normalise + enrich + land + send-to-cloud
processors: [pos-ingest InputPort, JoltTransformJSON, LookupRecord, PublishKafka, PutHDFS, RemoteProcessGroup]
cloud_analytics:
count: 3 # 3-node cluster in cloud VPC
resources: {heap: 16Gi, disk: 1Ti per node}
role: land-to-s3 + trigger downstream Airflow DAG
processors: [pos-ingest InputPort, PutS3Object, InvokeHTTP]
registry:
deployment: single-instance
buckets: [edge-store, central-dc, cloud-analytics]
promotion: gitlab-ci-driven
provenance:
retention: 30 days
index: on customer.id, store.id, transaction.id
replay: enabled
site_to_site:
edge -> central_dc: batched, compressed, TLS
central_dc -> cloud: batched, compressed, TLS
cluster_aware: true # RPG discovers cluster nodes
nifi_vs_airflow:
role_of_nifi: ingest + move + provenance
role_of_airflow: warehouse build + cross-service orchestration + reporting
boundary: cloud_analytics NiFi lands to S3 → S3 event triggers Airflow DAG
Step-by-step trace.
| Tier | Node count | Role | Sends to |
|---|---|---|---|
| Edge (per store) | 200 (1 each) | Capture + local buffer | Central DC via Site-to-Site |
| Central DC | 5 | Normalise + enrich + local Kafka + land | Cloud VPC via Site-to-Site |
| Cloud analytics | 3 | S3 landing + Airflow trigger | Airflow (via HTTP) |
| Registry | 1 | Version control | — |
| Airflow | (external) | Warehouse build | — |
After deployment, the topology captures POS transactions in real time, buffers them locally when central is down, batches efficiently across the WAN, and lands to both an on-prem Hadoop cluster (for the analytics team's legacy queries) and to S3 (for the cloud-native warehouse). Airflow orchestrates the S3-to-Snowflake ELT.
Output:
| Metric | Value |
|---|---|
| End-to-end latency (POS → central Kafka) | ~5 s (batched) |
| End-to-end latency (POS → cloud S3) | ~15 s (double-hop) |
| Provenance retention | 30 days |
| Rollback time | ~3 min (Registry-based) |
| Store outage isolation | Edge NiFi buffers 12 hours of POS |
| Central DC outage impact | 200 stores buffer to disk; auto-drain on recovery |
Why this works — concept by concept:
- Three-tier topology — edge, central DC, cloud. Each tier has a clear responsibility and its own failure isolation. Losing one edge NiFi affects one store; losing central DC just buffers everything at the edge.
- Site-to-Site between tiers — the only inter-tier transport. Backpressure end-to-end; batched + compressed for WAN; cluster-aware on receive. Zero custom code.
- Registry as the single-source-of-truth — one Registry, three buckets, one CI/CD pipeline. Every flow across all three tiers is version-controlled.
-
Provenance retention indexed on domain keys — 30 days of Provenance, indexed on
customer.id,store.id,transaction.id. Compliance and debugging queries return in seconds. - Cost — 200 edge nodes × small resource footprint + 5 DC + 3 cloud + 1 Registry ≈ mid-single-digit-percent of the retail chain's data platform TCO. The alternative (custom Python per store + K8s + custom lineage) is 10× the engineering cost and 3× the operational cost.
Streaming
Topic — streaming
Cluster + provenance production patterns
Real-Time
Topic — real-time-analytics
Real-time analytics topology problems
Cheat sheet — NiFi recipes
-
Five-processor flow layout. Source (
Get*/List*+Fetch*/Listen*) → validate (ValidateRecord/ValidateCsv) → enrich (LookupRecord/UpdateAttribute) → route (EvaluateJsonPath+RouteOnAttribute) → sink (Put*/Publish*). Every flow in an interview whiteboard is a variant. Attributes carry routing keys; content is rewritten only when strictly necessary. -
Backpressure config snippet. Set
back-pressure-object-thresholdandback-pressure-data-size-thresholdon every connection out of a source processor. Rule:object_threshold × avg_payload_bytes × num_hot_connections < 0.5 × content_repo_disk. Kafka (or the upstream source) is the buffer; NiFi is the flow.flowfile-expiration = 0 secfor data-preserving flows; positive TTL only for real-time dashboards where stale is worse than missing. -
Registry commit + import CLI.
nifi-toolkit-cli.sh nifi start-version-controlto commit a process group;nifi-toolkit-cli.sh nifi pg-import --flowVersion Nto materialise on another instance;nifi-toolkit-cli.sh nifi pg-change-version --flowVersion Mto roll forward or back. Every prod rollback is one CLI call. -
Parameter context yaml. Three-layer inheritance:
base-common-params(shared, sensitive via Vault) →env-common-params(dev/staging/prod) →env-team-params(team overrides). Reference in the flow via#{name}. Never put cleartext secrets in a parameter context; always{{vault:...}}. -
Provenance query for replay. UI: Data Provenance → search on FlowFile attribute (e.g.
customer.id = 42) + time window + event type. Click a matching event → Replay → NiFi re-injects the FlowFile at the same processor under the current (post-fix) flow definition. CLI:nifi-toolkit-cli.sh nifi replay-provenance-event --eventId <uuid>. -
Cluster mode rules. Every source that would double-fetch (
List*,Query*, non-consumer-groupConsume*) markedExecution Node = Primary Node. Immediately followed by a connection withload-balance-strategy = Round Robinto redistribute across all nodes. State that must survive node failure goes through Zookeeper via the cluster-scoped State Manager. -
Site-to-Site setup. Central NiFi exposes an Input Port with an access-control policy. Sending NiFi has a
RemoteProcessGrouptargeting the central URL and Input Port name. EnableuseCompression = trueand batch (1000 FF or 4 MB or 5 s) for WAN links. Backpressure propagates end-to-end automatically. -
NiFi 2.x Python processor. Subclass
FlowFileTransform, implementtransform(context, flowfile), returnFlowFileTransformResult(relationship, attributes, contents). Drop the.pyin$NIFI_HOME/python_extensions/. Auto-discovered on flow load. No Java, no Maven, no.nar. -
Expression Language essentials.
${attr}for lookup;${attr:equals('x')},${attr:contains('sub')},${attr:matches('regex')}for booleans;${now():format('yyyy-MM-dd')}for dates;${attr:substring(0, 10)}for slicing. Attribute-only; never reaches into content. -
RouteOnAttribute vs RouteOnContent. RouteOnAttribute reads a
Map— nanoseconds. RouteOnContent parses the payload — microseconds. Always extract routing keys to attributes once (withEvaluateJsonPathorExtractText), then RouteOnAttribute from there. -
Failure handling pattern. Every processor that touches an external system wires three relationships:
success→ happy path,failure→ into aRetryFlowFilewith aMaximum Retriescap,retries_exceeded→ dead-letter Kafka topic. Retries stay in the flow; permanent failures are inspected out-of-band. -
NiFi vs Airflow decision. Ingest, per-record, low-latency, integration-heavy, on-prem → NiFi. Batch, scheduled, cross-service, Python-first → Airflow. Combine them at the landing boundary: NiFi lands to S3 → S3 event triggers Airflow DAG. Never run batch ETL entirely in NiFi with a cron-scheduled
GenerateFlowFile; that's the anti-pattern. -
Monitoring the flow. Prometheus
/metricsendpoint scraped every 15 s. Key metrics:nifi_amount_flowfiles_queued,nifi_amount_bytes_queued,nifi_backpressure_active,nifi_processor_task_millis. Alerts:queued > 0.75 × threshold for 5 min;backpressure_active = 1 for 2 min. Runbook: identify the saturated connection → check downstream processor health → check upstream source health. -
Kubernetes deployment shape. StatefulSet with 3+ nodes, one PVC each for flow, content, and provenance repos (200 GB / 100 GB typical), external Zookeeper for coordinator + primary election, headless service for inter-node discovery, ingress for the UI at
:8443. Auto-scaling is manual (add nodes and rebalance); rolling upgrades work fine.
Frequently asked questions
What is Apache NiFi and when should a senior data engineer reach for it?
Apache NiFi is a flow-based data-movement runtime — an always-on graph of independent processors moving data packets (FlowFiles) through bounded queues with automatic backpressure, versioned via NiFi Registry, and lineage-tracked via Provenance. Reach for it when the workload is integration-heavy (SFTP, JMS, JDBC, MQTT, Kafka, S3, HDFS all mixed together), low-latency (seconds, not minutes), on-prem or hybrid (bank, telco, defence, healthcare), or when per-record provenance is a compliance requirement. Skip it when the workload is pure batch ETL orchestrated across a warehouse (Airflow's stronger), a pure stream-processing job with joins and windowed aggregations (Flink or Kafka Streams), or a single-cloud pure-Python team with no on-prem footprint. The senior signal is matching the tool to the workload shape, not defaulting to NiFi for everything.
NiFi vs Airflow — when do I pick which?
Pick NiFi when the workload is continuous data movement — files landing on SFTP, events arriving on Kafka, HTTPS ingestion — where the latency matters in seconds and the integration surface is broad. Pick Airflow when the workload is scheduled task orchestration — nightly warehouse builds, cross-service dependencies (Spark + dbt + Snowflake), Python-first pipelines where the artifact is a .py file. The clean boundary: NiFi handles ingest + move + land + provenance; Airflow orchestrates the transformations that run against the landed data. A production data platform typically has both — NiFi at the edge and ingest, Airflow at the warehouse and the transformation DAGs. The interview answer is not "one is better" — it's naming which shape of workload each is optimised for and pointing at the boundary where they compose.
What is a FlowFile in NiFi?
A FlowFile is NiFi's atomic data packet — a lightweight object with two parts: a map of key-value attributes (metadata like filename, uuid, source.host, plus any custom attributes processors add) and a pointer into the content repository (the actual payload bytes on disk). The critical insight: attributes travel with the FlowFile through processors, but content is copy-on-write — two FlowFiles sharing the same payload point at the same content-repository claim, and the bytes are rewritten only when a processor actually modifies them. This is what makes routing-on-attribute cheap (a Map lookup, nanoseconds) and routing-on-content expensive (a disk read + parse, microseconds). Every senior NiFi optimisation is "extract the routing key to an attribute once, then route on the attribute forever." FlowFile lineage is recorded per event in the Provenance Repository, which is what enables per-FlowFile replay and audit queries.
How does backpressure work in NiFi?
NiFi backpressure is a binary, propagating pause driven by bounded connection queues. Every connection has two thresholds — an object threshold (FlowFile count, default 10,000) and a size threshold (aggregate bytes, default 1 GB). When either threshold is crossed, NiFi stops scheduling the upstream processor — it pauses entirely, not throttled to a rate. If that upstream is now itself starved because its downstream queue is full, the pause propagates further upstream, all the way to the source processor. When the downstream drains the queue below the threshold, upstream resumes automatically — no manual reset. This is different from a rate limiter: rate limiters cap continuous throughput; backpressure is on/off, driven by the queue's actual state. In practice, backpressure means NiFi flows self-regulate under downstream slowdowns — the flow never fills the disk indefinitely, because the source stops pulling when the pipeline can't drain fast enough. The engineering trade-off: bounded queues + backpressure make NiFi predictable under load but require the source (Kafka, SFTP, JMS) to be the durable buffer during outages.
Is a NiFi cluster safe out of the box?
A NiFi cluster is safe out of the box if you correctly mark source processors as Primary-Node-only. The cluster runs the same flow definition on every node, so a processor like ListSFTP would double-fetch each file if allowed to run on every node — one FlowFile per file per node. The fix is Execution Node = Primary Node on every List*, Query*, Consume* (non-consumer-group), and other singleton-shaped source processors. Immediately after such a processor, put a connection with load-balance-strategy = Round Robin to redistribute the listed items across all nodes for parallel downstream processing. Coordinator + primary elections are handled by Zookeeper (external or embedded) with sub-5-second failover. State that must survive node failure goes through the cluster-scoped State Manager (also backed by ZK); node-local state stays fast. The one unsafe pattern is a source processor running on all nodes without a natural sharding scheme — that's how you accidentally 3× your ingestion volume in a 3-node cluster.
Do I need NiFi Registry to run NiFi in production?
Yes — NiFi Registry is non-negotiable in production. Without Registry, every flow lives only inside the NiFi instance's flow.json.gz file; there's no version history, no promotion across environments, no rollback capability, and every prod edit is a "hope it doesn't break" one-shot. Registry deploys as a small Spring Boot service (a single JAR or Docker container), stores versioned snapshots of process groups (buckets → flows → versions), and integrates natively with NiFi's UI ("Version → Start version control"). The canonical CI/CD story: commit from dev to Registry, import the same version on staging with staging parameter context, smoke-test, import on prod with prod parameter context — the flow definition is env-agnostic, the parameter contexts fill in env-specific credentials and endpoints. Rollback is a two-click UI operation or a single CLI command (pg-change-version --flowVersion N). Anyone editing flows directly on prod without Registry is shipping an incident. The engineering cost of adding Registry is a day; the avoided incidents pay for it within a week.
Practice on PipeCode
- Drill the streaming practice library → for the FlowFile + processor routing, backpressure sizing, and per-event lineage problems senior interviewers love.
- Rehearse on the ETL practice library → for the ingestion, validation, enrichment, and land-to-warehouse flows that motivate NiFi in the first place.
- Sharpen the real-time axis with the real-time analytics practice library → for the low-latency, prioritized, TTL-driven dashboard-shaped flows.
- Stack the prerequisites against PipeCode's broader 450+ data-engineering catalogue to anchor the FlowFile + processor + backpressure intuition against real graded inputs.
Lock in NiFi flow-based muscle memory
NiFi docs explain the API. PipeCode drills explain the decision — when transaction-level backpressure trumps a rate limit, when Registry rollback beats a code deploy, when Provenance replay closes an audit ticket in minutes. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the flow-based, integration-heavy, on-prem-plus-cloud production trade-offs senior data engineers actually face.





Top comments (0)