DEV Community

Cover image for Hadoop Interview Questions for Data Engineers: HDFS, YARN, MapReduce
Gowtham Potureddi
Gowtham Potureddi

Posted on

Hadoop Interview Questions for Data Engineers: HDFS, YARN, MapReduce

hadoop interview questions focus on three layers every Hadoop loop tests: HDFS (the distributed file system — NameNode, DataNodes, blocks, replication, fault tolerance), YARN (the cluster resource manager — ResourceManager, NodeManager, ApplicationMaster, containers), and MapReduce (the historical compute engine — mappers, combiners, partitioners, reducers, shuffle, sort). Whether you're prepping for hadoop interview questions and answers at a financial-services data team or hadoop interview questions for experienced at a legacy enterprise, the same three layers show up — and senior loops add the Hadoop ecosystem (Hive, Pig, HBase, Sqoop, Oozie) plus operational concerns like high availability and Kerberos authentication.

This guide walks through every theme in the hadoop interview questions and answers pdf ecosystem that reviewers love to test in data engineering interview questions: hdfs architecture with the NameNode-DataNode split, HDFS blocks and replication factor, yarn in hadoop with the ResourceManager + NodeManager + ApplicationMaster trio, container lifecycle, mapreduce in hadoop with the full map → combine → partition → shuffle → sort → reduce pipeline, hive vs pig for the high-level abstractions, the Hadoop ecosystem (Sqoop, Oozie, HBase, ZooKeeper), and the seven gotchas (small files, single-NameNode SPOF, skewed reducers, partition imbalance) that fail most candidates. Every section ends as hadoop interview questions with answers: a short runnable example, a step-by-step trace, and a concept-by-concept why this works breakdown.

PipeCode blog header for a Hadoop interview tutorial — bold white headline 'Hadoop · Interview Questions' with subtitle 'HDFS · YARN · MapReduce' and a minimal cluster diagram on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.


On this page


1. Why Hadoop interview questions still matter in 2026

Hadoop is the foundation modern lakehouses inherit — three layers cover the interview surface

The one-sentence invariant: hadoop interview questions still circle the same three-layer architecture — HDFS (storage), YARN (resource management), MapReduce (compute) — because every modern data lake / lakehouse (Spark on YARN, Hive on Tez, Iceberg on S3) inherits one or more of these patterns. Once you know the three layers, every prompt becomes "which layer is the reviewer probing?"

The three Hadoop layers at a glance.

  • HDFS (Hadoop Distributed File System) — distributed storage; files split into blocks; blocks replicated across DataNodes; managed by a NameNode.
  • YARN (Yet Another Resource Negotiator) — cluster resource manager; ResourceManager allocates resources; NodeManagers run containers; ApplicationMasters coordinate one job's containers.
  • MapReduce — historical batch compute engine; map → shuffle → reduce; still asked about even in Spark-first shops.

Why Hadoop still matters in 2026.

  • Storage layer survives — HDFS clusters are still in production at most large enterprises.
  • YARN is the de-facto resource manager for on-prem Spark clusters.
  • The conceptual model — sharded storage, locality-aware compute, replication for fault tolerance — appears in S3, GCS, Snowflake, Databricks, and every cloud-native equivalent.
  • Legacy migrations — many DE roles involve moving workloads off Hadoop to cloud lakehouses.
  • Interview signal — reviewers test whether you understand the distributed-systems primitives the modern stack stands on.

What interviewers listen for.

  • Do you name the NameNode / DataNode roles correctly when asked about HDFS? — basic-but-tested.
  • Do you mention the default block size (128MB) and replication factor (3)? — fluency signal.
  • Do you describe YARN as the resource manager that Spark / Hive / MapReduce all schedule against? — senior signal.
  • Do you contrast MapReduce's disk-write-at-every-stage model with Spark's in-memory model when asked about the difference? — interview-canonical answer.

hadoop vs spark — the most-asked comparison.

  • MapReduce writes every intermediate stage to disk (HDFS); Spark caches in memory.
  • Spark is 10-100× faster on iterative or multi-stage workloads.
  • Hadoop's strength — storage (HDFS) and resource management (YARN) — both still relevant.
  • The modern answer — use HDFS / S3 for storage, YARN / Kubernetes for resource management, Spark for compute; MapReduce is largely deprecated.

The five sub-themes senior loops add.

  • HDFS HA — NameNode high availability via standby NameNodes and ZooKeeper.
  • Kerberos — authentication in secure clusters.
  • Hive Metastore — schema registry shared by Hive, Spark, Trino, etc.
  • Sqoop / Flume — legacy ingest tools; mostly replaced by Kafka / managed connectors.
  • Oozie — workflow orchestration; mostly replaced by Airflow.

Worked example — a Hadoop-style data pipeline at a glance

Detailed explanation. A realistic legacy Hadoop pipeline combines all three layers: files land in HDFS, YARN schedules the Spark / MapReduce job, the job reads from HDFS, transforms, and writes back to HDFS. Even modern Spark-on-YARN jobs follow this exact shape.

Question. A 1TB sales CSV lands in HDFS daily. Compute per-region revenue and write the result back to HDFS.

Pipeline shape (Hadoop ecosystem).

Ingest    : Sqoop / NiFi / Kafka  →  HDFS  (raw zone, /raw/sales/dt=2026-05-23/)
Process   : Spark on YARN  reads HDFS, transforms, writes HDFS (curated zone, /curated/revenue/dt=2026-05-23/)
Catalog   : Hive Metastore registers /curated/revenue/ as a partitioned table
Serve     : Hive / Trino / Spark SQL queries the curated table
Orchestrate: Oozie or Airflow runs this every day
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. HDFS storage — the 1TB CSV is split into 8000 blocks of 128MB each, each block replicated on 3 DataNodes.
  2. YARN scheduling — Spark requests N executors from YARN's ResourceManager; YARN allocates containers on the right NodeManagers.
  3. Compute — Spark reads HDFS with data locality (executor on the same node as one of the block replicas), transforms, and writes back to HDFS.
  4. Hive Metastore — the curated table's location and schema are registered; downstream tools see it as a queryable table.
  5. Orchestration — Oozie / Airflow triggers the pipeline daily.

Rule of thumb: Modern jobs still touch all three Hadoop layers (HDFS for storage, YARN for resources, the compute engine of choice for processing); knowing how they interact is the senior-signal interview answer.

Python
Topic — etl
ETL Python drills

Practice →

Python
Topic — data-analysis
Distributed systems patterns

Practice →


2. HDFS architecture — NameNode, DataNode, blocks, replication

Diagram of HDFS architecture — a labelled NameNode box on top, three DataNode boxes in a row below, each DataNode holding three coloured block tiles; thin lines connect blocks of the same file across DataNodes showing 3× replication, on a light PipeCode card.

hdfs architecture — NameNode for metadata, DataNodes for blocks, 3× replication for fault tolerance

hdfs architecture is the #1 most-asked hadoop interview questions prompt. The senior answer in one sentence: HDFS uses a master-slave model — the NameNode holds the filesystem metadata (block locations, file tree, permissions) in memory; DataNodes store the actual block data on local disks; every block is replicated 3× by default across different DataNodes for fault tolerance.

The NameNode — the master.

  • Stores metadata in memory — file tree, file → block mapping, block → DataNode mapping, permissions.
  • Single instance (without HA) — historical single point of failure.
  • Edits log + FsImage — persistent on disk; edits log records mutations; FsImage is periodic snapshots.
  • Secondary NameNode — NOT a hot standby; it periodically merges edits into FsImage to keep the edits log small.
  • HA setup — active + standby NameNode with ZooKeeper-based failover; eliminates SPOF.

DataNodes — the workers.

  • Store HDFS blocks on local disk (one folder per data drive).
  • Heartbeat to the NameNode — every 3 seconds; report block list every hour.
  • Block report — full list of blocks stored on the DataNode.
  • Pipeline replication — when a block is written, the client sends to one DataNode, which streams to the next, which streams to the next.

Blocks.

  • Default block size — 128MB (since Hadoop 2.x); previously 64MB.
  • Configurabledfs.blocksize; can be set to 256MB or 512MB for huge files.
  • Why large blocks? — minimise NameNode metadata, maximise sequential disk reads, reduce seek overhead.
  • Last block — smaller than the configured size; only the actual data.

Replication factor.

  • Default — 3.
  • Configurable per filedfs.replication=3 cluster default; override per file.
  • Rack-aware placement — by default, one replica on the local rack, two on a different rack; minimises rack-failure risk.
  • HDFS auto-recovery — if a DataNode fails, the NameNode notices missing replicas and tells other DataNodes to copy blocks to restore the replication factor.

hdfs architecture interview question — the canonical answer template.

  • "HDFS uses a master-slave architecture. The NameNode holds metadata in memory — file tree, block locations, permissions. DataNodes store the actual blocks on local disks. Each file is split into 128MB blocks by default, and each block is replicated 3× across different DataNodes for fault tolerance, with rack-aware placement so we tolerate rack failures. The Secondary NameNode merges edits logs into FsImage snapshots; the modern HA setup uses an active + standby NameNode with ZooKeeper-based failover."

HDFS commands you'd use in a script.

hadoop fs -ls /user/data/                     # list files
hadoop fs -mkdir /user/data/output            # create directory
hadoop fs -put local.csv /user/data/raw/      # upload
hadoop fs -get /user/data/output/result.csv . # download
hadoop fs -cat /user/data/raw/local.csv       # cat
hadoop fs -du -h /user/data/                  # disk usage
hadoop fs -rm /user/data/raw/local.csv        # delete
hdfs dfsadmin -report                         # cluster health
Enter fullscreen mode Exit fullscreen mode

Federation and HA — the senior topics.

  • HDFS Federation — multiple independent NameNodes share the DataNode pool; scales beyond a single NameNode's memory.
  • HA (high availability) — active NameNode + standby NameNode, both reading from a JournalNode quorum; failover takes seconds.
  • ZooKeeper Failover Controller (ZKFC) — monitors NameNode health and triggers failover.

HDFS quirks worth knowing.

  • Write-once, append-only — HDFS doesn't support random writes; append() is supported since Hadoop 0.21.
  • Strong consistency on close — a file is fully visible only after the writer closes it.
  • No real rm recoveryTrash feature gives a soft delete (configurable retention).
  • Small files problem — many tiny files balloon NameNode memory (each file = one metadata entry); see §7.

Python
Topic — etl
HDFS / storage drills

Practice →

Python
Topic — data-analysis
Distributed storage patterns

Practice →


3. HDFS read and write paths — how files actually move

hdfs read path and hdfs write path — the senior-fluency follow-ups

hdfs read path and hdfs write path are the second-level HDFS questions every senior loop asks once the architecture answer is solid. Knowing the exact step-by-step flow demonstrates real systems-thinking.

HDFS write path — step by step.

  • Step 1 — Client calls create() on HDFS Java client.
  • Step 2 — Client asks NameNode for the first block's locations; NameNode picks 3 DataNodes (rack-aware) and returns them.
  • Step 3 — Client streams data to DataNode-1 (closest); DataNode-1 forwards to DataNode-2; DataNode-2 forwards to DataNode-3 (pipeline replication).
  • Step 4 — When the block is full (128MB), client requests the next block's locations.
  • Step 5 — Repeat until file is fully written.
  • Step 6 — Client calls close(); NameNode commits the file to the namespace.

Why pipeline replication?

  • Network efficiency — each DataNode sends to one other DataNode, not the client sending to all 3.
  • Client bandwidth — the client uploads only once; the cluster handles replication internally.
  • Throughput — multiple blocks can be in-flight concurrently across different pipelines.

HDFS read path — step by step.

  • Step 1 — Client calls open() on HDFS Java client.
  • Step 2 — Client asks NameNode for the file's block locations; NameNode returns the ordered block list with all replica DataNodes per block.
  • Step 3 — Client connects directly to the nearest DataNode (by network topology) for each block.
  • Step 4 — Read block bytes; verify checksum.
  • Step 5 — Repeat for each block; client transparently switches to a different replica if the chosen DataNode is slow or fails.
  • Step 6 — Client calls close().

Data locality — Hadoop's key performance trick.

  • Compute on the same node as the data — Spark / MapReduce schedule tasks on a node that hosts a replica of the input block.
  • Locality levelsNODE_LOCAL > RACK_LOCAL > ANY.
  • YARN passes locality preferences to Spark / MapReduce when scheduling containers.
  • In cloud (S3) — locality is mostly lost; compute and storage are decoupled.

Checksum verification.

  • Every block has a .crc checksum file alongside.
  • DataNodes verify checksums on every read; corrupted blocks are flagged and re-replicated from a healthy DataNode.
  • Client also verifies — defends against on-the-wire corruption.

Block placement strategy.

  • Default rack-aware — first replica on the local node (if the writer is on a DataNode), second on a different rack, third on a different node in the same different rack.
  • Goals — minimise inter-rack traffic on write, maximise tolerance to rack failure.
  • PluggableBlockPlacementPolicy can be customised (e.g. spread by storage tier).

HDFS short-circuit reads.

  • dfs.client.read.shortcircuit=true — when the client and the DataNode are on the same machine, the client reads directly from the local file system, bypassing the DataNode process.
  • Faster — eliminates the DataNode hop for local reads; significant on Spark-on-YARN clusters where executors often co-locate with DataNodes.

Erasure coding — the storage-efficient alternative.

  • 3× replication uses 3× storage; erasure coding (EC) uses ~1.5× for similar fault tolerance.
  • Trade-off — EC blocks are reconstructed from parity blocks on failure, which is CPU-heavy.
  • Common policiesRS-6-3-1024k (6 data blocks + 3 parity, 1024KB cell size).
  • Use case — cold data; hot data still gets 3× replication.

Python
Topic — etl
HDFS read/write drills

Practice →

Python
Topic — data-manipulation
Distributed file system patterns

Practice →


4. YARN architecture — ResourceManager, NodeManager, ApplicationMaster

Diagram of YARN architecture — a labelled ResourceManager box on top, three NodeManager boxes in a row below (each containing one Container slot and one ApplicationMaster slot), thin arrows showing resource requests / allocations and container heartbeats, on a light PipeCode card.

yarn in hadoop — ResourceManager allocates, NodeManagers run, ApplicationMasters coordinate

yarn in hadoop is the #2 most-asked hadoop interview questions prompt. The senior answer: YARN is the cluster resource manager that Spark, Hive, Tez, and MapReduce all schedule against; it separates resource management (RM) from per-job coordination (AM).

ResourceManager (RM) — the cluster master.

  • Scheduler — decides which application gets resources next; pluggable (FIFO, Capacity, Fair).
  • ApplicationsManager — accepts job submissions, launches the first container (which becomes the ApplicationMaster).
  • Single instance (without HA); HA setup uses ZooKeeper-based active/standby.
  • Receives heartbeats from every NodeManager every few seconds.

NodeManager (NM) — the per-node worker.

  • Runs on every cluster node — manages local resources (CPU, RAM, disk).
  • Launches containers on request from the RM.
  • Monitors container resource usage — kills containers that exceed their limits.
  • Heartbeats to the RM — reports available resources every few seconds.

ApplicationMaster (AM) — the per-job coordinator.

  • One per running application — Spark driver runs as an AM on Spark-on-YARN.
  • Negotiates containers from the RM based on the job's needs.
  • Coordinates the application's containers — distributes tasks, handles failures.
  • Talks back to the RM for additional containers and to the client for status.

Container — the unit of resource allocation.

  • One container = some CPU + some RAM + some disk on one NodeManager.
  • Each task runs in a container — Spark executor = one container, MapReduce mapper / reducer = one container.
  • yarn.scheduler.minimum-allocation-mb / yarn.scheduler.maximum-allocation-mb — min / max container size.
  • Containers are killed on completion — short-lived.

YARN schedulers.

  • FIFO — first-in, first-out; rarely used; one big job blocks everything.
  • Capacity Scheduler — multiple queues with reserved capacity; each tenant gets its share.
  • Fair Scheduler — every running app gets a fair share; pre-empts long-running jobs to give new ones resources.
  • Default in modern Hadoop — Capacity Scheduler.

Spark on YARN — the canonical interaction.

  • spark-submit --master yarn --deploy-mode cluster — Spark driver runs as an AM in a YARN container.
  • Spark driver requests executor containers from the RM via the AM.
  • Each Spark executor runs in one YARN container on some NodeManager.
  • YARN handles fault tolerance — if a NodeManager dies, executor containers on it are killed; Spark requests replacements.

yarn vs kubernetes — the modern comparison.

  • YARN — Hadoop-native; on-prem clusters; Capacity / Fair scheduling.
  • Kubernetes — cloud-native; multi-workload; Spark on K8s is increasingly common.
  • Modern Spark supports both--master yarn or --master k8s://....
  • Mesos — legacy; mostly deprecated.

YARN HA.

  • Active + standby RM with ZooKeeper-based leader election.
  • Embedded vs external state store — ZooKeeper or file-based.
  • Application state is persisted so a failed-over RM can recover.

Useful YARN commands.

yarn application -list                   # running apps
yarn application -kill application_id    # kill an app
yarn node -list                          # NodeManager status
yarn rmadmin -refreshQueues              # reload queue config
yarn logs -applicationId app_id          # fetch logs
Enter fullscreen mode Exit fullscreen mode

Python
Topic — etl
YARN / resource drills

Practice →

Python
Topic — data-manipulation
Cluster manager patterns

Practice →


5. MapReduce execution — map, combine, partition, shuffle, sort, reduce

Diagram of the MapReduce pipeline — six labelled stages from left to right (input split → map → combine → partition → shuffle → sort → reduce → output), each with a small example data card showing key/value pairs flowing through, on a light PipeCode card.

mapreduce in hadoop — six stages from input split to output

mapreduce in hadoop is the historical compute engine. Spark has largely replaced it for production work, but mapreduce interview questions still come up in every Hadoop loop because the conceptual pipeline (map → shuffle → reduce) underpins every distributed compute engine that followed.

The full MapReduce pipeline — six stages.

  • 1. Input split — input file divided into logical splits, typically one per HDFS block.
  • 2. Map — user-defined map(key, value) function emits (intermediate_key, intermediate_value) pairs per input record.
  • 3. Combine (optional) — local pre-aggregation on the map side; reduces shuffle volume; runs the reduce logic locally.
  • 4. PartitionPartitioner decides which reducer each intermediate key goes to; default is HashPartitioner.
  • 5. Shuffle and sort — intermediate data moved across the network; reducer-side merge-sort by key.
  • 6. Reduce — user-defined reduce(key, values_iterator) produces final output; written to HDFS.

mapreduce word count — the canonical example.

# Pseudo-code
def map(line):
    for word in line.split():
        emit(word, 1)

def reduce(word, counts):
    emit(word, sum(counts))
Enter fullscreen mode Exit fullscreen mode
  • Map output — many (word, 1) pairs.
  • Partition — words hashed to N reducers.
  • Shuffle + sort — each reducer sees its assigned words with all their 1s grouped together.
  • Reduce — sum the 1s per word.

combiner in mapreduce — local pre-aggregation.

  • Same signature as reducercombine(key, values_iterator) runs on the map side.
  • Cuts shuffle volume — map emits (word, 1) 1000 times; combiner converts to (word, 1000) before shuffle.
  • Must be associative + commutative — works for SUM, MIN, MAX, COUNT; doesn't work for AVG (need to track sum + count separately).
  • Not guaranteed to run — the framework may skip it.

partitioner in mapreduce — controlling reducer assignment.

  • HashPartitioner (default) — (key.hashCode() & Integer.MAX_VALUE) % numReducers.
  • Custom Partitioner — implement getPartition(key, value, numReduceTasks) for special routing (e.g. send "US" rows to reducer 0).
  • Use case — handling skew, custom partitioning by date, joining by composite keys.

Shuffle and sort.

  • Map output is partitioned + spilled to local disk as the map task runs.
  • Reducer fetches its assigned partition from every mapper (HTTP pull).
  • Merge-sort on the reducer side combines streams from all mappers into one sorted stream per key.
  • Disk-heavy — every intermediate stage hits disk (this is why Spark is faster).

Speculative execution.

  • Slow tasks get re-launched on different nodes; whichever finishes first wins.
  • Combats stragglers — slow disks, hot nodes, hardware variation.
  • Controlled bymapreduce.map.speculative / mapreduce.reduce.speculative.

mapreduce vs spark — the canonical comparison.

  • Disk writes — MapReduce writes every intermediate stage to HDFS; Spark caches in memory.
  • Latency — MapReduce jobs have minutes-long startup overhead; Spark is sub-second.
  • Iteration — MapReduce reloads from HDFS each iteration; Spark keeps in memory.
  • DAG — MapReduce is two-stage (map + reduce); Spark supports arbitrary DAGs.
  • Modern stack — MapReduce is deprecated in favour of Spark, Tez, or Flink.

When MapReduce still wins.

  • Very large one-shot batch jobs where the disk writes give safe checkpointing.
  • Memory-constrained clusters that can't hold the working set.
  • Legacy code bases with deep MapReduce investment.

Python
Topic — etl
MapReduce drills

Practice →

Python
Topic — data-analysis
Distributed compute patterns

Practice →


6. Hive, Pig, and the broader Hadoop ecosystem

Diagram of the Hadoop ecosystem — a layered stack with HDFS at the bottom, YARN in the middle, and compute engines (Spark, Hive, Tez, MapReduce) plus tools (Sqoop, Oozie, Hive Metastore, HBase) layered on top, each labelled with its role, on a light PipeCode card.

hive vs pig and the rest of the Hadoop ecosystem you must name

The Hadoop ecosystem has dozens of projects. Senior hadoop interview questions test fluency with the major ones — what each does, when to reach for it, and how they integrate.

Hive — SQL on Hadoop.

  • Hive Query Language (HQL) — SQL-like dialect that compiles to MapReduce, Tez, or Spark.
  • Hive Metastore — schema registry; tables map to HDFS paths.
  • Use case — batch SQL over HDFS / S3; ETL pipelines; ad-hoc analytics.
  • Modern replacement — Spark SQL, Trino, Presto all read the Hive Metastore directly.
  • HQL still widely used for legacy ETL and reporting.

Pig — dataflow scripting.

  • Pig Latin — procedural data-flow language; compiles to MapReduce.
  • Use case — ETL scripts where SQL feels awkward (complex multi-step transforms).
  • Less common in modern stacks — Spark DataFrames replaced most Pig use cases.

hive vs pig — the decision table.

Aspect Hive Pig
Language SQL-like (HQL) Procedural (Pig Latin)
User type Analyst / Engineer Engineer
Use case Batch SQL queries Multi-step ETL scripts
Schema Required (Metastore) Optional
Modern status Still common via Metastore Mostly deprecated

HBase — wide-column NoSQL.

  • Distributed, scalable column-family store on top of HDFS.
  • Use case — random reads / writes over very large datasets (Hive is batch-only).
  • Modeled after Google's Bigtable.
  • Modern alternatives — DynamoDB, Cassandra, Bigtable.

Sqoop — relational ↔ HDFS bulk transfer.

  • sqoop import — pulls data from MySQL / PostgreSQL / Oracle into HDFS or Hive.
  • sqoop export — pushes data from HDFS back to a RDBMS.
  • Parallelised via MapReduce — splits the table by primary-key range.
  • Modern replacements — Spark JDBC connectors, Airbyte, Fivetran.

Flume / Kafka — streaming ingest.

  • Flume — legacy log-ingest tool with source / channel / sink architecture.
  • Kafka — modern distributed log; standard for streaming ingest.
  • Pattern — producers → Kafka topics → consumers (Spark Streaming, Flink, custom).

Oozie — workflow orchestration.

  • XML-based workflow definitions for chaining Hadoop jobs (MapReduce, Hive, Pig, Spark).
  • Scheduler + coordinator — time-based or data-availability-based triggers.
  • Modern replacements — Airflow, Dagster, Prefect; all Python-first.

ZooKeeper — coordination service.

  • Distributed coordination for HA, leader election, configuration.
  • Used by — Kafka, HBase, HDFS HA, YARN HA, Storm, Hive Server.
  • Not directly an ETL tool — but every Hadoop production cluster depends on it.

Ambari / Cloudera Manager — cluster management.

  • Web UIs and APIs for managing Hadoop clusters.
  • Ambari — Apache, open source.
  • Cloudera Manager — proprietary; bundled with CDP.

The modern Hadoop ecosystem.

  • Storage — HDFS, S3, GCS, Azure Blob.
  • Resource management — YARN, Kubernetes.
  • Compute — Spark, Trino, Flink (MapReduce is deprecated).
  • SQL engines — Hive (on Tez / Spark), Trino, Presto, Spark SQL.
  • Streaming — Kafka + Spark Structured Streaming / Flink.
  • Catalog — Hive Metastore (still ubiquitous), AWS Glue, Iceberg / Delta / Hudi for table format.
  • Orchestration — Airflow / Dagster / Prefect (Oozie is legacy).

Python
Topic — etl
Hive / Pig / ecosystem drills

Practice →

Python
Topic — data-analysis
Big-data ecosystem patterns

Practice →


7. Hadoop interview gotchas — small files, SPOF, skew, partition imbalance

Seven gotchas every Hadoop interview tests

Hadoop has a small surface area but a long tail of operational gotchas that reviewers love to probe.

Gotcha 1 — Small-files problem.

  • The bug — many tiny files inflate NameNode memory (each file = one metadata entry, ~150 bytes); 10M tiny files = ~1.5GB of NameNode heap just for metadata.
  • Symptom — NameNode OOM, slow ls, slow startup.
  • Fix — coalesce small files via Hadoop Archive (HAR), SequenceFile, or HDFS compaction tools; in Spark, coalesce(N) before writing.

Gotcha 2 — NameNode single point of failure (SPOF).

  • The bug — pre-HA Hadoop had only one NameNode; its failure took the entire cluster down.
  • Symptom — cluster unavailable while NameNode is restarted from FsImage + edits log.
  • Fix — HA setup with active + standby NameNode + JournalNodes + ZKFC.

Gotcha 3 — Reducer skew on MapReduce.

  • The bug — one key has 90% of the map output; one reducer processes it all while others sit idle.
  • Symptom — job's tail "drags"; one task with 100× more records than peers.
  • Fix — custom Partitioner that spreads the skewed key, salt the key + post-aggregate, or migrate to Spark with AQE skew handling.

Gotcha 4 — Partition imbalance in HDFS.

  • The bug — DataNodes have uneven free space; one DataNode is 95% full while another is 20% full.
  • Symptom — writes start failing on the full node before the cluster is actually full.
  • Fix — run the HDFS balancer (hdfs balancer -threshold 10) to redistribute blocks.

Gotcha 5 — YARN container OOM-killed.

  • The bug — task uses more memory than its container's limit; NodeManager kills it.
  • SymptomContainer killed on request. Exit code 143; task retries; job slows.
  • Fix — bump container size (mapreduce.map.memory.mb or spark.executor.memoryOverhead) or reduce per-task memory footprint.

Gotcha 6 — HDFS write fail with pipeline broken.

  • The bug — one of the 3 DataNodes in the write pipeline dies mid-write; pipeline retries but eventually fails.
  • SymptomBlockMissingException or pipeline recovery failed.
  • Fix — usually transient; check NodeManager / DataNode health; ensure write replication is < cluster's live DataNode count.

Gotcha 7 — Hive partition explosion.

  • The bug — over-partitioned Hive table (e.g. partition by customer_id) creates millions of tiny partitions in the Metastore; queries take minutes to plan.
  • SymptomSHOW PARTITIONS t takes forever; query planning slow.
  • Fix — partition by low-cardinality, time-based columns (date, month); never partition by high-cardinality columns.

Operational checklists.

  • NameNode healthhdfs dfsadmin -report.
  • Block under-replicationhdfs fsck / -files -blocks -locations.
  • DataNode disk usagehdfs dfsadmin -report shows per-DataNode capacity.
  • YARN container statusyarn application -list -appStates RUNNING.
  • MapReduce job historymapred job -list + JobHistoryServer UI.

Python
Topic — etl
Hadoop gotcha drills

Practice →

Python
Topic — real-time-analytics
Distributed systems operations

Practice →


Choosing the right Hadoop layer (cheat sheet)

A one-screen cheat sheet for the most-asked hadoop interview questions patterns.

You want to … Hadoop layer / tool Notes
Store a large file HDFS 128MB blocks; 3× replication
Avoid NameNode SPOF HDFS HA Active + standby + JournalNodes
Find block locations hdfs fsck / -files -blocks -locations Verify replication
Schedule a Spark job spark-submit --master yarn --deploy-mode cluster Driver runs as AM
Allocate cluster capacity per team YARN Capacity Scheduler Queues with reserved capacity
Run a MapReduce job hadoop jar mr-job.jar Use Tez or Spark in modern stacks
Reduce shuffle volume Combiner Local pre-aggregation in MapReduce
Control reducer assignment Custom Partitioner Handles skew; default is HashPartitioner
Query HDFS data with SQL Hive on Tez / Spark Hive Metastore registers tables
ETL scripting on Hadoop Spark DataFrames Pig is largely deprecated
Bulk-move data from RDBMS Sqoop import / export Or Spark JDBC; or modern connectors
Stream events into HDFS Kafka + Spark Structured Streaming Flume is legacy
Orchestrate a daily job Airflow / Dagster Oozie is legacy
Wide-column NoSQL access HBase Random reads / writes over HDFS
Tune Spark executor sizing YARN container resource configs yarn.scheduler.maximum-allocation-mb
Diagnose slow Hive query Hive EXPLAIN + Tez UI Inspect stages and shuffle volume

Frequently asked questions

What is HDFS and how does it work?

HDFS (Hadoop Distributed File System) is a distributed file system designed to store huge files reliably across a cluster of commodity machines. The architecture is master-slave: a NameNode holds the filesystem metadata in memory — file tree, block locations, permissions — while many DataNodes store the actual block data on local disks. Files are split into blocks of 128MB by default (configurable via dfs.blocksize), and every block is replicated 3× across different DataNodes with rack-aware placement (first replica on the local rack, two on a different rack) for fault tolerance. When a DataNode fails, the NameNode notices the missing replicas via heartbeat loss and orders other DataNodes to copy blocks to restore the replication factor. Modern production clusters use HDFS HA with an active + standby NameNode plus JournalNodes for sub-second failover, eliminating the historical single-NameNode SPOF.

What is YARN and what role does it play in Hadoop?

YARN (Yet Another Resource Negotiator) is Hadoop's cluster resource manager. It separates resource management from per-application coordination using three components: the ResourceManager (RM) — the cluster master that arbitrates resources across all running applications via a pluggable Scheduler (FIFO, Capacity, Fair); NodeManagers (NM) — per-node workers that launch containers and report available resources via heartbeats; and the ApplicationMaster (AM) — one per running application, negotiating containers from the RM and coordinating the application's containers. Every task — a Spark executor, a MapReduce mapper, a Hive Tez task — runs inside a YARN container (CPU + RAM + disk slice on one NodeManager). On Spark-on-YARN with --deploy-mode cluster, the Spark driver runs inside the AM container, which then requests executor containers from the RM. YARN is the de-facto resource manager for on-prem Hadoop / Spark clusters; modern cloud-native deployments increasingly use Kubernetes instead.

What's the difference between MapReduce and Spark?

mapreduce vs spark is the most-asked hadoop interview questions comparison. MapReduce writes every intermediate stage to disk (HDFS); Spark caches intermediate results in RAM with disk fallback. For iterative workloads (ML training, graph algorithms), Spark is 10-100× faster because data stays in memory across iterations. MapReduce is a two-stage model (map + reduce); Spark supports arbitrary DAGs with many narrow + wide transformations. MapReduce job startup overhead is minutes; Spark is sub-second. Spark also provides a unified API (batch DataFrame, streaming, ML, graph) while MapReduce only handles batch. In the modern Hadoop ecosystem, MapReduce as a compute engine is largely deprecated in favour of Spark, Tez, or Flink — but HDFS and YARN remain widely used as storage and resource management for Spark. The interview-canonical answer: keep HDFS / YARN, drop MapReduce, run Spark.

What's the difference between Hive and Pig?

hive vs pig is the most-asked Hadoop ecosystem comparison. Hive is a SQL-on-Hadoop engine — analysts and engineers write HQL (Hive Query Language, a SQL dialect) which compiles to MapReduce, Tez, or Spark; the Hive Metastore is the schema registry mapping tables to HDFS / S3 paths. Pig is a procedural data-flow scripting language (Pig Latin) for ETL pipelines, also compiling to MapReduce. Hive's audience is anyone comfortable with SQL; Pig's audience is engineers comfortable with procedural data-flow code. In the modern Hadoop stack, Hive's Metastore is still ubiquitous (Spark SQL, Trino, Presto all read it), but Pig is largely deprecated — Spark DataFrames replaced almost every Pig use case. For a 2026 interview, mention that Hive's Metastore is the bridge between legacy and modern stacks, and that you'd reach for Spark over Pig for any new ETL work.

How does the MapReduce shuffle work?

The MapReduce shuffle is the data movement between the map phase and the reduce phase. After each map task runs the user's map(key, value) function and emits intermediate (intermediate_key, intermediate_value) pairs, the framework: (1) optionally runs a Combiner to pre-aggregate locally and reduce shuffle volume; (2) applies the Partitioner (default HashPartitioner) to decide which reducer each key goes to; (3) spills sorted output to local disk as the map task runs; (4) each reducer pulls its assigned partition from every mapper via HTTP; (5) reducer-side merge-sort combines streams from all mappers into one sorted stream per key; (6) the user's reduce(key, values_iterator) runs on the sorted stream. The shuffle is disk-heavy — every intermediate stage writes to local disk — which is why Spark (in-memory shuffle when possible) is 10-100× faster on iterative or multi-stage workloads. Custom Partitioners and Combiners are the two main interview-grade optimizations for MapReduce shuffles.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including Python practice keyed to the same distributed-compute concepts that underpin Hadoop (HDFS storage thinking, YARN resource management, MapReduce-style data flow). Whether you're drilling hadoop interview questions for freshers or grinding hadoop interview questions for experienced, the practice library mirrors the same three-layer mental model this guide teaches — plus the modern Spark / lakehouse skills you'll move toward.

Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse data-manipulation patterns →; reinforce data-analysis drills →; widen coverage on the full streaming Python library →.

Top comments (0)