Hannah Usmedynska

Posted on Mar 17

100 Hadoop Interview Questions for Middle Level Developers with Answers

#career #dataengineering #distributedsystems #interview

Mid-level interviews sit in a tricky spot. You are past the basics yet still expected to show you can solve real production problems. This collection of Hadoop interview questions for middle level developers covers exactly that gap, from HDFS internals to YARN tuning and distributed debugging.

Getting Ready for a Middle Hadoop Developer Interview

Structured prep separates strong candidates from those who rely on memorized definitions. Below is how targeted questions help both recruiters and developers approach the process efficiently.

How Hadoop Interview Questions Help Recruiters Assess Middles

Recruiters evaluating mid-level candidates need questions that go beyond textbook topics. A solid bank of Hadoop middle level interview questions and answers lets them compare depth of reasoning across applicants. If you also evaluate junior hires, see our list of Hadoop interview questions for beginners for entry-level benchmarks.

How Sample Hadoop Interview Questions Help Middle Developers Improve Skills

Practicing with middle level Hadoop interview questions and answers exposes weak spots in YARN scheduling, join strategies, and cluster security that textbooks gloss over. Once you master these, move on to expert-level Hadoop interview questions and answers to prepare for senior rounds.

List of 100 Hadoop Interview Questions for Middle Developers

Five sections below. Each opens with five bad and good answer contrasts so you can gauge response quality, followed by correct answers only. These Hadoop interview questions for 3 years experience and above cover the range interviewers expect at the mid-level.

Basic Middle Hadoop Developer Interview Questions

Start here. These Hadoop interview questions for middle level engineers focus on HDFS internals, YARN resource allocation, and MapReduce mechanics that every mid-level candidate should handle without hesitation.

1: How does HDFS handle block placement across racks?

Bad Answer: It puts copies on random nodes.

Good Answer: The first replica goes to the writer’s node, the second to a different rack, and the third to another node on that second rack. This balances fault tolerance with network cost.

2: What triggers HDFS under-replication and how is it resolved

Bad Answer: Under-replication means the file is corrupt.

Good Answer: It happens when a DataNode goes down and the live replica count drops below the configured factor. The NameNode detects this through missed heartbeats and schedules re-replication to a healthy node.

3: Explain the role of the ResourceManager in YARN.

Bad Answer: It runs user code.

Good Answer: The ResourceManager accepts applications, allocates containers through a scheduler, and tracks node health via NodeManager heartbeats. It never executes application logic.

4: How do you enable NameNode high availability?

Bad Answer: Just add a second NameNode to the cluster.

Good Answer: Configure two NameNodes sharing edits through JournalNodes. ZooKeeper handles automatic failover. Both nodes must have identical filesystem metadata at all times.

5: What is the difference between block size and split size?

Bad Answer: They are always the same.

Good Answer: Block size is a storage property set in HDFS. Split size is a logical division for MapReduce input, often matching block size by default but configurable independently through InputFormat settings.

6: How does checkpoint operation work in the Secondary NameNode?

It periodically downloads the current fsimage and edit log, merges them into a compact snapshot, and uploads the result back to the active NameNode.

7: What happens during HDFS safe mode?

The NameNode enters a read-only state at startup, waits for DataNodes to report live block counts, and exits once the minimum replication threshold is reached.

8: How does YARN fair scheduler differ from capacity scheduler?

Fair scheduler shares resources equally across running applications. Capacity scheduler reserves fixed slices of the cluster for queues, guaranteeing minimum capacity per tenant.

9: What is HDFS federation?

Multiple independent NameNodes manage separate namespaces over a shared pool of DataNodes, scaling the metadata layer horizontally.

10: How do you detect and repair a corrupt HDFS block?

Run hdfs fsck to find missing or corrupt blocks. If a healthy replica exists the NameNode re-replicates automatically. Otherwise restore from backup.

11: What is data locality and why does it matter?

Running computation on the same node that stores the data avoids network transfers. The framework prefers node-local, then rack-local, then any available slot.

12: How do container resource negotiations work in YARN?

The ApplicationMaster requests containers with specific CPU and memory. The ResourceManager scheduler grants them based on queue policies and node capacity.

13: What is the edit log and why can it become a bottleneck?

It journals every metadata change. If it grows too large without checkpointing, NameNode restart time increases because all edits must be replayed against the fsimage.

14: How do you tune the replication factor per file?
Use hdfs dfs -setrep to override the cluster default. Hot data can go higher for read throughput, cold data lower to save storage.

15: What is speculative execution and when should you disable it?

It launches a duplicate of a slow task. Disable it for non-idempotent tasks or when the cluster is already resource-constrained.

16: How does YARN handle node failure during a running job?

The ResourceManager detects missed NodeManager heartbeats and marks the node dead. The ApplicationMaster re-requests containers on healthy nodes and retries affected tasks.

17: What is erasure coding in HDFS and when would you use it?

Erasure coding stores parity blocks instead of full replicas, cutting storage overhead. Use it for cold data where read performance is less critical.

18: How do you manage resource queues in YARN?

Define queues in capacity-scheduler.xml or fair-scheduler.xml. Set weights, max capacities, and user limits per queue. Refresh with yarn rmadmin -refreshQueues.

19: What triggers a rebalance of HDFS data?

Node additions or decommissions skew storage. Run hdfs balancer with a threshold percentage to move blocks until usage is even.

20: What is short-circuit local read?

When the client runs on the same node as the DataNode, it reads blocks directly from disk through a Unix domain socket, bypassing the DataNode process.

21: How does YARN preemption work?

When a higher-priority queue is starved, the scheduler reclaims containers from lower-priority applications. Tasks in those containers are killed and retried.

22: What is the role of ZooKeeper in a distributed cluster?

It provides coordination primitives: leader election for NameNode HA, configuration synchronization, and distributed locking.

23: How do you configure HDFS encryption at rest?

Create an encryption zone using hdfs crypto -createZone with a key from the Key Management Server. Data inside the zone is encrypted transparently.

24: What is the timeline server in YARN?

It stores completed application history and per-framework metrics, allowing post-run analysis without log aggregation.

25: How do you handle small files in HDFS?

Each small file consumes a namespace entry. Merge them using HAR archives, SequenceFiles, or CombineFileInputFormat to reduce NameNode memory pressure.

Middle Hadoop Developer Programming Interview Questions

These Hadoop intermediate interview questions and answers probe the Java API, driver configuration, and framework internals that come up during hands-on rounds for Hadoop interview questions for 5 years experience candidates.

1: How do you implement a custom Partitioner and why would you need one?

Bad Answer: Only when you use more than two reducers.

Good Answer: You extend Partitioner and override getPartition to control which reducer receives each key. This fixes data skew when HashPartitioner sends most records to a single reducer.

2: Explain the shuffle and sort phase in MapReduce.

Bad Answer: Shuffle just sends map output to reducers.

Good Answer: Map output is partitioned, spilled to disk, and merge-sorted by key. The framework then pulls sorted segments from each mapper to the assigned reducer, which merges them before calling reduce.

3: How does the Combiner reduce network traffic?

Bad Answer: It compresses map output.

Good Answer: The Combiner runs a local reduce on each mapper’s output before shuffle. For associative operations like sum or max, it cuts the data volume transferred across the network substantially.

4: How do you handle multiple input formats in one job?

Bad Answer: Convert everything to text first.

Good Answer: Use MultipleInputs.addInputPath to register different InputFormats and mapper classes for each path. The framework delegates to the correct mapper at runtime.

5: How do you pass runtime parameters to mapper tasks?

Bad Answer: Use a global static variable.

Good Answer: Set key-value pairs on the Configuration object in the driver. Read them in the mapper’s setup method via context.getConfiguration().get().

6: How does WritableComparable differ from Writable?

WritableComparable adds a compareTo method. Keys must implement it because the framework sorts them during the shuffle phase.

7: When would you write a custom InputFormat?

When the default formats cannot parse your file structure. Extend FileInputFormat and return a RecordReader that yields your key-value pairs.

8: How do you set up a map-side join?

Load the small dataset into a HashMap in the mapper’s setup via Distributed Cache. Join against each incoming record in the map method.

9: What is secondary sort and when is it needed?

When you need values ordered within each key group. Create a composite key, a custom Partitioner for grouping, and a GroupingComparator on the natural key.

10: How do you implement a reduce-side join?

Tag records with their source, partition on the join key, and cross-match tags in the reducer to produce joined output.

11: What is OutputCommitter and why does it matter?

It manages the atomicity of job output. FileOutputCommitter moves task output from temporary directories to the final path only after success.

12: How do you use counters for debugging a job?

Increment custom counters in mapper or reducer code. After the job completes read them via job.getCounters() to spot skew or unexpected data.

13: How does DistributedCache work under the hood?

Files are uploaded to HDFS once by the client. NodeManagers localize them to each task’s working directory before execution.

14: What is the role of RecordReader?

It converts raw input split bytes into key-value pairs that the mapper consumes. LineRecordReader is the default for text files.

15: How do you chain two MapReduce jobs programmatically?

Run the first job to completion, then set its output path as the input of the second job. Use ControlledJob or JobControl for dependency management.

16: How do you write a custom OutputFormat?

Extend FileOutputFormat, return a RecordWriter that serializes key-value pairs to your desired format, and register it in the job configuration.

17: What is the purpose of GenericOptionsParser?

It separates generic framework flags (-conf, -D) from application arguments, injecting settings into the Configuration automatically.

18: How does the framework handle task failure and retry?

A failed task attempt is rescheduled on a different node up to the configured retry limit (mapreduce.map.maxattempts). The job fails only when all attempts are exhausted.

19: When should you set the number of reducers to zero?

For map-only jobs where aggregation is unnecessary. Each mapper writes directly to HDFS, skipping the shuffle phase entirely.

20: How do you unit test a mapper without a live cluster?

Instantiate the mapper, pass a mock Context, call map with test input, and assert the emitted pairs. MRUnit provides a fluent API for this.

21: What is NullWritable and when do you use it?

A zero-length placeholder. Use it when the key or value carries no meaningful data, such as dedup jobs where only the key matters.

22: How do you control the sort order of keys?

Implement compareTo in your WritableComparable for natural order. For custom sorting without changing the key, register a RawComparator on the job.

23: What is MultipleOutputs?

A utility that lets tasks write to named output files beyond the default part-r files, each with its own format and path.

24: How do you configure compression for intermediate data?

Set mapreduce.map.output.compress to true and choose a codec like Snappy or LZ4. This shrinks shuffle traffic with minimal CPU overhead.

25: How do you profile a slow MapReduce job?

Enable YARN timeline server and check per-task runtimes. Look for data skew, garbage collection stalls, and excessive spilling in task counters.

Middle Hadoop Developer Coding Interview Questions

Coding rounds for mid-level roles test whether you can translate design decisions into working pipelines. These questions match what you would see during Hadoop interview questions for three years experience to five years experience ranges.

1: How would you write a job that computes average salary per department from a CSV file?

Bad Answer: Sum everything in one reducer.

Good Answer: In the mapper split each CSV line and emit (department, salary). In the reducer accumulate both a running sum and a count, then divide at the end of the iterator.

2: How would you implement top-N records across a large dataset?

Bad Answer: Sort the whole dataset and take the first N.

Good Answer: Each mapper keeps a local TreeMap of N entries, emits them in cleanup. A single reducer merges all partial trees and retains the final N.

3: How do you deduplicate records in MapReduce?

Bad Answer: Load everything into a HashSet in one mapper.

Good Answer: Emit the entire record as the key with NullWritable as the value. The framework groups identical keys, and the reducer writes each key once.

4: How would you join two datasets of very different sizes?

Bad Answer: Always use a reduce-side join.

Good Answer: Put the smaller file into Distributed Cache and load it in mapper setup. Join against the larger file during the map phase, avoiding a full shuffle.

5: How would you count distinct users per day from a log file?

Bad Answer: Group by day in the reducer and loop through all users in memory.

Good Answer: Emit (day + user) as a composite key. Use a secondary sort with a GroupingComparator on day. In the reducer iterate the key group and count unique user values.

6: How do you write a job that filters records matching a regex?

Compile the pattern in setup. In map, test each line and call context.write only for matches. Set reducers to zero for a pure filter.

7: How do you implement an inverted index?

The mapper emits (word, documentId). The reducer collects all document IDs per word into a comma-separated list.

8: How would you process multi-line XML records?

Write a custom RecordReader that reads between start and end tags. Use the StreamXmlRecordReader configuration if available in your distribution.

9: How do you implement k-means iteration in MapReduce?

Each mapper assigns points to the nearest centroid and emits (centroid, point). The reducer averages all points per centroid and writes new centroids. Repeat in a driver loop until convergence.

10: How do you write a self-join on a single dataset?

Emit each record with the join key. In the reducer collect all records for that key and produce pairwise combinations.

11: How would you convert Parquet files to SequenceFiles?

Read with AvroParquetInputFormat in the mapper and emit records with SequenceFileOutputFormat in the reducer or a map-only job.

12: How do you partition output by date?

Parse the date in the mapper and use MultipleOutputs to write to date-based paths.

13: How would you compute a running total in MapReduce?

Sort by the ordering column via a composite key. In the reducer maintain an accumulator across iterations. Emit each record with the running sum.

14: How do you produce sorted output across all part files?

Use TotalOrderPartitioner with a sampled partition file. Each reducer receives a non-overlapping key range, so concatenating part files gives global order.

15: How do you handle JSON input where one object spans multiple lines?

Write a RecordReader that buffers characters between matching braces and emits each complete JSON object as a Text value.

16: How do you implement a left outer join in MapReduce?

Tag records by source. In the reducer, if the right source has no match, emit the left record with empty right fields.

17: How do you sample one percent of a large dataset?

In the mapper generate a random number per record and write only those below the threshold. Set reducers to zero.

18: How do you write a job that outputs to an HBase table?

Use TableOutputFormat. In the reducer or mapper create Put objects and emit them as values with NullWritable as the key.

19: How do you compute the median from a distributed dataset?

First run a histogram job to count values per bucket. In a second pass find the bucket containing the midpoint and scan within it.

20: How do you process gzip-compressed files?

Gzip is not splittable, so each file goes to one mapper. Set the appropriate codec. For splittable compression prefer LZO or BZip2.

21: How do you write test coverage for a reducer?

Create a mock Context, feed sorted key-value pairs to the reduce method, and assert the output pairs match expectations.

22: How do you build a co-occurrence matrix?

For each pair of items appearing together, emit (pair, 1). The reducer sums counts per pair.

23: How would you migrate a MapReduce pipeline to Spark?

Replace the mapper with flatMap and the reducer with reduceByKey or groupByKey. Adjust serialization and partition count to match the original pipeline’s intent.

24: How do you validate schema on CSV input before processing?

In the mapper’s setup, load expected column count and types. In map, reject or count malformed rows via a custom counter.

25: How do you write incremental output without overwriting previous runs?

Append the run timestamp or batch ID to the output path. Downstream consumers read the latest partition or scan all partitions.

Practice-Based Middle Hadoop Developer Interview Questions

These questions test real-world troubleshooting and design instincts. They overlap heavily with Hadoop scenario based programming interview questions that mid-level candidates encounter in final rounds.

1: A MapReduce job that ran fine for months suddenly takes three times longer. How do you investigate?

Bad Answer: Restart the cluster and rerun.

Good Answer: Check YARN resource manager for new jobs competing for capacity. Review task counters for data skew or spill increases. Compare input size against historical baselines.

2: You notice heavy data skew with 90% of records going to one reducer. What do you do?

Bad Answer: Increase the number of reducers.

Good Answer: Inspect the key distribution. Either add a salt prefix to the hot key and run a second aggregation pass, or write a custom Partitioner to spread load across reducers.

3: A cluster is running out of NameNode heap. What are your options?

Bad Answer: Add more DataNodes.

Good Answer: The problem is metadata, not storage. Increase the NameNode heap, archive small files with HAR, or implement HDFS federation to split the namespace.

4: You need to join a 500 GB table with a 200 MB lookup table. How?

Bad Answer: Do a reduce-side join.

Good Answer: Load the 200 MB table into Distributed Cache. Each mapper reads it into a HashMap in setup and joins during the map phase, eliminating a full shuffle of 500 GB.

5: Client teams complain that their jobs wait too long in the queue. How do you handle it?

Bad Answer: Tell them to wait or buy more hardware.

Good Answer: Review queue weights and capacity limits. Add preemption for priority workloads and set max resource share per user to prevent one team from hogging the cluster.

6: How do you plan a zero-downtime rolling upgrade?

Upgrade NameNodes first with HA failover. Then upgrade DataNodes one at a time while the cluster continues serving reads and writes.

7: How would you set up a disaster recovery strategy?

Replicate critical data to a standby cluster using DistCp on a schedule. Test recovery by promoting the standby and running a validation job set.

8: A job writes output but downstream consumers see partial data. What happened?

The job likely succeeded partially. Check for speculative task output conflicts or committed output that was overwritten by a late attempt. Switch to FileOutputCommitter v2 for atomicity.

9: How do you handle a sudden spike in small files?

Set up an ingestion layer that batches incoming files into larger blocks using Flume or a custom compaction job that merges small files into SequenceFiles.

10: How do you debug a task that succeeds locally but fails on the cluster?

Compare library versions, configuration overrides, and classpath order. Check if environment variables or Kerberos tickets differ from the local setup.

11: How do you estimate cluster resources for a new workload?

Profile on a subset: measure map time per record, mapper spill ratio, and reducer shuffle bytes. Extrapolate to full data volume and account for concurrent jobs.

12: How do you enforce data governance on a shared cluster?

Use Apache Ranger for policy-based access control. Integrate with LDAP for authentication. Enable audit logging to track who reads what.

13: A node decommission is taking hours. How do you speed it up?

Increase dfs.namenode.replication.max-streams to allow more parallel block transfers. Monitor network throughput to avoid saturating the links.

14: How do you test a new cluster configuration before production rollout?

Deploy the change to a staging cluster or a canary set of nodes. Run a benchmark suite and compare metrics before rolling out to production.

15: How do you handle time-zone-dependent data in a global cluster?

Store all timestamps in UTC. Apply time-zone conversion only at the presentation layer or in downstream consumers.

Tricky Middle Hadoop Developer Interview Questions

These questions catch candidates who memorize answers without understanding trade-offs. Expect them in senior-facing rounds for mid-level roles.

1: Can you run a MapReduce job without a reducer?

Bad Answer: No, every job needs at least one reducer.

Good Answer: Yes. Set the number of reducers to zero. Mappers write directly to HDFS. Useful for ETL filters, format conversions, and any task that needs no aggregation.

2: What happens if two NameNodes in an HA pair both think they are active?

Bad Answer: The cluster keeps running with both.

Good Answer: Split-brain scenario. Fencing mechanisms kill one node to prevent concurrent writes that would corrupt the namespace. Without proper fencing the filesystem can become inconsistent.

3: Is it always better to increase the number of reducers?

Bad Answer: More reducers always means faster processing.

Good Answer: Not necessarily. Too many reducers create tiny output files and add scheduling overhead. The optimal count depends on data volume, cluster size, and downstream reader patterns.

4: Can you run Spark and MapReduce on the same cluster simultaneously?

Bad Answer: No, they require separate clusters.

Good Answer: YARN manages both. Configure queue isolation so Spark and MapReduce workloads share resources without starving each other.

5: Does increasing block size always improve performance?

Bad Answer: Yes, bigger blocks mean faster reads.

Good Answer: Larger blocks reduce NameNode metadata overhead and benefit sequential scans. But they hurt small random reads and waste space when files are much smaller than the block size.

6: Why might a completed job report success yet produce wrong output?

Speculative execution could cause duplicate writes if the committer is not atomic. Bugs in the combiner or a non-associative aggregation are also common culprits.

7: How does HDFS handle a network partition between racks?

DataNodes on the isolated rack miss heartbeats. The NameNode marks them dead and triggers re-replication from surviving replicas. When the partition heals, excess replicas are trimmed.

8: What is the impact of having too many small files on the NameNode?

Each file and block occupies about 150 bytes of heap. Millions of small files can exhaust NameNode memory, slow down namespace operations, and degrade checkpoint times.

9: Why might a mapper run slowly on one particular node?

Hardware issues such as a degraded disk, high CPU contention from colocated tasks, or JVM garbage collection pauses. Check node-level metrics before blaming the code.

10: Can you change the replication factor of a file that is currently being written?

No. The replication factor is set when the file is created and locked until the write completes. Change it afterward with hdfs dfs -setrep.

Tips for Hadoop Interview Preparation for Middle Developers

Answering correctly is half the story. How you frame your experience during a technical round matters just as much.

Build a personal two-node cluster and practise failover, rebalancing, and job tuning on real data.
Study YARN queue configuration deeply; capacity and fair scheduling questions are a staple at this level.
Be ready to whiteboard a join strategy. Know when map-side beats reduce-side and why.
Review real production incidents where you fixed performance or data issues. Interviewers want stories, not just definitions.
Time yourself answering questions aloud. Two minutes per answer is a realistic interview pace.
Read the source code for InputFormat and OutputCommitter. Understanding framework internals sets you apart from candidates who only use high-level APIs.

Conclusion

The 100 questions above cover the full range that mid-level candidates face: HDFS internals, YARN resource management, MapReduce programming, hands-on coding, and production troubleshooting. Work through them section by section, build a practice cluster, and walk into your next round with real confidence.