DEV Community

grace wambua
grace wambua

Posted on • Edited on

Resource Monitoring for Data Pipelines

As a data engineering student, I came to a realization that sometimes the errors that slowly starve resources, don't always throw a code. Monitoring how our pipelines consume resources isn't just about performance, its about respect for the hardware. This article is about understanding the machines we built on, to maintain fast and efficient pipelines.

Introduction

When running data pipelines, especially in production, resource monitoring is critical to prevent slowdowns, crashes, or system-wide failures. Simple Linux command-line tools like top, htop, df -h, and free -h provide real-time visibility into system health and help you catch issues before they escalate.

1. Monitoring CPU & Processes: top and htop

top (Built-in, lightweight)

The top command gives a live view of system processes and CPU usage.

Shows:

  • CPU utilization (user, system, idle time)
  • Running processes and their CPU/memory consumption

 raw `top` endraw  command output

Why it matters for pipelines:

  • Identify CPU bottlenecks during heavy transformations (e.g., Spark jobs, ETL scripts)
  • Detect runaway processes consuming excessive CPU
  • Spot when multiple pipelines overload the system

Tip: Press P inside top to sort by CPU usage.

htop (Enhanced, user-friendly)

htop is an improved version of top with a more intuitive interface.

Features:

  • Color-coded CPU, memory, and swap usage
  • Easy process management (kill, renice)
  • Tree view of processes (great for pipeline dependencies)

 raw `htop` endraw  command output

Pipeline use cases:

  • Visualize parallel jobs in distributed pipelines
  • Quickly terminate stuck or zombie tasks
  • Monitor thread-level activity in real time

2. Monitoring Memory Usage: free -h

The free -h command shows memory usage in a human-readable format.

Key metrics:

  • Used
  • Free
  • Buffers/cache
  • Swap usage
  • Available

 raw `free -h` endraw  command output

Example use:

  • If your data pipeline loads large datasets into memory (e.g. Pandas, Spark), watch the available memory
  • If it drops too low, the system may start swapping, drastically slowing performance

Best practice:

  • Ensure pipelines don’t consume all RAM; leave headroom for the OS and other services

3. Monitoring Disk Space: df -h

The df -h command displays disk usage across mounted filesystems.

Shows:

  • Total, used, and available disk space
  • Usage percentage per filesystem

 raw `df -h` endraw  command output

Data pipelines often generate:

  • Temporary files
  • Logs
  • Intermediate datasets

If disk fills up:

  • Jobs may fail unexpectedly
  • Databases or services can crash

Common risk:

A pipeline writing large intermediate files (e.g. CSV/Parquet) can silently fill up disk,causing job failure or system instability.

Tip:

  • Watch for partitions approaching 90–100% usage
  • Clean up temp directories or rotate logs regularly

Preventing Production Failures

By combining these tools, you can proactively protect your system:

  • High CPU usage (top/htop)
    Indicates inefficient code or too many parallel jobs

  • Low available memory (free -h)
    Risk of crashes or heavy swapping

  • High disk usage (df -h)
    Risk of failed writes and system instability

Practical Workflow for Data Engineers

  • Start your pipeline
  • Open another terminal and run:

htop: monitor CPU + processes
watch free -h: track memory over time
watch df -h: monitor disk growth

  • Look for abnormal spikes or steady resource exhaustion
  • Adjust: Batch sizes, Parallelism, Memory allocation

Key Takeaways

These tools are lightweight, fast, and available on most Linux systems. They provide real-time insights into system health. Regular monitoring helps:

  • Prevent crashes
  • Optimize performance
  • Ensure stable production pipelines

Conclusion

Resource monitoring is about being a good steward of the infrastructure we use. Without proper monitoring, pipelines may crash unexpectedly and systems can become unresponsive.With these tools, you gain early warning signals, debug performance issues faster and ensure stable, reliable data processing.

Top comments (0)