grace wambua

Posted on Mar 28 • Edited on Mar 30

Resource Monitoring for Data Pipelines

#resourcemonitoring #datapipelines #dataengineering

As a data engineering student, I came to a realization that sometimes the errors that slowly starve resources, don't always throw a code. Monitoring how our pipelines consume resources isn't just about performance, its about respect for the hardware. This article is about understanding the machines we built on, to maintain fast and efficient pipelines.

Introduction

When running data pipelines, especially in production, resource monitoring is critical to prevent slowdowns, crashes, or system-wide failures. Simple Linux command-line tools like top, htop, df -h, and free -h provide real-time visibility into system health and help you catch issues before they escalate.

1. Monitoring CPU & Processes: `top` and `htop`

`top` (Built-in, lightweight)

The top command gives a live view of system processes and CPU usage.

Shows:

CPU utilization (user, system, idle time)
Running processes and their CPU/memory consumption

Why it matters for pipelines:

Identify CPU bottlenecks during heavy transformations (e.g., Spark jobs, ETL scripts)
Detect runaway processes consuming excessive CPU
Spot when multiple pipelines overload the system

Tip: Press P inside top to sort by CPU usage.

`htop` (Enhanced, user-friendly)

htop is an improved version of top with a more intuitive interface.

Features:

Color-coded CPU, memory, and swap usage
Easy process management (kill, renice)
Tree view of processes (great for pipeline dependencies)

Pipeline use cases:

Visualize parallel jobs in distributed pipelines
Quickly terminate stuck or zombie tasks
Monitor thread-level activity in real time

2. Monitoring Memory Usage: `free -h`

The free -h command shows memory usage in a human-readable format.

Key metrics:

Used
Free
Buffers/cache
Swap usage
Available

Example use:

If your data pipeline loads large datasets into memory (e.g. Pandas, Spark), watch the available memory
If it drops too low, the system may start swapping, drastically slowing performance

Best practice:

Ensure pipelines don’t consume all RAM; leave headroom for the OS and other services

3. Monitoring Disk Space: `df -h`

The df -h command displays disk usage across mounted filesystems.

Shows:

Total, used, and available disk space
Usage percentage per filesystem

Data pipelines often generate:

Temporary files
Logs
Intermediate datasets

If disk fills up:

Jobs may fail unexpectedly
Databases or services can crash

Common risk:

A pipeline writing large intermediate files (e.g. CSV/Parquet) can silently fill up disk,causing job failure or system instability.

Tip:

Watch for partitions approaching 90–100% usage
Clean up temp directories or rotate logs regularly

Preventing Production Failures

By combining these tools, you can proactively protect your system:

High CPU usage (top/htop)
Indicates inefficient code or too many parallel jobs
Low available memory (free -h)
Risk of crashes or heavy swapping
High disk usage (df -h)
Risk of failed writes and system instability

Practical Workflow for Data Engineers

Start your pipeline
Open another terminal and run:

htop: monitor CPU + processes
watch free -h: track memory over time
watch df -h: monitor disk growth

Look for abnormal spikes or steady resource exhaustion
Adjust: Batch sizes, Parallelism, Memory allocation

Key Takeaways

These tools are lightweight, fast, and available on most Linux systems. They provide real-time insights into system health. Regular monitoring helps:

Prevent crashes
Optimize performance
Ensure stable production pipelines

Conclusion

Resource monitoring is about being a good steward of the infrastructure we use. Without proper monitoring, pipelines may crash unexpectedly and systems can become unresponsive.With these tools, you gain early warning signals, debug performance issues faster and ensure stable, reliable data processing.

DEV Community

Resource Monitoring for Data Pipelines

Introduction