DEV Community

Cover image for Analyzing and Optimizing a Parquet ClickHouse Ingestion Pipeline
Mohamed Hussain S
Mohamed Hussain S

Posted on

Analyzing and Optimizing a Parquet ClickHouse Ingestion Pipeline

Hey devs,
Recently I got the chance to analyze an existing ingestion pipeline that loads large Parquet files into a ClickHouse database.

I didn’t write the pipeline from scratch - my role was to understand how it worked, identify bottlenecks, and figure out how it could be made faster and more scalable.

This post walks through how I analyzed the pipeline and the key performance lessons I learned.


Understanding the Pipeline

At a high level, the pipeline does the following:

  • Reads Parquet files from a folder
  • Performs on-the-go transformations and calculations
  • Ingests the processed data into a ClickHouse table

Simple in theory - but performance issues tend to hide in the details.


Identifying the Bottlenecks

While analyzing the pipeline, a clear pattern stood out:

Row-by-row processing in Python

  • Many transformations were executed one row at a time
  • Python loops and .apply() were heavily used
  • Each row triggered Python-level execution

This creates a CPU-bound bottleneck - Python processes rows sequentially, which does not scale well for large datasets.

Sub-optimal ingestion patterns

  • Inserts were not fully optimized for batch throughput
  • Smaller inserts increased network and coordination overhead

ClickHouse is extremely fast - but only when it receives data the right way.


Optimization Concepts That Matter

Instead of focusing on code, these are the engineering concepts that made the biggest difference.


Vectorized Operations

Pandas and NumPy are designed to work on entire columns at once, not row by row.

  • Vectorized operations run in optimized C code
  • They avoid Python’s loop overhead
  • They can be orders of magnitude faster

Whenever possible, column-level operations should replace .apply() and explicit loops.


Batch Inserts into ClickHouse

Databases perform better when data arrives in large batches.

  • Fewer round trips
  • Less connection overhead
  • Better compression and write efficiency

Batching is essential to fully utilize ClickHouse’s ingestion speed.


Parallel Processing for CPU-Heavy Work

Some transformations are inherently CPU-heavy.

  • These tasks benefit from multiprocessing
  • Work can be split across CPU cores
  • Total processing time drops significantly

This is especially important when transformations cannot be fully vectorized.


Smarter File Handling

File-based ingestion pipelines can suffer from operational overhead:

  • Repeatedly scanning large folders
  • Tracking which files are processed

A clean workflow that separates incoming, processing, and processed files reduces both complexity and I/O overhead.


Key Lessons Learned

From analyzing this pipeline, a few clear lessons stood out:

  1. Python row-by-row processing does not scale
  2. Vectorization should be the default approach
  3. Batch inserts unlock database performance
  4. Multiprocessing helps with CPU-bound workloads
  5. Simple file-handling strategies prevent long-term pain

You don’t need to rewrite everything - understanding where time is spent is often enough to unlock major gains.


Final Thoughts

You don’t always need to build systems from scratch to learn deeply.

Analyzing an existing ETL pipeline - understanding where it slows down and why - can teach you just as much about performance, scalability, and real-world data engineering.

These lessons apply far beyond Parquet and ClickHouse - they’re useful for any large-scale ingestion pipeline.


About Me

Mohamed Hussain S

🔗 LinkedInGitHub

I’m still learning and figuring things out as a associate data engineer.

If you like real-world content - analyzing systems, understanding bottlenecks, and learning from actual pipelines instead of theory-only stuff - feel free to follow me.


Top comments (2)

Collapse
 
arrebol profile image
Aurora

Couldn’t agree more with your final thoughts! I’d add that documenting the bottlenecks you find in existing ETL pipelines is just as important as fixing them—it creates a knowledge base for the team and helps avoid repeating the same mistakes. I’ve also found that even small tweaks to file-handling (like batch processing instead of single file reads) can make a massive difference in large-scale ingestion. Great job sharing these real-world lessons for new data engineers!

Collapse
 
mohhddhassan profile image
Mohamed Hussain S

Totally agree! Documenting bottlenecks is just as important as fixing them. And yeah, file-handling tweaks often look small but make a huge difference at scale.
Thanks for sharing!