DEV Community

Cover image for Analyzing and Optimizing a Parquet ClickHouse Ingestion Pipeline
Mohamed Hussain S
Mohamed Hussain S

Posted on

Analyzing and Optimizing a Parquet ClickHouse Ingestion Pipeline

Hey devs,
Recently I got the chance to analyze an existing ingestion pipeline that loads large Parquet files into a ClickHouse database.

I didn’t write the pipeline from scratch - my role was to understand how it worked, identify bottlenecks, and figure out how it could be made faster and more scalable.

This post walks through how I analyzed the pipeline and the key performance lessons I learned.


Understanding the Pipeline

At a high level, the pipeline does the following:

  • Reads Parquet files from a folder
  • Performs on-the-go transformations and calculations
  • Ingests the processed data into a ClickHouse table

Simple in theory - but performance issues tend to hide in the details.


Identifying the Bottlenecks

While analyzing the pipeline, a clear pattern stood out:

Row-by-row processing in Python

  • Many transformations were executed one row at a time
  • Python loops and .apply() were heavily used
  • Each row triggered Python-level execution

This creates a CPU-bound bottleneck - Python processes rows sequentially, which does not scale well for large datasets.

Sub-optimal ingestion patterns

  • Inserts were not fully optimized for batch throughput
  • Smaller inserts increased network and coordination overhead

ClickHouse is extremely fast - but only when it receives data the right way.


Optimization Concepts That Matter

Instead of focusing on code, these are the engineering concepts that made the biggest difference.


Vectorized Operations

Pandas and NumPy are designed to work on entire columns at once, not row by row.

  • Vectorized operations run in optimized C code
  • They avoid Python’s loop overhead
  • They can be orders of magnitude faster

Whenever possible, column-level operations should replace .apply() and explicit loops.


Batch Inserts into ClickHouse

Databases perform better when data arrives in large batches.

  • Fewer round trips
  • Less connection overhead
  • Better compression and write efficiency

Batching is essential to fully utilize ClickHouse’s ingestion speed.


Parallel Processing for CPU-Heavy Work

Some transformations are inherently CPU-heavy.

  • These tasks benefit from multiprocessing
  • Work can be split across CPU cores
  • Total processing time drops significantly

This is especially important when transformations cannot be fully vectorized.


Smarter File Handling

File-based ingestion pipelines can suffer from operational overhead:

  • Repeatedly scanning large folders
  • Tracking which files are processed

A clean workflow that separates incoming, processing, and processed files reduces both complexity and I/O overhead.


Key Lessons Learned

From analyzing this pipeline, a few clear lessons stood out:

  1. Python row-by-row processing does not scale
  2. Vectorization should be the default approach
  3. Batch inserts unlock database performance
  4. Multiprocessing helps with CPU-bound workloads
  5. Simple file-handling strategies prevent long-term pain

You don’t need to rewrite everything - understanding where time is spent is often enough to unlock major gains.


Final Thoughts

You don’t always need to build systems from scratch to learn deeply.

Analyzing an existing ETL pipeline - understanding where it slows down and why - can teach you just as much about performance, scalability, and real-world data engineering.

These lessons apply far beyond Parquet and ClickHouse - they’re useful for any large-scale ingestion pipeline.


About Me

Mohamed Hussain S

🔗 LinkedInGitHub

I’m still learning and figuring things out as a associate data engineer.

If you like real-world content - analyzing systems, understanding bottlenecks, and learning from actual pipelines instead of theory-only stuff - feel free to follow me.


Top comments (0)