DEV Community

Cover image for From Pipelines to Transforms: Making Vector Work with ClickHouse
Mohamed Hussain S
Mohamed Hussain S

Posted on

From Pipelines to Transforms: Making Vector Work with ClickHouse

Part 3 of a series on building a metrics pipeline into ClickHouse
Read Part 2: Understanding Vector Pipelines


Where Things Got Real

By this point, the pipeline structure made sense.

I understood:

  • Sources
  • Transforms
  • Sinks

But the pipeline still wasn’t working reliably.

That’s when it became clear:

The hardest part wasn’t collecting data.
It was transforming it correctly.


Why Transforms Matter

Raw metrics are rarely usable as-is.

When sending data into ClickHouse, even small inconsistencies can break ingestion.

Some common issues encountered were:

  • Wrong data types
  • Unexpected field structures
  • Missing values
  • Incorrect timestamp formats

Even if everything else is correct, these issues cause inserts to fail.


Enter VRL

In Vector, transformations are written using Vector Remap Language.

At first, VRL feels simple.

But in practice, it’s strict.

  • Types must be explicit
  • Fields must be handled carefully
  • Errors are not ignored

That strictness is what makes pipelines reliable - but also harder to get right.


The Timestamp Problem

One of the biggest issues I faced was timestamp handling.

ClickHouse expects timestamps in a specific format.

The raw data didn’t match that format.

Even when everything else was correct, inserts would fail silently because of this.

The fix looked like this:

.timestamp = to_unix_timestamp!(parse_timestamp!(.timestamp, "%+"))
Enter fullscreen mode Exit fullscreen mode

This line did three things:

  • Parsed the incoming timestamp
  • Converted it into a Unix format
  • Made it compatible with ClickHouse

It seems simple - but this was a major blocker.


Normalizing Metrics

Another challenge was aligning the data structure with what ClickHouse expects.

For both host and GPU metrics, this required:

  • Converting values into numeric types
  • Standardizing field names
  • Adding metadata like host and source
  • Ensuring consistent structure across all metrics

Without this step, ingestion would fail even if the pipeline looked correct.


From Raw Data to Queryable Format

One important transformation was changing how metrics were structured.

Instead of storing multiple values in a single record:

cpu, memory, disk
Enter fullscreen mode Exit fullscreen mode

The data was reshaped into a row-based format:

metric_name = "cpu", value = ...
metric_name = "memory", value = ...
Enter fullscreen mode Exit fullscreen mode

This made it easier to:

  • Query data in ClickHouse
  • Aggregate metrics
  • Maintain a consistent schema

Why This Was Hard

Most of the time spent on this pipeline wasn’t on setup.

It was here:

Write transform → Run → Fail → Fix → Repeat
Enter fullscreen mode Exit fullscreen mode

Each iteration revealed:

  • A type mismatch
  • A missing field
  • A formatting issue

This is where the pipeline actually gets built.


What Changed After This

Once the transforms were correct:

  • Data started flowing reliably
  • Inserts into ClickHouse succeeded
  • Queries started returning meaningful results

At that point, the pipeline finally felt stable.


What’s Next

Even after fixing transformations, one major challenge remained:

Debugging unexpected failures.

In the next part, I’ll walk through:

  • How I debugged pipeline issues
  • What ClickHouse logs revealed
  • And a mistake that cost me time

Series Overview


Final Thought

Transforms are where pipelines either succeed or fail.

Understanding how data needs to be shaped is more important than the tool itself.

Once that becomes clear, everything else starts to fall into place.


Top comments (0)