Mohamed Hussain S

Posted on Apr 16

From Pipelines to Transforms: Making Vector Work with ClickHouse

#dataengineering #clickhouse #observability #vector

Part 3 of a series on building a metrics pipeline into ClickHouse
Read Part 2: Understanding Vector Pipelines

Where Things Got Real

By this point, the pipeline structure made sense.

I understood:

Sources
Transforms
Sinks

But the pipeline still wasn’t working reliably.

That’s when it became clear:

The hardest part wasn’t collecting data.
It was transforming it correctly.

Why Transforms Matter

Raw metrics are rarely usable as-is.

When sending data into ClickHouse, even small inconsistencies can break ingestion.

Some common issues encountered were:

Wrong data types
Unexpected field structures
Missing values
Incorrect timestamp formats

Even if everything else is correct, these issues cause inserts to fail.

Enter VRL

In Vector, transformations are written using Vector Remap Language.

At first, VRL feels simple.

But in practice, it’s strict.

Types must be explicit
Fields must be handled carefully
Errors are not ignored

That strictness is what makes pipelines reliable - but also harder to get right.

The Timestamp Problem

One of the biggest issues I faced was timestamp handling.

ClickHouse expects timestamps in a specific format.

The raw data didn’t match that format.

Even when everything else was correct, inserts would fail silently because of this.

The fix looked like this:

.timestamp = to_unix_timestamp!(parse_timestamp!(.timestamp, "%+"))

This line did three things:

Parsed the incoming timestamp
Converted it into a Unix format
Made it compatible with ClickHouse

It seems simple - but this was a major blocker.

Normalizing Metrics

Another challenge was aligning the data structure with what ClickHouse expects.

For both host and GPU metrics, this required:

Converting values into numeric types
Standardizing field names
Adding metadata like host and source
Ensuring consistent structure across all metrics

Without this step, ingestion would fail even if the pipeline looked correct.

From Raw Data to Queryable Format

One important transformation was changing how metrics were structured.

Instead of storing multiple values in a single record:

cpu, memory, disk

The data was reshaped into a row-based format:

metric_name = "cpu", value = ...
metric_name = "memory", value = ...

This made it easier to:

Query data in ClickHouse
Aggregate metrics
Maintain a consistent schema

Why This Was Hard

Most of the time spent on this pipeline wasn’t on setup.

It was here:

Write transform → Run → Fail → Fix → Repeat

Each iteration revealed:

A type mismatch
A missing field
A formatting issue

This is where the pipeline actually gets built.

What Changed After This

Once the transforms were correct:

Data started flowing reliably
Inserts into ClickHouse succeeded
Queries started returning meaningful results

At that point, the pipeline finally felt stable.

What’s Next

Even after fixing transformations, one major challenge remained:

Debugging unexpected failures.

In the next part, I’ll walk through:

How I debugged pipeline issues
What ClickHouse logs revealed
And a mistake that cost me time

Series Overview

Part 1: Why the Telegraf approach didn’t work
Part 2: Understanding Vector Pipelines
Part 3: Writing transforms and handling data correctly (this post)
Part 4: Debugging and making the pipeline reliable

Final Thought

Transforms are where pipelines either succeed or fail.

Understanding how data needs to be shaped is more important than the tool itself.

Once that becomes clear, everything else starts to fall into place.

DEV Community