DEV Community: Mohamed Hussain S

Why Localhost Worked but the Application Couldn't Connect

Mohamed Hussain S — Wed, 08 Jul 2026 14:48:10 +0000

One of the most frustrating debugging sessions I've had wasn't caused by a broken application.

It wasn't caused by a firewall.

It wasn't even caused by the network.

The application worked perfectly.

At least, that's what I thought.

From the server itself, everything looked healthy.

curl http://127.0.0.1:8080

The application responded immediately.

The API worked.

Health checks passed.

Logs looked normal.

Yet every request coming from outside the server failed.

At first, it felt like the application was refusing connections.

The real problem was much simpler.

I misunderstood what localhost actually meant.

The Initial Assumption

When an application starts successfully, it's easy to assume it's ready for everyone to access.

That was my assumption too.

The service was running.

The process was alive.

The port existed.

So naturally, I expected clients on the network to connect without any issues.

Instead, every external request timed out.

Everything Looked Healthy

The first thing I checked was whether the application was actually running.

ps aux | grep sidecar

The process was there.

Next, I checked whether it was listening.

ss -tulpn

Again, everything seemed fine.

The application was listening on port 8080.

So why couldn't another machine connect?

The Important Detail I Missed

The answer was hidden in a single column of the output.

Instead of listening on:

0.0.0.0:8080

the application was listening on:

127.0.0.1:8080

At first glance, they both look like valid addresses.

Operationally, they mean completely different things.

What Localhost Really Means

127.0.0.1 is the loopback interface.

Traffic sent to this address never leaves the machine.

It doesn't travel through the network.

It never reaches another computer.

Only processes running on the same host can connect to it.

So this works:

curl http://127.0.0.1:8080

But from another machine:

curl http://server-ip:8080

the connection simply fails.

Nothing was wrong with the application.

It was doing exactly what it had been told to do.

Why 0.0.0.0 Is Different

When an application binds to:

0.0.0.0

it listens on all available network interfaces.

That means the service becomes reachable through:

the server's IP address
internal network interfaces
external interfaces (assuming routing and firewall rules allow it)

The application hasn't changed.

Only the interface it listens on has.

That tiny configuration difference completely changes who can communicate with it.

My Wrong Mental Model

This debugging session taught me an important lesson.

I had been thinking:

If the application starts successfully, networking must also be working.

Those are two completely different things.

An application can:

start correctly
bind successfully
respond to localhost

and still be completely unreachable from anywhere else.

Application health and network accessibility are separate problems.

Debugging It Systematically

Instead of assuming the application was broken, I started validating each layer independently.

First, verify the process exists.

ps aux | grep sidecar

Next, verify what interface it's actually listening on.

ss -tulpn

Then test locally.

curl http://127.0.0.1:8080

Finally, test remotely.

curl http://server-ip:8080

That simple progression immediately tells you where communication stops.

Instead of debugging everything at once, you're narrowing the search space one layer at a time.

Why This Happens So Often

This isn't unique to one application.

Many frameworks default to binding only to localhost during development.

That's perfectly reasonable.

It prevents accidentally exposing services to an entire network.

The problem appears when that same configuration moves into production.

The application still starts.

Health checks still succeed.

Logs still look clean.

Only remote clients fail.

Without checking the listening interface, it's easy to spend hours investigating the wrong thing.

The Bigger Lesson

Looking back, nothing in the deployment was actually broken.

The operating system behaved exactly as expected.

The application behaved exactly as expected.

The network behaved exactly as expected.

The only thing that was wrong was my assumption.

I treated "the application is running" as proof that "the application is reachable."

Those aren't the same statement.

One describes a running process.

The other describes network accessibility.

Confusing the two can send you down completely the wrong debugging path.

Final Thoughts

Production debugging often isn't about finding a broken component.

It's about understanding how different layers interact.

An application can be healthy while remaining inaccessible.

A network can be perfectly functional while a service listens on the wrong interface.

The more I debug distributed systems, the more I realize that successful deployments depend less on memorizing commands and more on building accurate mental models of how systems communicate.

This experience reinforced one lesson I'll carry into every future deployment:

Just because localhost works doesn't mean your application is actually reachable.

Understanding ACME: What I Learned While Debugging HTTPS Certificate Failures

Mohamed Hussain S — Tue, 07 Jul 2026 14:26:48 +0000

Deploying an application with HTTPS feels straightforward.

Point your domain to the server, configure a reverse proxy like Caddy, and let Let's Encrypt automatically issue a certificate.

That was exactly what I expected.

Instead, certificate issuance kept failing, even though everything looked correct.

At first, I assumed something was wrong with Caddy.

It wasn't.

The real lesson wasn't about Caddy at all-it was about understanding how ACME actually validates a domain.

The Problem

I had a server running behind Caddy, with DNS records pointing to the correct IP address.

The application was reachable, the firewall allowed HTTP and HTTPS traffic, and Caddy was listening on the expected ports.

Yet every attempt to obtain a certificate failed.

Initially, I started checking the obvious things:

Was DNS configured correctly?
Were ports 80 and 443 open?
Was Caddy configured correctly?
Was the firewall blocking traffic?

Everything appeared to be fine.

So why wasn't Let's Encrypt issuing a certificate?

My Wrong Assumption

At first, I thought HTTPS certificate issuance was simply a conversation between my server and Let's Encrypt.

If my server could reach Let's Encrypt, surely the certificate should be issued.

That assumption was completely wrong.

The important connection isn't just your server reaching Let's Encrypt.

It's Let's Encrypt successfully reaching your server.

That single realization changed how I approached the entire problem.

Understanding ACME

Let's Encrypt uses the ACME (Automatic Certificate Management Environment) protocol to verify that you actually own the domain you're requesting a certificate for.

The process is roughly:

Your ACME client requests a certificate.
Let's Encrypt sends a validation challenge.
Your server proves ownership by responding correctly.
Only after successful validation is the certificate issued.

In other words, certificate issuance isn't automatic trust.

It's a verification process.

Looking at the Logs

Once I stopped guessing and started reading the logs, things became much clearer.

Instead of generic "certificate failed" messages, the ACME logs explained which validation step was failing.

That immediately narrowed the investigation.

Rather than assuming TLS was broken, I started asking more useful questions:

Can Let's Encrypt reach my server?
Is the expected challenge actually being served?
Is DNS resolving to the correct address?
Is traffic arriving where I think it is?

Those questions were far more valuable than repeatedly changing configuration files.

The Bigger Lesson

One thing I learned during this debugging session is that HTTPS failures are often not TLS failures at all.

They can be:

DNS issues
Routing problems
Network interface problems
Reverse proxy configuration mistakes
Port forwarding issues

The certificate request simply exposes those underlying networking problems.

In my case, the root cause wasn't ACME itself-it was a networking issue preventing successful validation.

Understanding how ACME worked helped me stop treating the symptoms and start investigating the actual cause.

A Better Way to Debug ACME Failures

Instead of randomly changing configuration files, I found it much more effective to work through a simple checklist.

First, verify that DNS resolves to the expected IP address.

Next, confirm that ports 80 and 443 are reachable from outside the network.

Then inspect the ACME client logs to determine which validation step is failing.

Finally, validate that your server is actually serving the expected challenge response.

Each step eliminates an entire class of possible problems.

What Changed My Thinking

Before this experience, I viewed HTTPS certificate issuance as a feature provided by the reverse proxy.

Now I see it differently.

ACME is fundamentally a validation protocol.

TLS certificates are simply the end result of successfully proving domain ownership.

Once I understood that, the debugging process became much more systematic.

Instead of asking:

"Why won't Let's Encrypt issue my certificate?"

I started asking:

"What is the ACME server trying to verify, and why is that verification failing?"

That shift in thinking made all the difference.

Final Thoughts

Debugging production systems often comes down to replacing assumptions with understanding.

At first, I thought the problem was Caddy.

Then I suspected DNS.

Then I questioned my firewall.

In reality, none of those were the underlying issue.

The real lesson was understanding how ACME validates a deployment before a certificate is ever issued.

Once that mental model clicked, the logs became easier to interpret, the investigation became more focused, and the actual root cause was much easier to identify.

Sometimes the hardest part of debugging isn't fixing the system.

It's understanding how the system is supposed to work in the first place.

When Logs Aren't Enough: Using tcpdump to Debug Real Network Problems

Mohamed Hussain S — Mon, 22 Jun 2026 16:13:56 +0000

In my previous post, I wrote about a Linux routing issue that broke a deployment and caused repeated validation failures.

What ultimately led me to the root cause wasn't a configuration change or a log entry.

It was a packet capture.

This article isn't about a routing issue. It's about the tool that helped uncover it and the lesson I took away from the entire investigation.

Everything Looked Healthy

The first step was verifying the basics.

I checked whether the service was listening on the expected ports:

sudo ss -tulpn | grep -E ':80|:443'

Everything looked normal.

Next, I verified the application itself:

curl http://localhost

The application responded immediately.

I also verified DNS resolution and confirmed the domain was pointing to the correct public IP.

At this point, nothing looked obviously wrong. The application was healthy, the reverse proxy was healthy, and the network configuration appeared healthy. Yet validation attempts continued to fail.

The Logs Weren't Helping

The logs consistently showed variations of:

authorization failed
timeout during connect
likely firewall problem

The problem was that the logs only described the symptom.

They didn't explain why it was happening.

So I started investigating the usual suspects: DNS, firewall rules, reverse proxy configuration, and listening ports.

Everything continued to look fine.

The more I investigated, the less sense the issue made.

Looking Beyond The Logs

At some point I realized I was only looking at what the software was reporting.

I wasn't looking at what the network was actually doing.

So I decided to capture the traffic directly:

sudo tcpdump -ni ens3 tcp port 80

I triggered another validation attempt and watched the packets arrive.

Almost immediately, I saw something interesting:

IP 124.x.x.x > 51.x.x.x.80: Flags [S]
IP 124.x.x.x > 51.x.x.x.80: Flags [S]
IP 124.x.x.x > 51.x.x.x.80: Flags [S]

The requests were reaching the server.

That single observation completely changed the direction of the investigation.

The First Real Clue

Up until that moment, I had been operating under the assumption that external systems couldn't reach the server.

The packet capture proved otherwise.

Traffic was arriving.

The server was receiving connection attempts.

The problem wasn't inbound connectivity.

The problem was somewhere after that.

Instead of asking:

Why can't external systems reach my server?

I started asking:

If traffic is reaching the server, why isn't the connection completing?

That shift in thinking changed the entire investigation.

What tcpdump Revealed

To understand why the packet capture was so important, it helps to understand what a normal TCP connection looks like.

A healthy connection follows a three-way handshake:

Client  -> SYN      -> Server
Client <- SYN-ACK   <- Server
Client  -> ACK      -> Server

What I was actually seeing looked more like this:

Client  -> SYN      -> Server
Client  -> SYN      -> Server (Retry)
Client  -> SYN      -> Server (Retry)

The incoming connection attempts were reaching the server, but the connection was never being established successfully.

That immediately ruled out several possibilities.

DNS wasn't the problem because requests were arriving.

The reverse proxy wasn't the problem because it was listening correctly.

The application wasn't the problem because it responded locally.

Within a few minutes, the packet capture had eliminated entire categories of potential root causes.

Following The Evidence

Once I knew inbound traffic was reaching the server, I shifted my attention toward the network path itself.

I started examining interfaces, routes, and outbound traffic behaviour using commands like:

ip route

and

ip route get 8.8.8.8

Those commands eventually exposed the real issue.

The server had multiple network interfaces, and outbound traffic was being routed through an unexpected path. That routing behaviour was causing validation attempts to fail even though the service itself was perfectly healthy.

The actual root cause turned out to be a Linux routing issue.

But I might never have found it if I hadn't first verified what was happening on the wire.

Why tcpdump Was The Turning Point

Before running tcpdump, I was relying entirely on logs and assumptions.

The logs suggested a firewall issue.

The packet capture showed requests reaching the server.

Those two observations pointed in completely different directions.

Without the packet capture, I probably would have continued tweaking firewall rules, reverse proxy settings, and application configuration.

Instead, the investigation moved toward routing almost immediately.

That's what made tcpdump so valuable.

It wasn't the tool that solved the problem.

It was the tool that revealed reality.

Lessons Learned

1. Logs Don't Tell The Entire Story

Logs are useful, but they're generated by software. They only describe what the application believes is happening.

Sometimes that's not enough.

2. Verify Assumptions Early

I spent time investigating DNS, firewall rules, and reverse proxy configuration because they seemed like the most likely causes.

The packet capture disproved those assumptions within minutes.

3. Packet Captures Can Change Everything

You don't need advanced networking knowledge to get value from tcpdump.

Even a simple capture can tell you whether traffic is arriving, leaving, or disappearing somewhere in between.

4. Eliminate Entire Categories Of Problems

One of the biggest advantages of packet captures is that they quickly rule things out.

Sometimes that's more valuable than finding the answer immediately.

Final Thoughts

This incident reminded me that assumptions can be surprisingly expensive.

The logs pointed toward a firewall issue. The services looked healthy. Everything seemed to suggest a particular problem.

But the moment I looked at the packets, the entire investigation changed direction.

Since then, whenever a network issue doesn't make sense, I try to reach for tcpdump much earlier.

Because logs tell you what software thinks happened.

Packets show you what actually happened.

The Hidden Linux Routing Issue That Broke My Deployment

Mohamed Hussain S — Wed, 17 Jun 2026 03:41:37 +0000

The deployment should have taken a few minutes.

The application was running, DNS was configured correctly, and the domain was already pointing to the server's public IP. Caddy was configured as a reverse proxy and was listening on ports 80 and 443. Every item on my deployment checklist appeared healthy.

Yet every Let's Encrypt validation attempt kept failing.

The error looked simple enough:

authorization failed
timeout during connect
likely firewall problem

At first, I believed it.

I checked DNS resolution, verified firewall rules, confirmed that Caddy was listening on the expected ports, and made sure the application itself was reachable. Every check came back clean.

That was the first clue that the problem might not be where the logs were pointing.

The Obvious Things

The first assumption was DNS.

I verified that the domain resolved to the correct public IP.

dig +short my-domain.com

Everything looked correct.

Next came the firewall.

sudo ufw status

Ports 80 and 443 were open. There were no unexpected deny rules, and nothing suggested inbound traffic was being blocked.

Then I checked whether Caddy was actually listening.

sudo ss -tulpn | grep -E ':80|:443'

Again, everything looked normal.

The application itself was healthy too.

curl http://localhost:3001

returned a valid response.

At this point I had checked most of the things engineers typically check when certificate validation fails. DNS looked good, the firewall looked good, the reverse proxy was healthy, and the application was running.

Yet the validation errors continued.

The Part That Sent Me In The Wrong Direction

The error messages kept mentioning connectivity problems and possible firewall issues.

That wording influenced my thinking more than it should have.

I spent time investigating firewall rules, reverse proxy configuration, TLS settings, and domain configuration. Every new hypothesis felt reasonable, but none of them explained why local tests consistently succeeded while external validation continued to fail.

The contradiction kept bothering me.

If the service was truly unreachable, why did everything work from inside the server?

Then I Hit The Rate Limit

This was the point where I realized I was no longer troubleshooting.

I was guessing.

After several failed validation attempts, Let's Encrypt stopped accepting new authorization requests and returned a rate-limit error.

too many failed authorizations

I had burned through multiple validation attempts without actually understanding the root cause.

Looking back, this was probably the most useful lesson from the entire incident.

Repeatedly retrying a failing system is not the same thing as debugging it.

Looking At The Network Instead Of The Logs

At this point I stopped changing configurations and started gathering evidence.

The first useful clue came from tcpdump.

sudo tcpdump -ni ens3 tcp port 80

While monitoring traffic, I triggered requests from outside the server.

The packet capture immediately showed incoming connection attempts reaching the machine.

That was important.

It meant DNS was working.

It meant external traffic was reaching the public interface.

It meant the firewall was not silently dropping inbound requests.

The requests were arriving exactly where they were supposed to.

So why was validation timing out?

The Routing Table Finally Revealed The Problem

The next step was checking the routing table.

ip route

The output looked roughly like this:

default via 10.2.0.1 dev ens4 metric 100
default via 51.x.x.x dev ens3 metric 100

The server had two network interfaces.

ens3 connected to the public network
ens4 connected to a private network

Initially, I didn't think much of it. Multi-interface servers are fairly common.

Then I started checking where outbound traffic was actually leaving.

ip route get 8.8.8.8

The result surprised me.

8.8.8.8 via 10.2.0.1 dev ens4

I tested several additional destinations.

ip route get 1.1.1.1
ip route get 8.8.4.4
ip route get <validator-ip>

Every single lookup showed outbound traffic leaving through the private interface.

That was the breakthrough.

Understanding What Was Actually Happening

A Quick Note About Asymmetric Routing

The issue I was dealing with has a name: asymmetric routing.

Traffic was entering the server through the public interface (ens3), but Linux was attempting to send replies through the private interface (ens4).

From the application's perspective everything looked healthy.

From Let's Encrypt's perspective the connection never completed successfully.

Why This Can Cause Timeouts

While investigating the issue, I came across Linux's Reverse Path Filtering (rp_filter).

When a packet arrives on one interface but Linux believes the reply should leave through another, the kernel may treat the traffic as suspicious and drop it.

Whether the packet was being dropped by rp_filter, upstream networking, or another layer wasn't something I conclusively proved.

But understanding this interaction finally explained why inbound requests were visible while validation attempts still timed out.

Let's Encrypt validators were connecting to my public IP.

Those packets arrived through the public interface.

Let's Encrypt
      |
      v
Public Interface (ens3)
      |
      v
    Server

So far, everything was fine.

The problem appeared when Linux generated a response.

Instead of sending the response back through the same public interface, the routing table was selecting the private interface as the preferred outbound path.

Let's Encrypt
      |
      v
Public Interface (ens3)
      |
      v
    Server
      |
      v
Private Interface (ens4)

This is a classic networking issue known as asymmetric routing.

Traffic enters through one interface and attempts to leave through another.

From the application's perspective, everything appears healthy.

From the remote system's perspective, the connection never completes correctly.

The result is timeouts.

Exactly what Let's Encrypt was reporting.

Why This Was So Difficult To Find

The issue hid behind several misleading signals.

The application was healthy.

The reverse proxy was healthy.

DNS was correct.

Ports were open.

The firewall was configured properly.

Every layer looked healthy when viewed independently.

The actual failure existed underneath all of them.

Most deployment troubleshooting guides focus on application configuration, reverse proxies, certificates, and firewall rules. Very few immediately point you toward route selection.

Especially when the server appears to be functioning normally.

The Fix

Once the routing issue was identified, the fix itself was straightforward.

The server needed to use the public interface for internet-bound traffic instead of attempting to route those responses through the private network.

After correcting the routing configuration, I verified the result.

ip route get 8.8.8.8

The output now showed traffic leaving through the public interface.

Exactly what I wanted.

I restarted Caddy and triggered another validation attempt.

This time the validators connected successfully, the challenge completed, and the certificate was issued within seconds.

Hours of troubleshooting ultimately came down to a routing decision that Linux was making automatically.

Lessons Learned

A few takeaways from this incident stood out.

Error messages often describe symptoms, not causes

The logs repeatedly suggested firewall issues.

The firewall was never the problem.

Stop retrying and start investigating

I hit Let's Encrypt's authorization limits because I kept retrying before understanding the failure.

That was entirely avoidable.

Packet captures reveal reality

When logs become confusing, tcpdump often provides a much clearer picture of what is actually happening on the network.

Multi-interface servers deserve extra scrutiny

If a server has both public and private interfaces, route selection should be one of the first things you verify.

Two commands can save hours

If you're debugging unexplained connectivity issues, run these early:

ip route

ip route get 8.8.8.8

Those two commands exposed the real problem faster than everything else I tried.

Final Thoughts

I started this investigation convinced I had a TLS problem.

Then I thought it was DNS.

Then I suspected the firewall.

Then I questioned my reverse proxy configuration.

In the end, none of those were responsible.

The real issue was a routing decision happening at the operating system level long before the request ever reached my application.

And like most memorable debugging sessions, the hardest part wasn't fixing the problem.

It was figuring out where the problem actually lived.

ClickHouse Duplicates: Clean Your Results vs. Clean Your Storage

Mohamed Hussain S — Sat, 13 Jun 2026 12:36:42 +0000

The word FINAL appears in multiple places in ClickHouse.

Two of the most commonly confused examples are:

SELECT *
FROM events FINAL;

and:

OPTIMIZE TABLE events FINAL;

At first glance, they sound like they should do roughly the same thing.

After all, both contain the word FINAL.

But they actually solve two completely different problems.

One affects query results.

The other affects how data is physically stored.

Understanding this distinction can save a lot of confusion when working with MergeTree tables.

Why This Confusion Happens

Most people encounter FINAL while working with engines like:

ReplacingMergeTree
SummingMergeTree
AggregatingMergeTree

Sooner or later they notice something like:

SELECT *
FROM users;

returns duplicate versions of rows.

Then they discover:

SELECT *
FROM users FINAL;

and suddenly the results look correct.

Naturally, many people assume:

FINAL merges the table.

But that's not exactly what is happening.

What SELECT FINAL Actually Does

When you run:

SELECT *
FROM users FINAL;

ClickHouse applies merge logic during query execution.

Think of it as:

"Show me what the table would look like if all relevant merges had already happened."

The important part:

It only affects the query result.

After the query finishes:

parts remain unchanged
storage remains unchanged
nothing is rewritten on disk

The merge logic happens temporarily while the query is running.

Once the query completes, the table is exactly as it was before.

What OPTIMIZE FINAL Actually Does

Now let's look at:

OPTIMIZE TABLE users FINAL;

This is a completely different operation.

Instead of modifying query results, ClickHouse physically merges parts on disk.

The operation:

rewrites data
merges eligible parts
removes obsolete versions
creates larger merged parts

Unlike SELECT FINAL, the effects remain after the command completes.

This is a storage operation, not a query operation.

The Simplest Way to Remember It

Whenever I think about these commands, I use a very simple mental model:

Command	Purpose
SELECT FINAL	Clean the result
OPTIMIZE FINAL	Clean the storage

That's really the core difference.

One affects what you see.

The other affects how the data is stored.

Does OPTIMIZE FINAL Create One Giant Part?

This is another common misconception.

Suppose your table is partitioned like this:

PARTITION BY toYYYYMM(event_date)

and contains:

2025-01
2025-02
2025-03

Many people assume:

OPTIMIZE TABLE events FINAL;

will merge the entire table into one huge part.

It won't.

Merge operations do not cross partition boundaries.

What you are more likely to end up with is:

2025-01 -> one large part
2025-02 -> one large part
2025-03 -> one large part

Each partition is optimized independently.

This distinction becomes important when working with large datasets.

Should You Use SELECT FINAL Everywhere?

Not really.

FINAL is incredibly useful when correctness matters.

For example:

SELECT *
FROM users FINAL;

may be exactly what you need when querying a ReplacingMergeTree table and you want the latest state immediately.

But it still introduces additional work during query execution.

So while modern ClickHouse versions have significantly improved FINAL performance, it shouldn't automatically become your default query pattern.

Use it when you need the merge logic.

Not because it's available.

Should You Run OPTIMIZE FINAL Regularly?

Also no.

This is another mistake people sometimes make.

OPTIMIZE FINAL is a heavy operation.

It forces merges that ClickHouse would normally schedule on its own.

In many cases, background merges already do a good job of maintaining healthy storage.

Running:

OPTIMIZE TABLE events FINAL;

every time you insert data is usually unnecessary.

Think of it as an operational tool.

Not a routine query optimization technique.

When Would You Use Each?

SELECT FINAL

Useful when:

querying ReplacingMergeTree tables
validating latest state
merge results are needed immediately

OPTIMIZE FINAL

Useful when:

forcing merges intentionally
maintenance operations
testing storage behavior
special operational situations

Both have valid use cases.

They simply solve different problems.

Final Thoughts

The word FINAL appears in both commands, which makes them easy to confuse.

But once you understand the difference, many ClickHouse behaviors start making a lot more sense.

SELECT FINAL does not physically merge your table.

It only applies merge logic while reading data.

OPTIMIZE FINAL actually rewrites and merges parts on disk.

Or put another way:

SELECT FINAL cleans what you see.

OPTIMIZE FINAL cleans how the data is stored.

And that's a distinction every ClickHouse engineer should understand.

Why ClickHouse Loves Append-Heavy Workloads

Mohamed Hussain S — Wed, 27 May 2026 09:24:45 +0000

One thing that makes ClickHouse feel very different from traditional OLTP databases is how much it prefers append-heavy workloads.

And once you understand why, many ClickHouse behaviors suddenly start making sense:

immutable parts
background merges
ingestion batching
merge pressure
even why FINAL exists

At first, this can feel strange if you are coming from databases like PostgreSQL or MySQL where updates and row modifications are extremely normal.

But analytical databases think very differently internally.

Traditional OLTP Systems Think in Terms of Updates

In most transactional databases, modifying rows constantly is completely normal.

For example:

UPDATE inventory
SET stock = stock - 1
WHERE product_id = 101;

or:

UPDATE users
SET last_login = now()
WHERE user_id = 42;

These systems are heavily optimized for:

transactional correctness
row-level updates
operational consistency
frequent modifications

Because that is exactly what OLTP workloads need.

And honestly, PostgreSQL is incredibly good at this.

ClickHouse Thinks Very Differently

ClickHouse is not primarily designed around transactional row updates.

It is designed around analytical workloads:

metrics
logs
observability
event streams
historical analytics
large aggregations

And these workloads are naturally append-heavy.

For example:

INSERT INTO events VALUES (...);

or:

INSERT INTO metrics VALUES (...);

New events continuously arrive.

Old data is rarely modified frequently.

That changes the entire storage philosophy underneath.

ClickHouse Stores Data as Immutable Parts

This is one of the most important concepts to understand.

In MergeTree engines, ClickHouse stores inserts as immutable parts on disk.

Meaning:

inserts create new parts instead of constantly rewriting existing rows directly.

And honestly, this is one of the biggest reasons ClickHouse scales analytical ingestion so well.

Because append-heavy writes are operationally much cheaper than constantly rewriting data in place.

Why Immutable Storage Works So Well

Immutable storage gives ClickHouse several advantages:

efficient sequential writes
better compression
reduced locking pressure
faster analytical scans
simpler background merging

Instead of constantly modifying rows directly, ClickHouse can:

append data quickly
merge parts later
optimize storage asynchronously

This model fits analytical systems extremely well.

Especially when ingesting:

logs
metrics
telemetry
clickstream data
observability events

at very large scale.

Columnar Storage Makes This Even More Powerful

Another reason append-heavy storage works so well in ClickHouse is because data is stored by column instead of by row.

This matters a lot for analytical workloads.

Because queries often need only a few columns from massive datasets.

For example:

SELECT avg(response_time_ms)
FROM metrics;

does not need to read:

user_agent
request_headers
payload columns

at all.

And because parts are immutable, ClickHouse can compress these columns extremely efficiently using specialized compression algorithms.

This is one of the reasons analytical scans in ClickHouse can remain surprisingly fast even at very large scale.

This Is Why Background Merges Exist

One thing that confused me initially was:

why ClickHouse relies so heavily on merges.

But once you understand immutable parts, merges make perfect sense.

Because inserts continuously create smaller parts.

And background merges later:

combine them
reduce fragmentation
improve compression
optimize query performance

This is also why:

tiny inserts become dangerous
too many parts create pressure
unhealthy fragmentation slows systems down

Many ClickHouse operational behaviors trace back to this append-heavy storage philosophy underneath.

Why Updates Feel Different in ClickHouse

This does not mean ClickHouse cannot handle updates.

It absolutely can.

But updates behave differently because the storage engine is optimized differently.

In many cases, updates are internally handled through:

mutations
part rewrites
asynchronous merge operations

instead of lightweight in-place row modifications like traditional OLTP systems.

And this is why large-scale frequent updates can feel operationally heavier in ClickHouse.

Because the system is optimizing for analytical scale first.

Not transactional mutation-heavy workloads.

Many Systems Handle Updates as New Inserts Instead

One thing I found interesting is that many ClickHouse workloads avoid frequent in-place updates entirely.

Instead, systems often:

insert newer versions of rows
append updated events
track timestamps or versions

and later use things like:

ReplacingMergeTree
argMax()
merge logic

to retrieve the latest state.

For example:

SELECT
    user_id,
    argMax(status, updated_at)
FROM user_status
GROUP BY user_id;

This fits naturally into ClickHouse’s append-heavy design philosophy.

Instead of constantly rewriting rows directly, systems continuously append newer versions while merges and analytical queries reconcile state later.

Why Event Streams Fit ClickHouse So Naturally

This is honestly where ClickHouse feels extremely powerful.

Modern systems continuously generate:

logs
metrics
traces
user events
telemetry streams

And these workloads naturally behave like:

append-heavy event streams.

New records continuously arrive.

Historical records mostly remain unchanged.

That is exactly the kind of workload ClickHouse loves.

Which is why architectures like:

Applications
      ↓
Kafka / Streaming
      ↓
ClickHouse

feel so natural operationally.

The storage model aligns perfectly with the workload behavior.

This Also Explains Why FINAL Exists

A lot of ClickHouse behavior becomes easier to understand once you think in terms of append-heavy storage.

For example:
ReplacingMergeTree may temporarily contain multiple versions of rows until merges eventually reconcile them.

That is why queries sometimes use:

SELECT * FROM events FINAL;

to apply merge logic during query execution.

Again:

immutable parts
append-heavy ingestion
asynchronous merging

all connect back together underneath.

The Important Lesson

One thing I’ve started realizing with ClickHouse is that many operational behaviors make much more sense once you stop thinking in terms of:

"traditional transactional databases."

ClickHouse is optimizing for:

analytical ingestion
historical querying
append-heavy workloads
large-scale scans

And once you understand that design philosophy, many of its storage behaviors stop feeling strange.

Final Thought

ClickHouse is not trying to behave like a traditional OLTP database.

It is optimizing for analytical scale.

And append-heavy design is one of the biggest reasons it performs so well for:

observability
metrics
event streams
analytical systems
real-time analytics workloads

Why Too Many Parts Hurt ClickHouse Performance

Mohamed Hussain S — Mon, 25 May 2026 14:00:25 +0000

A lot of people initially think ClickHouse performance problems come from:

large queries
bad joins
massive datasets
missing indexes

And honestly, those things can matter.

But one of the most common operational problems in ClickHouse often starts much earlier:

too many tiny parts.

This is one of those issues that usually stays invisible at first.

Then suddenly:

merges fall behind
queries slow down
memory usage increases
inserts become unstable

And the cluster starts behaving strangely.

Every Insert Creates Parts

This is the first thing that’s important to understand.

In MergeTree-based engines, ClickHouse stores data as immutable parts.

Something as simple as:

INSERT INTO events VALUES (...);

creates new parts on disk.

And this is completely normal.

ClickHouse is designed around this storage model.

So:

parts themselves are not the problem.

The real issue starts when parts begin accumulating faster than merges can stabilize them.

Why Tiny Inserts Become Dangerous

At smaller scale, tiny inserts may seem harmless.

For example:

inserting row-by-row
extremely frequent micro-batches
tiny streaming flush intervals

Initially:

everything still works.

But over time, the number of parts starts growing aggressively.

Now ClickHouse has to manage:

more metadata
more merges
more scheduling
more file operations

This creates operational overhead.

Meaning:

the system starts spending increasing resources managing fragmentation itself.

Why Merges Matter So Much

ClickHouse relies heavily on background merges.

These merges:

combine smaller parts
reduce fragmentation
improve compression
optimize query performance

Under healthy ingestion patterns, merges naturally keep the system stable over time.

That is the ideal state.

But problems start when:

parts created per second
        >
parts merged per second

Now fragmented parts begin accumulating faster than ClickHouse can compact them.

And this is usually where instability slowly starts building.

The Dangerous Part Is That It Builds Slowly

This is what makes the issue tricky operationally.

You usually do not notice the problem immediately.

The cluster may look perfectly healthy initially.

Then gradually:

insert latency increases
merges lag behind
CPU usage becomes unstable
queries become heavier
replication slows down

And eventually ClickHouse may start throwing errors like:

Too many parts

At that point, the merge system is already under serious pressure.

Queries Also Become More Expensive

A lot of people think parts only affect inserts.

But queries suffer too.

Because queries now need to:

open more parts
scan more metadata
coordinate more files

Even when the actual dataset itself is not massive.

So sometimes:

performance degradation comes more from fragmentation than raw data volume.

That is a very important operational insight.

FINAL Does Not Really Solve This

One thing that’s important to understand:

FINAL is not really a solution for too many parts.

For example:

SELECT *
FROM events FINAL;

FINAL applies merge logic during query execution.

But the fragmented parts still physically exist underneath.

So if the system already has excessive fragmentation:

queries still scan many parts
merge pressure still exists
query execution can become heavier

Which means:

FINAL can actually become more expensive when fragmentation becomes unhealthy.

The real fix is usually improving ingestion and merge behavior itself.

Over-Partitioning Can Quietly Make This Worse

Another thing that often accelerates part explosion is overly granular partitioning.

For example:

PARTITION BY toYYYYMMDDhh(timestamp)

instead of something broader like:

PARTITION BY toYYYYMM(timestamp)

Now even small inserts may create parts across many partitions simultaneously.

Which means:

a single insert can end up creating multiple fragmented parts underneath.

And over time, merge pressure increases much faster than expected.

ClickHouse Also Has Ways to Help

Modern ClickHouse versions also support features like async inserts to help reduce excessive tiny-part creation.

Instead of immediately flushing every small insert into separate parts, ClickHouse can buffer inserts internally before writing larger parts to disk.

This helps reduce fragmentation and merge pressure in workloads that naturally produce smaller inserts.

But async inserts are not a replacement for healthy ingestion patterns themselves.

Stable batching still matters a lot.

Why Batch Size Matters So Much

ClickHouse generally performs much better with:

larger batches
fewer inserts
healthier merge behavior

Because fewer parts means:

fewer merges
lower metadata overhead
better compression
more efficient scans

This is one of the reasons ClickHouse ingestion patterns often look very different from traditional OLTP systems.

Too Many Parts Also Affects Startup and Recovery

Another thing people often discover late:

Large numbers of parts also affect:

startup time
replication recovery
metadata loading
server restarts

Because ClickHouse now has to:

scan part metadata
validate parts
rebuild internal state

before the server becomes fully operational again.

So the issue is not just query performance.

It becomes an overall operational stability problem.

The Important Lesson

One thing I’ve noticed with ClickHouse is that many performance problems are actually merge-management problems underneath.

And too many parts is one of the clearest examples of that.

Because the issue usually is not:

“ClickHouse cannot handle large data.”

The issue is more often:

fragmentation and merge pressure slowly became unhealthy.

That is a very different operational problem.

Final Thought

ClickHouse is extremely good at handling massive analytical workloads.

But it performs best when the storage engine is allowed to merge parts efficiently.

And sometimes the biggest performance problem is not the query itself.

It is the thousands of tiny fragmented parts quietly building underneath the system over time.

Why Real-Time Analytics Eventually Changes Your Database Architecture

Mohamed Hussain S — Tue, 19 May 2026 16:36:19 +0000

A lot of systems begin with a single database.

Usually PostgreSQL.

And honestly, in the beginning, that works perfectly fine.

The application stores:

users
payments
inventory
authentication
operational state

Dashboards query the same database.

Analytics queries also run directly on PostgreSQL.

Everything feels simple.

The Problem Usually Starts Slowly

At first, analytical queries are small.

Maybe:

daily reports
lightweight aggregations
small dashboards

Nothing too serious.

But over time, systems start generating:

more events
more metrics
more logs
more historical records
more observability data

And analytical workloads start behaving very differently from transactional workloads.

For example:

SELECT
    service_name,
    avg(response_time_ms)
FROM metrics
WHERE timestamp >= now() - INTERVAL 30 DAY
GROUP BY service_name;

This is a very different kind of workload from:

UPDATE inventory
SET stock = stock - 1
WHERE product_id = 101;

One is trying to preserve operational correctness.

The other is trying to analyze huge amounts of historical data.

And eventually those workloads start colliding.

PostgreSQL Slowly Becomes Responsible for Everything

This is where things usually start getting interesting.

A lot of systems unintentionally turn PostgreSQL into:

the transactional database
the reporting database
the analytics database
the observability database

all at the same time.

And honestly, modern PostgreSQL is capable enough that this can work surprisingly well for a while.

Until:

dashboards become heavier
retention windows grow
analytical scans become larger
observability traffic increases
aggregations become expensive

Now suddenly the same database handling:

payments
authentication
users
inventory

is also handling large analytical workloads.

And this is usually where architectural pressure starts building.

The Real Problem Is Workload Isolation

This is honestly the biggest lesson.

The issue is usually not:

“PostgreSQL is slow.”

The issue is:

transactional workloads and analytical workloads optimize for completely different things.

Transactional systems care heavily about:

consistency
operational latency
updates
row-level modifications
business correctness

Analytical systems care heavily about:

large scans
aggregations
compression
historical analytics
query throughput

Those are fundamentally different workload patterns.

And eventually trying to optimize one database perfectly for both becomes painful.

Why Observability Changes Everything So Quickly

One thing I find interesting is how fast observability workloads expose architectural limitations.

Because observability systems continuously generate:

logs
metrics
traces
events

And these workloads grow aggressively over time.

Now imagine running:

large aggregations
historical scans
high-cardinality queries
real-time dashboards

on the same database handling:

authentication
inventory
operational business logic
transactional traffic

At smaller scale this may still work.

At larger scale:

query contention increases
operational latency becomes sensitive
workload isolation becomes harder

And eventually systems start evolving toward separation.

This Is Usually When Analytical Databases Start Appearing

At some point, many systems evolve toward something like this:

Application
    ↓
PostgreSQL
    ↓
CDC / Kafka / Airbyte
    ↓
ClickHouse / OLAP DB
    ↓
Analytics / Dashboards / Observability

This pattern has become extremely common in modern analytical systems.

And honestly, the reason is pretty simple:

PostgreSQL remains responsible for operational correctness.

ClickHouse becomes responsible for analytical scale.

Each system handles the workload it was actually designed for.

Not All Analytical Data Needs PostgreSQL First

One important thing though:

Not all analytical data even originates from PostgreSQL.

A lot of observability workloads:

logs
metrics
traces
telemetry events

often flow directly into ClickHouse/OLAP DB through streaming pipelines.

Something like:

Applications / Services
        ↓
Kafka / Streaming Pipelines
        ↓
ClickHouse / OLAP DB

In many systems, PostgreSQL stores the business data while ClickHouse directly handles logs, metrics, events, and analytical workloads.

And honestly, this makes a lot of sense.

Because analytical systems are usually optimized for:

append-heavy ingestion
historical querying
event-style workloads

not transactional business operations.

Why Not Just Use ClickHouse for Everything?

This is another common misunderstanding.

ClickHouse is incredible for analytical workloads.

But transactional systems still require things like:

frequent updates
operational consistency
transactional guarantees
row-level modifications
business-critical correctness

Those are not the primary design goals of analytical databases.

You generally do not want your:

authentication system
payment workflows
inventory updates
operational application state

depending entirely on analytical database behavior.

Why CDC Pipelines Become So Important

One reason this architecture became so practical is CDC (Change Data Capture).

Instead of repeatedly exporting data manually, systems continuously stream changes from PostgreSQL into analytical systems using:

Kafka
Debezium
Airbyte
streaming pipelines

That means:

operational systems continue working normally
analytical systems receive near real-time data
workloads stay separated cleanly

And analytical queries no longer compete directly against transactional traffic.

Don’t Rush Into Multi-Database Architectures

One important thing though:

Most systems do not need Kafka + ClickHouse pipelines on Day 1.

Honestly, many applications can scale surprisingly far with PostgreSQL alone using:

proper indexing
query optimization
read replicas
partitioning
extensions like Citus

The goal is not to introduce more infrastructure as early as possible.

The real signal usually appears when analytical workloads start affecting operational user experience.

That is often when workload separation starts becoming worth the additional architectural complexity.

Because systems like:

CDC pipelines
Kafka
analytical databases

also introduce operational overhead of their own.

And good architecture is usually about introducing complexity only when the workload actually demands it.

The Bigger Engineering Lesson

Most systems do not start with multiple databases.

They evolve into them as workloads grow.

Transactional workloads and analytical workloads behave very differently at scale.

And eventually systems start separating:

operational correctness
analytical querying
observability workloads
historical analytics

into infrastructure optimized for each workload.

Final Thought

A lot of modern systems do not start with multiple databases.

They evolve into them.

Because transactional workloads and analytical workloads eventually want very different things from the same infrastructure.

And real-time analytics is often the thing that forces that architectural separation to happen.

FINAL in ClickHouse Isn’t as Expensive as It Used to Be

Mohamed Hussain S — Thu, 14 May 2026 16:04:00 +0000

For a long time, the advice around FINAL in ClickHouse was pretty straightforward:

Avoid it whenever possible.

And honestly, that advice existed for good reasons.

Older versions of ClickHouse could make FINAL extremely expensive depending on:

table size
partitioning
number of parts
merge state
query patterns

So people started treating FINAL almost like a red flag.

But modern ClickHouse has changed a lot.

And I think the conversation around FINAL deserves a bit more nuance now.

Why FINAL Existed in the First Place

To understand why FINAL was historically considered expensive, you first need to understand what it actually does.

In engines like:

ReplacingMergeTree
CollapsingMergeTree
VersionedCollapsingMergeTree

ClickHouse does not immediately rewrite rows in place.

Instead:

inserts create new parts
background merges reconcile rows later
deduplication happens asynchronously

That means queries can temporarily see:

duplicate versions
old versions
intermediate states

Example:

SELECT *
FROM users
FINAL;

FINAL forces ClickHouse to apply merge logic during query execution itself.

That means the query may:

read more data
perform additional deduplication work
consume more CPU and memory

This is why older advice strongly discouraged using it everywhere.

The Old FINAL Problem

Historically, FINAL could become painful on large datasets.

Especially when:

partitions were large
too many parts existed
merges lagged behind
queries scanned massive ranges

People would add:

FINAL

to "fix" duplicate rows without understanding why duplicates existed in the first place.

The result was often:

slower queries
higher memory usage
unnecessary query overhead

So the community advice became:

Design your schema properly and avoid FINAL whenever possible.

And honestly?

That advice still matters.

But the implementation of FINAL itself has improved significantly over time.

Modern ClickHouse Has Improved FINAL a Lot

Recent ClickHouse versions introduced multiple improvements around FINAL.

Things like:

parallel execution
partition-aware optimizations
improved memory behavior
smarter merge execution
reduced unnecessary reads

Which means:

FINAL is no longer the monster it used to be.

And this is important because newer ClickHouse guidance has also become more practical about using it when necessary.

Even in some recent discussions and office hours from the ClickHouse ecosystem, using FINAL for latest-state queries is no longer treated as automatically wrong.

That would have sounded controversial a few years ago.

FINAL vs argMax Isn’t Always a Simple Comparison

For a long time, many ClickHouse users avoided FINAL by using patterns like:

SELECT
    id,
    argMax(status, version)
FROM users
GROUP BY id;

And honestly, for older ClickHouse versions and large workloads, that often made sense.

But modern ClickHouse has improved FINAL significantly enough that the tradeoff is no longer as one-sided as it used to be.

In some latest-state query scenarios, using FINAL can now be:

simpler
easier to maintain
and completely reasonable

depending on:

table size
partitioning
query filters
merge behavior

The important part is understanding the workload instead of blindly following older rules.

So… Is FINAL Safe to Use Now?

This is where nuance matters.

The answer is not:

"FINAL bad"

and also not:

"FINAL free now"

The real answer is:

FINAL is much more practical in modern ClickHouse, but workload design still matters.

That distinction is important.

Where FINAL Makes Sense

There are legitimate cases where FINAL is completely reasonable now.

For example:

latest-state queries
smaller partitions
low-latency analytical workloads
deduplicated views over mutable datasets
operational analytics

Especially when using:

proper partitioning
controlled part counts
optimized schemas

In these cases, modern ClickHouse handles FINAL much better than older versions did.

Where FINAL Can Still Hurt

Even with improvements, FINAL is not magically free.

It can still become expensive when:

scanning huge datasets
querying many partitions
merges are heavily delayed
part counts explode
schema design is poor

For example:

SELECT *
FROM massive_events_table
FINAL
WHERE timestamp >= now() - INTERVAL 30 DAY;

On very large analytical datasets, this can still force substantial extra work.

So blindly adding FINAL everywhere is still not a great idea.

SELECT ... FINAL vs OPTIMIZE TABLE ... FINAL

One important distinction:

SELECT * FROM users FINAL;

and

OPTIMIZE TABLE users FINAL;

are completely different operations.

SELECT ... FINAL applies merge logic during query execution.

OPTIMIZE TABLE ... FINAL forces a heavy merge operation on storage parts themselves.

The first is a query-time behavior.

The second is a storage-level operation that can become extremely expensive on large datasets.

People often mix these two together when discussing FINAL performance, but they solve very different problems.

The Bigger Lesson Is Understanding Why You Need FINAL

This is honestly the most important part.

A lot of people use FINAL reactively.

They see:

duplicate rows
outdated versions
inconsistent query results

and immediately add:

FINAL

without understanding:

merge behavior
part lifecycle
asynchronous deduplication
storage engine behavior

That usually creates larger problems later.

The better approach is:

Understand why the table requires FINAL in the first place.

Because sometimes:

the schema can improve
partitioning can improve
merges can stabilize naturally
query design can change

And sometimes:

using FINAL is actually perfectly acceptable.

ClickHouse Advice Evolves Too

One thing I find interesting about ClickHouse is how quickly operational advice evolves as the engine improves.

Advice that was absolutely correct for older versions can become incomplete later.

And I think FINAL is one of the best examples of that.

Older guidance:

avoid FINAL aggressively

Modern reality:

understand FINAL properly before deciding whether to avoid it

That is a much more useful mental model now.

Final Thought

I still would not recommend blindly adding FINAL everywhere.

But I also do not think modern ClickHouse users should automatically treat it like a disaster anymore.

The real question is not:

"Is FINAL bad?"

The real question is:

"Why does this query need FINAL, and is that tradeoff acceptable for this workload?"

That mindset leads to much better ClickHouse designs than simply following old rules blindly.

References

ClickHouse Docs - FINAL Modifier

Altinity KB - FINAL Clause Speed

Why PostgreSQL and ClickHouse Work So Well Together

Mohamed Hussain S — Mon, 11 May 2026 09:26:10 +0000

A lot of people compare PostgreSQL and ClickHouse like they are competing databases.

They really are not.

In fact, modern data systems often use both together.

And once you understand what each database is optimized for, the reason becomes pretty obvious.

PostgreSQL and ClickHouse Solve Different Problems

The biggest mistake people make is expecting both databases to behave similarly.

They are built for entirely different workloads.

PostgreSQL is primarily an OLTP database.

ClickHouse is primarily an OLAP database.

That single difference changes almost everything about how they think internally.

PostgreSQL Thinks About Transactions First

PostgreSQL is extremely good at handling transactional workloads.

Things like:

user data
payments
inventory
banking records
order systems
application state

These are systems where:

consistency matters
updates happen frequently
rows are modified constantly
transactions must be reliable

For example:

UPDATE inventory
SET stock = stock - 1
WHERE product_id = 101;

This kind of workload is where PostgreSQL shines.

You want:

ACID guarantees
reliable transactions
row-level updates
strong consistency

PostgreSQL is designed around exactly that.

ClickHouse Thinks About Analytics First

ClickHouse approaches data very differently.

Instead of optimizing for frequent row updates, it optimizes for analytical queries across massive datasets.

Things like:

metrics
observability
logs
event streams
analytical dashboards
time-series workloads

For example:

SELECT
    service_name,
    avg(response_time_ms)
FROM metrics
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY service_name;

This is a completely different style of workload.

Instead of:

modifying small numbers of rows

ClickHouse is optimized for:

scanning huge amounts of data efficiently
aggregating billions of records
compressing analytical datasets
fast columnar reads

PostgreSQL Stores the Business. ClickHouse Explains It.

This is honestly the simplest way I think about it now.

PostgreSQL usually stores:

current application state
transactional business data
operational records

ClickHouse usually stores:

analytical history
events
metrics
large-scale queryable telemetry

One powers the application.

The other explains what the application is doing.

Why They Commonly Exist Together

This is where things get interesting.

In many modern architectures, PostgreSQL becomes the operational source of truth.

Then data flows into ClickHouse for analytics.

Something like this:

Application
    ↓
PostgreSQL
    ↓
CDC / Airbyte / Kafka
    ↓
ClickHouse
    ↓
Dashboards / Analytics / Observability

This pattern is far more common than many people realize.

Because each database is doing what it is best at.

Why Not Just Use PostgreSQL for Analytics?

PostgreSQL can do analytical queries.

But analytical workloads behave very differently from transactional workloads.

For example:

scanning billions of rows
large aggregations
observability queries
real-time analytics
historical trend analysis

These workloads stress databases differently.

ClickHouse is optimized around:

columnar storage
vectorized execution
aggressive compression
analytical query execution

That is why queries over huge datasets often feel dramatically faster in ClickHouse.

Why Not Just Use ClickHouse for Everything?

This is another common misunderstanding.

ClickHouse is incredible for analytics.

But transactional systems require things like:

frequent updates
transactional consistency
row-level modifications
operational application state

That is not the primary design goal of ClickHouse.

You generally do not want your:

user authentication system
banking transactions
inventory updates
operational business logic

to depend entirely on analytical database behavior.

The Interesting Part Is the Separation of Responsibilities

What I personally find interesting is how these systems complement each other instead of replacing each other.

PostgreSQL handles:

operational correctness

ClickHouse handles:

analytical scale

That separation creates much cleaner architectures.

Instead of forcing one database to solve every problem, each system handles the workload it was designed for.

CDC Is What Connects Them

One thing that makes this architecture powerful is CDC (Change Data Capture).

Instead of manually exporting data repeatedly, systems can stream changes from PostgreSQL into ClickHouse continuously.

Tools like:

Debezium
Airbyte
Kafka pipelines

make this pattern extremely practical now.

The operational system continues running normally while analytical systems receive data almost in real time.

They Even Think Differently Internally

The differences go deeper than just "transactions vs analytics".

PostgreSQL thinks heavily about:

rows
transactional consistency
updates
locking
relational integrity

ClickHouse thinks heavily about:

columns
compression
merges
partitions
analytical scans
aggregation efficiency

Even their storage engines reflect completely different priorities.

This Is Why Modern Data Stacks Often Use Both

Once you stop viewing databases as competitors and instead view them as workload-specific systems, the architecture starts making much more sense.

PostgreSQL handles the operational side.

ClickHouse handles the analytical side.

Together, they create systems that can:

process transactions reliably
scale analytical workloads efficiently
support observability
power dashboards
retain huge historical datasets

without forcing a single database to do everything.

Final Thought

The more I learn about databases, the more I realize that most modern architectures are really about separation of responsibilities.

PostgreSQL and ClickHouse work well together because they optimize for fundamentally different problems.

One is built to preserve business state reliably.

The other is built to analyze massive amounts of history efficiently.

And when combined properly, they complement each other extremely well.

PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic

Mohamed Hussain S — Wed, 06 May 2026 12:24:28 +0000

I was building and testing a PostgreSQL backup and restore workflow using pgBackRest.

The idea was simple:

take backups
restore them automatically
validate the database
make recovery predictable

Instead, I ended up repeatedly breaking PostgreSQL recovery itself.

At one point, PostgreSQL refused to start entirely, the application depending on it failed to start, and I started seeing errors like:

invalid checkpoint record
could not locate a valid checkpoint record at 0/DEAD

Later, I also hit timeline mismatch errors like:

ERROR: [058]: target timeline 3 forked from backup timeline 2

At first, I thought:

pgBackRest restores were corrupting PostgreSQL.

That assumption turned out to be completely wrong.

The real problem was the way I was handling recovery.

What I Was Building

I was testing a PostgreSQL backup/restore flow locally after repeated restore failures elsewhere.

To isolate the issue properly, I moved PostgreSQL onto my local machine and started testing the restore logic independently through API-triggered workflows.

The restore flow looked roughly like this:

Download backup repo
Stop PostgreSQL
Restore backup
Start PostgreSQL
Validate database

Sounds straightforward.

It wasn't.

The First Major Failure

After a restore attempt, PostgreSQL refused to start.

The logs looked like this:

LOG: database system was interrupted
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record at 0/DEAD

At that point:

PostgreSQL was down
the application couldn't start
authentication-related functionality stopped working
and repeated restore attempts made things even worse

What confused me initially was this:

The restore itself appeared to complete.

But PostgreSQL would immediately enter recovery problems afterward.

My Wrong Assumption

This was the real issue.

Every time recovery failed, I kept seeing files like:

backup_label
recovery.signal
standby.signal

So I assumed they were leftover artifacts from failed restores.

My restore automation started aggressively cleaning them up.

Something like this:

rm -f recovery.signal standby.signal backup_label

I genuinely believed this was helping PostgreSQL start cleanly.

In reality:

I was deleting the exact recovery metadata PostgreSQL needed.

That misunderstanding caused almost every major issue afterward.

What PostgreSQL Was Actually Trying To Do

This was the turning point.

pgBackRest wasn't randomly writing junk files into the data directory.

Those files exist for a reason.

During restore:

backup_label tells PostgreSQL where recovery should begin
recovery.signal tells PostgreSQL to enter recovery mode
WAL replay reconstructs a consistent database state

PostgreSQL was actually trying to perform a valid recovery process.

My automation kept interrupting or invalidating it.

Once I understood that, the entire problem started making sense.

The Recovery Loop Problem

Because my cleanup logic removed recovery metadata prematurely, PostgreSQL ended up in inconsistent states repeatedly.

Sometimes it would:

enter recovery mode
fail WAL replay
lose checkpoint continuity
refuse startup entirely

Other times it would partially start, but remain stuck in recovery mode.

That led to additional logic being added just to stabilize startup behavior.

For example:

SELECT pg_is_in_recovery();

and when required:

SELECT pg_promote();

The goal wasn't to "force PostgreSQL to work".

The goal was:

let PostgreSQL finish recovery properly, then promote only when necessary.

That distinction mattered a lot.

The Timeline Mismatch Error

At one stage, I also hit this:

ERROR: [058]: target timeline 3 forked from backup timeline 2

This one was especially confusing at first.

The issue was not just corrupted startup state anymore.

Now PostgreSQL was rejecting WAL history itself.

This happened because earlier restore attempts had already created inconsistent recovery timelines.

I had essentially created multiple broken recovery histories while repeatedly testing and modifying the restore process.

That was another important lesson:

PostgreSQL backups are not just data files.
They are tightly connected to WAL history and recovery timelines.

At this point, I realized I was no longer debugging a simple restore failure. I was debugging recovery history itself.

The Real Problem In My Restore Flow

Initially, my restore logic tried to "fix" PostgreSQL after restore.

That approach was fundamentally flawed.

The older flow looked roughly like this:

Old Approach	Problem
Delta restore	Mixed old/new recovery state
Delete `backup_label`	Broke recovery metadata
Delete `recovery.signal`	Interrupted recovery
Force archive changes	Caused WAL continuity issues
Hope PostgreSQL starts	No validation or recovery awareness

I was treating recovery artifacts like corruption.

They weren't corruption.

They were part of PostgreSQL recovery itself.

The Change That Finally Fixed It

The biggest realization was this:

Stop fighting PostgreSQL recovery.

Instead of trying to manually "clean up" PostgreSQL after restore, I changed the restore flow completely.

The corrected restore flow became:

Stop PostgreSQL cleanly
Completely empty the data directory
Run pgBackRest restore properly
Let PostgreSQL recover normally
Wait for readiness
Promote only if recovery mode persists
Validate using pgBackRest check

The critical change was this:

self._run_pgbackrest("restore", "--type=immediate")

And equally important:

self._empty_directory(self.pg_data_dir)

Instead of attempting partial or delta-style recovery cleanup, the restore process now starts from a completely clean data directory.

That eliminated a huge amount of inconsistent state.

Why `--type=immediate` Helped

This turned out to be extremely important.

--type=immediate tells pgBackRest:

restore to the latest immediately consistent point available.

That meant:

PostgreSQL could perform proper WAL-based recovery
recovery metadata stayed intact
WAL replay remained valid
timeline handling became predictable

Most importantly:

PostgreSQL itself was finally allowed to control recovery correctly.

The Mistake That Increased the Blast Radius

One thing I learned the hard way:

Never test restore automation against a database actively used by an application.

Even though this was a testing workflow, the PostgreSQL instance was still tied to application startup behavior.

So whenever PostgreSQL failed:

application startup failed too
user-related functionality broke
debugging became much harder under pressure

After repeated failures, I moved the restore testing flow entirely onto my local machine and isolated PostgreSQL from the rest of the application stack.

That made debugging significantly easier.

Another Subtle Issue: Backup Failures After Restore

I also ran into another confusing problem after some restore attempts.

In certain cases, subsequent backups started failing unexpectedly after a restore.

Part of the issue came from mixing:

restore operations
delta-style restore assumptions
and archive/WAL state inconsistencies

At one stage, I was also toggling archive-related behavior incorrectly during recovery experiments, which further complicated WAL continuity.

This reinforced another important realization:

PostgreSQL backups are tightly coupled with WAL history and recovery timelines.

Even when the database appears to start correctly, inconsistent recovery state can break future backup behavior in subtle ways.

What I Learned From This

This experience completely changed how I think about PostgreSQL recovery.

Some major lessons:

backup_label and recovery.signal are not garbage files
PostgreSQL recovery is heavily WAL-dependent
Timelines matter more than most people realize
Partial cleanup creates inconsistent recovery states
A clean restore is often safer than trying to "repair" recovery manually
pgBackRest already knows how to orchestrate PostgreSQL recovery properly
Restore validation matters as much as backup creation
Backup testing should happen in isolated environments

Most importantly:

PostgreSQL recovery is not something you should "fight".

Once I stopped trying to override recovery behavior manually and instead allowed PostgreSQL + pgBackRest to handle recovery the way they were designed to, the restore flow finally became stable.

The Final Restore Flow That Actually Worked

After multiple failed recovery attempts, timeline mismatches, and broken startup states, I stopped trying to manually "fix" PostgreSQL recovery and instead simplified the restore process completely.

The final stable flow looked roughly like this:

# simplified restore flow

stop_postgres()

empty_data_directory()

pgbackrest_restore("--type=immediate")

start_postgres()

wait_for_connection()

if postgres_is_in_recovery():
    promote_postgres()

pgbackrest_check()

The important part here is not the code itself.

It's the recovery philosophy behind it.

The earlier versions of my restore logic tried to:

partially clean recovery state
remove recovery metadata
force PostgreSQL out of recovery
preserve old data directory state

That approach kept creating inconsistent recovery conditions.

The corrected flow instead does three important things:

starts from a completely clean data directory
lets pgBackRest manage recovery metadata properly
allows PostgreSQL to perform WAL recovery the way it was designed to

The biggest change was no longer treating files like backup_label or recovery.signal as corruption artifacts.

They were part of the recovery process itself.

Final Thought

At the beginning, I thought PostgreSQL restores were failing because the database was corrupted.

In reality, the corruption was coming from my own recovery assumptions.

The system wasn't broken.

My mental model of PostgreSQL recovery was.

arrayJoin in ClickHouse: Why Your Rows Are Duplicating (and How to Control It)

Mohamed Hussain S — Tue, 28 Apr 2026 10:20:11 +0000

When working with arrays in ClickHouse, arrayJoin feels straightforward.

Until your query suddenly returns far more rows than expected.

The Use Case

Let’s say you have a table like this:

CREATE TABLE events (
    user_id UInt32,
    actions Array(String)
) ENGINE = MergeTree
ORDER BY user_id;

Example row:

user_id: 1
actions: ['click', 'scroll', 'purchase']

Now you want each action as a separate row.

The Tool: `arrayJoin`

SELECT user_id, arrayJoin(actions) AS action
FROM events;

Output:

1   click
1   scroll
1   purchase

So far, everything looks correct.

Where Things Go Wrong

Now let’s say you write:

SELECT user_id,
       arrayJoin(actions) AS action,
       arrayJoin(actions) AS action2
FROM events;

You might expect:

3 rows

But you actually get:

9 rows

Why This Happens

arrayJoin doesn’t just flatten arrays.

It expands rows.

Each element in the array creates a new row.

So when you use it multiple times:

First arrayJoin → expands rows
Second arrayJoin → expands again

Result:

3 elements → 3 × 3 = 9 rows

This is effectively a cartesian multiplication of rows.

The Hidden Impact

This becomes a real problem when:

Arrays are large
Multiple arrayJoins are used
You don’t expect row multiplication

Result:

Incorrect output
Sudden increase in row count
Slower queries

The Better Approach

1. Use a single `arrayJoin` when possible

SELECT user_id,
       arrayJoin(actions) AS action
FROM events;

2. Use `ARRAY JOIN` syntax (cleaner and explicit)

SELECT user_id, action
FROM events
ARRAY JOIN actions AS action;

3. Use `arrayZip` to avoid unintended multiplication

If you’re working with multiple arrays:

SELECT user_id,
       arrayJoin(arrayZip(actions, actions)) AS zipped
FROM events;

This ensures elements are paired instead of multiplied.

Why This Matters

arrayJoin is powerful-but easy to misuse.

If used without understanding:

Row count can explode
Queries become expensive
Results can be misleading

Real-World Use Cases

Event tracking pipelines
Flattening nested JSON
Working with semi-structured logs
Exploding arrays into rows for analysis

One Important Gotcha

Every arrayJoin multiplies rows.

If your result size looks unexpectedly large, this is one of the first things to check.

Final Thoughts

arrayJoin is one of the most useful tools in ClickHouse.

But its behavior is not always intuitive.

In many cases, the issue is not the data itself-but how the query expands it.

Understanding this early can save a lot of debugging time.