Introduction
I have set up several MCP servers on my home server and have Claude and ChatGPT call them. I even had AI build the server itself and the monitoring unit, and I felt like I was maintaining a relatively stable operation.
Yet, one day, Claude said:
"That tool cannot be found."
Huh. It was working just fine a moment ago.
When I restart the session, it works again.
After a while, it says "I can't see it" again.
Restart. It fixes it.
"Sometimes it disappears" is the trickiest issue
This pattern is frustrating because isolating the cause is difficult and progress stalls.
- The app is alive.
- The port is listening.
- Health checks are passing.
- But from the AI's side, "the tool is not registered."
Even when I look at the server logs, nothing is crashing. Sometimes, even when checking client-side logs, there isn't even a trace that a connection attempt was made.
Thinking it was strange, I let it slide for months, telling myself, "It's fine since a restart fixes it." Probabilistic failures like this are something humans can just brush off with an 'oh, it failed again' when interacting with it. You just press the reload button, and that's it.
Constantly pinging AI makes the fluctuations apparent
Once it is registered as an MCP for AI, the story changes.
Every time a session begins, the AI goes to connect to the registered MCP server. It resolves the name, performs the TLS handshake, fetches the tool list, and adds it to the context. If any single step fails during this process, the state remains "the tool is invisible" for the entire session.
Probabilistic failures that used to be resolved by a human thinking "Oh, it failed, let me press it again" now align perfectly with AI session boundaries, and the conversation begins in a state where the tool effectively doesn't exist.
And because the AI side only returns "that tool does not exist," the cause remains invisible to the human.
I suspected the lowest layer
I suspected the "network," "TLS," and "server" in order, and finally, DNS remained.
Since my home server does not have a static IP, I was using DDNS (dynv6.net). It's the standard for those without a static IP. I've been using it for years.
That DDNS was failing to resolve names occasionally.
Specifically, there were moments when dig would return NXDOMAIN or SERVFAIL. I'm not sure if it was a provider-side issue, upstream cache, or rate limiting. A few minutes later, if I ran dig again, it would go through normally as if nothing had happened.
……Was this it?
"A name that worked a few minutes ago does not work now" was definitely happening.
And if an AI session starts the moment the DDNS resolution drops, the tool disappears. I suspected this was likely the cause.
I never suspected DNS before
When I realized this, I was quite surprised myself.
Until now, I had almost no concept of suspecting DNS. To me, name resolution was just something you typed into a browser address bar to see if it connected or not.
I wasn't even aware that a "sometimes it fails" mode existed.
My consciousness was not directed at the existence of the DNS layer. I didn't even have the recognition that name resolution isn't a binary choice of "working or broken," but rather something that "sometimes works and sometimes doesn't." Having the AI ping it daily is what finally directed my attention there.
The option of Cloudflare Tunnel
That is when I encountered Cloudflare Tunnel.
From a server without a static IP, you establish a permanent outbound connection from your end to the Cloudflare edge. From the client's perspective, it just resolves via Cloudflare DNS and connects to the edge. After that, the tunnel carries it to your home.
In other words, my server no longer needs to expose its name via DDNS. Cloudflare DNS holds the name, and Cloudflare acts as the exit point.
I just needed to register my domain (kitepon.dev) with Cloudflare NS and set tunnel routes for each subdomain. No static IP is needed. No port forwarding is needed. No DDNS update scripts are needed.
Two operational debts disappeared as a bonus
There were two side effects I noticed after migrating. Both were bonus features, but they are subtly effective.
First: I was liberated from the /etc/hosts pilgrimage for hairpin NAT countermeasures.
I use SoftBank's 10G line. This does not support hairpin NAT. If I try to hit my own public domain from within my home LAN, the route to go out and come back cannot be established, and I get stuck.
My previous solution was to go around and update /etc/hosts with the internal address for every device, every container, and every WSL instance. This was a subtle operational debt; every time I added a new device or spun up a new container, I was forced to update hosts files.
Once I moved to Cloudflare Tunnel, it uses the same path (via Cloudflare) whether I hit it from inside or outside. I removed all the special handling for hosts.
Second: On the SoftBank line, I occasionally suffered from irregular inbound blocks.
This was another thing that had been a minor annoyance for years. Whether it was the SoftBank line, the home gateway, or upstream security measures, I couldn't pinpoint the cause, but there were times when access from outside would not connect.
Since Cloudflare Tunnel is a permanent connection established outbound from the home server, from the ISP's perspective, only "communication from home to the outside" occurs. Blocks triggered on the inbound side structurally became irrelevant.
After the migration
The AIs stopped saying "I can't see the tool."
This is an observation, not an absolute guarantee. Cloudflare itself might not be perfect either. However, the instability of the DDNS I operated myself and the edge availability of a commercial CDN are on different orders of magnitude. I have to admit that.
And since the side effects of editing hosts and the SoftBank-side blocks also disappeared, the triggers for the AI saying "cannot access" have decreased all at once.
What I learned
By having the AI ping it daily, I realized I should suspect DNS.
To me, DNS was a mechanism for "the browser's address bar." For human access, it didn't bother me if it failed occasionally.
But when you start throwing long-term tasks at an AI, the fluctuations you previously let slide become critically impactful. When you put something that runs 24 hours a day on top of your stack, the lies in the underlying layers peel away one by one.
It just happened that in my home server, the first layer to peel away was the DNS layer.
And the means to patch that peeled layer has been in front of me for free for over 5 years. Cloudflare Tunnel became free in 2020. I just didn't have the trigger to notice.
Perhaps next, another layer will peel away. I will write about it again when that happens.
Top comments (0)