We Moved a Production System from Azure VMs to Bare Metal Kubernetes in 3 Months

#kubernetes #azure

This wasn’t one of those long, overplanned migrations that drag on for years. It took us about three months from start to finish, and most of that time was spent being careful rather than building something complicated.

The system we inherited was running on Azure with multiple VMs, a self-managed MySQL instance, and a load balancer in front. It had all the symptoms of something that had grown without a plan.

Everything was configured manually. No infrastructure as code, no containers, no orchestration. If something broke, someone would log into a machine and try to fix it directly, which worked until it didn’t.

The biggest issues always showed up under load. The database would start locking, queries would slow down, and parts of the system would just stop responding. Not crash, just hang long enough to cause real problems. CPU usage would spike on certain machines while others stayed underutilized. There was no real way to scale or redistribute load cleanly.

During peak hours, especially in peak season, the system became unpredictable. Some requests would go through, others would fail silently, and queues would start backing up. It wasn’t one clear failure point, it was a combination of small issues stacking up at the worst possible time.

The traffic patterns were predictable, the load was consistently high, and most of the cost was going into abstraction layers that we weren’t really benefiting from.

So we moved to bare metal.

The new setup was built on four servers running Kubernetes.
We used Longhorn for storage, Calico for networking, and Traefik for ingress. Nothing exotic, just components that are stable and well understood.
Everything was handled with Terraform and Ansible to avoid repeating the same mistakes we inherited.

The goal wasn’t to build something fancy but to make the system predictable.

The migration itself was done without downtime, but it required a lot patience. We didn’t switch everything at once, the new MySQL instance was set up first and configured to replicate from the existing database. The application continued using the old database while the new one stayed in sync in the background.

Once we were confident in the data consistency, we started shifting reads. That allowed us to test the new environment under real traffic without taking risks. After that, we switched writes, kept the old database running for a while as a safety net, and only then fully moved over.

From the outside, nothing changed during the transition. Internally, everything did.

The difference after the migration was immediate. The database stopped being a bottleneck. Locking issues disappeared under load. Query performance became consistent instead of degrading during peak traffic.
Queues that used to build up during busy periods now process almost instantly. Jobs don’t sit around waiting anymore. When something fails now, it’s because of an actual bug, not because the infrastructure couldn’t keep up.

The system handles hundreds of requests per second during peak without any noticeable degradation. The same endpoints that used to struggle under load now behave exactly the same whether traffic is low or high.

Uptime improved significantly as well. Before, there were enough small issues that it realistically hovered lower than acceptable, but after the migration, it stabilized around what you would expect from a properly designed system.

Cost went down by almost 40% on a yearly basis, simply because we removed unnecessary overhead and reduced the amount of time spent maintaining unstable infrastructure.

Immediately after going live we had a critical issue.

One of the banks we worked with stopped accepting payments from us, but it failed silently. Turns out the old IPs were whitelisted during the initial setup which happened years ago, but this was never documented, and now with the new IPs we couldn't make any request to the bank.
This however was quickly fixed by sending them the new IPs.

The biggest improvement wasn’t Kubernetes or bare metal itself. It was removing unpredictability. The system is now consistent. It behaves the same on peak days as it does on normal ones.
Peak traffic is no longer something that requires planning around or worrying about.

But, this is not a blanket recommendation.

Bare metal comes with real responsibility. You are not just running applications anymore, you are running infrastructure. Power redundancy, network redundancy, hardware failures, physical access, all of that becomes your problem. If a server goes down, there is no provider quietly replacing it in the background. If your network has issues, there is no managed layer absorbing that complexity for you. Doing it wrong can easily put you in a worse position than where you started.

In our case, the workload justified it. The traffic was predictable, the system needed consistent performance, and the cost of staying in the cloud was higher than the cost of owning the infrastructure.

For most businesses the cloud is still the better choice. It removes a lot of operational risk and lets you focus on the product instead of the infrastructure.

In this case, bare metal was the right call.
In many others, it won’t be.

DEV Community

We Moved a Production System from Azure VMs to Bare Metal Kubernetes in 3 Months

Top comments (0)