I Pulled My Data From the Cloud: Do I Regret It?

#devops #sistemmimarisi #software

A year ago, when we migrated the critical ERP data and the entire application of a mid-sized manufacturing company from the cloud to our own bare-metal servers, I still vividly remember the "are we making a mistake?" look on the team's faces. After years of hearing about the cloud's flexibility, ease, and promise of "infinite scalability," whether this decision would bring me regret was a big question mark at the time. So, after this ambitious rollback, do I regret it today? Absolutely not.

Our primary motivation for making this decision was to gain cost control and operational independence. When cloud bills started to swell unpredictably, I saw that not only resource usage but also network egress costs were imposing a significant burden.

Why Did We Decide to Move From Cloud to On-Premise?

The cloud's initial appeal of flexibility and rapid deployment capabilities was eventually overshadowed by unpredictable costs and vendor lock-in risks. In a manufacturing ERP, especially with real-time workloads like AI-driven production planning and operator screens, network latency and database I/O performance were critical. Optimizing the cost of such workloads in the cloud had become increasingly difficult.

The bills we paid for PostgreSQL instances, Redis caches, and FastAPI backends were increasing much faster than expected. In particular, network egress fees, as data transfer volume grew, were devouring our budget like a black hole. Furthermore, continuous data flow via VPN tunnels for integration with other systems on our local network led to performance degradation.

ℹ️ Cost and Control

The conveniences offered by the cloud cannot be overlooked, but long-term costs and restrictions on operational control can create significant disadvantages, especially for certain workloads. Having full control over your own infrastructure offers customized performance optimizations and fixed costs.

This situation prompted me and my team to ask, "Do we really need this much flexibility?" and "How much can we save by building our own infrastructure?" Our detailed cost analysis clearly showed that an on-premise solution would be much more economical in the medium term. Additionally, the desire to keep our security policies and data storage standards under our own control was an important factor reinforcing this decision.

The Migration Process: Expectations vs. Reality

Migrating data from the cloud to our own servers was a much more complex process than I anticipated. Specifically, the physical replication of a 5TB PostgreSQL database pushed network bandwidth to its limits, and this process took longer than we expected. Although we had planned in detail days in advance to minimize downtime, we experienced an additional 30 minutes of outage due to DNS propagation times and various network segmentation issues during the go-live moment.

When setting up the new infrastructure, adapting existing cloud configurations to bare-metal servers was a challenge in itself. For PostgreSQL, we had to fine-tune parameters like checkpoint_timeout and max_wal_size to prevent WAL bloat issues. For Redis, maxmemory-policy selections were critical to minimize OOM evictions, and we conducted lengthy performance tests between volatile-lru and allkeys-lfu.

⚠️ To Avoid Being Caught Off Guard

Unexpected problems can always arise during data migration processes. Details such as DNS propagation times, MTU mismatches, or database replication errors can lead to critical outages. Detailed testing and rollback plans are vital.

I experienced similar challenges when migrating the backend of my own side product. Once, when defining a new systemd unit, I accidentally set the memory.max limit incorrectly instead of memory.high, causing the application to be unexpectedly OOM-killed. Such small but critical errors showed once again how many details we need to master when we step out of the "comfort zone" provided by managed cloud services. Although these situations were frustrating at first, each problem became a new learning opportunity for me.

Pros and Cons of On-Premise Life (From My Perspective)

The biggest advantage we gained after moving on-premise was full control over costs and performance. It was now much easier to predict monthly bills, and we could prevent unnecessary resource consumption. The direct disk write speeds of PostgreSQL on our own servers provided much higher performance without being limited by IOPS in the cloud.

For example, a batch job processing over 100,000 transaction records daily in a manufacturing company's ERP, which took an average of 45 minutes in the cloud, dropped to 18 minutes on our on-premise infrastructure. This was a tremendous gain in operational efficiency. Keeping network segmentation entirely under our control with VLANs and firewall rules also strengthened our security posture. By implementing switch hardening techniques like DHCP snooping and Dynamic ARP Inspection (DAI, I built a more resilient structure against internal threats.

However, some disadvantages of on-premise cannot be ignored. Especially the initial investment cost and operational burden can be a major barrier at the beginning. In situations like server failures, network problems, or power outages, all responsibility lies with us. We had to design disaster recovery plans much more detailed and manually. For instance, when a RAID card on a backup server failed, I sorely missed the convenience of automatic replication in the cloud.

💡 The Value of Specialized Control

On your own servers, you can fine-tune every detail, from operating system kernel parameters to network QoS settings. This level of control is invaluable, especially for applications requiring high performance or specialized security needs.

Nevertheless, the in-depth knowledge and level of control I gained in return for this additional operational burden made this change worthwhile. Now, when a problem arises, I don't have to wait for a third-party provider's APIs or support team to find the root cause. I can directly examine SystemD units, journald logs, and cgroup limits, speeding up the problem-solving process.

So, Do I Regret It? My Final Decision.

I definitely do not regret my decision to pull my data from the cloud to my own servers. On the contrary, this process has taught me a great deal in terms of both technical knowledge and operational flexibility. Of course, this is not the right path for everyone. For small-scale projects or rapidly changing workloads, the cloud can still be an ideal solution. However, for teams that have reached a certain size, want tighter control over costs and data security, and can handle the operational burden, an on-premise or hybrid approach offers significant advantages.

In my experience, especially for long-lived and critical systems like a manufacturing ERP, investing in our own infrastructure has proven to be a much smarter strategy in the long run. Costs became more predictable, performance exceeded our expectations, and we significantly strengthened our security posture. We acted knowing the risks and challenges behind this decision, and the results show that the effort was worth it.