The beauty in simplicity: Migration in real-time, without data loss.

#showdev #article

This was originally posted on my blog, where I publish all things tech, sysadmin, and more

We've all heard the stories - to be able to migrate in real time, without any data loss. I've done it once before for this blog, and I've done it again for my wife and I. In the middle of the day, about 1k visitors online. Here's a breakdown of how I did it.

The Goal

Not only do I want to migrate providers, I want to migrate my content to a new country, literally continents apart. Moving from my old blog host on Hetzner, to DigitalOcean. Luckily, both sides have 1G port (Hetzner unmetered, DO has a few TB cap) - more than enough.

File Size isn't an issue, about 2.4 GB of file size, the transfer of data took about 1 minute in total (which isn't bad, considering distance, it could be improved however).

The issue that is happening is real-time movement, behind the scenes there's analytic engines, few headless APIs, etc. that all interact with data live. This is the issue.

In order to fix this, we'll need to setup some master/slave replication on PostgreSQL, then go fully fledged master/master replication. Next, I imagine we'll have to slowly pull traffic off the old host to the new one. Finally, turn off the old box.

Explanation

With our files transferred, I've setup both DO VMs as 1GB / 25GB SSD / 2TB bandwidth instances. I've enabled SSH keys.

While browsing on DO interface, I noticed they have their Cloud Firewalls, this is great - the interface is better than Hetzners Firewall interface (for dedicated servers). I started by creating one that is filtered by tags (production, cloudflare, blogging) - so this firewall will be applied to virtual machines with those tags.

I want only SSH from my jump servers (OVH BHS VM or my home servers), HTTP/HTTPS only from Cloudflare, and deny the rest of the world for this instance. Because Cloud Firewalls are so simple, this is how the rules ended up looking:

Outbound, I've allowed to all sources (for good or bad). I should look at tightening this up in the future, but for right now I'll let it be as-is. I want to make sure everything gets up for production first.

I've enabled private networking on my two VMs for our blogs, this helps save on bandwidth count, and I've linked them together in same DC. Sweet.

Now, for the SQL instance, I created a firewall called "int-production" - this will reject ALL communications on it's public IPv4 and IPv6 addresses, but can speak to the connected VMs that need access (eg. blogs). I've enabled private networking so I can connect to it direct over private network, yet again saving bandwidth (as it's metered on DigitalOcean).

I setup master/slave replication, allowed the Firewall to accept from Hetzner IP, allowed outbound to Hetzner IP only as well. After 20 minutes, we were synced up and ready to motor. I flipped it over to master/master, and started slowly redirecting the domains over.

15 minutes later, I was confident the hosts were identical, and shut off replication, turned off the Hetzner node. I ran over to my monitoring on UptimeRobot, and saw exactly what I was looking for - 100% uptime.

But wait, all my error logs have CF's IP in them, before I had setup some rules on the host dedicated server, now that these are individual VMs, I need to re-configure them, I found this article + script for nginx that worked like a charm, and setup a crontab on it. I set it to once every week (Sunday, 11 PM GMT), which is more than sufficient for my needs.

No data was lost, we were online the whole time, and this scales for my growing needs. My wife and I's blogs combined have hit well over 50,000 requests per day, and the old system didn't scale really well and would constantly drop legitimate connections, the energy and time lost debugging this outweighed the cost of staying at the old provider.

Thanks to better flexibility with DigitalOcean, I've been able to add some same rules for all my VMs with tags, instead of how I used to - individually and replicate them across N virtual machines. Alert Policies for Memory, Disk, Inbound/Outbound bandwidth help.

Future ideas

In the event of traffic spikes, and as my side project network grows, I'll spin up a DO load balancer, it's only $10 per month and seems very easy to use. Throwing my VMs for blogs, apps behind it. I'd love to get more oversight, possibly migrate my Grafana + Prometheus setup to DO, use internal networking to save on bandwidth.

Let's see how it holds up. I'd love to be in a position to migrate all my production off my own colocated gear, and revert the colocated gear back to a lab. I'm going to keep testing DO for a few months, and hope to make a full switch. The price point is reasonable, the location is good (20ms or less from where I live),

That's all for now! I'll keep updating on my experience with DO as I dive into more features. Maybe next I'll try their k8 services.