GitHub Actions vs. GCE Pull Poller: My Single VM Deployment Battle

#deployment #infra #gce #githubactions

Running a full AI product, aicoreutility.com, on a single, modest VM is a constant exercise in resource management and engineering pragmatism. Most days, it's a quiet hum of activity. But every so often, a bug or a deployment hiccup reminds me of the fragility of this setup. One such incident, which recurred a few times, was the dreaded empty .next directory after a deployment.

The Symptom: A Ghost in the Machine

The symptom was always the same: after a deploy, the website would become inaccessible, returning a 500 error. Digging into the logs, the culprit was consistently an empty or corrupted .next directory. This directory is where Next.js outputs the built application, and without it, the server has nothing to serve.

This wasn't a random occurrence. It seemed to happen most often during automated deployments triggered by pushes to GitHub. My initial thought was that the build process itself was failing, perhaps due to resource constraints on the VM. I'd check the build logs, and sometimes they'd show errors, but other times, they'd appear to complete successfully, only for the .next directory to be missing or empty upon inspection.

The Wrong Turns: Chasing Shadows

My first few attempts to fix this involved tweaking the build process itself. I experimented with different build commands, ensuring all dependencies were correctly installed, and even tried increasing the VM's memory temporarily. I also looked at the deployment scripts, trying to add more robust error checking.

One of the key parts of my deployment strategy was an atomic swap. After a successful build, the new build artifacts (in a temporary directory like .next.new) would be swapped with the current production directory (.next). This ensures that if the new build fails, the old one remains untouched. However, in this scenario, it seemed like the swap was either not happening correctly, or the build itself was producing an empty output before the swap even occurred.

I also had a watchdog script running, but it was primarily focused on the PM2 process and basic server health checks. It wasn't sophisticated enough to detect an empty .next directory as a critical failure before it caused a full outage. The PM2 process would still be running, but it would be serving nothing.

The Root Cause: A Race Condition on a Single VM

The recurring nature of the problem, especially with automated pushes, pointed towards a timing or concurrency issue. The material I reviewed highlighted the core problem: my GitHub Actions workflow was executing the build and swap operations within an external SSH session or job. If this session timed out, or the connection was interrupted for any reason, the atomic swap could be left incomplete. The new build artifacts might end up in .next.new, but the swap to .next would fail, leaving the production directory empty or corrupted. The PM2 process, unaware of this underlying file system issue, would continue to run, leading to the 500 errors.

Essentially, the external nature of the GitHub Actions build and deploy process, combined with the potential for transient network issues or session timeouts on a single VM, created a race condition. The build could succeed, but the critical final step of making it live could fail silently, leaving the application in a broken state.

This was particularly insidious because the build logs might not always show a clear error. The process might just... stop. And the watchdog, not designed to check the integrity of the .next directory itself, wouldn't catch it until the next request came in.

The Fix: Embracing the Simpler Path

The solution, as documented in my build logs, was to simplify the deployment process and eliminate the external dependency that was causing the race condition. I decided to move away from the GitHub Actions-driven build and swap and instead rely solely on a Google Compute Engine (GCE) pull poller.

Here's how the new system works:

GitHub Actions Disabled for Auto-Deploy: The deploy.yml workflow was modified to only allow manual triggers (workflow_dispatch). Automatic deployments via on.push were commented out, effectively disabling them for routine use.
GCE Pull Poller: A systemd timer (riel-autodeploy.timer) is set to run every 90 seconds. This timer triggers a script (auto_deploy_poll.sh).
Server-Side Build and Deploy: The auto_deploy_poll.sh script, in turn, calls redeploy.sh auto. Crucially, this script runs entirely on the server itself. It checks if a redeploy is necessary (e.g., if frontend, chat, or backend code has changed). If a redeploy is needed, it performs the build, smoke tests, and atomic swap all within the server's environment.
Flock and Gates: The redeploy.sh script now uses flock to prevent multiple instances from running concurrently and includes more robust checks. It ensures that the build is complete and the resulting .next directory is not empty and has a valid BUILD_ID before proceeding with the swap. It also ignores temporary connection issues during PM2 reloads, only rolling back on genuine failures (4xx, 5xx, chunk 404s).

This approach eliminates the external session dependency. The build and swap happen in a controlled environment on the VM, making it far less susceptible to network interruptions or timeouts that plagued the GitHub Actions method. The watchdog script was also updated to check for the integrity of the .next directory (non-empty, valid BUILD_ID, sufficient chunks) rather than just relying on PM2's status.

The Lesson: Simplicity is Robustness

This incident reinforced a core principle of solo development on limited infrastructure: simplicity often breeds robustness. While GitHub Actions is a powerful tool, its complexity introduced a failure mode that was difficult to debug on a single VM. By reverting to a simpler, server-side pull poller, I've created a deployment process that is more resilient to the inherent instabilities of a shared environment. The occasional need to manually trigger a deploy via GitHub Actions is a small price to pay for the increased stability and reduced downtime.

...building aicoreutility.com in the open... aicoreutility.com

💬 This is part of *Riel** — a full AI product I'm building solo, in public (failures and all). Read more build logs → · See the product →*