DEV Community

Cover image for HNG Portal DevOps Technical Documentation
Adenuga Israel Abimbola
Adenuga Israel Abimbola

Posted on

HNG Portal DevOps Technical Documentation

On behalf of my team...

Executive Summary

This document outlines the complete DevOps implementation for the HNG Portal project, covering CI/CD pipeline setup, server infrastructure configuration, and automated deployment workflows across staging and production environments. The project involved resolving critical infrastructure challenges, migrating services to stable AWS instances, and establishing robust automated deployment processes.


Table of Contents

  1. Infrastructure Setup & Migration
  2. CI/CD Pipeline Configuration
  3. Service Persistence & Auto-Recovery
  4. Deployment Architecture
  5. Troubleshooting & Resolution Log

1. Infrastructure Setup & Migration

1.1 Critical Infrastructure Challenges & Resolution

Initial Server Stability Issues

We encountered significant infrastructure instability on the original server that affected both frontend and backend deployments. The issues were severe enough that we had to communicate with all stakeholders - the frontend team, backend team, and project management - about the situation.

To maintain development velocity while resolving the infrastructure problems, we made the decision to provision separate AWS EC2 instances for the frontend and backend teams. This allowed both teams to continue their work uninterrupted while we systematically diagnosed and fixed the underlying server issues.

Once we fully resolved all the infrastructure problems, we successfully migrated everything to the stabilized production server, consolidating our deployment architecture.

Final Production Infrastructure:

Environment Domain API Domain Server IP
Staging staging.takeda.emerj.net api.staging.takeda.emerj.net 54.91.35.111
Production takeda.emerj.net api.takeda.emerj.net 54.91.35.111

1.2 EC2 Server Configuration

Server Access & Authentication

We started by setting up proper SSH access to the EC2 instances. Initially, there was confusion with the SSH command syntax - the wrong flag was being used for key authentication. We corrected this to use the proper approach:

# Example SSH connection approach
ssh -i ~/path/to/keyfile.pem ubuntu@<server-ip>

# Proper key permissions
chmod 400 ~/path/to/keyfile.pem
Enter fullscreen mode Exit fullscreen mode

Storage Management & Disk Expansion

One of the major infrastructure issues we faced was running out of disk space, which was causing deployment failures. The root filesystem was completely full at 7.6GB out of 7.6GB, blocking critical operations like Git pulls and package installations.

We resolved this by expanding the EBS volume from 8GB to 16GB and properly resizing the filesystem:

# Example approach to expanding partitions
sudo apt install cloud-guest-utils
sudo growpart /dev/device partition-number
sudo resize2fs /dev/partition
Enter fullscreen mode Exit fullscreen mode

After the expansion, we verified the new capacity and confirmed we had sufficient space (8GB free) for normal operations. We also implemented cleanup procedures:

# Example cleanup approaches
sudo apt clean
sudo journalctl --vacuum-size=100M
docker system prune -a
Enter fullscreen mode Exit fullscreen mode

1.3 GitHub Actions Self-Hosted Runner Setup

Runner Installation Process

We set up self-hosted runners on our EC2 instances to enable direct deployment from GitHub Actions. The initial setup had some challenges - the runner archive wasn't properly extracted, and we encountered authentication issues during registration.

# Example runner setup approach
tar xzf actions-runner-package.tar.gz
./config.sh --url https://github.com/organization/repository --token <TOKEN>
Enter fullscreen mode Exit fullscreen mode

Authentication Issues & Resolution

We hit a 404 error during runner registration initially. This happened because the runner registration token had expired - these tokens are only valid for about an hour. We had to go back to the GitHub repository settings, generate a fresh token from the Actions runner page, and immediately use it for registration. This taught us to always use tokens as soon as they're generated.

Runner Verification

After successful registration, we verified the runner was showing as online in the GitHub Actions dashboard and tested that it could pick up and execute workflow jobs. We also configured deploy keys to allow secure repository access without exposing credentials.


2. CI/CD Pipeline Configuration

2.1 Backend CI Pipeline (Laravel)

Workflow Configuration Structure

We configured the backend CI pipeline to trigger on pull requests to the staging branch, specifically watching for changes in the backend directory. The pipeline sets up a complete testing environment with MySQL service containers.

# Example CI workflow structure
name: Backend CI Pipeline

on:
  pull_request:
    branches:
      - staging
    paths:
      - "backend/**"

services:
  database:
    image: mysql:version
    env:
      MYSQL_ROOT_PASSWORD: secure-password
      MYSQL_DATABASE: test-database
Enter fullscreen mode Exit fullscreen mode

Pipeline Execution Steps

The backend pipeline goes through several stages:

  1. Environment Setup - We configure PHP with all necessary extensions like mbstring, pdo, mysql, tokenizer, xml, ctype, and fileinfo that Laravel requires.

  2. Dependency Management - Composer installs all backend dependencies using production optimization flags.

  3. Application Configuration - The pipeline copies environment configuration templates and generates application keys.

  4. Database Preparation - Migrations run to set up the test database schema.

  5. Testing - The test suite executes to validate code quality before deployment.

2.2 Frontend CI Pipeline (Next.js)

Trigger Configuration

The frontend pipeline is set up to trigger on both pull requests and direct pushes to the staging branch. We configured path-specific triggers so the pipeline only runs when there are actual frontend code changes, saving runner resources.

Build Process

The frontend build process involves:

  • Installing dependencies using pnpm
  • Building the Next.js application
  • Verifying build artifacts are created successfully
  • Preparing for deployment to the self-hosted runner

2.3 Production CI Configuration

Branch Protection & Quality Gates

For the production branch, we implemented stricter controls. All CI checks must pass before any code can be merged. This ensures that only tested, verified code makes it to production. The deployment happens automatically when changes are successfully merged to the prod branch.


3. Service Persistence & Auto-Recovery

3.1 Backend Service Configuration (Systemd)

Production Backend Service

The backend API runs as a systemd service, which gives us powerful process management capabilities including automatic restarts on failure and startup on server reboot.

# Example systemd service structure
[Unit]
Description=Production Backend API Service
After=network.target

[Service]
Type=simple
User=www-data
WorkingDirectory=/var/www/production/backend
ExecStart=/usr/bin/node server.js
Restart=on-failure
Environment=NODE_ENV=production
Environment=PORT=6001

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

The WantedBy=multi-user.target directive ensures the service starts automatically when the server boots. The Restart=on-failure policy means if the backend crashes for any reason, systemd will automatically restart it.

Service Management

We use standard systemd commands to manage the backend services:

# Example service management
sudo systemctl enable backend-service
sudo systemctl start backend-service
sudo systemctl status backend-service
Enter fullscreen mode Exit fullscreen mode

Path Configuration Challenge

One significant issue we encountered was that systemd doesn't inherit the user's shell PATH. This caused failures when systemd tried to execute Node or pnpm commands that were installed via NVM. We had to explicitly specify full paths or configure the PATH environment variable in the service file.

3.2 Frontend Service Configuration (PM2)

Why PM2 for Frontend

The frontend applications run under PM2 rather than systemd. PM2 provides excellent process management for Node.js applications with features like:

  • Automatic restart on crashes
  • Built-in log management
  • Zero-downtime reloads
  • Easy clustering capabilities

PM2 Startup Configuration

We configured PM2 to start automatically on system boot:

# Example PM2 startup configuration
pm2 startup systemd
pm2 save
Enter fullscreen mode Exit fullscreen mode

This generates a systemd service that launches PM2 on boot, which in turn starts all saved PM2 processes.

PM2 Troubleshooting

We encountered an interesting issue where PM2 was trying to execute the pnpm binary as if it were a Node script, causing syntax errors. The problem was that pnpm is actually a shell script, not JavaScript. We resolved this by ensuring PM2 was configured to execute the correct commands with proper interpreters.

3.3 PHP Composer Installation for Backend

Composer Setup

The Laravel backend required Composer for dependency management. We installed it globally so it's accessible system-wide:

# Example Composer installation approach
curl -sS https://getcomposer.org/installer -o composer-setup.php
sudo php composer-setup.php --install-dir=/usr/local/bin --filename=composer
Enter fullscreen mode Exit fullscreen mode

In our case, we were also using Herd Lite, which provided its own Composer binary at a specific location. We ensured the correct binary path was being used in our deployment scripts.


4. Deployment Architecture

4.1 Port Allocation Strategy

We established a clear port allocation scheme to avoid conflicts and make service management straightforward:

Environment Service Port
Production Frontend 4001
Production Backend 6001
Staging Frontend 4000
Staging Backend 6000

This layout makes it easy to remember: staging uses the x000 ports, production uses x001 ports. Backend services are on 6xxx, frontend on 4xxx.

4.2 Nginx Reverse Proxy Configuration

Infrastructure Recovery

We faced a critical issue where the entire Nginx installation was corrupted - the main configuration file and directory structure were missing. This was blocking all deployments since Nginx couldn't even start.

We had to perform a complete Nginx reinstallation:

# Example complete Nginx reinstallation approach
sudo apt purge nginx nginx-common nginx-full -y
sudo rm -rf /etc/nginx
sudo apt update && sudo apt install nginx -y
Enter fullscreen mode Exit fullscreen mode

This recreated the proper directory structure with all default configuration files.

Production Server Blocks

We configured Nginx to reverse proxy to our frontend and backend services. Here's the conceptual approach for the production configuration:

# Example production frontend proxy
server {
    listen 80;
    server_name takeda.emerj.net;

    location / {
        proxy_pass http://localhost:4001;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        # Additional proxy headers for client info
    }
}

# Example production backend API proxy
server {
    listen 80;
    server_name api.takeda.emerj.net;

    location / {
        proxy_pass http://localhost:6001;
        # Similar proxy configuration
    }
}
Enter fullscreen mode Exit fullscreen mode

Staging Server Blocks

The staging configuration follows the same pattern but uses different domains and ports:

# Example staging frontend proxy
server {
    listen 80;
    server_name staging.takeda.emerj.net;

    location / {
        proxy_pass http://localhost:4000;
        # Proxy configuration
    }
}

# Example staging backend API proxy
server {
    listen 80;
    server_name api.staging.takeda.emerj.net;

    location / {
        proxy_pass http://localhost:6000;
        # Proxy configuration
    }
}
Enter fullscreen mode Exit fullscreen mode

Site Activation

After creating the configuration files, we enabled them and verified everything was correct:

# Example site enablement approach
sudo ln -s /etc/nginx/sites-available/production /etc/nginx/sites-enabled/
sudo ln -s /etc/nginx/sites-available/staging /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
Enter fullscreen mode Exit fullscreen mode

4.3 Logging Configuration

Nginx Log Setup

We needed to ensure proper log directories existed with correct permissions:

# Example log directory setup
sudo mkdir -p /var/log/nginx
sudo touch /var/log/nginx/error.log
sudo touch /var/log/nginx/access.log
sudo chown -R www-data:www-data /var/log/nginx
Enter fullscreen mode Exit fullscreen mode

This allows us to monitor access patterns and debug any proxy issues.


5. Troubleshooting & Resolution Log

5.1 Critical Issues Encountered & Resolved

Issue 1: Complete Disk Full - Git Operations Failing

Symptom: Git pull commands were failing with "No space left on device" errors.

Investigation: We ran disk usage commands and found the root filesystem was at 100% capacity - completely full at 7.6GB with 0 bytes available.

Resolution: We implemented a multi-step cleanup process:

  • Cleared apt package cache to free immediate space
  • Vacuumed systemd journal logs that were consuming significant space
  • Removed unused Docker containers and images
  • Finally expanded the EBS volume and resized the filesystem as detailed earlier

Issue 2: pnpm Not Found in Systemd Context

Symptom: Backend service was failing with "Failed to locate executable /usr/bin/pnpm"

Investigation: We discovered that pnpm was installed via NVM in the user's home directory, not in a system-wide location. Systemd was looking for it at /usr/bin/pnpm which didn't exist.

Resolution: We updated the systemd service configuration to either:

  • Use the full path to the pnpm binary in the NVM directory
  • Set the PATH environment variable in the service file to include the NVM binary directory
  • Or use a shell wrapper that loads the full environment before executing commands

Issue 3: PM2 Crash Loop with Syntax Errors

Symptom: PM2 logs showed repeated "SyntaxError: missing ) after argument list" errors, with the process constantly restarting.

Investigation: We analyzed the logs and discovered that Node.js was trying to execute the pnpm shell script directly as JavaScript code, which obviously failed since pnpm is a bash script, not a Node module.

Resolution: We corrected the PM2 configuration to properly invoke pnpm through the shell rather than trying to run it as a Node script. This involved adjusting how the command was being called.

Issue 4: Nginx Configuration Completely Missing

Symptom: Running nginx -t produced "No such file or directory" errors for nginx.conf.

Investigation: We checked the /etc/nginx directory and found it was either missing entirely or had been corrupted - the main configuration file and critical subdirectories like sites-available and sites-enabled didn't exist.

Resolution: We performed a complete purge and reinstall of Nginx. This was the only way to restore the proper directory structure and default configuration files. After reinstallation, we rebuilt our custom configurations from scratch.

Issue 5: 502 Bad Gateway from Nginx

Symptom: Visiting the application URL resulted in a 502 Bad Gateway error.

Investigation: We verified that Nginx was running properly but discovered the backend application wasn't actually listening on the expected port. The service was failing to start but failing silently.

Resolution: We used systemctl status and journalctl to examine the service logs, identified why it wasn't starting (usually environment or path issues), fixed the configuration, and restarted the service. Once the application was properly listening on its port, Nginx could successfully proxy to it.

Issue 6: GitHub Runner 404 During Registration

Symptom: Attempting to register the self-hosted runner returned a 404 Not Found error from GitHub's API.

Investigation: This can happen for several reasons - wrong repository URL, wrong token type (repo vs organization), or most commonly, an expired token.

Resolution: We went back to the repository's Settings > Actions > Runners page, generated a fresh registration token, and immediately used it to register the runner. Registration tokens expire after about an hour, so timing is important.

Issue 7: Hostname Resolution Warnings in Sudo

Symptom: Every sudo command showed "unable to resolve host" warnings, though commands still executed.

Investigation: We found that the server's hostname wasn't properly listed in the /etc/hosts file, so the system couldn't resolve its own hostname.

Resolution: We edited /etc/hosts to include an entry mapping the hostname to 127.0.1.1, which resolved the warning messages.

5.2 Deployment Verification Process

After making changes, we established a verification checklist to ensure everything was working:

# Example verification commands

# Check frontend services
sudo systemctl status staging-frontend
sudo systemctl status prod-frontend
pm2 status

# Check backend services  
sudo systemctl status staging-backend
sudo systemctl status prod-backend

# Verify Nginx configuration and status
sudo nginx -t
sudo systemctl status nginx

# Confirm services are listening on correct ports
sudo netstat -tlnp | grep -E ':(4000|4001|6000|6001)'

# Monitor live logs for issues
sudo journalctl -u prod-backend -f
pm2 logs
sudo tail -f /var/log/nginx/error.log
Enter fullscreen mode Exit fullscreen mode

6. Deployment Workflow Summary

6.1 Automated Deployment Process

Staging Deployment Flow

When a developer pushes code to the staging branch, here's what happens automatically:

  1. GitHub detects the push and triggers the appropriate CI workflow
  2. Our self-hosted runner picks up the job
  3. The runner checks out the code
  4. Dependencies are installed (pnpm for frontend, composer for backend)
  5. The application is built
  6. For frontend: PM2 reloads the application with zero downtime
  7. For backend: The systemd service is restarted
  8. Nginx continues serving traffic throughout the process

Production Deployment Flow

Production deployments follow a similar but more controlled process:

  1. Changes are merged to the prod branch after passing all staging tests
  2. The production CI workflow executes
  3. Build artifacts are generated with production optimizations
  4. Services are updated using the same reload/restart mechanisms
  5. We monitor logs immediately after deployment to catch any issues

6.2 Server Reboot Survival

All our services are configured to automatically start after a server reboot. This is critical for maintaining uptime during planned maintenance or unexpected restarts.

Frontend (PM2): The PM2 startup script runs at boot and launches all saved PM2 processes, including both staging and production frontends.

Backend (Systemd): The systemd services have WantedBy=multi-user.target, which means they start automatically during the normal multi-user boot sequence.

Nginx: The Nginx service is enabled and starts automatically, ready to proxy requests as soon as the backend and frontend services are available.

We verified this works by actually performing test reboots of the server and confirming all services came back online without manual intervention.


7. Security Considerations

Throughout the deployment setup, we implemented several security best practices:

SSH Key Security: All SSH keys are kept with restrictive permissions (chmod 400) so only the owner can read them. This prevents unauthorized access even if someone gains access to the filesystem.

Service User Isolation: Backend services run under the www-data user rather than root, following the principle of least privilege. This limits the damage if a service is compromised.

Environment Variables: Sensitive configuration like database passwords and API keys are stored in systemd service files or PM2 ecosystem files, not in version control.

Nginx as Security Layer: The Nginx reverse proxy provides an additional security boundary between the internet and our application services. It also allows us to easily add features like rate limiting or SSL termination.

Database Isolation: The MySQL service is configured to only accept connections from localhost, preventing external database access.


8. Maintenance & Monitoring

8.1 Health Check Procedures

We established regular health check procedures to catch issues before they become critical:

# Example system health check approach
systemctl list-units --state=failed
df -h
free -h

# Application-specific checks
pm2 status
sudo journalctl -u prod-backend --since "1 hour ago"
sudo tail -f /var/log/nginx/error.log
Enter fullscreen mode Exit fullscreen mode

8.2 Log Management

Nginx Logs: Access logs help us track traffic patterns, while error logs are crucial for debugging proxy issues.

Systemd Journal: Backend services log to the systemd journal, which we can query with journalctl. We keep these rotated to prevent disk space issues.

PM2 Logs: Frontend logs are managed by PM2, which provides easy access through the pm2 logs command and automatically rotates log files.

8.3 Backup & Recovery

While not explicitly detailed in the troubleshooting above, proper backup procedures are essential. We ensure:

  • Regular database backups
  • Code is in version control (not relying on server copies)
  • Configuration files are documented (like in this document)
  • AMI snapshots of the EC2 instance for disaster recovery

9. Repository & Live Application

Repository Structure:

  • Staging Branch CI: .github/workflows/backend-ci-staging.yml
  • Production Branch CI: Configured with appropriate branch protection rules
  • Deploy keys configured for secure repository access
  • Self-hosted runner active and registered

Live Application URLs:

  • Staging Frontend: staging.takeda.emerj.net
  • Staging Backend API: api.staging.takeda.emerj.net
  • Production Frontend: takeda.emerj.net
  • Production Backend API: api.takeda.emerj.net

Server Infrastructure:

  • All services hosted on: 54.91.35.111
  • Multiple environment isolation via port allocation
  • Services configured for automatic startup and failure recovery

10. Lessons Learned & Best Practices

Always Use Fresh Tokens: We learned that GitHub runner registration tokens expire quickly. Generate and use them immediately.

Test Reboot Survival: Don't assume services will start on reboot - actually test it by restarting the server.

Monitor Disk Space: Many mysterious issues trace back to full disks. Regular monitoring and cleanup procedures are essential.

Document Port Allocations: Clear port allocation schemes prevent conflicts and make troubleshooting much easier.

Systemd PATH Matters: Never assume systemd services inherit your shell environment. Always explicitly configure PATH and environment variables.

Keep Nginx Configs Simple: Start with basic working configurations and add complexity gradually. This makes debugging much easier.


Conclusion

This DevOps implementation successfully delivers:

  • Fully automated CI/CD pipelines for both staging and production environments
  • Service persistence across reboots using systemd and PM2 with proper startup configurations
  • Scalable architecture with clear environment separation
  • Comprehensive logging for troubleshooting and monitoring
  • Robust error recovery with automatic restart mechanisms
  • Production-ready infrastructure that survived rigorous testing including server reboots

The infrastructure migration from unstable servers to our current provided consolidated instance was complex but ultimately successful. We maintained development velocity by providing temporary separate environments while systematically resolving all infrastructure issues. The final architecture is production-ready, well-documented, and built to survive the real-world challenges of maintaining uptime and deployability.

Ps: all code snippets only show the thought process towards the approach and not what was used :)

Top comments (0)