Solved: Backup All GitHub Repositories to S3 Bucket Automatically

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: This guide provides an automated, cost-effective solution to mitigate data loss risks by backing up all GitHub repositories. It leverages a Python script to clone repositories, archive them, and upload them to an AWS S3 bucket, orchestrated by GitHub Actions for scheduled execution.

🎯 Key Takeaways

The solution uses a Python script with requests for GitHub API interaction, subprocess for Git operations, shutil for archiving, and boto3 for S3 uploads.
Authentication requires a GitHub Personal Access Token (PAT) with repo scope and AWS IAM user credentials (Access Key ID, Secret Access Key) with s3:PutObject, s3:ListBucket, and s3:GetObject permissions on the target S3 bucket.
Automation is achieved via GitHub Actions, configured with a cron schedule for daily backups and workflow\_dispatch for manual triggers, securely passing sensitive credentials as repository secrets.

Backup All GitHub Repositories to S3 Bucket Automatically

As a Senior DevOps Engineer and Technical Writer for TechResolve, I understand the critical importance of data resilience. In today’s cloud-native landscape, while platforms like GitHub offer high availability, relying solely on a single vendor for your invaluable source code can be a significant risk. Disasters, accidental deletions, or even account compromises can lead to irreparable data loss if not properly safeguarded.

Manual backups are tedious, error-prone, and often overlooked, especially in fast-paced development environments. The cost of specialized third-party backup solutions can also be prohibitive for many teams. This tutorial addresses these challenges by providing a robust, automated, and cost-effective solution to back up all your GitHub repositories directly to an Amazon S3 bucket.

By the end of this guide, you will have a fully automated system that regularly pulls all your GitHub repositories and archives them securely in S3, giving you peace of mind, improved disaster recovery capabilities, and full control over your code’s backups.

Prerequisites

Before we dive into the automation, ensure you have the following in place:

GitHub Account: Access to the repositories you wish to back up.
GitHub Personal Access Token (PAT): With appropriate scopes. For private repositories, the repo scope (all sub-options) is required. For public repositories only, public_repo is sufficient. We recommend generating a token specifically for this backup process.
AWS Account: With permissions to create S3 buckets and IAM users/roles.
AWS S3 Bucket: A bucket configured in your AWS account to store the backups.
AWS IAM User or Role: With programmatic access (Access Key ID and Secret Access Key) and permissions to perform s3:PutObject and s3:ListBucket actions on the designated S3 bucket.
Python 3.x: Installed on the system where the script will run (or implicitly available in a GitHub Actions runner).
pip: Python’s package installer.
Git: Command-line Git installed (also implicitly available in GitHub Actions runners).

Step-by-Step Guide: Automating Your GitHub Backups

Step 1: Create a GitHub Personal Access Token (PAT)

This token will allow our script to authenticate with GitHub’s API and clone your repositories. Treat it like a password.

Go to your GitHub profile settings.
Navigate to “Developer settings” > “Personal access tokens” > “Tokens (classic)”.
Click “Generate new token” > “Generate new token (classic)”.
Give it a descriptive name (e.g., “S3-Backup-Script”).
Set an appropriate expiration (e.g., 90 days, 1 year, or no expiration if managed securely). Remember to rotate it regularly.
Under “Select scopes”, check repo (all sub-options) to ensure it can access both public and private repositories. If you only have public repos to back up, public_repo will suffice.
Click “Generate token”.
IMPORTANT: Copy the token immediately. You will not be able to see it again. Store it securely.

Step 2: Configure AWS S3 Bucket and IAM Permissions

We need an S3 bucket to store the archives and an IAM entity with the necessary permissions.

Create an S3 Bucket: If you don’t have one, navigate to the S3 service in your AWS Console and create a new bucket. Choose a unique name and region. For security, consider enabling server-side encryption and versioning on the bucket.
Create an IAM User (or Role):
1. Go to the IAM service in your AWS Console.
2. Navigate to “Users” > “Add user”.
3. Give the user a name (e.g., “github-backup-user”) and select “Programmatic access”.
4. On the “Permissions” page, choose “Attach existing policies directly” and then “Create policy”.
5. Use the JSON tab to define a policy like this, replacing your-s3-bucket-name with your actual bucket name:
```
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "s3:PutObject",
                  "s3:ListBucket",
                  "s3:GetObject"
              ],
              "Resource": [
                  "arn:aws:s3:::your-s3-bucket-name/*",
                  "arn:aws:s3:::your-s3-bucket-name"
              ]
          }
      ]
  }
```
This policy grants permissions to upload objects, list the bucket content, and retrieve objects (for verification if needed).
1. Save the policy, then attach it to your newly created IAM user.
2. Complete the user creation process. Copy the Access Key ID and Secret Access Key. Store them securely.

Step 3: Develop the Backup Script (Python)

This Python script will fetch your repositories, clone them, create archives, and upload them to S3. Create a file named backup_github.py.

First, install the necessary Python libraries:

pip install requests boto3

Now, here’s the Python script:

import os
import requests
import subprocess
import shutil
import datetime
import boto3
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# --- Configuration from Environment Variables ---
GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')
AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY')
S3_BUCKET_NAME = os.getenv('S3_BUCKET_NAME')
AWS_REGION = os.getenv('AWS_REGION', 'us-east-1') # Default to us-east-1 if not set

# Directory to temporarily store cloned repos
TEMP_DIR = 'github_backup_temp'

# --- GitHub API Functions ---
def get_user_repos(token):
    headers = {'Authorization': f'token {token}'}
    repos = []
    page = 1
    while True:
        response = requests.get(f'https://api.github.com/user/repos?type=all&per_page=100&page={page}', headers=headers)
        response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
        page_repos = response.json()
        if not page_repos:
            break
        repos.extend(page_repos)
        page += 1
    return repos

# --- Git Operations ---
def clone_repo(repo_url, local_path, token):
    try:
        # For private repos, embed the token in the URL
        # For public repos, the token is not strictly needed for cloning, but doesn't hurt.
        auth_repo_url = repo_url.replace('https://', f'https://oauth2:{token}@')
        logging.info(f"Cloning {repo_url} to {local_path}...")
        subprocess.run(['git', 'clone', auth_repo_url, local_path], check=True, capture_output=True)
        logging.info(f"Successfully cloned {repo_url}")
    except subprocess.CalledProcessError as e:
        logging.error(f"Failed to clone {repo_url}. Error: {e.stderr.decode().strip()}")
        raise

# --- Archiving ---
def create_archive(source_dir, output_filename):
    logging.info(f"Creating archive for {source_dir}...")
    # shutil.make_archive creates a .tar.gz by default on Unix-like systems, or .zip on Windows.
    # We explicitly specify 'zip' for cross-platform consistency.
    archive_name = shutil.make_archive(output_filename, 'zip', source_dir)
    logging.info(f"Archive created: {archive_name}")
    return archive_name

# --- S3 Operations ---
def upload_to_s3(file_path, bucket_name, s3_key, region):
    logging.info(f"Uploading {file_path} to s3://{bucket_name}/{s3_key}...")
    try:
        s3 = boto3.client(
            's3',
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=region
        )
        s3.upload_file(file_path, bucket_name, s3_key)
        logging.info(f"Successfully uploaded {file_path} to S3.")
    except Exception as e:
        logging.error(f"Failed to upload {file_path} to S3. Error: {e}")
        raise

# --- Main Backup Logic ---
def main():
    if not all([GITHUB_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET_NAME]):
        logging.error("Missing one or more required environment variables (GITHUB_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET_NAME). Exiting.")
        exit(1)

    # Clean up any previous temp directory before starting
    if os.path.exists(TEMP_DIR):
        shutil.rmtree(TEMP_DIR)
        logging.info(f"Cleaned up previous temporary directory: {TEMP_DIR}")
    os.makedirs(TEMP_DIR, exist_ok=True)

    try:
        repos = get_user_repos(GITHUB_TOKEN)
        logging.info(f"Found {len(repos)} repositories to back up.")

        backup_date = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')

        for repo in repos:
            repo_name = repo['name']
            repo_clone_url = repo['clone_url']
            repo_owner = repo['owner']['login']
            local_repo_path = os.path.join(TEMP_DIR, repo_owner, repo_name)

            try:
                # Clone the repository
                clone_repo(repo_clone_url, local_repo_path, GITHUB_TOKEN)

                # Create a zip archive
                archive_base_name = f"{repo_owner}-{repo_name}-{backup_date}"
                archive_full_path = create_archive(local_repo_path, os.path.join(TEMP_DIR, archive_base_name))

                # Define S3 key (path in S3)
                s3_key = f"github-backups/{repo_owner}/{repo_name}/{os.path.basename(archive_full_path)}"

                # Upload to S3
                upload_to_s3(archive_full_path, S3_BUCKET_NAME, s3_key, AWS_REGION)

            except Exception as e:
                logging.error(f"An error occurred while processing repository {repo_name}: {e}")
            finally:
                # Clean up local clone and archive
                if os.path.exists(local_repo_path):
                    shutil.rmtree(local_repo_path)
                    logging.info(f"Cleaned up local clone of {repo_name}")
                if 'archive_full_path' in locals() and os.path.exists(archive_full_path):
                    os.remove(archive_full_path)
                    logging.info(f"Cleaned up local archive of {repo_name}")

    except requests.exceptions.HTTPError as e:
        logging.error(f"GitHub API Error: {e}. Check your GITHUB_TOKEN and its permissions.")
    except Exception as e:
        logging.error(f"An unexpected error occurred during the backup process: {e}")
    finally:
        # Final cleanup of the main temporary directory
        if os.path.exists(TEMP_DIR):
            shutil.rmtree(TEMP_DIR)
            logging.info(f"Final cleanup: Removed temporary directory {TEMP_DIR}")

if __name__ == '__main__':
    main()

Code Logic Explanation:

Environment Variables: The script relies on environment variables for sensitive credentials (GitHub Token, AWS Keys) and configuration (S3 Bucket, AWS Region). This is a best practice for security.
get_user_repos(token): Fetches all repositories associated with the authenticated GitHub user. It handles pagination to ensure all repositories are retrieved.
clone_repo(...): Uses the git clone command via Python’s subprocess module. For private repositories, the GitHub PAT is embedded in the clone URL (e.g., https://oauth2:YOUR_TOKEN@github.com/owner/repo.git) for authentication.
create_archive(...): Utilizes Python’s shutil.make_archive to create a ZIP archive of the cloned repository. We use ZIP for broad compatibility.
upload_to_s3(...): Employs the boto3 library to connect to AWS S3 and upload the generated archive file. Credentials are passed directly for the client initialization, which is useful in environments where AWS CLI config isn’t available, like GitHub Actions without specific AWS configuration actions.
Main Loop: Iterates through each fetched repository, clones it into a temporary directory, archives it, uploads the archive to S3 with a descriptive key (path), and then cleans up the temporary local files.
Error Handling & Logging: Includes basic error handling for API calls, Git operations, and S3 uploads, with informative logging messages to track progress and issues.
Cleanup: Ensures that local temporary directories and archives are removed after each repository is processed and at the end of the script run.

Step 4: Automate with GitHub Actions

GitHub Actions is an excellent choice for this automation as it’s tightly integrated with GitHub and provides a robust, serverless environment to run your script on a schedule.

Store Secrets: Go to your GitHub repository (or organization) > “Settings” > “Secrets and variables” > “Actions” > “New repository secret”.

Add the following secrets:

USER_GITHUB_PAT: Your GitHub Personal Access Token created in Step 1. (Note: use a different name than GITHUB_TOKEN to avoid confusion with the default token provided by GitHub Actions).
AWS_ACCESS_KEY_ID: Your AWS Access Key ID from Step 2.
AWS_SECRET_ACCESS_KEY: Your AWS Secret Access Key from Step 2.
S3_BUCKET_NAME: The name of your S3 bucket.
AWS_REGION: The AWS region of your S3 bucket (e.g., us-east-1).
1. Create Workflow File: In your GitHub repository, create a directory .github/workflows/ and inside it, create a file named backup.yml.
2. Add Workflow Content: Paste the following YAML into .github/workflows/backup.yml:

   name: Daily GitHub Repo Backup to S3

   on:
     schedule:
       # Runs every day at 02:00 AM UTC
       - cron: '0 2 * * *'
     workflow_dispatch:
       # Allows manual trigger of the workflow

   jobs:
     backup_repositories:
       runs-on: ubuntu-latest

       env:
         GITHUB_TOKEN: ${{ secrets.USER_GITHUB_PAT }}
         AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
         AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
         S3_BUCKET_NAME: ${{ secrets.S3_BUCKET_NAME }}
         AWS_REGION: ${{ secrets.AWS_REGION }}

       steps:
       - name: Checkout repository (optional, if script is in this repo)
         uses: actions/checkout@v4

       - name: Set up Python
         uses: actions/setup-python@v5
         with:
           python-version: '3.x' # Specify your preferred Python version, e.g., '3.9'

       - name: Install Python dependencies
         run: |
           python -m pip install --upgrade pip
           pip install requests boto3

       - name: Run GitHub backup script
         run: |
           # If your script is directly in the repo root:
           python backup_github.py
           # If your script is in a subdirectory, e.g., 'scripts/':
           # python scripts/backup_github.py

Workflow Logic Explanation:

on: schedule: This defines when the workflow will run automatically. The cron: '0 2 * * *' expression means it will run daily at 02:00 AM UTC. You can adjust this to your needs.
workflow_dispatch: Adds a button to manually trigger the workflow from the GitHub Actions tab, useful for testing.
env:: All the secrets we defined earlier are passed into the job as environment variables, making them accessible to our Python script.
uses: actions/checkout@v4: This step is necessary if your backup_github.py script is located within the same GitHub repository where you’re setting up the Action.
uses: actions/setup-python@v5: Configures a Python environment on the runner.
pip install ...: Installs the required Python libraries.
python backup_github.py: Executes your backup script.

Commit this backup.yml file to your repository. The GitHub Action will now automatically run based on your schedule, backing up all your repositories to S3!

Common Pitfalls

GitHub API Rate Limits: If you have an extremely large number of repositories or run the script too frequently, you might hit GitHub’s API rate limits. The script includes basic retry mechanisms via requests.raise_for_status(), but for very high volume, consider exponential backoff or spacing out runs.
Authentication Errors (GitHub PAT or AWS Credentials):
- GitHub: Double-check your USER_GITHUB_PAT secret. Ensure it has the correct repo scopes. If cloning fails for private repos, it’s almost always a token issue.
- AWS: Verify your AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and the IAM policy. Ensure the policy grants s3:PutObject and s3:ListBucket on the specific S3 bucket.
- S3 Bucket Name/Region: Ensure S3_BUCKET_NAME and AWS_REGION are exactly correct.
Large Repositories or Many Repos: Cloning large repositories or a very high number of repositories can take time and consume disk space on the GitHub Actions runner. GitHub Actions runners have ample space for typical use, but extreme cases might require optimization or a self-hosted runner.
Empty S3 Bucket: Verify the S3 bucket exists and is correctly named in your AWS environment. If the bucket is in a different region than specified, boto3 will error out.
Timeout: The GitHub Actions job might time out if the backup takes longer than the allowed job duration (default 6 hours for standard runners). Consider breaking down the backup, optimizing the script, or using self-hosted runners for very extensive backups.

Conclusion

Congratulations! You’ve successfully set up an automated, cost-effective, and robust system to back up all your GitHub repositories to an AWS S3 bucket. This solution provides a vital layer of protection against data loss, ensuring your valuable source code is always safe and accessible, independent of GitHub’s operational status.

This automated process frees up your team from tedious manual tasks, allowing them to focus on innovation while enjoying the peace of mind that comes with a solid disaster recovery strategy.

What’s Next?

S3 Lifecycle Policies: Configure S3 lifecycle rules to automatically transition older backups to cheaper storage classes (like S3 Glacier) or expire them after a certain period to manage costs and data retention.
Backup Encryption: While S3 offers server-side encryption by default, consider client-side encryption for an extra layer of security before uploading backups.
Organization Repositories: Modify the get_user_repos function to get_org_repos(org_name, token) using the https://api.github.com/orgs/{org_name}/repos endpoint if you need to back up repositories belonging to a GitHub organization.
Backup Other GitHub Data: Extend the script to back up other critical GitHub data like Gist, Wikis, or GitHub Issues (via their respective APIs).
Monitoring and Alerting: Set up AWS CloudWatch alarms on your S3 bucket (e.g., for object creation) or monitor GitHub Actions workflow runs to ensure your backups are consistently succeeding.