DEV Community

Cover image for Running Screaming Frog on GCP with Cloud Run Jobs
kaugesaar
kaugesaar

Posted on • Originally published at kaugesaar.se

Running Screaming Frog on GCP with Cloud Run Jobs

Running Screaming Frog on a VM means paying for idle time between crawls. Cloud Run Jobs let you spin up a container, run the crawl, and shut down - you only pay for actual compute time.

At Precis, we built an internal service that runs Screaming Frog crawls at scale using GCP. This post walks through the core setup - a simplified version you can deploy in about 30 minutes.

For this proof-of-concept, we're going to use Cloud Run Jobs, Cloud Storage, and a simple Dockerfile combined with a bash file used as its entrypoint.

What we're building

The setup is straightforward:

  1. Dockerfile that installs Screaming Frog
  2. Entrypoint script that runs the crawl
  3. Cloud Run Job that executes the container
  4. GCS bucket to store exports

Prerequisites

You'll need:

  • Google Cloud Project with billing enabled
  • gcloud CLI installed and configured
  • Screaming Frog license
  • Basic Docker knowledge

Setting up GCP resources

These next steps assume you have gcloud sdk installed and that you are somewhat familiar with GCP.

Note that many of the steps below require you to have these two ENVs set.

PROJECT_ID="your-gcp-project"
REGION="your-preferred-region"
Enter fullscreen mode Exit fullscreen mode

Enable the required APIs:

gcloud services enable artifactregistry.googleapis.com run.googleapis.com
Enter fullscreen mode Exit fullscreen mode

Create a storage bucket for the CSV exports that Screaming Frog will generate. We'll later mount this bucket to the Cloud Run Job instance.

gsutil mb -l ${REGION} gs://${PROJECT_ID}-crawl-output
Enter fullscreen mode Exit fullscreen mode

Create a service account for the job and give it permission to read, write and delete files:

gcloud iam service-accounts create screaming-frog-runner \
    --display-name="ScreamingFrog Runner"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:screaming-frog-runner@${PROJECT_ID}.iam.gserviceaccount.com" \
    --role="roles/storage.objectAdmin"
Enter fullscreen mode Exit fullscreen mode

Store your Screaming Frog license and a persistent machine ID in Secret Manager:

# Enable Secret Manager API
gcloud services enable secretmanager.googleapis.com

# Create license secret
echo -n "YOUR-LICENSE-KEY" | gcloud secrets create screaming-frog-license \
    --data-file=-

# Create a persistent machine ID
uuidgen | gcloud secrets create screaming-frog-machine-id \
    --data-file=-

# Grant access to service account
gcloud secrets add-iam-policy-binding screaming-frog-license \
    --member="serviceAccount:screaming-frog-runner@${PROJECT_ID}.iam.gserviceaccount.com" \
    --role="roles/secretmanager.secretAccessor"

gcloud secrets add-iam-policy-binding screaming-frog-machine-id \
    --member="serviceAccount:screaming-frog-runner@${PROJECT_ID}.iam.gserviceaccount.com" \
    --role="roles/secretmanager.secretAccessor"
Enter fullscreen mode Exit fullscreen mode

Building the Docker image

Create a Dockerfile:

FROM ubuntu:22.04

# Install dependencies
RUN apt-get update && apt-get install -y \
    openjdk-21-jre \
    xvfb \
    wget \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Download and install Screaming Frog
RUN wget -O /tmp/screamingfrog.deb \
    https://download.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_23.2_all.deb && \
    apt-get update && \
    apt-get install -y /tmp/screamingfrog.deb && \
    rm /tmp/screamingfrog.deb

# Copy entrypoint script
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]
Enter fullscreen mode Exit fullscreen mode

A few things to note:

  • Ubuntu 22.04: Screaming Frog distributes as a .deb package, so any Debian-based image should work fine
  • xvfb: Screaming Frog requires a display even in CLI mode, xvfb provides a virtual one so it can run fully headless

Now create entrypoint.sh:

#!/bin/bash
set -e

URL="${CRAWL_URL}"
OUTPUT_BASE="${OUTPUT_DIR:-/mnt/crawl-output}"

# Create a per-run subfolder: domain/YYYY-MM-DD_HHMMSS
DOMAIN=$(echo "$URL" | sed -E 's|https?://([^/]+).*|\1|')
TIMESTAMP=$(date -u +"%Y-%m-%d_%H%M%S")
OUTPUT_DIR="${OUTPUT_BASE}/${DOMAIN}/${TIMESTAMP}"
mkdir -p "$OUTPUT_DIR"

# Setup license, machine ID, and EULA
mkdir -p /root/.ScreamingFrogSEOSpider
echo "${SF_LICENSE}" > /root/.ScreamingFrogSEOSpider/licence.txt
echo "${SF_MACHINE_ID}" > /root/.ScreamingFrogSEOSpider/machine-id.txt
cat > /root/.ScreamingFrogSEOSpider/spider.config << 'EOF'
eula.accepted=15
EOF

# Run the crawl
xvfb-run screamingfrogseospider \
    --headless \
    --crawl "$URL" \
    --output-folder "$OUTPUT_DIR" \
    --export-tabs "Internal:All,External:All,Response Codes:All,Page Titles:All,Meta Description:All,H1:All,Images:All" \
    --overwrite \
    --save-crawl

echo "Crawl completed successfully"
Enter fullscreen mode Exit fullscreen mode

The script does four things:

  1. License setup: Writes the license key we stored in Secret Manager to the path Screaming Frog expects.
  2. Machine ID: Writes the persistent UUID so each container run identifies as the same machine.
  3. EULA acceptance: Required for it to run.
  4. Runs the crawl: xvfb-run provides the virtual display, and we export a few selected tabs - feel free to edit.

So now in your directory you should have:

.
├── Dockerfile
└── entrypoint.sh
Enter fullscreen mode Exit fullscreen mode

Deploying the Cloud Run Job

Deploy the job directly from source (this builds the image using Cloud Build and deploys in one command):

gcloud run jobs deploy screaming-frog-crawler \
    --source . \
    --region=${REGION} \
    --service-account=screaming-frog-runner@${PROJECT_ID}.iam.gserviceaccount.com \
    --cpu=4 \
    --memory=16Gi \
    --max-retries=0 \
    --task-timeout=3600 \
    --set-env-vars=OUTPUT_DIR=/mnt/crawl-output \
    --set-secrets=SF_LICENSE=screaming-frog-license:latest,SF_MACHINE_ID=screaming-frog-machine-id:latest \
    --add-volume name=crawl-storage,type=cloud-storage,bucket=${PROJECT_ID}-crawl-output \
    --add-volume-mount volume=crawl-storage,mount-path=/mnt/crawl-output
Enter fullscreen mode Exit fullscreen mode

This command will:

  • Build your Docker image using Cloud Build
  • Push it to Artifact Registry automatically
  • Create (or update) the Cloud Run Job
  • Mount the bucket we created as a volume

Some notes on configuration:

The 4 CPU, 16GB RAM I found to be a good starting point for most crawls. Scale up to 8 CPU / 32GB for large sites (100K+ URLs).

Important: Screaming Frog periodically checks available disk space and stops the crawl if it detects 5GB or less remaining. On Cloud Run, available memory serves as disk space - there's no separate disk allocation, even with the GCS mount. So while 2 CPU / 8GB technically works, you'll be cutting it close with Screaming Frog's 5GB threshold.

Cost: With 4 CPU / 16GB, expect roughly $0.15-0.20 per hour of crawl time. A typical 10K URL crawl takes 15-30 minutes, so around $0.05-0.10 per crawl. Storage costs are negligible for CSV exports.

Running crawls

Manual execution:

gcloud run jobs execute screaming-frog-crawler \
    --region=${REGION} \
    --update-env-vars=CRAWL_URL=https://example.com
Enter fullscreen mode Exit fullscreen mode

Check execution status:

gcloud run jobs executions list \
    --job=screaming-frog-crawler \
    --region=${REGION}
Enter fullscreen mode Exit fullscreen mode

View logs:

gcloud logging read \
    "resource.type=cloud_run_job AND resource.labels.job_name=screaming-frog-crawler" \
    --limit=50 \
    --format=json
Enter fullscreen mode Exit fullscreen mode

Accessing crawl results

You can browse and download files directly from the GCP Console by navigating to your bucket. Or use the CLI:

First create a folder where you want to store the files.

mkdir -p ./crawl-output/
Enter fullscreen mode Exit fullscreen mode

List crawl outputs:

gsutil ls gs://${PROJECT_ID}-crawl-output/
Enter fullscreen mode Exit fullscreen mode

Download the latest crawl for a domain:

LATEST=$(gsutil ls gs://${PROJECT_ID}-crawl-output/domain.tld/ | sort | tail -1)
gsutil -m cp -r "${LATEST}*" ./crawl-output/
Enter fullscreen mode Exit fullscreen mode

Or download all crawls:

gsutil -m cp -r gs://${PROJECT_ID}-crawl-output/* ./crawl-output/
Enter fullscreen mode Exit fullscreen mode

The output directory includes:

  • internal_all.csv - All internal URLs discovered
  • external_all.csv - External links
  • response_codes_all.csv - HTTP status codes
  • page_titles_all.csv - Page titles
  • meta_description_all.csv - Meta descriptions
  • h1_all.csv - H1 tags
  • images_all.csv - Image inventory
  • crawl.seospider - Full crawl file (open in Screaming Frog GUI)

Screaming Frog output example

What's next

This gives you a working setup for running Screaming Frog crawls serverless. From here, you could:

  • Add configuration files: Use Screaming Frog's config files to standardize crawl settings across executions
  • Implement progress tracking: Parse log output to report crawl progress in real-time
  • Build a web UI: Create a simple interface for managing crawls and viewing results (hint: this is what we built)
  • Add notifications: Send alerts when crawls complete or fail
  • Track changes: Compare crawls over time to detect new issues

Once deployed, crawls run unattended and you only pay for what you use.

Top comments (0)