Hector Flores

Posted on May 29 • Originally published at htek.dev

The Definitive GitHub Actions Debugging Guide: 65+ Real Errors and How to Fix Them

#githubactions #devops #cicd #security

GitHub Actions is the CI/CD backbone for millions of repositories. It's also the source of some of the most confusing, silent, and undocumented failure modes in modern DevOps.

I've spent years debugging Actions workflows — first across 500+ repository migrations at an enterprise scale, then building agentic DevOps platforms that push Actions to its limits. This guide is the result: every error message I've collected, every silent failure I've traced, and every workaround that actually works.

This is a reference guide, not a tutorial. Bookmark it. Search it when something breaks. Every section includes the actual error message (so you can Ctrl+F or Google it), the root cause, and the fix with copy-paste code.

Quick Diagnosis Flowchart

Start here: identify your failure category before diving into 65+ specific scenarios.

Before diving into 65+ scenarios, start here:

Workflow never appears in Actions tab? → YAML Syntax Issues or Trigger Problems
Workflow runs but a step fails? → Check the error message against the sections below
Workflow runs but produces wrong results silently? → Silent Failures
Secrets are empty or permissions denied? → Secrets & Permissions
Cache miss or artifact not found? → Caching & Artifacts
Jobs cancelled unexpectedly? → Concurrency Issues

Pro tip: Install actionlint right now. It catches the majority of syntax and context issues in this guide before you push. Run it locally or add it to your CI: uses: raven-actions/actionlint@v2.

YAML Syntax & Validation Errors

These errors prevent your workflow from even registering with GitHub. No run appears — the workflow is silently rejected.

Unexpected or Typo'd YAML Keys

Error:

The workflow is not valid. .github/workflows/ci.yml (Line: 6, Col: 5):
Unexpected value 'default'

unexpected key "Shell" for step to run shell command. expected one of
"continue-on-error", "env", "id", "if", "name", "run", "shell",
"timeout-minutes", "working-directory" [syntax-check]

Root cause: YAML key names in GitHub Actions are case-sensitive. default: is not defaults:. Shell: is not shell:. branch: is not branches:.

Fix: Use actionlint to catch these before pushing. Common corrections:

default: → defaults:
branch: → branches:
Shell: → shell:

Standard YAML linters (yamllint, Python yaml.safe_load()) won't catch these because the YAML is syntactically valid — it's semantically wrong for GitHub Actions.

Missing Required Keys

Error:

"runs-on" section is missing in job "test" [syntax-check]
"jobs" section should not be empty [syntax-check]

Fix: Every job needs runs-on: and at least one entry in steps:. Matrix keys are compared case-insensitively — node and NODE cannot coexist.

Expression Syntax Errors

Error:

got unexpected character '"' while lexing expression...
do you mean string literals? only single quotes are available
for string delimiter [expression]

Root cause: GitHub Actions expressions use a custom mini-language, not JavaScript. Double quotes are not valid string delimiters. The + operator doesn't exist for concatenation.

Fix:

# ❌ Wrong
run: echo "${{ "hello" }}"
run: echo "${{ var1 + var2 }}"

# ✅ Correct
run: echo "${{ 'hello' }}"
run: echo "${{ format('{0}{1}', var1, var2) }}"

Context Variable Type Errors

Error:

receiver of object dereference "owner" must be type of object but
got "string" [expression]

Root cause: github.repository is a string ("owner/repo"), not an object. People try github.repository.owner expecting the org name.

Fix: Use github.repository_owner for the owner. Use toJSON(env) to dump environment variables, not ${{ env }} (which outputs the string 'Object').

`secrets.*` in Unexpected Contexts — Silent Failures

Error: No error. The workflow behaves unexpectedly or steps are silently skipped.

Root cause: While secrets is technically available in step if: conditions, using it there can cause unexpected behavior — particularly in composite actions, reusable workflows, or when the secret is undefined. The expression evaluates to empty string for undefined secrets, which can cause conditions to behave differently than expected.

Fix:

# ⚠️ Can behave unexpectedly with undefined secrets
- if: ${{ secrets.MY_SECRET != '' }}
  run: echo "has secret"

# ✅ Map to env first, then check env (more reliable)
- env:
    MY_SECRET: ${{ secrets.MY_SECRET }}
  run: |
    if [ -n "$MY_SECRET" ]; then
      echo "has secret"
    fi

This pattern is especially dangerous because the failure mode is silence — no error, no notification. The env-mapping approach is more explicit and actionlint can validate it.

`env` Context Unavailable in Reusable Workflow `with:`

Error:

Unrecognized named-value: 'env'. Located at position 1 within
expression: env.SOMETHING

Root cause: The env context is not available in the with: block when calling reusable workflows. This is a confirmed open bug with 226+ reactions.

Fix: Pass values via github.event.inputs, secrets: inherit, or hardcode them. There is no clean workaround — this is a known platform limitation.

`if:` Conditionals Always Evaluating to `true`

Error: No error. The step always runs regardless of condition.

Root cause: Using YAML block scalar |, trailing spaces, or wrapping ${{ }} with extra characters makes the condition a non-empty string — which is always truthy.

# ❌ Always true — trailing newline from |
if: |
  ${{ github.event_name == 'push' }}

# ❌ Always true — trailing space
if: "${{ github.event_name == 'push' }} "

# ❌ Always true — extra characters between ${{ }} blocks
if: ${{ github.event_name == 'push' }} && ${{ github.ref_name == 'main' }}

Fix:

# ✅ Correct — no extra characters
if: github.event_name == 'push'

# ✅ Correct — single expression, no wrapping needed
if: github.event_name == 'push' && github.ref_name == 'main'

Boolean Inputs Are Strings in Composite Actions

# In composite action — this is ALWAYS false:
if: ${{ inputs.realRun == true }}

Root cause: Composite actions receive all inputs as strings, even when declared with type: boolean. This is a confirmed bug with 117+ reactions.

Fix: Compare to the string 'true':

if: ${{ inputs.realRun == 'true' }}

Composite Actions: No `defaults:` Support

Root cause: Composite actions do not support the defaults: key. You cannot set a default shell. Every run: step must explicitly specify shell:.

Fix:

runs:
  using: composite
  steps:
    - run: echo "hello"
      shell: bash        # Required on EVERY step
    - run: echo "world"
      shell: bash        # Must repeat

Tab Characters in YAML

Error:

found a tab character where an indentation space is expected

Fix: YAML does not allow tabs for indentation. In VS Code: View → Render Whitespace. Add to .editorconfig:

[*.yml]
indent_style = space
indent_size = 2

Silent Failures: The Most Dangerous Category

The most dangerous bugs are the ones your pipeline says passed.

These are the scenarios where nothing visibly breaks — your workflow just does the wrong thing.

Scheduled Workflows Silently Disabled After 60 Days

Symptom: A cron workflow that's been running for months just stops. No notification.

Root cause: GitHub automatically disables schedule-triggered workflows after 60 days of repository inactivity (no commits). Workflow runs themselves don't count as activity.

Fix:

- uses: gautamkrishnar/keepalive-workflow@v2
  with:
    time_elapsed: '45'  # triggers 15 days before the 60-day cutoff

Or re-enable manually:

gh workflow enable "Workflow Name" --repo OWNER/REPO

`GITHUB_TOKEN` Cannot Trigger Downstream Workflows

Symptom: A workflow pushes a commit or creates a tag, but the expected downstream workflow (triggered by on: push) never fires.

Root cause: This is by design. Commits made with GITHUB_TOKEN do not trigger further workflow runs — it's GitHub's recursion prevention mechanism.

Fix: Use a GitHub App installation token or a PAT:

- uses: actions/create-github-app-token@v1
  id: app-token
  with:
    app-id: ${{ vars.APP_ID }}
    private-key: ${{ secrets.APP_PRIVATE_KEY }}

- uses: actions/checkout@v4
  with:
    token: ${{ steps.app-token.outputs.token }}

Cache Rate Limiting Falls Through as "Cache Not Found"

Error:

Warning: Failed to restore: Failed to GetCacheEntryDownloadURL:
Rate limited: Failed request: (429) Too Many Requests
Cache not found for input keys: ...

Root cause: When the cache API rate limits you, the action reports it as a cache miss — not a rate limit error. Your build proceeds without cache, silently slower.

Fix: Don't trigger hundreds of parallel matrix jobs all saving caches simultaneously. Stagger cache operations or use fewer, broader cache keys.

Fork PR Secrets Evaluate to Empty Strings

Symptom: A contributor opens a PR from a fork. Secret-dependent steps fail or skip silently.

Root cause: Secrets are not passed to workflows triggered by pull_request from forks. This is a deliberate security boundary.

Fix: Design CI to not require secrets for tests. For deployment previews after code review, use pull_request_target with a mandatory label gate:

on:
  pull_request_target:
    types: [labeled]

jobs:
  deploy-preview:
    if: github.event.label.name == 'safe to test'
    # ...

⚠️ Security warning: Never checkout fork code with pull_request_target and then run it with repository secrets. This creates a pwn-request vulnerability.

Runner & Environment Problems

Self-Hosted Runner Registration & Update Loops

Error:

Runner update in progress, do not shutdown runner.
Downloading 2.277.1 runner... Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
[...loops again...]

Root cause: Containerized runners built on older Ubuntu images (18.04) hit glibc incompatibility when auto-update downloads a newer runner binary.

Fix:

Rebuild container on Ubuntu 22.04+
Disable auto-update: DISABLE_AUTO_UPDATE=1
Add rm -rf /home/runner/actions-runner to container entrypoint before ./config.sh
Add watchdog cron polling GET /orgs/{org}/actions/runners every 5 minutes

Runner Out of Disk Space

Error:

No space left on device (os error 28)

Root cause: GitHub-hosted ubuntu-latest runners have ~14GB usable, but pre-installed toolchains (Android SDK ~8GB, .NET ~1.5GB, Haskell ~5GB) consume most of it.

Fix: Add a cleanup step before heavy builds:

- name: Free Disk Space
  uses: jlumbroso/free-disk-space@main
  with:
    tool-cache: false
    android: true
    dotnet: true
    haskell: true
    large-packages: true

This reclaims ~10-15GB.

Environment Variables Not Persisting Between Steps

Error:

Warning: The `set-output` command is deprecated and will be disabled soon.

Root cause: ::set-output and ::set-env were deprecated in favor of environment files.

Fix:

# ❌ Deprecated
- run: echo "::set-output name=dir::$(yarn cache dir)"

# ✅ Current
- run: echo "dir=$(yarn cache dir)" >> $GITHUB_OUTPUT

# For multi-line values:
- run: |
    echo "MY_VAR<<EOF" >> $GITHUB_ENV
    echo "$multiline_value" >> $GITHUB_ENV
    echo "EOF" >> $GITHUB_ENV

Tools Not Found in Next Step (PATH Issues)

Error:

/bin/bash: my-tool: command not found

Root cause: Each run: step spawns a fresh shell. export PATH=... is lost when that step ends.

Fix: Write to $GITHUB_PATH, not PATH:

- name: Install tool
  run: |
    pip install my-cli-tool
    echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Use tool  # PATH is now updated
  run: my-cli-tool --version

Docker Not Available on Runner

Error:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
Is the docker daemon running?

Root cause: ubuntu-latest-slim, ARC containers, and self-hosted runners without DinD don't expose Docker.

Fix:

Standard ubuntu-latest: Docker is available natively
ARC/containerized: Use DinD sidecar or switch to JavaScript/composite actions
For private registry pulls, add docker/login-action before container actions

Service Container Connectivity

Error:

connection to server at "localhost", port 5432 failed: Connection refused

Root cause: In containerized jobs (container: at job level), service containers are on a Docker bridge network. localhost doesn't work.

Fix: Always add health checks, and use the service label as hostname in containerized jobs:

services:
  postgres:
    image: postgres:15
    env:
      POSTGRES_PASSWORD: password
    ports:
      - 5432:5432
    options: >-
      --health-cmd pg_isready
      --health-interval 10s
      --health-timeout 5s
      --health-retries 5

For containerized jobs, connect to postgres:5432 (the service label), not localhost:5432.

Runner Image Deprecation

Error:

No hosted runners with requested label(s): 'ubuntu-18.04' can be found.
sudo: docker-compose: command not found

Fix:

# ❌ Removed
- run: sudo docker-compose up -d

# ✅ Docker Compose v2 plugin syntax
- run: sudo docker compose up -d

Track upcoming removals at the actions/runner-images releases.

Windows Runner Gotchas

Error:

AssertionError: expected '40-learnings\\passesdefaultgate.md' to contain '40-learnings/'

Root cause: Path separators (\ vs /), missing POSIX tools (jq, sed), shebangs not honored, CRLF line endings.

Fix:

defaults:
  run:
    shell: bash  # uses Git Bash on Windows

# Install missing tools
- if: runner.os == 'Windows'
  run: choco install jq -y
  shell: pwsh

# Disable CRLF auto-conversion
- run: git config --global core.autocrlf false

Node.js Runtime Deprecation

Error:

Node.js 16 actions are deprecated. Please update the following actions
to use Node.js 20: actions/checkout@v3, actions/cache@v3

Fix: Bump to latest major versions of all actions. For own actions, update action.yml to runs.using: node24. Emergency workaround:

env:
  FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: 'true'

Deprecation timeline: node12 (cutoff mid-2023) → node16 (mid-2024) → node20 (enforcement rolling out 2025-2026). Check the GitHub Actions changelog for the latest timeline.

Secrets, Permissions & Authentication

The GitHub Actions permission model: repo defaults → workflow permissions block → GITHUB_TOKEN scope. The #1 source of 403 errors.

`GITHUB_TOKEN` Permission Denied (403)

Error:

remote: Permission to org/repo.git denied to github-actions[bot].
fatal: unable to access '...': The requested URL returned error: 403

Root cause: Default GITHUB_TOKEN is read-only since GitHub tightened defaults for new repos and orgs in February 2023.

Fix: Add explicit permissions: to the job:

permissions:
  contents: write       # git push
  pull-requests: write  # PR creation
  packages: write       # GHCR push

Critical: The permissions: block completely replaces defaults. Any permission not listed becomes none. Listing only contents: write drops all other permissions including pull-requests.

OIDC Federation Failures with AWS

Error:

Could not assume role with OIDC: Not authorized to perform
sts:AssumeRoleWithWebIdentity

Root causes and fixes:

Reusable workflows change the sub claim. The OIDC JWT subject reflects the calling repo, not the reusable workflow's repo. IAM trust policies must match the caller.
Missing permissions: id-token: write on the calling job.
Audience mismatch:

- uses: aws-actions/configure-aws-credentials@v4
  with:
    audience: sts.amazonaws.com  # must match trust policy
    role-to-assume: arn:aws:iam::123456789012:role/MyRole
    aws-region: us-east-1

Cross-Repo Access (403)

Error:

remote: Permission to other-org/other-repo.git denied to github-actions[bot].

Root cause: GITHUB_TOKEN is scoped to a single repository. It cannot access other repos — this is a security boundary by design.

Fix: Use a GitHub App installation token (recommended) or a PAT:

- uses: actions/create-github-app-token@v1
  id: app-token
  with:
    app-id: ${{ vars.APP_ID }}
    private-key: ${{ secrets.APP_PRIVATE_KEY }}
    repositories: "target-repo"

- uses: actions/checkout@v4
  with:
    token: ${{ steps.app-token.outputs.token }}
    repository: org/target-repo

Environment Protection Rules Blocking Deployments

Error:

This deployment was rejected

Root cause: The triggering ref doesn't match the environment's allowed branches/tags filter, or the required reviewer also triggered the workflow (GitHub doesn't allow self-approval).

Fix: Ensure the triggering ref matches the environment's branch filter pattern. Add a second reviewer if the triggering user is the sole required reviewer.

GitHub App Token Generation Failures

Error:

error:0909006C:PEM routines:get_name:no start line

Root cause: Private key corrupted during shell escaping or base64 encoding.

Fix: Store the raw PEM file directly as a GitHub secret:

gh secret set APP_PRIVATE_KEY < my-app.private-key.pem

Use actions/create-github-app-token@v1 (official, node20-native) instead of tibdex/github-app-token.

Docker Registry Auth (GHCR)

Error:

denied: installation not allowed to Write organization package

Fix:

Add permissions: packages: write to the job
For org packages: visit package settings → Manage Actions Access → add the repository with Write access
Don't set DOCKER_CONFIG: $HOME/.docker at job level — it breaks credential persistence

Dependabot Secrets Namespace

Root cause: Dependabot runs in a separate secrets namespace. Repository secrets are not available to Dependabot-triggered workflows.

Fix: Add secrets to both namespaces:

gh secret set NPM_TOKEN --body "npm_xxx" --app actions
gh secret set NPM_TOKEN --body "npm_xxx" --app dependabot

PAT vs. GITHUB_TOKEN Decision Matrix

Scenario	Use
Push to same repo	`GITHUB_TOKEN` + `contents: write`
Create PR on same repo	`GITHUB_TOKEN` + `pull-requests: write`
Push to different repo	GitHub App token or PAT
Trigger another workflow	PAT (GITHUB_TOKEN can't trigger workflows)
Cross-org operations	Classic PAT with `repo` scope

Prefer GitHub App tokens over PATs: PATs are tied to individuals (leave org = token breaks), expire, and are harder to audit.

Caching, Artifacts & Dependencies

Cache Miss Despite Recent Save

Error:

Cache not found for input keys: Linux-node-abc123def456

Root causes:

Branch scoping: Caches from main are accessible to branches, but not vice-versa
Version mismatch: Changing OS or compression tool changes the cache version hash
Rate limiting: 429s fall through silently as "cache not found"
Infrastructure outage: Check githubstatus.com

Fix: Always prime cache on the default branch first. Use the List Caches API to debug version mismatches.

`cache-hit` Output Semantics

# ❌ Wrong — cache-hit is empty string (not 'false') on full miss
if: steps.cache.outputs.cache-hit == 'false'

# ✅ Correct — always use != 'true'
if: steps.cache.outputs.cache-hit != 'true'

cache-hit is 'true' on exact key match, empty string on miss, and 'false' on restore-keys match. Yes, really.

Cache Size Limit (10 GB Per Repo)

Symptom: Random cache misses on older branches.

Root cause: Repos have a 10 GB total cache limit. Oldest caches are LRU-evicted silently.

Fix: Clean up branch caches on PR close:

on:
  pull_request:
    types: [closed]
jobs:
  cleanup:
    runs-on: ubuntu-latest
    permissions:
      actions: write
    steps:
      - run: |
          for id in $(gh cache list --ref refs/pull/${{ github.event.pull_request.number }}/merge \
            --limit 100 --json id --jq '.[].id'); do
            gh cache delete $id
          done
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GH_REPO: ${{ github.repository }}

`upload-artifact` v3 → v4 Breaking Changes

Error:

An artifact with the same name already exists for the associated workflow run.

Root cause: v4 artifacts are immutable. Multiple jobs can no longer upload to the same artifact name.

Fix:

# v4 — unique names per matrix job
- uses: actions/upload-artifact@v4
  with:
    name: build-${{ matrix.os }}-${{ matrix.node }}

# Download all and merge
- uses: actions/download-artifact@v4
  with:
    pattern: build-*
    merge-multiple: true
    path: dist/

Cross-Workflow Artifact Download

Error:

Unable to download artifact(s): Artifact not found for name: my-artifact

Fix: Both upload and download must use the same version family (v3↔v3 or v4↔v4 — they use different storage backends):

- uses: actions/download-artifact@v4
  with:
    name: my-artifact
    github-token: ${{ secrets.GITHUB_TOKEN }}  # required for cross-workflow
    run-id: ${{ github.event.workflow_run.id }}

`npm ci` Cache Save Timeout

Error:

The operation was canceled.

Root cause: Cache save (tar compression) on large node_modules exceeds the job timeout. Missing zstd in DinD containers forces slow gzip fallback.

Fix: Cache ~/.npm (the npm cache directory), not node_modules:

- uses: actions/cache@v5
  with:
    path: ${{ steps.npm-cache-dir.outputs.dir }}
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}

For DinD environments, install zstd: apt-get install -y zstd.

Docker Layer Caching

Error:

cache export feature is currently not supported for docker driver

Fix: You must use docker/setup-buildx-action first — the default Docker driver doesn't support cache export:

- uses: docker/setup-buildx-action@v3

- uses: docker/build-push-action@v6
  with:
    cache-from: type=gha,scope=${{ github.workflow }}
    cache-to: type=gha,mode=max,scope=${{ github.workflow }}

Cache Corruption

Error:

tar: Error is not recoverable: exiting now
gzip: stdin: unexpected end of file

Fix: Delete the corrupt cache via CLI:

gh cache list --repo owner/repo
gh cache delete <cache-id> --repo owner/repo

Prevent future corruption with a download timeout:

env:
  SEGMENT_DOWNLOAD_TIMEOUT_MINS: 5

Git LFS Files Not Downloaded

Symptom: Binary files are 140-byte text pointers instead of actual content.

Fix:

- uses: actions/checkout@v4
  with:
    lfs: true
    fetch-depth: 1

Cache LFS objects to reduce bandwidth:

- uses: actions/cache@v5
  with:
    path: .git/lfs
    key: ${{ runner.os }}-lfs-${{ hashFiles('.lfsconfig') }}

Lockfile Hash Returns Empty String

Error:

Cache not found for input keys: Linux-node-

Root cause: hashFiles('**/package-lock.json') matched no files, returning empty string.

Fix: Debug with:

- run: |
    echo "Hash: ${{ hashFiles('**/package-lock.json') }}"
    find . -name "package-lock.json" -not -path "*/node_modules/*"

Correct patterns per ecosystem:

# npm
key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
# pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements*.txt', '**/pyproject.toml') }}
# Gradle
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle*', '**/gradle-wrapper.properties') }}

Trigger Problems

Workflow Not Triggering At All

No error. No run appears.

Root causes (in priority order):

Workflow file is not on the default branch
YAML syntax error (silently rejected)
Branch filter mismatch (branches: [master] but default is main)
Workflow disabled via UI or inactivity
Commit made by GITHUB_TOKEN (won't trigger downstream)

Fix:

# Check workflow state
gh workflow list
gh workflow view "My Workflow"

`workflow_dispatch` Button Not Showing

Root causes:

Workflow file not on default branch (most common)
No write access to repository
Wrong YAML indentation:

# ❌ Wrong — nested under push
on:
  push:
    branches: [main]
    workflow_dispatch:      # indented under push

# ✅ Correct — sibling of push
on:
  push:
    branches: [main]
  workflow_dispatch:        # same level as push

Cron Schedule Running Late or Not Running

Root cause: GitHub does not guarantee cron timing. During high load, scheduled runs can be delayed by hours or skipped entirely. Minimum interval is 5 minutes. Public/free-tier repos are deprioritized. All times are UTC.

A real-world case: workflow configured for */10 * * * * (expected ~144 runs/day), but only 4 runs fired in 32 hours.

Fix: For time-sensitive operations, use an external cron service to trigger workflow_dispatch via API. Accept a ±1 hour SLA for GitHub-hosted scheduled workflows.

`workflow_run` Not Firing

Root causes:

The listener workflow must be on the default branch
workflows: ["CI Build"] must exactly match the source workflow's name: field
Missing types: [completed] — without it, fires on both start and finish
Source workflow triggered by GITHUB_TOKEN (recursion prevention)

Fix:

on:
  workflow_run:
    workflows: ["CI Build"]     # exact match to name: in source workflow
    types: [completed]

jobs:
  post-build:
    if: github.event.workflow_run.conclusion == 'success'

`repository_dispatch` Returns 204 But Workflow Doesn't Run

Root cause: API returns 204 even when event_type doesn't match — the mismatch is silent.

Fix: Verify event_type exactly matches the workflow's types::

on:
  repository_dispatch:
    types: [docker-image-updated]  # must EXACTLY match API call

Path Filters Not Working as Expected

Root cause: paths: and paths-ignore: are mutually exclusive — using both on the same event is not supported. docs (without /**) matches a file literally named docs, not the directory.

Fix:

# Correct: ignore docs directory
on:
  push:
    paths-ignore:
      - 'docs/**'
      - '*.md'

Tag Push vs. Release Published

Trigger	When It Fires	Use Case
`push: tags: [v*]`	On tag push	Binary build
`release: types: [created]`	Release created	Build + draft release
`release: types: [published]`	Explicit publish	Deploy to prod

Concurrency & Timing

Jobs Cancelled Unexpectedly

Root cause: Overly broad concurrency group key. Using group: ${{ github.workflow }} alone means all runs compete, even on different branches.

Fix:

# PR workflows — cancel stale runs on same PR
concurrency:
  group: ci-${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

# Production deploys — queue, never cancel
concurrency:
  group: deploy-production
  cancel-in-progress: false

# Branch-sensitive — cancel only on non-default branches
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}

Empty `head_ref` Causing Cross-Branch Cancellation

Root cause: github.head_ref is empty for push events. All push-triggered runs get the same group key and cancel each other.

Fix:

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}

Job `needs` Failure Cascading

Symptom: A downstream job is Skipped even though you want it to run after upstream failure.

Root cause: Default if: on every job is success(), meaning "only run if ALL needs jobs succeeded."

Fix:

# Always run (notifications, cleanup)
final-job:
  needs: [job-a, job-b]
  if: always()
  steps:
    - if: contains(needs.*.result, 'failure')
      run: exit 1

Default Timeout is 6 Hours

Root cause: A hung test suite silently consumes a runner for 6 hours.

Fix: Always set timeout-minutes at the job level:

jobs:
  test:
    timeout-minutes: 20
    steps:
      - run: npm test
        timeout-minutes: 10

Matrix `include` vs. `exclude` Confusion

Key insight:

include entries that match ALL existing keys add properties to the existing row — they don't create a new job
include entries that match NO existing cell create a new job
exclude requires ALL keys to exist in the base matrix — unknown keys are silently ignored
Max 256 matrix jobs per workflow run

strategy:
  fail-fast: false  # strongly recommended for diagnostics
  matrix: ${{ fromJSON(needs.prepare.outputs.matrix) }}

Dynamic Matrix and Required Status Checks

The problem: Matrix job names like test (ubuntu-latest, 16) change when matrix values change. Branch protection requires exact string matches — no wildcards.

Fix: Add a stable summary job and require that instead:

test-summary:
  needs: [test]
  if: always()
  runs-on: ubuntu-latest
  steps:
    - if: needs.test.result != 'success'
      run: exit 1

Known Unsolved Problems

These are confirmed platform limitations with no clean workaround. Understanding them saves hours of debugging dead ends.

No SSH / Interactive Debugging (#241 — 107 👍, open since 2019)

The runner has no TTY allocated. Interactive debugging is not possible natively. Workarounds like mxschmitt/action-tmate open SSH reverse tunnels but are a security risk (session URL is in public logs).

No Step-Level Retry

There's no native retry: 3 syntax on steps. Use nick-fields/retry for run: steps, or a bash loop:

for i in 1 2 3; do
  flaky-command && break || sleep 15
done

No Early-Exit / Step Flow Control (#662 — 1,031 👍)

The highest-voted open runner issue. You cannot exit a job early with a specific conclusion (success/neutral). Every step must use if: guards to skip, creating verbose YAML.

Reusable Workflows Cannot Be Called from Composite Actions

Composite actions are inlined steps on the parent runner. Calling a reusable workflow (which spawns a separate runner) from inside a composite action is architecturally impossible without a lifecycle model redesign.

No `services:` or `container:` in Composite Actions (ADR 0549)

By architectural decision. Service containers require Docker lifecycle management at the job level — composite actions don't have job-level lifecycle.

Secret Masking Edge Cases (#475 — 68 👍, open since 2020)

::add-mask:: echoes the secret value before the mask takes effect. Short secrets (1-3 chars) cause entire log lines to become ***. Base64 and URL-encoded versions of secrets may not be masked.

Cost/Billing Opacity

No per-workflow, per-job, or per-repository breakdown of Actions minutes. The billing page shows total org-level usage. Use gh api /repos/{owner}/{repo}/actions/runs/{id} for approximate per-run duration.

Essential Tooling

`actionlint` — The Single Most Impactful Tool

rhysd/actionlint catches the majority of syntax, context, and type errors in this guide before you push:

# Install
go install github.com/rhysd/actionlint/cmd/actionlint@latest
# Or brew install actionlint

# Run
actionlint

# In CI
- uses: raven-actions/actionlint@v2

It validates: YAML syntax, expression types, context availability, matrix configurations, reusable workflow inputs/outputs, shell script syntax, and action version compatibility.

Online Playground

Don't want to install anything? Use the actionlint playground — paste your workflow YAML and get instant feedback.

Debug Logging

Enable debug logging for any workflow run:

Go to the failed run → "Re-run all jobs" → check "Enable debug logging"
Or set repository variable ACTIONS_STEP_DEBUG to true (adds ##[debug] output to all steps)

`gh` CLI for Debugging

# List workflow runs
gh run list --workflow ci.yml

# View specific run logs
gh run view <run-id> --log

# Download logs for grep
gh run view <run-id> --log | grep 'error'

# List and delete caches
gh cache list
gh cache delete <id>

# Check workflow state
gh workflow list
gh workflow enable "Workflow Name"

Cross-Reference: Related Guides

If you're working with GitHub Actions in the context of platform engineering and DevOps automation, these related articles go deeper on specific patterns:

Lessons from 500 GitHub Migrations — enterprise-scale GitHub rollouts
Platform Engineering with GitHub — building internal developer platforms on GitHub
GitOps for Everything: Beyond Deployments — declarative infrastructure with Actions
GitHub Agentic Workflows: Hands-On Guide — automated workflows with GitHub Copilot
CI Monitor Extension: Agent CI Feedback Loop — automated CI debugging with AI agents

Resources

Every error message, workaround, and fix in this guide is sourced from real GitHub Issues, official documentation, and architecture decision records:

rhysd/actionlint — Static linter for GitHub Actions workflows (the canonical error message reference)
actions/runner Issues — Official runner bug tracker
actions/cache Tips & Workarounds — Official caching troubleshooting
actions/upload-artifact Migration Guide — v3 → v4 breaking changes
GitHub Actions Context Availability — Which contexts are available where
GitHub Actions Security Guides — GITHUB_TOKEN, OIDC, fork PR security
actions/runner ADRs — Architecture decisions explaining why limitations exist
GitHub Status — Check for infrastructure incidents before debugging

This guide covers the scenarios that have cost me and thousands of other developers the most debugging hours. If your specific error isn't here, open an issue or reach out on LinkedIn — I'll add it to the next update.

Quick Diagnosis Flowchart

YAML Syntax & Validation Errors

Unexpected or Typo'd YAML Keys

Missing Required Keys

Expression Syntax Errors

Context Variable Type Errors

secrets.* in Unexpected Contexts — Silent Failures

env Context Unavailable in Reusable Workflow with:

if: Conditionals Always Evaluating to true

Boolean Inputs Are Strings in Composite Actions

Composite Actions: No defaults: Support

Tab Characters in YAML

Silent Failures: The Most Dangerous Category

Scheduled Workflows Silently Disabled After 60 Days

GITHUB_TOKEN Cannot Trigger Downstream Workflows

Cache Rate Limiting Falls Through as "Cache Not Found"

Fork PR Secrets Evaluate to Empty Strings

Runner & Environment Problems

Self-Hosted Runner Registration & Update Loops

Runner Out of Disk Space

Environment Variables Not Persisting Between Steps

Tools Not Found in Next Step (PATH Issues)

Docker Not Available on Runner

Service Container Connectivity

Runner Image Deprecation

Windows Runner Gotchas

Node.js Runtime Deprecation

Secrets, Permissions & Authentication

GITHUB_TOKEN Permission Denied (403)

OIDC Federation Failures with AWS

Cross-Repo Access (403)

Environment Protection Rules Blocking Deployments

GitHub App Token Generation Failures

Docker Registry Auth (GHCR)

Dependabot Secrets Namespace

PAT vs. GITHUB_TOKEN Decision Matrix

Caching, Artifacts & Dependencies

Cache Miss Despite Recent Save

cache-hit Output Semantics

Cache Size Limit (10 GB Per Repo)

upload-artifact v3 → v4 Breaking Changes

Cross-Workflow Artifact Download

npm ci Cache Save Timeout

Docker Layer Caching

Cache Corruption

Git LFS Files Not Downloaded

Lockfile Hash Returns Empty String

Trigger Problems

Workflow Not Triggering At All

workflow_dispatch Button Not Showing

Cron Schedule Running Late or Not Running

workflow_run Not Firing

repository_dispatch Returns 204 But Workflow Doesn't Run

Path Filters Not Working as Expected

Tag Push vs. Release Published

Concurrency & Timing

Jobs Cancelled Unexpectedly

Empty head_ref Causing Cross-Branch Cancellation

Job needs Failure Cascading

Default Timeout is 6 Hours

Matrix include vs. exclude Confusion

Dynamic Matrix and Required Status Checks

Known Unsolved Problems

No SSH / Interactive Debugging (#241 — 107 👍, open since 2019)

No Step-Level Retry

No Early-Exit / Step Flow Control (#662 — 1,031 👍)

Reusable Workflows Cannot Be Called from Composite Actions

No services: or container: in Composite Actions (ADR 0549)

Secret Masking Edge Cases (#475 — 68 👍, open since 2020)

Cost/Billing Opacity

Essential Tooling

actionlint — The Single Most Impactful Tool

Online Playground

Debug Logging

gh CLI for Debugging

Cross-Reference: Related Guides

Resources

`secrets.*` in Unexpected Contexts — Silent Failures

`env` Context Unavailable in Reusable Workflow `with:`

`if:` Conditionals Always Evaluating to `true`

Composite Actions: No `defaults:` Support

`GITHUB_TOKEN` Cannot Trigger Downstream Workflows

`GITHUB_TOKEN` Permission Denied (403)

`cache-hit` Output Semantics

`upload-artifact` v3 → v4 Breaking Changes

`npm ci` Cache Save Timeout

`workflow_dispatch` Button Not Showing

`workflow_run` Not Firing

`repository_dispatch` Returns 204 But Workflow Doesn't Run

Empty `head_ref` Causing Cross-Branch Cancellation

Job `needs` Failure Cascading

Matrix `include` vs. `exclude` Confusion

No `services:` or `container:` in Composite Actions (ADR 0549)

`actionlint` — The Single Most Impactful Tool

`gh` CLI for Debugging