GitHub Actions is the CI/CD backbone for millions of repositories. It's also the source of some of the most confusing, silent, and undocumented failure modes in modern DevOps.
I've spent years debugging Actions workflows — first across 500+ repository migrations at an enterprise scale, then building agentic DevOps platforms that push Actions to its limits. This guide is the result: every error message I've collected, every silent failure I've traced, and every workaround that actually works.
This is a reference guide, not a tutorial. Bookmark it. Search it when something breaks. Every section includes the actual error message (so you can Ctrl+F or Google it), the root cause, and the fix with copy-paste code.
Quick Diagnosis Flowchart

Start here: identify your failure category before diving into 65+ specific scenarios.
Before diving into 65+ scenarios, start here:
- Workflow never appears in Actions tab? → YAML Syntax Issues or Trigger Problems
- Workflow runs but a step fails? → Check the error message against the sections below
- Workflow runs but produces wrong results silently? → Silent Failures
- Secrets are empty or permissions denied? → Secrets & Permissions
- Cache miss or artifact not found? → Caching & Artifacts
- Jobs cancelled unexpectedly? → Concurrency Issues
Pro tip: Install
actionlintright now. It catches the majority of syntax and context issues in this guide before you push. Run it locally or add it to your CI:uses: raven-actions/actionlint@v2.
YAML Syntax & Validation Errors
These errors prevent your workflow from even registering with GitHub. No run appears — the workflow is silently rejected.
Unexpected or Typo'd YAML Keys
Error:
The workflow is not valid. .github/workflows/ci.yml (Line: 6, Col: 5):
Unexpected value 'default'
unexpected key "Shell" for step to run shell command. expected one of
"continue-on-error", "env", "id", "if", "name", "run", "shell",
"timeout-minutes", "working-directory" [syntax-check]
Root cause: YAML key names in GitHub Actions are case-sensitive. default: is not defaults:. Shell: is not shell:. branch: is not branches:.
Fix: Use actionlint to catch these before pushing. Common corrections:
-
default:→defaults: -
branch:→branches: -
Shell:→shell:
Standard YAML linters (yamllint, Python yaml.safe_load()) won't catch these because the YAML is syntactically valid — it's semantically wrong for GitHub Actions.
Missing Required Keys
Error:
"runs-on" section is missing in job "test" [syntax-check]
"jobs" section should not be empty [syntax-check]
Fix: Every job needs runs-on: and at least one entry in steps:. Matrix keys are compared case-insensitively — node and NODE cannot coexist.
Expression Syntax Errors
Error:
got unexpected character '"' while lexing expression...
do you mean string literals? only single quotes are available
for string delimiter [expression]
Root cause: GitHub Actions expressions use a custom mini-language, not JavaScript. Double quotes are not valid string delimiters. The + operator doesn't exist for concatenation.
Fix:
# ❌ Wrong
run: echo "${{ "hello" }}"
run: echo "${{ var1 + var2 }}"
# ✅ Correct
run: echo "${{ 'hello' }}"
run: echo "${{ format('{0}{1}', var1, var2) }}"
Context Variable Type Errors
Error:
receiver of object dereference "owner" must be type of object but
got "string" [expression]
Root cause: github.repository is a string ("owner/repo"), not an object. People try github.repository.owner expecting the org name.
Fix: Use github.repository_owner for the owner. Use toJSON(env) to dump environment variables, not ${{ env }} (which outputs the string 'Object').
secrets.* in Unexpected Contexts — Silent Failures
Error: No error. The workflow behaves unexpectedly or steps are silently skipped.
Root cause: While secrets is technically available in step if: conditions, using it there can cause unexpected behavior — particularly in composite actions, reusable workflows, or when the secret is undefined. The expression evaluates to empty string for undefined secrets, which can cause conditions to behave differently than expected.
Fix:
# ⚠️ Can behave unexpectedly with undefined secrets
- if: ${{ secrets.MY_SECRET != '' }}
run: echo "has secret"
# ✅ Map to env first, then check env (more reliable)
- env:
MY_SECRET: ${{ secrets.MY_SECRET }}
run: |
if [ -n "$MY_SECRET" ]; then
echo "has secret"
fi
This pattern is especially dangerous because the failure mode is silence — no error, no notification. The env-mapping approach is more explicit and actionlint can validate it.
env Context Unavailable in Reusable Workflow with:
Error:
Unrecognized named-value: 'env'. Located at position 1 within
expression: env.SOMETHING
Root cause: The env context is not available in the with: block when calling reusable workflows. This is a confirmed open bug with 226+ reactions.
Fix: Pass values via github.event.inputs, secrets: inherit, or hardcode them. There is no clean workaround — this is a known platform limitation.
if: Conditionals Always Evaluating to true
Error: No error. The step always runs regardless of condition.
Root cause: Using YAML block scalar |, trailing spaces, or wrapping ${{ }} with extra characters makes the condition a non-empty string — which is always truthy.
# ❌ Always true — trailing newline from |
if: |
${{ github.event_name == 'push' }}
# ❌ Always true — trailing space
if: "${{ github.event_name == 'push' }} "
# ❌ Always true — extra characters between ${{ }} blocks
if: ${{ github.event_name == 'push' }} && ${{ github.ref_name == 'main' }}
Fix:
# ✅ Correct — no extra characters
if: github.event_name == 'push'
# ✅ Correct — single expression, no wrapping needed
if: github.event_name == 'push' && github.ref_name == 'main'
Boolean Inputs Are Strings in Composite Actions
# In composite action — this is ALWAYS false:
if: ${{ inputs.realRun == true }}
Root cause: Composite actions receive all inputs as strings, even when declared with type: boolean. This is a confirmed bug with 117+ reactions.
Fix: Compare to the string 'true':
if: ${{ inputs.realRun == 'true' }}
Composite Actions: No defaults: Support
Root cause: Composite actions do not support the defaults: key. You cannot set a default shell. Every run: step must explicitly specify shell:.
Fix:
runs:
using: composite
steps:
- run: echo "hello"
shell: bash # Required on EVERY step
- run: echo "world"
shell: bash # Must repeat
Tab Characters in YAML
Error:
found a tab character where an indentation space is expected
Fix: YAML does not allow tabs for indentation. In VS Code: View → Render Whitespace. Add to .editorconfig:
[*.yml]
indent_style = space
indent_size = 2
Silent Failures: The Most Dangerous Category

The most dangerous bugs are the ones your pipeline says passed.
These are the scenarios where nothing visibly breaks — your workflow just does the wrong thing.
Scheduled Workflows Silently Disabled After 60 Days
Symptom: A cron workflow that's been running for months just stops. No notification.
Root cause: GitHub automatically disables schedule-triggered workflows after 60 days of repository inactivity (no commits). Workflow runs themselves don't count as activity.
Fix:
- uses: gautamkrishnar/keepalive-workflow@v2
with:
time_elapsed: '45' # triggers 15 days before the 60-day cutoff
Or re-enable manually:
gh workflow enable "Workflow Name" --repo OWNER/REPO
GITHUB_TOKEN Cannot Trigger Downstream Workflows
Symptom: A workflow pushes a commit or creates a tag, but the expected downstream workflow (triggered by on: push) never fires.
Root cause: This is by design. Commits made with GITHUB_TOKEN do not trigger further workflow runs — it's GitHub's recursion prevention mechanism.
Fix: Use a GitHub App installation token or a PAT:
- uses: actions/create-github-app-token@v1
id: app-token
with:
app-id: ${{ vars.APP_ID }}
private-key: ${{ secrets.APP_PRIVATE_KEY }}
- uses: actions/checkout@v4
with:
token: ${{ steps.app-token.outputs.token }}
Cache Rate Limiting Falls Through as "Cache Not Found"
Error:
Warning: Failed to restore: Failed to GetCacheEntryDownloadURL:
Rate limited: Failed request: (429) Too Many Requests
Cache not found for input keys: ...
Root cause: When the cache API rate limits you, the action reports it as a cache miss — not a rate limit error. Your build proceeds without cache, silently slower.
Fix: Don't trigger hundreds of parallel matrix jobs all saving caches simultaneously. Stagger cache operations or use fewer, broader cache keys.
Fork PR Secrets Evaluate to Empty Strings
Symptom: A contributor opens a PR from a fork. Secret-dependent steps fail or skip silently.
Root cause: Secrets are not passed to workflows triggered by pull_request from forks. This is a deliberate security boundary.
Fix: Design CI to not require secrets for tests. For deployment previews after code review, use pull_request_target with a mandatory label gate:
on:
pull_request_target:
types: [labeled]
jobs:
deploy-preview:
if: github.event.label.name == 'safe to test'
# ...
⚠️ Security warning: Never checkout fork code with
pull_request_targetand then run it with repository secrets. This creates a pwn-request vulnerability.
Runner & Environment Problems
Self-Hosted Runner Registration & Update Loops
Error:
Runner update in progress, do not shutdown runner.
Downloading 2.277.1 runner... Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
[...loops again...]
Root cause: Containerized runners built on older Ubuntu images (18.04) hit glibc incompatibility when auto-update downloads a newer runner binary.
Fix:
- Rebuild container on Ubuntu 22.04+
- Disable auto-update:
DISABLE_AUTO_UPDATE=1 - Add
rm -rf /home/runner/actions-runnerto container entrypoint before./config.sh - Add watchdog cron polling
GET /orgs/{org}/actions/runnersevery 5 minutes
Runner Out of Disk Space
Error:
No space left on device (os error 28)
Root cause: GitHub-hosted ubuntu-latest runners have ~14GB usable, but pre-installed toolchains (Android SDK ~8GB, .NET ~1.5GB, Haskell ~5GB) consume most of it.
Fix: Add a cleanup step before heavy builds:
- name: Free Disk Space
uses: jlumbroso/free-disk-space@main
with:
tool-cache: false
android: true
dotnet: true
haskell: true
large-packages: true
This reclaims ~10-15GB.
Environment Variables Not Persisting Between Steps
Error:
Warning: The `set-output` command is deprecated and will be disabled soon.
Root cause: ::set-output and ::set-env were deprecated in favor of environment files.
Fix:
# ❌ Deprecated
- run: echo "::set-output name=dir::$(yarn cache dir)"
# ✅ Current
- run: echo "dir=$(yarn cache dir)" >> $GITHUB_OUTPUT
# For multi-line values:
- run: |
echo "MY_VAR<<EOF" >> $GITHUB_ENV
echo "$multiline_value" >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
Tools Not Found in Next Step (PATH Issues)
Error:
/bin/bash: my-tool: command not found
Root cause: Each run: step spawns a fresh shell. export PATH=... is lost when that step ends.
Fix: Write to $GITHUB_PATH, not PATH:
- name: Install tool
run: |
pip install my-cli-tool
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Use tool # PATH is now updated
run: my-cli-tool --version
Docker Not Available on Runner
Error:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
Is the docker daemon running?
Root cause: ubuntu-latest-slim, ARC containers, and self-hosted runners without DinD don't expose Docker.
Fix:
- Standard
ubuntu-latest: Docker is available natively - ARC/containerized: Use DinD sidecar or switch to JavaScript/composite actions
- For private registry pulls, add
docker/login-actionbefore container actions
Service Container Connectivity
Error:
connection to server at "localhost", port 5432 failed: Connection refused
Root cause: In containerized jobs (container: at job level), service containers are on a Docker bridge network. localhost doesn't work.
Fix: Always add health checks, and use the service label as hostname in containerized jobs:
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: password
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
For containerized jobs, connect to postgres:5432 (the service label), not localhost:5432.
Runner Image Deprecation
Error:
No hosted runners with requested label(s): 'ubuntu-18.04' can be found.
sudo: docker-compose: command not found
Fix:
# ❌ Removed
- run: sudo docker-compose up -d
# ✅ Docker Compose v2 plugin syntax
- run: sudo docker compose up -d
Track upcoming removals at the actions/runner-images releases.
Windows Runner Gotchas
Error:
AssertionError: expected '40-learnings\\passesdefaultgate.md' to contain '40-learnings/'
Root cause: Path separators (\ vs /), missing POSIX tools (jq, sed), shebangs not honored, CRLF line endings.
Fix:
defaults:
run:
shell: bash # uses Git Bash on Windows
# Install missing tools
- if: runner.os == 'Windows'
run: choco install jq -y
shell: pwsh
# Disable CRLF auto-conversion
- run: git config --global core.autocrlf false
Node.js Runtime Deprecation
Error:
Node.js 16 actions are deprecated. Please update the following actions
to use Node.js 20: actions/checkout@v3, actions/cache@v3
Fix: Bump to latest major versions of all actions. For own actions, update action.yml to runs.using: node24. Emergency workaround:
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: 'true'
Deprecation timeline: node12 (cutoff mid-2023) → node16 (mid-2024) → node20 (enforcement rolling out 2025-2026). Check the GitHub Actions changelog for the latest timeline.
Secrets, Permissions & Authentication

The GitHub Actions permission model: repo defaults → workflow permissions block → GITHUB_TOKEN scope. The #1 source of 403 errors.
GITHUB_TOKEN Permission Denied (403)
Error:
remote: Permission to org/repo.git denied to github-actions[bot].
fatal: unable to access '...': The requested URL returned error: 403
Root cause: Default GITHUB_TOKEN is read-only since GitHub tightened defaults for new repos and orgs in February 2023.
Fix: Add explicit permissions: to the job:
permissions:
contents: write # git push
pull-requests: write # PR creation
packages: write # GHCR push
Critical: The
permissions:block completely replaces defaults. Any permission not listed becomesnone. Listing onlycontents: writedrops all other permissions includingpull-requests.
OIDC Federation Failures with AWS
Error:
Could not assume role with OIDC: Not authorized to perform
sts:AssumeRoleWithWebIdentity
Root causes and fixes:
Reusable workflows change the
subclaim. The OIDC JWT subject reflects the calling repo, not the reusable workflow's repo. IAM trust policies must match the caller.Missing
permissions: id-token: writeon the calling job.Audience mismatch:
- uses: aws-actions/configure-aws-credentials@v4
with:
audience: sts.amazonaws.com # must match trust policy
role-to-assume: arn:aws:iam::123456789012:role/MyRole
aws-region: us-east-1
Cross-Repo Access (403)
Error:
remote: Permission to other-org/other-repo.git denied to github-actions[bot].
Root cause: GITHUB_TOKEN is scoped to a single repository. It cannot access other repos — this is a security boundary by design.
Fix: Use a GitHub App installation token (recommended) or a PAT:
- uses: actions/create-github-app-token@v1
id: app-token
with:
app-id: ${{ vars.APP_ID }}
private-key: ${{ secrets.APP_PRIVATE_KEY }}
repositories: "target-repo"
- uses: actions/checkout@v4
with:
token: ${{ steps.app-token.outputs.token }}
repository: org/target-repo
Environment Protection Rules Blocking Deployments
Error:
This deployment was rejected
Root cause: The triggering ref doesn't match the environment's allowed branches/tags filter, or the required reviewer also triggered the workflow (GitHub doesn't allow self-approval).
Fix: Ensure the triggering ref matches the environment's branch filter pattern. Add a second reviewer if the triggering user is the sole required reviewer.
GitHub App Token Generation Failures
Error:
error:0909006C:PEM routines:get_name:no start line
Root cause: Private key corrupted during shell escaping or base64 encoding.
Fix: Store the raw PEM file directly as a GitHub secret:
gh secret set APP_PRIVATE_KEY < my-app.private-key.pem
Use actions/create-github-app-token@v1 (official, node20-native) instead of tibdex/github-app-token.
Docker Registry Auth (GHCR)
Error:
denied: installation not allowed to Write organization package
Fix:
- Add
permissions: packages: writeto the job - For org packages: visit package settings → Manage Actions Access → add the repository with Write access
- Don't set
DOCKER_CONFIG: $HOME/.dockerat job level — it breaks credential persistence
Dependabot Secrets Namespace
Root cause: Dependabot runs in a separate secrets namespace. Repository secrets are not available to Dependabot-triggered workflows.
Fix: Add secrets to both namespaces:
gh secret set NPM_TOKEN --body "npm_xxx" --app actions
gh secret set NPM_TOKEN --body "npm_xxx" --app dependabot
PAT vs. GITHUB_TOKEN Decision Matrix
| Scenario | Use |
|---|---|
| Push to same repo |
GITHUB_TOKEN + contents: write
|
| Create PR on same repo |
GITHUB_TOKEN + pull-requests: write
|
| Push to different repo | GitHub App token or PAT |
| Trigger another workflow | PAT (GITHUB_TOKEN can't trigger workflows) |
| Cross-org operations | Classic PAT with repo scope |
Prefer GitHub App tokens over PATs: PATs are tied to individuals (leave org = token breaks), expire, and are harder to audit.
Caching, Artifacts & Dependencies
Cache Miss Despite Recent Save
Error:
Cache not found for input keys: Linux-node-abc123def456
Root causes:
-
Branch scoping: Caches from
mainare accessible to branches, but not vice-versa - Version mismatch: Changing OS or compression tool changes the cache version hash
- Rate limiting: 429s fall through silently as "cache not found"
- Infrastructure outage: Check githubstatus.com
Fix: Always prime cache on the default branch first. Use the List Caches API to debug version mismatches.
cache-hit Output Semantics
# ❌ Wrong — cache-hit is empty string (not 'false') on full miss
if: steps.cache.outputs.cache-hit == 'false'
# ✅ Correct — always use != 'true'
if: steps.cache.outputs.cache-hit != 'true'
cache-hit is 'true' on exact key match, empty string on miss, and 'false' on restore-keys match. Yes, really.
Cache Size Limit (10 GB Per Repo)
Symptom: Random cache misses on older branches.
Root cause: Repos have a 10 GB total cache limit. Oldest caches are LRU-evicted silently.
Fix: Clean up branch caches on PR close:
on:
pull_request:
types: [closed]
jobs:
cleanup:
runs-on: ubuntu-latest
permissions:
actions: write
steps:
- run: |
for id in $(gh cache list --ref refs/pull/${{ github.event.pull_request.number }}/merge \
--limit 100 --json id --jq '.[].id'); do
gh cache delete $id
done
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GH_REPO: ${{ github.repository }}
upload-artifact v3 → v4 Breaking Changes
Error:
An artifact with the same name already exists for the associated workflow run.
Root cause: v4 artifacts are immutable. Multiple jobs can no longer upload to the same artifact name.
Fix:
# v4 — unique names per matrix job
- uses: actions/upload-artifact@v4
with:
name: build-${{ matrix.os }}-${{ matrix.node }}
# Download all and merge
- uses: actions/download-artifact@v4
with:
pattern: build-*
merge-multiple: true
path: dist/
Cross-Workflow Artifact Download
Error:
Unable to download artifact(s): Artifact not found for name: my-artifact
Fix: Both upload and download must use the same version family (v3↔v3 or v4↔v4 — they use different storage backends):
- uses: actions/download-artifact@v4
with:
name: my-artifact
github-token: ${{ secrets.GITHUB_TOKEN }} # required for cross-workflow
run-id: ${{ github.event.workflow_run.id }}
npm ci Cache Save Timeout
Error:
The operation was canceled.
Root cause: Cache save (tar compression) on large node_modules exceeds the job timeout. Missing zstd in DinD containers forces slow gzip fallback.
Fix: Cache ~/.npm (the npm cache directory), not node_modules:
- uses: actions/cache@v5
with:
path: ${{ steps.npm-cache-dir.outputs.dir }}
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
For DinD environments, install zstd: apt-get install -y zstd.
Docker Layer Caching
Error:
cache export feature is currently not supported for docker driver
Fix: You must use docker/setup-buildx-action first — the default Docker driver doesn't support cache export:
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v6
with:
cache-from: type=gha,scope=${{ github.workflow }}
cache-to: type=gha,mode=max,scope=${{ github.workflow }}
Cache Corruption
Error:
tar: Error is not recoverable: exiting now
gzip: stdin: unexpected end of file
Fix: Delete the corrupt cache via CLI:
gh cache list --repo owner/repo
gh cache delete <cache-id> --repo owner/repo
Prevent future corruption with a download timeout:
env:
SEGMENT_DOWNLOAD_TIMEOUT_MINS: 5
Git LFS Files Not Downloaded
Symptom: Binary files are 140-byte text pointers instead of actual content.
Fix:
- uses: actions/checkout@v4
with:
lfs: true
fetch-depth: 1
Cache LFS objects to reduce bandwidth:
- uses: actions/cache@v5
with:
path: .git/lfs
key: ${{ runner.os }}-lfs-${{ hashFiles('.lfsconfig') }}
Lockfile Hash Returns Empty String
Error:
Cache not found for input keys: Linux-node-
Root cause: hashFiles('**/package-lock.json') matched no files, returning empty string.
Fix: Debug with:
- run: |
echo "Hash: ${{ hashFiles('**/package-lock.json') }}"
find . -name "package-lock.json" -not -path "*/node_modules/*"
Correct patterns per ecosystem:
# npm
key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
# pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements*.txt', '**/pyproject.toml') }}
# Gradle
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle*', '**/gradle-wrapper.properties') }}
Trigger Problems
Workflow Not Triggering At All
No error. No run appears.
Root causes (in priority order):
- Workflow file is not on the default branch
- YAML syntax error (silently rejected)
- Branch filter mismatch (
branches: [master]but default ismain) - Workflow disabled via UI or inactivity
- Commit made by
GITHUB_TOKEN(won't trigger downstream)
Fix:
# Check workflow state
gh workflow list
gh workflow view "My Workflow"
workflow_dispatch Button Not Showing
Root causes:
- Workflow file not on default branch (most common)
- No write access to repository
- Wrong YAML indentation:
# ❌ Wrong — nested under push
on:
push:
branches: [main]
workflow_dispatch: # indented under push
# ✅ Correct — sibling of push
on:
push:
branches: [main]
workflow_dispatch: # same level as push
Cron Schedule Running Late or Not Running
Root cause: GitHub does not guarantee cron timing. During high load, scheduled runs can be delayed by hours or skipped entirely. Minimum interval is 5 minutes. Public/free-tier repos are deprioritized. All times are UTC.
A real-world case: workflow configured for */10 * * * * (expected ~144 runs/day), but only 4 runs fired in 32 hours.
Fix: For time-sensitive operations, use an external cron service to trigger workflow_dispatch via API. Accept a ±1 hour SLA for GitHub-hosted scheduled workflows.
workflow_run Not Firing
Root causes:
- The listener workflow must be on the default branch
-
workflows: ["CI Build"]must exactly match the source workflow'sname:field - Missing
types: [completed]— without it, fires on both start and finish - Source workflow triggered by
GITHUB_TOKEN(recursion prevention)
Fix:
on:
workflow_run:
workflows: ["CI Build"] # exact match to name: in source workflow
types: [completed]
jobs:
post-build:
if: github.event.workflow_run.conclusion == 'success'
repository_dispatch Returns 204 But Workflow Doesn't Run
Root cause: API returns 204 even when event_type doesn't match — the mismatch is silent.
Fix: Verify event_type exactly matches the workflow's types::
on:
repository_dispatch:
types: [docker-image-updated] # must EXACTLY match API call
Path Filters Not Working as Expected
Root cause: paths: and paths-ignore: are mutually exclusive — using both on the same event is not supported. docs (without /**) matches a file literally named docs, not the directory.
Fix:
# Correct: ignore docs directory
on:
push:
paths-ignore:
- 'docs/**'
- '*.md'
Tag Push vs. Release Published
| Trigger | When It Fires | Use Case |
|---|---|---|
push: tags: [v*] |
On tag push | Binary build |
release: types: [created] |
Release created | Build + draft release |
release: types: [published] |
Explicit publish | Deploy to prod |
Concurrency & Timing
Jobs Cancelled Unexpectedly
Root cause: Overly broad concurrency group key. Using group: ${{ github.workflow }} alone means all runs compete, even on different branches.
Fix:
# PR workflows — cancel stale runs on same PR
concurrency:
group: ci-${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
# Production deploys — queue, never cancel
concurrency:
group: deploy-production
cancel-in-progress: false
# Branch-sensitive — cancel only on non-default branches
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
Empty head_ref Causing Cross-Branch Cancellation
Root cause: github.head_ref is empty for push events. All push-triggered runs get the same group key and cancel each other.
Fix:
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
Job needs Failure Cascading
Symptom: A downstream job is Skipped even though you want it to run after upstream failure.
Root cause: Default if: on every job is success(), meaning "only run if ALL needs jobs succeeded."
Fix:
# Always run (notifications, cleanup)
final-job:
needs: [job-a, job-b]
if: always()
steps:
- if: contains(needs.*.result, 'failure')
run: exit 1
Default Timeout is 6 Hours
Root cause: A hung test suite silently consumes a runner for 6 hours.
Fix: Always set timeout-minutes at the job level:
jobs:
test:
timeout-minutes: 20
steps:
- run: npm test
timeout-minutes: 10
Matrix include vs. exclude Confusion
Key insight:
-
includeentries that match ALL existing keys add properties to the existing row — they don't create a new job -
includeentries that match NO existing cell create a new job -
excluderequires ALL keys to exist in the base matrix — unknown keys are silently ignored - Max 256 matrix jobs per workflow run
strategy:
fail-fast: false # strongly recommended for diagnostics
matrix: ${{ fromJSON(needs.prepare.outputs.matrix) }}
Dynamic Matrix and Required Status Checks
The problem: Matrix job names like test (ubuntu-latest, 16) change when matrix values change. Branch protection requires exact string matches — no wildcards.
Fix: Add a stable summary job and require that instead:
test-summary:
needs: [test]
if: always()
runs-on: ubuntu-latest
steps:
- if: needs.test.result != 'success'
run: exit 1
Known Unsolved Problems
These are confirmed platform limitations with no clean workaround. Understanding them saves hours of debugging dead ends.
No SSH / Interactive Debugging (#241 — 107 👍, open since 2019)
The runner has no TTY allocated. Interactive debugging is not possible natively. Workarounds like mxschmitt/action-tmate open SSH reverse tunnels but are a security risk (session URL is in public logs).
No Step-Level Retry
There's no native retry: 3 syntax on steps. Use nick-fields/retry for run: steps, or a bash loop:
for i in 1 2 3; do
flaky-command && break || sleep 15
done
No Early-Exit / Step Flow Control (#662 — 1,031 👍)
The highest-voted open runner issue. You cannot exit a job early with a specific conclusion (success/neutral). Every step must use if: guards to skip, creating verbose YAML.
Reusable Workflows Cannot Be Called from Composite Actions
Composite actions are inlined steps on the parent runner. Calling a reusable workflow (which spawns a separate runner) from inside a composite action is architecturally impossible without a lifecycle model redesign.
No services: or container: in Composite Actions (ADR 0549)
By architectural decision. Service containers require Docker lifecycle management at the job level — composite actions don't have job-level lifecycle.
Secret Masking Edge Cases (#475 — 68 👍, open since 2020)
::add-mask:: echoes the secret value before the mask takes effect. Short secrets (1-3 chars) cause entire log lines to become ***. Base64 and URL-encoded versions of secrets may not be masked.
Cost/Billing Opacity
No per-workflow, per-job, or per-repository breakdown of Actions minutes. The billing page shows total org-level usage. Use gh api /repos/{owner}/{repo}/actions/runs/{id} for approximate per-run duration.
Essential Tooling
actionlint — The Single Most Impactful Tool
rhysd/actionlint catches the majority of syntax, context, and type errors in this guide before you push:
# Install
go install github.com/rhysd/actionlint/cmd/actionlint@latest
# Or brew install actionlint
# Run
actionlint
# In CI
- uses: raven-actions/actionlint@v2
It validates: YAML syntax, expression types, context availability, matrix configurations, reusable workflow inputs/outputs, shell script syntax, and action version compatibility.
Online Playground
Don't want to install anything? Use the actionlint playground — paste your workflow YAML and get instant feedback.
Debug Logging
Enable debug logging for any workflow run:
- Go to the failed run → "Re-run all jobs" → check "Enable debug logging"
- Or set repository variable
ACTIONS_STEP_DEBUGtotrue(adds##[debug]output to all steps)
gh CLI for Debugging
# List workflow runs
gh run list --workflow ci.yml
# View specific run logs
gh run view <run-id> --log
# Download logs for grep
gh run view <run-id> --log | grep 'error'
# List and delete caches
gh cache list
gh cache delete <id>
# Check workflow state
gh workflow list
gh workflow enable "Workflow Name"
Cross-Reference: Related Guides
If you're working with GitHub Actions in the context of platform engineering and DevOps automation, these related articles go deeper on specific patterns:
- Lessons from 500 GitHub Migrations — enterprise-scale GitHub rollouts
- Platform Engineering with GitHub — building internal developer platforms on GitHub
- GitOps for Everything: Beyond Deployments — declarative infrastructure with Actions
- GitHub Agentic Workflows: Hands-On Guide — automated workflows with GitHub Copilot
- CI Monitor Extension: Agent CI Feedback Loop — automated CI debugging with AI agents
Resources
Every error message, workaround, and fix in this guide is sourced from real GitHub Issues, official documentation, and architecture decision records:
-
rhysd/actionlint— Static linter for GitHub Actions workflows (the canonical error message reference) -
actions/runnerIssues — Official runner bug tracker -
actions/cacheTips & Workarounds — Official caching troubleshooting -
actions/upload-artifactMigration Guide — v3 → v4 breaking changes - GitHub Actions Context Availability — Which contexts are available where
-
GitHub Actions Security Guides —
GITHUB_TOKEN, OIDC, fork PR security -
actions/runnerADRs — Architecture decisions explaining why limitations exist - GitHub Status — Check for infrastructure incidents before debugging
This guide covers the scenarios that have cost me and thousands of other developers the most debugging hours. If your specific error isn't here, open an issue or reach out on LinkedIn — I'll add it to the next update.
Top comments (0)