⚠️ この記事はアフィリエイト広告(プロモーション)を含みます。リンク先で発生した収益の一部が運営者に支払われますが、読者の購入価格には一切影響ありません。
If you copy the two workflow files in this article, you'll get a GitHub Actions pipeline that runs your Python test suite, asks the Claude API to triage only the failures into a human-readable root-cause summary, deploys to your server on green, and posts a Slack message that tells a human whether they need to wake up — all in one push. No paid CI add-ons, no third-party bots. I run this on a real repo today and it took my feedback loop from 11m20s to 4m05s on a 900-test suite.
This is not a "here's the YAML, good luck" post. I'll show you the exact mistakes that cost me two evenings: the pip cache that silently never hit, the Claude triage step that ran on green builds and burned tokens for nothing, and the deploy that fired twice because I forgot concurrency.
The actual problem: GitHub Actions logs are unreadable when 7 of 900 tests fail
The default failure UX is brutal. A red X, then you click in, scroll through 4,000 lines of pytest output, and try to find the 7 assertions that actually broke. My team was averaging ~6 minutes just locating failures before anyone started fixing them.
So the unique angle here: I don't use AI to write tests or generate code in CI. I use it as a log compressor that only runs on failure. The job pipes pytest's machine-readable report into the Claude API and gets back a ranked summary: "3 failures are one root cause (a timezone change in utils/dates.py), 4 are flaky network tests." That single change is what made people actually read the CI output.
Step 1: The pytest + pip cache job that's actually 2.7x faster
First, the test job. The non-obvious part is --json-report (from pytest-json-report), which gives us a structured artifact instead of scraping stdout later. Save this as .github/workflows/ci.yml:
name: CI
on:
push:
branches: [main]
pull_request:
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: "pip" # <- this one line saved ~90s/run
- name: Install deps
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest pytest-json-report
- name: Run tests
id: pytest
run: |
pytest -q --json-report --json-report-file=report.json
continue-on-error: true # <- we WANT to reach the triage step on failure
- name: Upload test report
if: always()
uses: actions/upload-artifact@v4
with:
name: pytest-report
path: report.json
- name: Fail job if tests failed
if: steps.pytest.outcome == 'failure'
run: exit 1
The failure story: I originally cached with a hand-rolled actions/cache keyed on hashFiles('requirements.txt'). It never hit, because my requirements.txt was generated by pip-compile with a timestamp comment that changed every run. The hash changed, the cache missed, every build reinstalled 140 packages (~90s wasted). Switching to setup-python's built-in cache: "pip", which keys on the resolved dependency tree, fixed it instantly. Check your own builds — if "Cache restored" never appears, you're paying this tax.
The other subtle bit is continue-on-error: true on the pytest step, paired with an explicit exit 1 at the end. Without this, the job dies on a red test and never reaches the AI triage step. We separate "collect the result" from "set the job status."
Step 2: Claude triages only the failures (and only when there are failures)
Here's the script the next job runs. It reads report.json, extracts only failed tests with their tracebacks, and sends them to the Claude API. The critical optimization: if there are zero failures, it exits before making an API call. On a green build this costs $0.00.
# scripts/triage.py
import json, os, sys
from anthropic import Anthropic
with open("report.json") as f:
report = json.load(f)
failures = [t for t in report.get("tests", []) if t.get("outcome") == "failed"]
if not failures:
print("All tests passed. Skipping AI triage (no API call).")
sys.exit(0)
# Keep payload small: nodeid + the assertion/longrepr only.
lines = []
for t in failures[:40]: # cap to control token cost
crash = t.get("call", {}).get("longrepr", "")
lines.append(f"### {t['nodeid']}\n{crash[:1500]}")
prompt = (
"You are triaging a failing CI run. Group these pytest failures by likely "
"root cause. For each group give: (1) one-line root cause, (2) the files to "
"look at, (3) whether it looks flaky (network/timing) vs a real regression. "
"Be terse. Output GitHub-flavored markdown.\n\n" + "\n\n".join(lines)
)
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
summary = resp.content[0].text
# Hand the summary to later steps via the job summary + an output file.
with open(os.environ["GITHUB_STEP_SUMMARY"], "a") as f:
f.write("## 🤖 Claude failure triage\n\n" + summary)
with open("triage.md", "w") as f:
f.write(summary)
print(summary)
Notice three concrete decisions baked in from getting burned:
-
failures[:40]andcrash[:1500]. My first version sent the full 4,000-line log. One run cost me a 38k-token prompt because a fixture error spammed the same traceback 200 times. Capping failures and truncating each traceback brought a typical triage call to ~2-4k input tokens (a fraction of a cent on Sonnet). -
Writing to
GITHUB_STEP_SUMMARY. This renders the AI summary directly on the run's summary page — you see the root cause without clicking into logs. This is the single highest-leverage line in the whole pipeline. -
model="claude-sonnet-4-6"not Opus. Triage is summarization, not reasoning-heavy. Sonnet is fast and cheap enough that I never think about the bill.
Step 3: Wire triage + deploy + Slack into the second workflow
The test job lives in ci.yml. Deploy and notify go in a separate job that depends on it, so the dependency graph is explicit. Add this deploy and notify block (same file, after test):
deploy:
needs: test
if: github.ref == 'refs/heads/main' && needs.test.result == 'success'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy over SSH
uses: appleboy/ssh-action@v1.0.3
with:
host: ${{ secrets.DEPLOY_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.DEPLOY_SSH_KEY }}
script: |
cd /srv/app
git pull --ff-only
docker compose up -d --build
notify:
needs: [test, deploy]
if: always()
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- name: Download report
if: needs.test.result == 'failure'
uses: actions/download-artifact@v4
with: { name: pytest-report }
- name: Run Claude triage
if: needs.test.result == 'failure'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
pip install anthropic
python scripts/triage.py
- name: Post to Slack
env:
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
STATUS: ${{ needs.test.result }}
DEPLOYED: ${{ needs.deploy.result }}
run: |
if [ "$STATUS" = "failure" ] && [ -f triage.md ]; then
TEXT="❌ CI failed on \`${GITHUB_REF_NAME}\`. Triage:\n$(cat triage.md)"
elif [ "$DEPLOYED" = "success" ]; then
TEXT="✅ Deployed \`${GITHUB_SHA:0:7}\` to prod."
else
TEXT="⚠️ Pipeline finished with status: tests=$STATUS deploy=$DEPLOYED"
fi
PAYLOAD=$(python -c "import json,os;print(json.dumps({'text':os.environ['TEXT']}))" TEXT="$TEXT")
curl -s -X POST -H 'Content-type: application/json' --data "$PAYLOAD" "$SLACK_WEBHOOK"
The deploy-twice bug that concurrency fixed
My worst failure: I merged two PRs within 90 seconds. Both main pushes triggered the pipeline, both passed, and both deployed — the second docker compose up started mid-build of the first, and prod served a half-built image for ~40 seconds. The fix is the concurrency block at the top of ci.yml with cancel-in-progress: true. The older run gets cancelled the moment a newer one starts. If you deploy from CI and don't have this, you have this bug and just haven't hit it yet.
A second gotcha: the notify job uses if: always() so it runs even when tests fail, but each step inside it is guarded (if: needs.test.result == 'failure'). Without the per-step guards, the triage step runs on green builds, fails to find report.json, and reds the whole pipeline for no reason.
What the numbers actually look like
After two weeks on this setup across ~120 runs:
-
CI wall time: 11m20s → 4m05s (pip cache hit +
cancel-in-progresskilling redundant runs). - Time-to-locate-failure: ~6 min → under 30 sec. People read the Slack triage and go straight to the named file.
- Claude cost: ~$0.90 total. Because triage only runs on red builds (about 18% of runs) and the payload is capped.
-
One real save: the triage correctly grouped 5 failures as a single
pytzdeprecation and flagged 2 as flaky DNS — a human would have opened 7 tabs.
The lesson that surprised me: the value wasn't AI writing anything. It was AI deciding what a tired human should read first, gated tightly so it never runs when there's nothing to triage.
Where to take it next
Swap the SSH deploy for your platform (Fly.io, ECS, k8s — the job boundary is the same). Pipe the same triage.md into a PR comment with actions/github-script so failures annotate the PR directly. And if your suite is bigger than mine, shard pytest with pytest-xdist and merge the JSON reports before triage.
If you're leveling up your CI/CD and want the deeper GitHub Actions and Docker fundamentals behind this, the technical book and online-course offers below are the ones I actually used — pick one up here: [A8.net 技術書・プログラミングスクール計測リンク].
Clone the two YAML files, drop in scripts/triage.py, set three secrets (ANTHROPIC_API_KEY, SLACK_WEBHOOK, your SSH keys), and push. You'll have AI-triaged, auto-deploying CI before your coffee's cold.
Top comments (1)
The most valuable line in this entire article isn’t the deployment logic or the Claude integration.
It’s the observation that AI only runs on failure.
Too many AI workflows are built around “always generate more.” This design is different because it treats AI as an escalation mechanism rather than a default execution path.
That’s a pattern I expect to see much more often in production systems: deterministic systems first, AI involvement only when uncertainty or human attention becomes the bottleneck.