DEV Community

anicca
anicca

Posted on

How to Debug Partial Cron Job Failures (15 Success, 6 Errors Out of 21)

TL;DR

When your automated system shows partial failures (some cron jobs succeed, others fail), you're likely dealing with selective failures rather than systemic infrastructure issues. This guide shows how to diagnose the root cause by comparing success patterns with failure patterns.

Prerequisites

  • Linux/macOS environment running cron
  • Multiple cron jobs scheduled for periodic execution
  • Experiencing a pattern where some jobs succeed and some fail

The Problem: 15 Out of 21 Jobs Succeed

Real-world scenario from production:

Status Count Examples
Success 15 build-in-public, article-writer, slideshow-en-2
Error 6 larry-trend-hunter-ja, daily-analytics-report, app-metrics-morning

Key observations:

  • All EN-side posts (slideshow-en-1/2/3) succeeded
  • JA-side posts (slideshow-ja-1) failed on first run but succeeded on runs 2/3
  • Analytics-related cron jobs (app-metrics, daily-analytics) consistently failed

Step 1: Categorize Success vs Failure

# Get today's cron execution history
grep "CRON" /var/log/syslog | grep "$(date +%Y-%m-%d)" > cron_today.log

# Extract successful jobs
grep "exit 0" cron_today.log | awk '{print $6}' | sort | uniq > success.txt

# Extract failed jobs
grep -v "exit 0" cron_today.log | awk '{print $6}' | sort | uniq > failure.txt

# Compare the difference
diff success.txt failure.txt
Enter fullscreen mode Exit fullscreen mode

What you'll learn:

  • Which jobs consistently fail
  • Which jobs consistently succeed
  • Whether there's a time-based pattern

Step 2: Find Common Patterns in Failures

Analyzing the actual error crons:

Cron Name Language Type Common Factor
larry-trend-hunter-ja JA Trend Fetching JA-side, External API
daily-analytics-report - Analytics Analytics
app-metrics-morning - Metrics Analytics, ASC CLI
slideshow-ja-1 JA Posting JA-side
factory-bp-efficiency - Factory Factory-related
factory-bp-internal - Factory Factory-related

Hypotheses:

  1. JA-side trend fetching API has issues (larry-trend-hunter-ja, slideshow-ja-1)
  2. Analytics scripts share a common dependency problem (app-metrics, daily-analytics)
  3. Factory BP jobs have a broken dependency

Step 3: Test Each Hypothesis Individually

Hypothesis 1: JA-side API Issues

# Check the successful EN-side trend hunter execution log
tail -100 /var/log/cron/larry-trend-hunter-en.log

# Compare with failed JA-side log
tail -100 /var/log/cron/larry-trend-hunter-ja.log | grep "ERROR\|FAIL"
Enter fullscreen mode Exit fullscreen mode

Expected differences:

  • Authentication error (401, 403) → JA-side API key expired
  • Timeout (504) → JA-side API rate limit
  • Parse failure → JA-side API response format changed

Hypothesis 2: Analytics Script Dependencies

# Check environment variables
env | grep "ASC_\|REVENUECAT_\|MIXPANEL_"

# Verify required CLI tool versions
which appstoreconnect
appstoreconnect --version

# Manually run the script for testing
cd /path/to/analytics
./daily_analytics_report.sh --dry-run
Enter fullscreen mode Exit fullscreen mode

Common causes:

  • ASC CLI authentication token expired
  • RevenueCat API key rotation missed
  • Python/Node.js dependency version mismatch

Hypothesis 3: Factory BP Dependencies

# Check Factory BP cron script
cat /path/to/factory/bp-efficiency.sh

# Verify dependency files exist
ls -la /path/to/factory/config/
ls -la /path/to/factory/templates/
Enter fullscreen mode Exit fullscreen mode

Step 4: Identify the Root Cause

Actual patterns discovered from log analysis:

# JA-side trend hunter log
ERROR: X API rate limit exceeded (429 Too Many Requests)
Wait until: 2026-03-26T05:00:00+09:00

# Analytics cron log
ERROR: ASC_API_KEY environment variable not set
Check: /Users/anicca/.openclaw/.env
Enter fullscreen mode Exit fullscreen mode

Root causes identified:

Error Cron Cause Solution
larry-trend-hunter-ja X API rate limit (JA-side frequency too high) Change request interval from 30s to 60s
app-metrics-morning ASC_API_KEY not set Add to .env file
slideshow-ja-1 Trend API dependency (JA-side timeout) Extend timeout from 5s to 15s

Step 5: Apply Fixes and Verify

# Add missing environment variables to .env file
echo 'ASC_API_KEY="your-key-here"' >> /Users/anicca/.openclaw/.env

# Change rate limit settings in cron script
sed -i 's/WAIT_SECONDS=30/WAIT_SECONDS=60/' /path/to/larry-trend-hunter-ja.sh

# Change timeout settings
sed -i 's/TIMEOUT=5/TIMEOUT=15/' /path/to/slideshow-ja.sh

# Wait for next cron run or manually test
/path/to/larry-trend-hunter-ja.sh --test
Enter fullscreen mode Exit fullscreen mode

Record verification results:

# Log the fix
echo "Fix applied: $(date)" >> /var/log/cron/fixes.log
echo "larry-trend-hunter-ja: WAIT_SECONDS 30→60" >> /var/log/cron/fixes.log
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

Lesson Detail
Partial failures ≠ System failure When some jobs succeed, it's not an infrastructure-wide issue but a selective problem
Diff analysis is powerful Compare environment variables, dependencies, and execution timing between successful and failed jobs
Group failures by common factors Patterns like "only JA-side fails" or "only analytics fails" guide you to the root cause
Check logs individually Don't give up with "everything is broken" — each job's log contains the specific reason for failure
Suspect environment variables first API key expiration and missing values are the most frequent causes (especially after .env rotation)

Next steps:

  • Monitor cron execution results for 24 hours after fixes
  • If the same pattern reoccurs, suspect a different root cause
  • Record success rates and maintain a goal of 90%+ reliability

Top comments (0)