How to add monitoring to gocron scheduled jobs in Go

#gocron #go #devops #monitoring

gocron is the most widely used job scheduling library in Go. It handles the hard parts of scheduling — cron expressions, concurrency control, timezone awareness, singleton modes — and gets out of your way.

What it doesn't do is tell you when something goes wrong.

The gocron v2 documentation lists a Monitor interface and a MonitorStatus interface for collecting metrics from job execution. Both entries note the same thing: "There are currently no open source implementations of the Monitor interface available."

That gap is what this post addresses. Here's how to instrument your gocron jobs with external monitoring so you're alerted the moment a job fails, hangs indefinitely, or stops running on schedule.

What gocron monitoring gives you natively

gocron v2 exposes two scheduler-level monitoring hooks:

Monitor collects metrics per job execution. You implement the interface and attach it to the scheduler. gocron calls your implementation before and after each job run.

MonitorStatus extends Monitor with error and status tracking per execution.

Both are useful for local observability — feeding metrics into Prometheus, for example. But they don't solve the most important monitoring problems:

Missed runs: if gocron itself crashes or your process dies, nothing fires an alert because the scheduler is gone.
Hung jobs: gocron can detect that a job is taking longer than expected, but it doesn't alert anyone externally.
Silent failures: a job that completes without error but processes zero records looks identical to a successful run.

External monitoring solves all three. The approach: your jobs send start and completion pings to an external service. If a ping doesn't arrive when expected, you get alerted — even if your entire Go process is down.

The basic instrumentation pattern

For simple gocron jobs, add HTTP pings around the work:

package main

import (
    "context"
    "fmt"
    "log"
    "net/http"
    "time"

    "github.com/go-co-op/gocron/v2"
)

const (
    apiKey    = "ck_live_your_api_key"
    monitorID = "your-monitor-id"
    baseURL   = "https://api.crontify.com/api/v1/ping"
)

func ping(event string) {
    url := fmt.Sprintf("%s/%s/%s", baseURL, monitorID, event)
    req, err := http.NewRequest("POST", url, nil)
    if err != nil {
        log.Printf("ping %s: failed to build request: %v", event, err)
        return
    }
    req.Header.Set("X-API-Key", apiKey)

    client := &http.Client{Timeout: 10 * time.Second}
    resp, err := client.Do(req)
    if err != nil {
        log.Printf("ping %s: request failed: %v", event, err)
        return
    }
    defer resp.Body.Close()
}

func syncRecords() error {
    // your job logic
    return nil
}

func main() {
    s, err := gocron.NewScheduler()
    if err != nil {
        log.Fatal(err)
    }

    _, err = s.NewJob(
        gocron.CronJob("0 2 * * *", false),
        gocron.NewTask(func() {
            ping("start")

            if err := syncRecords(); err != nil {
                // Optionally POST error details to the fail endpoint
                ping("fail")
                log.Printf("syncRecords failed: %v", err)
                return
            }

            ping("success")
        }),
    )
    if err != nil {
        log.Fatal(err)
    }

    s.Start()
    select {}
}

This gives you missed run detection (if the start ping doesn't arrive within your grace period) and failed job detection (if fail is pinged instead of success).

Adding hung job detection

Hung job detection requires a start ping followed by a success or fail ping within a maximum duration. That's already provided by the pattern above — Crontify's scheduler checks for runs that started but never completed.

You can also add an application-level timeout using Go's context package to force-terminate jobs that run too long:

gocron.NewTask(func() {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute)
    defer cancel()

    ping("start")

    done := make(chan error, 1)
    go func() {
        done <- syncRecords()
    }()

    select {
    case err := <-done:
        if err != nil {
            ping("fail")
            log.Printf("job failed: %v", err)
        } else {
            ping("success")
        }
    case <-ctx.Done():
        ping("fail")
        log.Printf("job exceeded 30 minute timeout")
    }
}),

If syncRecords() runs beyond 30 minutes, the context cancels, a fail ping is sent, and Crontify records the run as failed. The external hung job detection provides a second safety net — if your process itself hangs and never sends any ping, the monitor will alert after the threshold regardless.

Attaching metadata for silent failure detection

Go jobs often process records, sync data, or run aggregations. Attaching the count of what was actually processed lets you define alert rules on the output — firing an alert when a job succeeds but processes nothing.

The Crontify ping endpoints accept a JSON body on the success endpoint:

import (
    "bytes"
    "encoding/json"
)

type SuccessPayload struct {
    Meta map[string]any `json:"meta"`
}

func pingSuccess(meta map[string]any) {
    payload, _ := json.Marshal(SuccessPayload{Meta: meta})

    url := fmt.Sprintf("%s/%s/success", baseURL, monitorID)
    req, _ := http.NewRequest("POST", url, bytes.NewReader(payload))
    req.Header.Set("X-API-Key", apiKey)
    req.Header.Set("Content-Type", "application/json")

    client := &http.Client{Timeout: 10 * time.Second}
    resp, err := client.Do(req)
    if err != nil {
        log.Printf("pingSuccess failed: %v", err)
        return
    }
    defer resp.Body.Close()
}

// In your job:
result, err := syncRecords()
if err != nil {
    ping("fail")
    return
}

pingSuccess(map[string]any{
    "rows_synced":   result.Count,
    "duration_ms":   result.DurationMs,
    "errors_skipped": result.Errors,
})

In Crontify, you can then define a rule: rows_synced eq 0 → fire alert. The run is still logged as a success, but you get an immediate notification that something upstream is broken.

Wrapping it up as a reusable helper

If you have multiple gocron jobs to instrument, a small wrapper avoids repetition:

type CrontifyJob struct {
    MonitorID string
    APIKey    string
    BaseURL   string
}

func (c *CrontifyJob) Wrap(fn func() (map[string]any, error)) func() {
    return func() {
        c.ping("start")

        meta, err := fn()
        if err != nil {
            c.ping("fail")
            log.Printf("job %s failed: %v", c.MonitorID, err)
            return
        }

        if meta != nil {
            c.pingSuccess(meta)
        } else {
            c.ping("success")
        }
    }
}

Usage:

nightly := &CrontifyJob{
    MonitorID: "mon_abc123",
    APIKey:    os.Getenv("CRONTIFY_API_KEY"),
    BaseURL:   "https://api.crontify.com/api/v1/ping",
}

s.NewJob(
    gocron.CronJob("0 2 * * *", false),
    gocron.NewTask(nightly.Wrap(func() (map[string]any, error) {
        result, err := syncRecords()
        if err != nil {
            return nil, err
        }
        return map[string]any{"rows_synced": result.Count}, nil
    })),
)

What you get

After instrumenting your gocron jobs with external monitoring:

Missed runs alert you when a job doesn't start within its grace period — catches process crashes, server reboots, and OOM kills that take down the entire scheduler.
Hung jobs alert you when a job starts but never finishes — catches deadlocks, infinite loops, and database locks.
Silent failures alert you when a job completes but produces no output — catches empty upstream responses, failed database conditions, and zero-record syncs.
Recovery alerts fire automatically when a previously failing job returns to healthy.

Crontify is free for up to 5 monitors. No credit card required.