Domonique Luchin

Posted on May 11

Building a self-healing cron system with pg_cron and Supabase edge functions

#supabase #postgres #devops #automation

I run 6 AI businesses from a single VPS. When your entire operation depends on automated tasks running perfectly, you learn to build systems that fix themselves before you wake up to angry customers.

Here's how I built a cron system that monitors itself and recovers from failures automatically using pg_cron and Supabase Edge Functions.

Why I needed this

My Load Bearing Empire processes thousands of AI agent calls daily. Lead scoring runs every 15 minutes. Data sync happens hourly. Payment processing triggers every 30 minutes.

A single failed cron job costs me real money. I've been burned by silent failures too many times.

Most developers rely on external monitoring services. I prefer owning my infrastructure. This system costs me $0 in additional subscriptions and runs entirely within Supabase.

The architecture

Three components work together:

pg_cron schedules and executes jobs
Edge Functions handle the actual business logic
Health monitoring table tracks job status and triggers recovery

The key insight: every cron job reports its status to a central monitoring table. If a job fails or doesn't report in, the system automatically retries and alerts me.

Setting up the foundation

First, enable pg_cron in your Supabase project:

-- Run this in your SQL editor
CREATE EXTENSION IF NOT EXISTS pg_cron;

-- Create the monitoring table
CREATE TABLE cron_health (
  id SERIAL PRIMARY KEY,
  job_name TEXT NOT NULL,
  last_run TIMESTAMP WITH TIME ZONE,
  last_success TIMESTAMP WITH TIME ZONE,
  status TEXT CHECK (status IN ('running', 'success', 'failed')),
  error_message TEXT,
  retry_count INTEGER DEFAULT 0,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Index for fast lookups
CREATE INDEX idx_cron_health_job_name ON cron_health(job_name);
CREATE INDEX idx_cron_health_last_run ON cron_health(last_run);

Creating a self-reporting Edge Function

Here's an Edge Function that reports its own health status:

// supabase/functions/process-leads/index.ts
import { serve } from "https://deno.land/std@0.168.0/http/server.ts"
import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'

serve(async (req) => {
  const jobName = 'process-leads'
  const supabase = createClient(
    Deno.env.get('SUPABASE_URL') ?? '',
    Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') ?? ''
  )

  try {
    // Update status to running
    await supabase
      .from('cron_health')
      .upsert({
        job_name: jobName,
        last_run: new Date().toISOString(),
        status: 'running',
        retry_count: 0
      }, { onConflict: 'job_name' })

    // Your actual business logic here
    const result = await processLeads()

    // Report success
    await supabase
      .from('cron_health')
      .upsert({
        job_name: jobName,
        last_run: new Date().toISOString(),
        last_success: new Date().toISOString(),
        status: 'success',
        error_message: null
      }, { onConflict: 'job_name' })

    return new Response(JSON.stringify({ success: true, processed: result.count }))

  } catch (error) {
    // Report failure
    await supabase
      .from('cron_health')
      .upsert({
        job_name: jobName,
        last_run: new Date().toISOString(),
        status: 'failed',
        error_message: error.message,
        retry_count: (await getCurrentRetryCount(jobName)) + 1
      }, { onConflict: 'job_name' })

    return new Response(JSON.stringify({ error: error.message }), { status: 500 })
  }
})

The self-healing mechanism

This monitoring function runs every 5 minutes and handles recovery:

-- Create the health check function
CREATE OR REPLACE FUNCTION check_cron_health()
RETURNS void AS $$
DECLARE
  job_record RECORD;
  function_url TEXT;
BEGIN
  -- Find jobs that haven't reported success in their expected interval
  FOR job_record IN 
    SELECT job_name, last_run, last_success, retry_count
    FROM cron_health
    WHERE (
      -- Jobs that should run every 15 minutes but haven't succeeded in 20 minutes
      (job_name LIKE '%leads%' AND last_success < NOW() - INTERVAL '20 minutes') OR
      -- Jobs that should run hourly but haven't succeeded in 75 minutes  
      (job_name LIKE '%sync%' AND last_success < NOW() - INTERVAL '75 minutes')
    )
    AND retry_count < 3
  LOOP
    -- Build the Edge Function URL
    function_url := 'https://your-project.supabase.co/functions/v1/' || job_record.job_name;

    -- Trigger retry via HTTP request
    PERFORM net.http_post(
      url := function_url,
      headers := '{"Authorization": "Bearer ' || current_setting('app.service_role_key') || '"}',
      body := '{}'
    );

    -- Log the retry attempt
    INSERT INTO cron_health (job_name, last_run, status, retry_count)
    VALUES (job_record.job_name || '_retry', NOW(), 'retry_triggered', job_record.retry_count + 1);

  END LOOP;
END;
$$ LANGUAGE plpgsql;

Scheduling everything

Now wire it all together with pg_cron:

-- Schedule your business logic
SELECT cron.schedule('process-leads', '*/15 * * * *', 
  'SELECT net.http_post(''https://your-project.supabase.co/functions/v1/process-leads'', ''{"Authorization": "Bearer service_role_key"}'', '''')');

-- Schedule the health monitor
SELECT cron.schedule('health-check', '*/5 * * * *', 'SELECT check_cron_health()');

-- Clean up old health records weekly
SELECT cron.schedule('cleanup-health', '0 2 * * 0', 
  'DELETE FROM cron_health WHERE created_at < NOW() - INTERVAL ''30 days''');

Monitoring dashboard

Query this to see your system health:

-- Current status of all jobs
SELECT 
  job_name,
  status,
  last_success,
  EXTRACT(EPOCH FROM (NOW() - last_success))/60 as minutes_since_success,
  retry_count,
  error_message
FROM cron_health 
WHERE job_name NOT LIKE '%retry%'
ORDER BY last_run DESC;

Real results

Since implementing this system 3 months ago:

Zero silent failures
4 automatic recoveries from network timeouts
99.8% job success rate
2 minutes average recovery time

You get infrastructure that fixes itself. Your cron jobs report their health. Failed jobs retry automatically. You sleep better knowing your systems won't fail silently.

Build systems that work without you watching them.

DEV Community