I run 6 AI businesses from a single VPS. When your entire operation depends on automated tasks running perfectly, you learn to build systems that fix themselves before you wake up to angry customers.
Here's how I built a cron system that monitors itself and recovers from failures automatically using pg_cron and Supabase Edge Functions.
Why I needed this
My Load Bearing Empire processes thousands of AI agent calls daily. Lead scoring runs every 15 minutes. Data sync happens hourly. Payment processing triggers every 30 minutes.
A single failed cron job costs me real money. I've been burned by silent failures too many times.
Most developers rely on external monitoring services. I prefer owning my infrastructure. This system costs me $0 in additional subscriptions and runs entirely within Supabase.
The architecture
Three components work together:
- pg_cron schedules and executes jobs
- Edge Functions handle the actual business logic
- Health monitoring table tracks job status and triggers recovery
The key insight: every cron job reports its status to a central monitoring table. If a job fails or doesn't report in, the system automatically retries and alerts me.
Setting up the foundation
First, enable pg_cron in your Supabase project:
-- Run this in your SQL editor
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Create the monitoring table
CREATE TABLE cron_health (
id SERIAL PRIMARY KEY,
job_name TEXT NOT NULL,
last_run TIMESTAMP WITH TIME ZONE,
last_success TIMESTAMP WITH TIME ZONE,
status TEXT CHECK (status IN ('running', 'success', 'failed')),
error_message TEXT,
retry_count INTEGER DEFAULT 0,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- Index for fast lookups
CREATE INDEX idx_cron_health_job_name ON cron_health(job_name);
CREATE INDEX idx_cron_health_last_run ON cron_health(last_run);
Creating a self-reporting Edge Function
Here's an Edge Function that reports its own health status:
// supabase/functions/process-leads/index.ts
import { serve } from "https://deno.land/std@0.168.0/http/server.ts"
import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'
serve(async (req) => {
const jobName = 'process-leads'
const supabase = createClient(
Deno.env.get('SUPABASE_URL') ?? '',
Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') ?? ''
)
try {
// Update status to running
await supabase
.from('cron_health')
.upsert({
job_name: jobName,
last_run: new Date().toISOString(),
status: 'running',
retry_count: 0
}, { onConflict: 'job_name' })
// Your actual business logic here
const result = await processLeads()
// Report success
await supabase
.from('cron_health')
.upsert({
job_name: jobName,
last_run: new Date().toISOString(),
last_success: new Date().toISOString(),
status: 'success',
error_message: null
}, { onConflict: 'job_name' })
return new Response(JSON.stringify({ success: true, processed: result.count }))
} catch (error) {
// Report failure
await supabase
.from('cron_health')
.upsert({
job_name: jobName,
last_run: new Date().toISOString(),
status: 'failed',
error_message: error.message,
retry_count: (await getCurrentRetryCount(jobName)) + 1
}, { onConflict: 'job_name' })
return new Response(JSON.stringify({ error: error.message }), { status: 500 })
}
})
The self-healing mechanism
This monitoring function runs every 5 minutes and handles recovery:
-- Create the health check function
CREATE OR REPLACE FUNCTION check_cron_health()
RETURNS void AS $$
DECLARE
job_record RECORD;
function_url TEXT;
BEGIN
-- Find jobs that haven't reported success in their expected interval
FOR job_record IN
SELECT job_name, last_run, last_success, retry_count
FROM cron_health
WHERE (
-- Jobs that should run every 15 minutes but haven't succeeded in 20 minutes
(job_name LIKE '%leads%' AND last_success < NOW() - INTERVAL '20 minutes') OR
-- Jobs that should run hourly but haven't succeeded in 75 minutes
(job_name LIKE '%sync%' AND last_success < NOW() - INTERVAL '75 minutes')
)
AND retry_count < 3
LOOP
-- Build the Edge Function URL
function_url := 'https://your-project.supabase.co/functions/v1/' || job_record.job_name;
-- Trigger retry via HTTP request
PERFORM net.http_post(
url := function_url,
headers := '{"Authorization": "Bearer ' || current_setting('app.service_role_key') || '"}',
body := '{}'
);
-- Log the retry attempt
INSERT INTO cron_health (job_name, last_run, status, retry_count)
VALUES (job_record.job_name || '_retry', NOW(), 'retry_triggered', job_record.retry_count + 1);
END LOOP;
END;
$$ LANGUAGE plpgsql;
Scheduling everything
Now wire it all together with pg_cron:
-- Schedule your business logic
SELECT cron.schedule('process-leads', '*/15 * * * *',
'SELECT net.http_post(''https://your-project.supabase.co/functions/v1/process-leads'', ''{"Authorization": "Bearer service_role_key"}'', '''')');
-- Schedule the health monitor
SELECT cron.schedule('health-check', '*/5 * * * *', 'SELECT check_cron_health()');
-- Clean up old health records weekly
SELECT cron.schedule('cleanup-health', '0 2 * * 0',
'DELETE FROM cron_health WHERE created_at < NOW() - INTERVAL ''30 days''');
Monitoring dashboard
Query this to see your system health:
-- Current status of all jobs
SELECT
job_name,
status,
last_success,
EXTRACT(EPOCH FROM (NOW() - last_success))/60 as minutes_since_success,
retry_count,
error_message
FROM cron_health
WHERE job_name NOT LIKE '%retry%'
ORDER BY last_run DESC;
Real results
Since implementing this system 3 months ago:
- Zero silent failures
- 4 automatic recoveries from network timeouts
- 99.8% job success rate
- 2 minutes average recovery time
You get infrastructure that fixes itself. Your cron jobs report their health. Failed jobs retry automatically. You sleep better knowing your systems won't fail silently.
Build systems that work without you watching them.
Top comments (0)