Vigilmon

Posted on Jun 30

Monitoring Temporal.io Workflows with Vigilmon

#temporal #workflows #monitoring #devops

Monitoring Temporal.io Workflows with Vigilmon

Temporal is a workflow orchestration platform that handles the hard parts of distributed systems: retries, timeouts, state persistence, and failure recovery. When it's working, your workflows run reliably across days, weeks, or months without you thinking about them.

When Temporal goes down — or when a scheduled workflow silently stops firing — business processes halt. Payments aren't processed. Reports aren't generated. Data isn't synced. And because Temporal is designed to feel invisible, failures in it are often invisible too.

This guide shows you how to monitor the Temporal server health endpoints and use Vigilmon heartbeat monitors to catch silent workflow failures before they become business incidents.

The two failure modes in Temporal deployments

Server-level failures: The Temporal frontend, history, matching, or worker services crash or become unresponsive. New workflows can't be started, existing ones can't advance.

Silent workflow failures: The Temporal server is healthy but a specific workflow type has stopped being scheduled — perhaps because the worker process that handles it crashed, the cron schedule was misconfigured, or a code deployment broke the workflow registration. The server is "green" but critical business logic isn't running.

Standard HTTP uptime monitors only catch the first type. You need heartbeat monitoring to catch the second.

Step 1: Understand Temporal's health endpoints

Temporal exposes health information through two interfaces: an HTTP gateway and gRPC.

HTTP health endpoint (Temporal Web UI + gateway)

If you're running Temporal with the bundled web UI or a Temporal Cloud gateway, you typically have an HTTP endpoint available:

# Check the Temporal frontend via HTTP gateway
curl https://your-temporal.com/health

# Or via the metrics endpoint (Prometheus-compatible)
curl http://your-temporal.com:7233/metrics

For self-hosted Temporal, the most accessible HTTP health check is through the Temporal Web UI:

curl https://your-temporal-ui.com
# Returns HTTP 200 if the UI is serving

gRPC health check (the authoritative check)

Temporal's services expose standard gRPC health checks. Using grpc-health-probe:

# Install grpc-health-probe
go install github.com/grpc-ecosystem/grpc-health-probe@latest

# Check the frontend service (default port 7233)
grpc-health-probe -addr=your-temporal.com:7233

# Check the history service (port 7234)
grpc-health-probe -addr=your-temporal.com:7234 -service temporal.api.workflowservice.v1.WorkflowService

Healthy output:

status: SERVING

HTTP wrapper for external monitoring

Since most external monitoring tools (including Vigilmon) probe HTTP, wrap the gRPC check in a lightweight HTTP health service on your server:

// cmd/temporal-healthcheck/main.go
package main

import (
    "context"
    "fmt"
    "log"
    "net/http"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    healthpb "google.golang.org/grpc/health/grpc_health_v1"
)

func temporalHealthHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    conn, err := grpc.DialContext(ctx, "localhost:7233",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithBlock(),
    )
    if err != nil {
        http.Error(w, fmt.Sprintf("cannot connect: %v", err), http.StatusServiceUnavailable)
        return
    }
    defer conn.Close()

    client := healthpb.NewHealthClient(conn)
    resp, err := client.Check(ctx, &healthpb.HealthCheckRequest{
        Service: "temporal.api.workflowservice.v1.WorkflowService",
    })
    if err != nil || resp.Status != healthpb.HealthCheckResponse_SERVING {
        http.Error(w, "temporal not serving", http.StatusServiceUnavailable)
        return
    }

    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, `{"status":"ok","temporal":"serving"}`)
}

func main() {
    http.HandleFunc("/health", temporalHealthHandler)
    log.Println("health proxy listening on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Run this sidecar on your Temporal host:

go run cmd/temporal-healthcheck/main.go

Now Vigilmon can probe http://your-temporal-host:8080/health and get a meaningful HTTP response that reflects actual Temporal service state.

Step 2: Set up Vigilmon server monitoring

With your health endpoint live, configure Vigilmon:

Sign up at vigilmon.online — free tier, no credit card
New Monitor → HTTP
URL: http://your-temporal-host:8080/health
Keyword: "status":"ok"
Set check interval to 5 minutes
Save

Also add a monitor for the Temporal Web UI if you use it:

URL: https://your-temporal-ui.com
Expected status: 200

Step 3: Heartbeat monitoring for scheduled workflows

This is the most important monitoring pattern for Temporal. Server health checks only tell you the engine is running. Heartbeat monitoring tells you that your specific workflows are actually executing.

The pattern: at the end of each successful workflow execution (or the last activity in the workflow), ping a unique Vigilmon heartbeat URL. If the ping stops arriving within the expected interval, Vigilmon alerts you.

Temporal Go SDK example

// internal/workflows/nightly_report.go
package workflows

import (
    "context"
    "net/http"
    "os"
    "time"

    "go.temporal.io/sdk/activity"
    "go.temporal.io/sdk/temporal"
    "go.temporal.io/sdk/workflow"
)

func NightlyReportWorkflow(ctx workflow.Context) error {
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Minute,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, ao)

    // Execute report generation
    if err := workflow.ExecuteActivity(ctx, GenerateReport).Get(ctx, nil); err != nil {
        return err
    }

    // Ping heartbeat on successful completion
    if err := workflow.ExecuteActivity(ctx, PingHeartbeat).Get(ctx, nil); err != nil {
        workflow.GetLogger(ctx).Warn("heartbeat ping failed — non-fatal", "error", err)
        // Don't fail the workflow over a monitoring ping failure
    }

    return nil
}

func PingHeartbeat(ctx context.Context) error {
    logger := activity.GetLogger(ctx)
    heartbeatURL := os.Getenv("REPORT_HEARTBEAT_URL")
    if heartbeatURL == "" {
        return nil
    }

    client := &http.Client{Timeout: 10 * time.Second}
    resp, err := client.Get(heartbeatURL)
    if err != nil {
        logger.Warn("heartbeat ping error", "error", err)
        return nil // Don't surface this as an activity failure
    }
    defer resp.Body.Close()
    logger.Info("heartbeat pinged", "status", resp.StatusCode)
    return nil
}

Temporal Python SDK example

# workflows/sync_workflow.py
import asyncio
import os
import aiohttp
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from datetime import timedelta


@activity.defn
async def ping_heartbeat() -> None:
    """Ping Vigilmon heartbeat — call after successful workflow completion."""
    url = os.environ.get("SYNC_HEARTBEAT_URL")
    if not url:
        return
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                activity.logger.info(f"Heartbeat pinged: {resp.status}")
    except Exception as e:
        activity.logger.warning(f"Heartbeat ping failed (non-fatal): {e}")


@activity.defn
async def run_data_sync() -> dict:
    """Actual sync logic."""
    # ... your sync implementation ...
    return {"synced": 0}


@workflow.defn
class DataSyncWorkflow:
    @workflow.run
    async def run(self) -> None:
        result = await workflow.execute_activity(
            run_data_sync,
            start_to_close_timeout=timedelta(minutes=30),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )
        workflow.logger.info(f"Sync complete: {result}")

        # Always ping on success, even if the ping itself fails
        await workflow.execute_activity(
            ping_heartbeat,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(maximum_attempts=1),
        )

Set up the Vigilmon heartbeat monitor

In Vigilmon:

New Monitor → Heartbeat
Name: "Nightly Report Workflow"
Expected interval: 25 hours (for a nightly cron — buffer beyond 24h)
Copy the unique URL
Set in your worker environment:

# .env or environment config
REPORT_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-token-1
SYNC_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-token-2

One heartbeat per critical workflow type. When you get an alert, you know exactly which workflow stopped — not just that "something is wrong."

Step 4: Monitor the Temporal worker processes

Your Temporal workflows won't run if the worker processes aren't polling for tasks. Workers are separate processes from the Temporal server, and they die silently.

Add a liveness ping from your worker startup:

// worker/main.go
package main

import (
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"

    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
)

func main() {
    c, err := client.Dial(client.Options{
        HostPort: os.Getenv("TEMPORAL_HOST"),
    })
    if err != nil {
        log.Fatalf("client dial failed: %v", err)
    }
    defer c.Close()

    w := worker.New(c, "main-task-queue", worker.Options{})
    // Register workflows and activities
    // w.RegisterWorkflow(...)
    // w.RegisterActivity(...)

    // Expose a simple health endpoint for the worker process
    go func() {
        http.HandleFunc("/alive", func(rw http.ResponseWriter, r *http.Request) {
            rw.WriteHeader(http.StatusOK)
            rw.Write([]byte(`{"alive":true}`))
        })
        log.Println("worker liveness on :8081")
        http.ListenAndServe(":8081", nil)
    }()

    // Block until interrupt
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, os.Interrupt, syscall.SIGTERM)

    if err := w.Start(); err != nil {
        log.Fatalf("worker start failed: %v", err)
    }
    <-quit
    w.Stop()
}

Add a Vigilmon monitor on http://your-worker-host:8081/alive. If the worker process crashes, you'll know in minutes rather than at the next missed workflow execution.

Step 5: Configure alerts

Slack setup:

Create an incoming webhook in Slack
Vigilmon: Notifications → New Channel → Slack
Enable on all Temporal monitors

When a workflow heartbeat is missed:

🔴 MISSED: Nightly Report Workflow heartbeat
Expected every: 25 hours
Last ping: 27 hours ago
Action needed: check Temporal worker logs

When the Temporal server goes down:

🔴 DOWN: temporal-health.your-domain.com/health
Status: 503 Service Unavailable
Regions: US-East, EU-West

What you've built

What	How
Server health monitoring	HTTP proxy over gRPC health check
Per-workflow monitoring	Vigilmon heartbeat — one per critical workflow type
Worker liveness	HTTP endpoint on worker process
Immediate alerts	Slack notifications on server down or missed heartbeat
Failure isolation	Named heartbeats identify exactly which workflow stopped

The server health check catches Temporal infrastructure failures. The per-workflow heartbeats catch silent execution failures that infrastructure monitors will never see. Together, they give you full observability over your Temporal deployment.

Get started free at vigilmon.online — your first monitor is running in under a minute.

DEV Community

Monitoring Temporal.io Workflows with Vigilmon

Monitoring Temporal.io Workflows with Vigilmon

The two failure modes in Temporal deployments

Step 1: Understand Temporal's health endpoints

HTTP health endpoint (Temporal Web UI + gateway)

gRPC health check (the authoritative check)

HTTP wrapper for external monitoring

Step 2: Set up Vigilmon server monitoring

Step 3: Heartbeat monitoring for scheduled workflows

Temporal Go SDK example

Temporal Python SDK example

Set up the Vigilmon heartbeat monitor

Step 4: Monitor the Temporal worker processes

Step 5: Configure alerts

What you've built

Top comments (0)