DEV Community

Vigilmon
Vigilmon

Posted on

Monitoring Temporal.io Workflows with Vigilmon

Monitoring Temporal.io Workflows with Vigilmon

Temporal is a workflow orchestration platform that handles the hard parts of distributed systems: retries, timeouts, state persistence, and failure recovery. When it's working, your workflows run reliably across days, weeks, or months without you thinking about them.

When Temporal goes down — or when a scheduled workflow silently stops firing — business processes halt. Payments aren't processed. Reports aren't generated. Data isn't synced. And because Temporal is designed to feel invisible, failures in it are often invisible too.

This guide shows you how to monitor the Temporal server health endpoints and use Vigilmon heartbeat monitors to catch silent workflow failures before they become business incidents.


The two failure modes in Temporal deployments

Server-level failures: The Temporal frontend, history, matching, or worker services crash or become unresponsive. New workflows can't be started, existing ones can't advance.

Silent workflow failures: The Temporal server is healthy but a specific workflow type has stopped being scheduled — perhaps because the worker process that handles it crashed, the cron schedule was misconfigured, or a code deployment broke the workflow registration. The server is "green" but critical business logic isn't running.

Standard HTTP uptime monitors only catch the first type. You need heartbeat monitoring to catch the second.


Step 1: Understand Temporal's health endpoints

Temporal exposes health information through two interfaces: an HTTP gateway and gRPC.

HTTP health endpoint (Temporal Web UI + gateway)

If you're running Temporal with the bundled web UI or a Temporal Cloud gateway, you typically have an HTTP endpoint available:

# Check the Temporal frontend via HTTP gateway
curl https://your-temporal.com/health

# Or via the metrics endpoint (Prometheus-compatible)
curl http://your-temporal.com:7233/metrics
Enter fullscreen mode Exit fullscreen mode

For self-hosted Temporal, the most accessible HTTP health check is through the Temporal Web UI:

curl https://your-temporal-ui.com
# Returns HTTP 200 if the UI is serving
Enter fullscreen mode Exit fullscreen mode

gRPC health check (the authoritative check)

Temporal's services expose standard gRPC health checks. Using grpc-health-probe:

# Install grpc-health-probe
go install github.com/grpc-ecosystem/grpc-health-probe@latest

# Check the frontend service (default port 7233)
grpc-health-probe -addr=your-temporal.com:7233

# Check the history service (port 7234)
grpc-health-probe -addr=your-temporal.com:7234 -service temporal.api.workflowservice.v1.WorkflowService
Enter fullscreen mode Exit fullscreen mode

Healthy output:

status: SERVING
Enter fullscreen mode Exit fullscreen mode

HTTP wrapper for external monitoring

Since most external monitoring tools (including Vigilmon) probe HTTP, wrap the gRPC check in a lightweight HTTP health service on your server:

// cmd/temporal-healthcheck/main.go
package main

import (
    "context"
    "fmt"
    "log"
    "net/http"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    healthpb "google.golang.org/grpc/health/grpc_health_v1"
)

func temporalHealthHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    conn, err := grpc.DialContext(ctx, "localhost:7233",
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithBlock(),
    )
    if err != nil {
        http.Error(w, fmt.Sprintf("cannot connect: %v", err), http.StatusServiceUnavailable)
        return
    }
    defer conn.Close()

    client := healthpb.NewHealthClient(conn)
    resp, err := client.Check(ctx, &healthpb.HealthCheckRequest{
        Service: "temporal.api.workflowservice.v1.WorkflowService",
    })
    if err != nil || resp.Status != healthpb.HealthCheckResponse_SERVING {
        http.Error(w, "temporal not serving", http.StatusServiceUnavailable)
        return
    }

    w.WriteHeader(http.StatusOK)
    fmt.Fprintln(w, `{"status":"ok","temporal":"serving"}`)
}

func main() {
    http.HandleFunc("/health", temporalHealthHandler)
    log.Println("health proxy listening on :8080")
    log.Fatal(http.ListenAndServe(":8080", nil))
}
Enter fullscreen mode Exit fullscreen mode

Run this sidecar on your Temporal host:

go run cmd/temporal-healthcheck/main.go
Enter fullscreen mode Exit fullscreen mode

Now Vigilmon can probe http://your-temporal-host:8080/health and get a meaningful HTTP response that reflects actual Temporal service state.


Step 2: Set up Vigilmon server monitoring

With your health endpoint live, configure Vigilmon:

  1. Sign up at vigilmon.online — free tier, no credit card
  2. New Monitor → HTTP
  3. URL: http://your-temporal-host:8080/health
  4. Keyword: "status":"ok"
  5. Set check interval to 5 minutes
  6. Save

Also add a monitor for the Temporal Web UI if you use it:

  • URL: https://your-temporal-ui.com
  • Expected status: 200

Step 3: Heartbeat monitoring for scheduled workflows

This is the most important monitoring pattern for Temporal. Server health checks only tell you the engine is running. Heartbeat monitoring tells you that your specific workflows are actually executing.

The pattern: at the end of each successful workflow execution (or the last activity in the workflow), ping a unique Vigilmon heartbeat URL. If the ping stops arriving within the expected interval, Vigilmon alerts you.

Temporal Go SDK example

// internal/workflows/nightly_report.go
package workflows

import (
    "context"
    "net/http"
    "os"
    "time"

    "go.temporal.io/sdk/activity"
    "go.temporal.io/sdk/temporal"
    "go.temporal.io/sdk/workflow"
)

func NightlyReportWorkflow(ctx workflow.Context) error {
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Minute,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, ao)

    // Execute report generation
    if err := workflow.ExecuteActivity(ctx, GenerateReport).Get(ctx, nil); err != nil {
        return err
    }

    // Ping heartbeat on successful completion
    if err := workflow.ExecuteActivity(ctx, PingHeartbeat).Get(ctx, nil); err != nil {
        workflow.GetLogger(ctx).Warn("heartbeat ping failed — non-fatal", "error", err)
        // Don't fail the workflow over a monitoring ping failure
    }

    return nil
}

func PingHeartbeat(ctx context.Context) error {
    logger := activity.GetLogger(ctx)
    heartbeatURL := os.Getenv("REPORT_HEARTBEAT_URL")
    if heartbeatURL == "" {
        return nil
    }

    client := &http.Client{Timeout: 10 * time.Second}
    resp, err := client.Get(heartbeatURL)
    if err != nil {
        logger.Warn("heartbeat ping error", "error", err)
        return nil // Don't surface this as an activity failure
    }
    defer resp.Body.Close()
    logger.Info("heartbeat pinged", "status", resp.StatusCode)
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Temporal Python SDK example

# workflows/sync_workflow.py
import asyncio
import os
import aiohttp
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from datetime import timedelta


@activity.defn
async def ping_heartbeat() -> None:
    """Ping Vigilmon heartbeat — call after successful workflow completion."""
    url = os.environ.get("SYNC_HEARTBEAT_URL")
    if not url:
        return
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                activity.logger.info(f"Heartbeat pinged: {resp.status}")
    except Exception as e:
        activity.logger.warning(f"Heartbeat ping failed (non-fatal): {e}")


@activity.defn
async def run_data_sync() -> dict:
    """Actual sync logic."""
    # ... your sync implementation ...
    return {"synced": 0}


@workflow.defn
class DataSyncWorkflow:
    @workflow.run
    async def run(self) -> None:
        result = await workflow.execute_activity(
            run_data_sync,
            start_to_close_timeout=timedelta(minutes=30),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )
        workflow.logger.info(f"Sync complete: {result}")

        # Always ping on success, even if the ping itself fails
        await workflow.execute_activity(
            ping_heartbeat,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=RetryPolicy(maximum_attempts=1),
        )
Enter fullscreen mode Exit fullscreen mode

Set up the Vigilmon heartbeat monitor

In Vigilmon:

  1. New Monitor → Heartbeat
  2. Name: "Nightly Report Workflow"
  3. Expected interval: 25 hours (for a nightly cron — buffer beyond 24h)
  4. Copy the unique URL
  5. Set in your worker environment:
# .env or environment config
REPORT_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-token-1
SYNC_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-token-2
Enter fullscreen mode Exit fullscreen mode

One heartbeat per critical workflow type. When you get an alert, you know exactly which workflow stopped — not just that "something is wrong."


Step 4: Monitor the Temporal worker processes

Your Temporal workflows won't run if the worker processes aren't polling for tasks. Workers are separate processes from the Temporal server, and they die silently.

Add a liveness ping from your worker startup:

// worker/main.go
package main

import (
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"

    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
)

func main() {
    c, err := client.Dial(client.Options{
        HostPort: os.Getenv("TEMPORAL_HOST"),
    })
    if err != nil {
        log.Fatalf("client dial failed: %v", err)
    }
    defer c.Close()

    w := worker.New(c, "main-task-queue", worker.Options{})
    // Register workflows and activities
    // w.RegisterWorkflow(...)
    // w.RegisterActivity(...)

    // Expose a simple health endpoint for the worker process
    go func() {
        http.HandleFunc("/alive", func(rw http.ResponseWriter, r *http.Request) {
            rw.WriteHeader(http.StatusOK)
            rw.Write([]byte(`{"alive":true}`))
        })
        log.Println("worker liveness on :8081")
        http.ListenAndServe(":8081", nil)
    }()

    // Block until interrupt
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, os.Interrupt, syscall.SIGTERM)

    if err := w.Start(); err != nil {
        log.Fatalf("worker start failed: %v", err)
    }
    <-quit
    w.Stop()
}
Enter fullscreen mode Exit fullscreen mode

Add a Vigilmon monitor on http://your-worker-host:8081/alive. If the worker process crashes, you'll know in minutes rather than at the next missed workflow execution.


Step 5: Configure alerts

Slack setup:

  1. Create an incoming webhook in Slack
  2. Vigilmon: Notifications → New Channel → Slack
  3. Enable on all Temporal monitors

When a workflow heartbeat is missed:

🔴 MISSED: Nightly Report Workflow heartbeat
Expected every: 25 hours
Last ping: 27 hours ago
Action needed: check Temporal worker logs
Enter fullscreen mode Exit fullscreen mode

When the Temporal server goes down:

🔴 DOWN: temporal-health.your-domain.com/health
Status: 503 Service Unavailable
Regions: US-East, EU-West
Enter fullscreen mode Exit fullscreen mode

What you've built

What How
Server health monitoring HTTP proxy over gRPC health check
Per-workflow monitoring Vigilmon heartbeat — one per critical workflow type
Worker liveness HTTP endpoint on worker process
Immediate alerts Slack notifications on server down or missed heartbeat
Failure isolation Named heartbeats identify exactly which workflow stopped

The server health check catches Temporal infrastructure failures. The per-workflow heartbeats catch silent execution failures that infrastructure monitors will never see. Together, they give you full observability over your Temporal deployment.


Get started free at vigilmon.online — your first monitor is running in under a minute.

Top comments (0)