Monitoring Temporal.io Workflows with Vigilmon
Temporal is a workflow orchestration platform that handles the hard parts of distributed systems: retries, timeouts, state persistence, and failure recovery. When it's working, your workflows run reliably across days, weeks, or months without you thinking about them.
When Temporal goes down — or when a scheduled workflow silently stops firing — business processes halt. Payments aren't processed. Reports aren't generated. Data isn't synced. And because Temporal is designed to feel invisible, failures in it are often invisible too.
This guide shows you how to monitor the Temporal server health endpoints and use Vigilmon heartbeat monitors to catch silent workflow failures before they become business incidents.
The two failure modes in Temporal deployments
Server-level failures: The Temporal frontend, history, matching, or worker services crash or become unresponsive. New workflows can't be started, existing ones can't advance.
Silent workflow failures: The Temporal server is healthy but a specific workflow type has stopped being scheduled — perhaps because the worker process that handles it crashed, the cron schedule was misconfigured, or a code deployment broke the workflow registration. The server is "green" but critical business logic isn't running.
Standard HTTP uptime monitors only catch the first type. You need heartbeat monitoring to catch the second.
Step 1: Understand Temporal's health endpoints
Temporal exposes health information through two interfaces: an HTTP gateway and gRPC.
HTTP health endpoint (Temporal Web UI + gateway)
If you're running Temporal with the bundled web UI or a Temporal Cloud gateway, you typically have an HTTP endpoint available:
# Check the Temporal frontend via HTTP gateway
curl https://your-temporal.com/health
# Or via the metrics endpoint (Prometheus-compatible)
curl http://your-temporal.com:7233/metrics
For self-hosted Temporal, the most accessible HTTP health check is through the Temporal Web UI:
curl https://your-temporal-ui.com
# Returns HTTP 200 if the UI is serving
gRPC health check (the authoritative check)
Temporal's services expose standard gRPC health checks. Using grpc-health-probe:
# Install grpc-health-probe
go install github.com/grpc-ecosystem/grpc-health-probe@latest
# Check the frontend service (default port 7233)
grpc-health-probe -addr=your-temporal.com:7233
# Check the history service (port 7234)
grpc-health-probe -addr=your-temporal.com:7234 -service temporal.api.workflowservice.v1.WorkflowService
Healthy output:
status: SERVING
HTTP wrapper for external monitoring
Since most external monitoring tools (including Vigilmon) probe HTTP, wrap the gRPC check in a lightweight HTTP health service on your server:
// cmd/temporal-healthcheck/main.go
package main
import (
"context"
"fmt"
"log"
"net/http"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
healthpb "google.golang.org/grpc/health/grpc_health_v1"
)
func temporalHealthHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
conn, err := grpc.DialContext(ctx, "localhost:7233",
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithBlock(),
)
if err != nil {
http.Error(w, fmt.Sprintf("cannot connect: %v", err), http.StatusServiceUnavailable)
return
}
defer conn.Close()
client := healthpb.NewHealthClient(conn)
resp, err := client.Check(ctx, &healthpb.HealthCheckRequest{
Service: "temporal.api.workflowservice.v1.WorkflowService",
})
if err != nil || resp.Status != healthpb.HealthCheckResponse_SERVING {
http.Error(w, "temporal not serving", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
fmt.Fprintln(w, `{"status":"ok","temporal":"serving"}`)
}
func main() {
http.HandleFunc("/health", temporalHealthHandler)
log.Println("health proxy listening on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
Run this sidecar on your Temporal host:
go run cmd/temporal-healthcheck/main.go
Now Vigilmon can probe http://your-temporal-host:8080/health and get a meaningful HTTP response that reflects actual Temporal service state.
Step 2: Set up Vigilmon server monitoring
With your health endpoint live, configure Vigilmon:
- Sign up at vigilmon.online — free tier, no credit card
- New Monitor → HTTP
- URL:
http://your-temporal-host:8080/health - Keyword:
"status":"ok" - Set check interval to 5 minutes
- Save
Also add a monitor for the Temporal Web UI if you use it:
- URL:
https://your-temporal-ui.com - Expected status:
200
Step 3: Heartbeat monitoring for scheduled workflows
This is the most important monitoring pattern for Temporal. Server health checks only tell you the engine is running. Heartbeat monitoring tells you that your specific workflows are actually executing.
The pattern: at the end of each successful workflow execution (or the last activity in the workflow), ping a unique Vigilmon heartbeat URL. If the ping stops arriving within the expected interval, Vigilmon alerts you.
Temporal Go SDK example
// internal/workflows/nightly_report.go
package workflows
import (
"context"
"net/http"
"os"
"time"
"go.temporal.io/sdk/activity"
"go.temporal.io/sdk/temporal"
"go.temporal.io/sdk/workflow"
)
func NightlyReportWorkflow(ctx workflow.Context) error {
ao := workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Minute,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 3,
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
// Execute report generation
if err := workflow.ExecuteActivity(ctx, GenerateReport).Get(ctx, nil); err != nil {
return err
}
// Ping heartbeat on successful completion
if err := workflow.ExecuteActivity(ctx, PingHeartbeat).Get(ctx, nil); err != nil {
workflow.GetLogger(ctx).Warn("heartbeat ping failed — non-fatal", "error", err)
// Don't fail the workflow over a monitoring ping failure
}
return nil
}
func PingHeartbeat(ctx context.Context) error {
logger := activity.GetLogger(ctx)
heartbeatURL := os.Getenv("REPORT_HEARTBEAT_URL")
if heartbeatURL == "" {
return nil
}
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Get(heartbeatURL)
if err != nil {
logger.Warn("heartbeat ping error", "error", err)
return nil // Don't surface this as an activity failure
}
defer resp.Body.Close()
logger.Info("heartbeat pinged", "status", resp.StatusCode)
return nil
}
Temporal Python SDK example
# workflows/sync_workflow.py
import asyncio
import os
import aiohttp
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from datetime import timedelta
@activity.defn
async def ping_heartbeat() -> None:
"""Ping Vigilmon heartbeat — call after successful workflow completion."""
url = os.environ.get("SYNC_HEARTBEAT_URL")
if not url:
return
try:
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
activity.logger.info(f"Heartbeat pinged: {resp.status}")
except Exception as e:
activity.logger.warning(f"Heartbeat ping failed (non-fatal): {e}")
@activity.defn
async def run_data_sync() -> dict:
"""Actual sync logic."""
# ... your sync implementation ...
return {"synced": 0}
@workflow.defn
class DataSyncWorkflow:
@workflow.run
async def run(self) -> None:
result = await workflow.execute_activity(
run_data_sync,
start_to_close_timeout=timedelta(minutes=30),
retry_policy=RetryPolicy(maximum_attempts=3),
)
workflow.logger.info(f"Sync complete: {result}")
# Always ping on success, even if the ping itself fails
await workflow.execute_activity(
ping_heartbeat,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=1),
)
Set up the Vigilmon heartbeat monitor
In Vigilmon:
- New Monitor → Heartbeat
- Name: "Nightly Report Workflow"
- Expected interval: 25 hours (for a nightly cron — buffer beyond 24h)
- Copy the unique URL
- Set in your worker environment:
# .env or environment config
REPORT_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-token-1
SYNC_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-token-2
One heartbeat per critical workflow type. When you get an alert, you know exactly which workflow stopped — not just that "something is wrong."
Step 4: Monitor the Temporal worker processes
Your Temporal workflows won't run if the worker processes aren't polling for tasks. Workers are separate processes from the Temporal server, and they die silently.
Add a liveness ping from your worker startup:
// worker/main.go
package main
import (
"log"
"net/http"
"os"
"os/signal"
"syscall"
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/worker"
)
func main() {
c, err := client.Dial(client.Options{
HostPort: os.Getenv("TEMPORAL_HOST"),
})
if err != nil {
log.Fatalf("client dial failed: %v", err)
}
defer c.Close()
w := worker.New(c, "main-task-queue", worker.Options{})
// Register workflows and activities
// w.RegisterWorkflow(...)
// w.RegisterActivity(...)
// Expose a simple health endpoint for the worker process
go func() {
http.HandleFunc("/alive", func(rw http.ResponseWriter, r *http.Request) {
rw.WriteHeader(http.StatusOK)
rw.Write([]byte(`{"alive":true}`))
})
log.Println("worker liveness on :8081")
http.ListenAndServe(":8081", nil)
}()
// Block until interrupt
quit := make(chan os.Signal, 1)
signal.Notify(quit, os.Interrupt, syscall.SIGTERM)
if err := w.Start(); err != nil {
log.Fatalf("worker start failed: %v", err)
}
<-quit
w.Stop()
}
Add a Vigilmon monitor on http://your-worker-host:8081/alive. If the worker process crashes, you'll know in minutes rather than at the next missed workflow execution.
Step 5: Configure alerts
Slack setup:
- Create an incoming webhook in Slack
- Vigilmon: Notifications → New Channel → Slack
- Enable on all Temporal monitors
When a workflow heartbeat is missed:
🔴 MISSED: Nightly Report Workflow heartbeat
Expected every: 25 hours
Last ping: 27 hours ago
Action needed: check Temporal worker logs
When the Temporal server goes down:
🔴 DOWN: temporal-health.your-domain.com/health
Status: 503 Service Unavailable
Regions: US-East, EU-West
What you've built
| What | How |
|---|---|
| Server health monitoring | HTTP proxy over gRPC health check |
| Per-workflow monitoring | Vigilmon heartbeat — one per critical workflow type |
| Worker liveness | HTTP endpoint on worker process |
| Immediate alerts | Slack notifications on server down or missed heartbeat |
| Failure isolation | Named heartbeats identify exactly which workflow stopped |
The server health check catches Temporal infrastructure failures. The per-workflow heartbeats catch silent execution failures that infrastructure monitors will never see. Together, they give you full observability over your Temporal deployment.
Get started free at vigilmon.online — your first monitor is running in under a minute.
Top comments (0)