DEV Community

Vigilmon
Vigilmon

Posted on

How to Monitor Kubernetes Services with Vigilmon (HTTP + TCP + Alerts)

Running services on Kubernetes gives you powerful orchestration — self-healing pods, rolling deployments, horizontal scaling. But Kubernetes also introduces a new class of failure modes that internal health probes and kubectl get pods simply cannot catch.

In this tutorial you'll set up comprehensive uptime monitoring for your Kubernetes services using Vigilmon — free tier, no credit card required.


Why Kubernetes services need external monitoring

Kubernetes has livenessProbe and readinessProbe built in. So why do you need external monitoring on top of that?

Because the internal probes only tell you what the node can see. External monitoring tells you what your users see. These are not the same thing:

  • LoadBalancer IP is unreachable — your pods are healthy, but the cloud provider's load balancer has a routing issue or the external IP never got provisioned
  • NodePort binding breaks — a node reboots and the NodePort service doesn't come back cleanly; pods show Running but traffic can't reach them
  • Ingress controller misconfiguration — an nginx-ingress or Traefik update changes routing rules; pods are fine, users get 404s
  • DNS resolution failure — your cluster-external DNS record stops resolving; Kubernetes knows nothing about this
  • Cross-namespace connectivity broken — a NetworkPolicy change silently drops traffic between namespaces

The pattern is the same: Kubernetes reports everything green while users see downtime. External monitoring from multiple geographic regions is the only reliable way to catch this class of failure.


What you'll need

  • A running Kubernetes cluster (any provider: EKS, GKE, AKS, k3s, etc.)
  • A service exposed via LoadBalancer, NodePort, or Ingress
  • A free Vigilmon account (sign up takes 30 seconds)

Step 1: Expose your service and add a health endpoint

Before setting up monitoring, make sure your service is reachable from outside the cluster and has a /health endpoint to probe.

Here's a minimal Kubernetes Deployment + Service for a Node.js app:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-api
  template:
    metadata:
      labels:
        app: my-api
    spec:
      containers:
        - name: my-api
          image: your-registry/my-api:latest
          ports:
            - containerPort: 3000
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-api
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: my-api
  ports:
    - port: 80
      targetPort: 3000
      protocol: TCP
Enter fullscreen mode Exit fullscreen mode

Apply both:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
Enter fullscreen mode Exit fullscreen mode

Get your external IP:

kubectl get svc my-api -n production
# NAME     TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
# my-api   LoadBalancer   10.96.42.100    203.0.113.50     80:31234/TCP   2m
Enter fullscreen mode Exit fullscreen mode

Make sure your /health route returns HTTP 200:

curl http://203.0.113.50/health
# {"status":"ok","timestamp":"2024-01-15T10:23:00Z"}
Enter fullscreen mode Exit fullscreen mode

Step 2: Set up HTTP monitoring in Vigilmon

Now that your service is reachable, add it to Vigilmon:

  1. Log in to vigilmon.online and go to Monitors → New Monitor
  2. Choose HTTP / HTTPS as the type
  3. Set the URL to your service endpoint, e.g. http://203.0.113.50/health (or https://api.yourdomain.com/health if you have a domain and TLS)
  4. Set the check interval to 1 minute
  5. Under Expected response, set:
    • Status code: 200
    • (Optional) Response body contains: "status":"ok"
  6. Save the monitor

Vigilmon will probe your endpoint from multiple geographic regions. If any region can't reach your LoadBalancer — even when your pods are running — an incident is triggered immediately.

The multi-region consensus advantage

This is where Vigilmon differs from single-region monitoring tools. Instead of declaring an outage the moment one probe fails, Vigilmon uses multi-region consensus: multiple independent probes from different geographic locations must agree that the service is unreachable before an alert fires.

For Kubernetes, this matters because:

  • A single probe failure could be a transient network blip on one cloud provider's transit path
  • A regional cloud provider incident might affect probes from one region but not others
  • True downtime is confirmed quickly (within seconds) when multiple regions agree

You get fewer false positives without missing real incidents.


Step 3: Monitor your TCP layer

HTTP monitoring tells you if your app responds correctly. TCP monitoring tells you if the port is reachable at all — useful for detecting load balancer routing failures before they manifest as HTTP errors.

If you're running a database or other TCP service inside your cluster, you might expose it via NodePort for internal monitoring:

# postgres-nodeport.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgres-nodeport
  namespace: production
spec:
  type: NodePort
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
      nodePort: 30432
      protocol: TCP
Enter fullscreen mode Exit fullscreen mode

Then in Vigilmon:

  1. Go to Monitors → New Monitor
  2. Choose TCP Port
  3. Enter your node's IP and the NodePort number: node-1.example.com / 30432
  4. Save the monitor

Now you'll know immediately if the TCP socket becomes unreachable — even if Kubernetes reports the pod as Running.


Step 4: Configure alert channels

When a k8s service goes down, you want to know in seconds — not when a user files a support ticket.

Email alerts

  1. In Vigilmon, go to Alert Channels → Add Channel → Email
  2. Enter your on-call email address (or your team's alert email)
  3. Assign the channel to your monitors

Webhook alerts (Slack, PagerDuty, etc.)

  1. Go to Alert Channels → Add Channel → Webhook
  2. Enter your webhook URL
  3. Assign the channel to your monitors

Example payload Vigilmon sends when a service goes down:

{
  "monitor_name": "my-api /health",
  "status": "down",
  "url": "http://203.0.113.50/health",
  "started_at": "2024-01-15T10:23:00Z",
  "duration_seconds": 120
}
Enter fullscreen mode Exit fullscreen mode

Use this to route alerts into PagerDuty, Opsgenie, or a custom webhook handler that enriches the alert with Kubernetes context:

# Example webhook handler that annotates the alert with kubectl output
kubectl get pods -n production -l app=my-api --no-headers | \
  awk '{print $1, $3, $4}' > /tmp/pod-status.txt
Enter fullscreen mode Exit fullscreen mode

Using kubectl to correlate alerts

When you get a Vigilmon alert, your first instinct should be to check whether the issue is at the Kubernetes layer or external:

# 1. Check pod status
kubectl get pods -n production -l app=my-api

# 2. Check service endpoints — are pods registered?
kubectl get endpoints my-api -n production

# 3. Check events for the service
kubectl describe svc my-api -n production

# 4. Test from inside the cluster
kubectl run tmp-curl --image=curlimages/curl --rm -it --restart=Never -- \
  curl http://my-api.production.svc.cluster.local/health
Enter fullscreen mode Exit fullscreen mode

If curl from inside the cluster works but Vigilmon shows the service as down, the problem is external — load balancer, DNS, or network routing. If it fails inside the cluster too, you have a pod or service configuration issue.


Step 5: Create a public status page

If you're running production services for customers or internal users, a public status page reduces support noise when incidents happen.

  1. In Vigilmon, go to Status Pages → New Status Page
  2. Give it a name: "Production Cluster Services"
  3. Add your monitors: my-api /health, TCP postgres, etc.
  4. Group them logically — API, Database, Queue, etc.
  5. Publish the page

You'll get a public URL like https://status.vigilmon.online/your-page. Share it in your runbook, your app's dashboard, or with stakeholders who need visibility into cluster health without kubectl access.


Putting it all together

Here's a summary of the monitors to create for a typical Kubernetes production deployment:

Monitor Type What it catches
https://api.yourdomain.com/health HTTP Pod crashes, app errors, ingress routing failures, DNS issues
https://api.yourdomain.com HTTP TLS/certificate issues, Ingress path routing
node-1.example.com:30432 TCP Postgres NodePort reachability
node-1.example.com:443 TCP HTTPS port binding at the node level

And the corresponding Kubernetes resources to ensure your services are monitorable:

# Check what's exposed externally
kubectl get svc -A --field-selector spec.type=LoadBalancer
kubectl get svc -A --field-selector spec.type=NodePort
kubectl get ingress -A
Enter fullscreen mode Exit fullscreen mode

What's next

  • SSL certificate expiry — Vigilmon monitors your TLS cert and alerts you before it lapses. Critical for Kubernetes clusters using cert-manager, where auto-renewal can silently fail
  • Heartbeat monitoring — if you run CronJobs in Kubernetes, Vigilmon's heartbeat monitors will alert you when a scheduled job stops reporting in, even if the pod exits cleanly
  • Multi-environment coverage — add separate monitors for staging and production, and use Vigilmon's status page grouping to keep them organized

Get started free at vigilmon.online — no credit card, monitors start running in under a minute.

Top comments (0)