DEV Community

Cover image for Your Database Just Died. Why Is Everything Still Running?
Petr Petrenko
Petr Petrenko

Posted on

Your Database Just Died. Why Is Everything Still Running?

It's 3 AM. Your PostgreSQL pod crashes. On-call fires. The engineer wakes up, checks the dashboard, and spends 20 minutes figuring out why the payments service is throwing 500s at 100% error rate before they realize — database is down, but payments is still running, hammering connection pools, logging thousands of errors per second, triggering cascading alerts.

They scale payments to zero. Problem stops. Recovery begins.

That 20 minutes was avoidable. klink would have scaled payments to zero automatically — 30 seconds after the database went down.


The Problem Nobody Talks About

Kubernetes is excellent at keeping individual services running. Liveness probes, restart policies, resource limits — it's all there. But Kubernetes has no concept of relationships between workloads.

When your database fails, Kubernetes does exactly what you told it to: keeps the dependent services running. Those services now do nothing useful — they just generate noise, consume resources, and make your incident harder to debug.

This is the gap klink fills.


What klink Does

klink introduces a new primitive: WorkloadDependency. You declare that service B depends on service A. klink watches. When A goes unhealthy, klink scales B to zero. When A recovers, klink restores B automatically.

apiVersion: deps.klink.dev/v1alpha1
kind: WorkloadDependency
metadata:
  name: payments-needs-database
  namespace: production
spec:
  dependent:
    kind: Deployment
    name: payments-service

  dependsOn:
    - kind: Deployment
      name: postgresql
      condition:
        minReadyPercent: 80    # healthy if ≥80% pods ready
        window: 30s            # ignore transient restarts
        recoveryWindow: 60s    # wait for stability before restoring

  onDegraded:
    action: ScaleToZero

  mode: strict
Enter fullscreen mode Exit fullscreen mode

That's it. No code changes. No sidecars. No complex configuration.


How It Works

When PostgreSQL crashes, klink waits 30 seconds (ignoring transient restarts), then scales payments to zero and saves the replica count. When the database recovers, klink waits 60 seconds for stability, then automatically restores payments to its original replica count.

The hysteresis window is critical. Without it, a single pod restart would cascade a shutdown. With window: 30s, klink ignores transient failures — only sustained outages trigger action.


Enforcement Modes

Different situations call for different behavior. klink has four modes:

Mode What it does When to use
strict Scales to 0 on failure. Reverts manual scale-ups within 15s. Production services where cascade is non-negotiable
soft Scales to 0 once. Respects manual overrides. Services where operators need flexibility during incidents
gate Doesn't scale down. Blocks scale-up via admission webhook. Preventing HPA from scaling up while dependency is down
observe Logs what it would do. Takes no action. Safe onboarding — see what klink would do before enabling it

Start with observe mode. Apply klink to your existing services, watch the logs for a week, and only switch to strict or soft once you're confident.

spec:
  mode: observe  # "would scale payments to 0 — dependency unhealthy"
Enter fullscreen mode Exit fullscreen mode

The Mutual Dependency Problem

What happens when A depends on B and B depends on A?

Naive implementations deadlock. Both services go to zero, each waiting for the other to recover. You need a manual fix every time.

klink solves this with CoSuspended detection:

Payments depends on database AND database depends on payments. When database fails, klink scales payments to zero and marks it as CoSuspended — intentionally paused, not broken. database sees payments at zero but doesn't cascade. Restore database manually → payments comes back automatically. No deadlock.

When klink scales payments to zero because database failed, it marks payments as CoSuspended — intentionally scaled down by klink, not actually broken.

When database checks its dependencies, it sees payments at zero — but recognizes it as CoSuspended and doesn't cascade. When you manually restore database, klink automatically restores payments.

No deadlock. No manual intervention.


Argo Rollout Support — Canary Awareness

klink understands Argo Rollouts. If your payments service is in the middle of a canary deployment when its database goes down, klink defers the scale-to-zero until the rollout completes.

If a canary rollout is in progress when the dependency fails, klink defers the scale-to-zero until the rollout completes. Interrupting an active deployment would break the canary analysis and take down the stable version. klink waits, then acts.

You never want to interrupt an active deployment. klink handles this automatically.


CronJob Support — Suspend Instead of Scale

For batch jobs, scaling to zero makes no sense. klink sets spec.suspend: true instead:

spec:
  dependent:
    kind: CronJob
    name: nightly-billing-export
  dependsOn:
    - kind: Deployment
      name: billing-service
Enter fullscreen mode Exit fullscreen mode

When billing-service goes down, the CronJob is suspended. No failed jobs accumulating in history. When billing-service recovers, the CronJob resumes automatically.


Notifications

Get notified when workloads are suspended or restored — before your monitoring fires:

spec:
  notify:
    webhookSecretRef:
      name: slack-webhook
      key: url
    onPhases: [Suspended, Healthy]
Enter fullscreen mode Exit fullscreen mode

The notification arrives the moment klink acts, with full context:

{
  "workloadDependency": "payments-needs-database",
  "namespace": "production",
  "phase": "Suspended",
  "previousPhase": "Degraded",
  "dependent": "payments-service",
  "dependentKind": "Deployment",
  "message": "dependency postgresql not healthy: 0/3 ready",
  "timestamp": "2026-06-15T03:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

Notifications include retry with exponential backoff (1s → 2s → 4s) so transient webhook outages don't silently drop alerts.


Safety Net — maxSuspendDuration

Long outages happen. Your database might be down for hours. You don't want your payments service suspended indefinitely.

spec:
  onDegraded:
    action: ScaleToZero
    maxSuspendDuration: 4h
Enter fullscreen mode Exit fullscreen mode

After 4 hours, klink restores the workload regardless of dependency state and enters Released phase — it won't re-suspend until the dependency genuinely recovers. This prevents indefinite outages from a single bad dependency.


Observability

klink exports Prometheus metrics so you can see exactly what's happening:

klink_dependency_phase{namespace="production", name="payments-needs-database", phase="Suspended"} 1
klink_scale_to_zero_total{namespace="production", kind="Deployment", name="payments-service"} 3
klink_replicas_restored_total{namespace="production", kind="Deployment", name="payments-service"} 3
Enter fullscreen mode Exit fullscreen mode

GKE users get a PodMonitoring resource automatically when metrics are enabled.


Getting Started

helm upgrade --install klink oci://ghcr.io/n0rm4l-me/charts/klink \
  --version 0.3.0 \
  --namespace klink-system \
  --create-namespace
Enter fullscreen mode Exit fullscreen mode

Apply your first WorkloadDependency:

apiVersion: deps.klink.dev/v1alpha1
kind: WorkloadDependency
metadata:
  name: payments-needs-database
  namespace: default
spec:
  dependent:
    kind: Deployment
    name: payments
  dependsOn:
    - kind: Deployment
      name: database
      condition:
        minReadyPercent: 80
        window: 30s
        recoveryWindow: 60s
  onDegraded:
    action: ScaleToZero
  mode: observe  # start here — see what klink would do
Enter fullscreen mode Exit fullscreen mode

Check the status:

kubectl get workloaddependencies -A

NAMESPACE    NAME                      PHASE     REPLICAS   MESSAGE
production   payments-needs-database   Healthy              all dependencies healthy
Enter fullscreen mode Exit fullscreen mode

When you're comfortable with what you see in observe mode, switch to strict or soft.


What klink Supports

Workload As dependent As dependency
Deployment ✅ scale to 0 ✅ readyReplicas check
StatefulSet ✅ scale to 0 ✅ readyReplicas check
CronJob ✅ suspend/resume
Argo Rollout ✅ scale to 0 (canary-aware) ✅ phase check

The Incident That Started This

We run microservices on Kubernetes. One evening our message queue had a rolling restart — routine maintenance, 45 seconds of unavailability. But 12 services that depended on it kept running and kept trying to connect. By the time the queue was back, we had retries queued up, connection pools exhausted, and a 10-minute degraded period that should have been 45 seconds.

The fix was conceptually simple: "if the queue is down, pause the consumers." But there was no Kubernetes-native way to express that relationship.

So we built klink.


What's Next

  • Prometheus-based health conditionspromQuery: 'pg_up == 1' instead of readyReplicas
  • kubectl klink pluginklink graph, klink why payments-service
  • DaemonSet support

The project is open source under Apache 2.0. Issues, PRs, and feedback welcome.

GitHub: github.com/n0rm4l-me/klink

Top comments (0)