Ali

Posted on Feb 6

Kubernetes + Node.js: Health Checks and Graceful Shutdown Done Right

#kubernetes #node #gracefulshutdown #devops

TL;DR: Kubernetes sends SIGTERM and waits 30 seconds. If your Node.js app doesn't handle it properly, requests fail during deployments. This guide covers liveness/readiness probes, graceful shutdown, and the keep-alive problem.

You deploy a new version. Kubernetes starts rolling out pods. Users start seeing 502 errors.

Sound familiar?

The problem isn't Kubernetes. It's how your Node.js app handles the shutdown sequence. Get it right, and you get zero-downtime deployments. Get it wrong, and every deploy causes errors.

How Kubernetes Terminates Pods

When Kubernetes decides to kill a pod (deployment, scale-down, node drain), this happens:

1. Pod marked for termination
2. Pod removed from Service endpoints (no new traffic)
3. PreStop hook runs (if configured)
4. SIGTERM sent to container
5. Grace period countdown starts (default: 30s)
6. If still running: SIGKILL sent (forced kill)

Your app has 30 seconds to:

Stop accepting new connections
Finish in-flight requests
Close database/cache connections
Exit cleanly

If you don't exit in time, Kubernetes sends SIGKILL. That's an immediate, non-graceful termination. Connections drop. Transactions fail.

The SIGTERM Problem

By default, Node.js does nothing special with SIGTERM:

// Default behavior: exit immediately
const server = app.listen(3000)

// When SIGTERM arrives:
// - Active requests get ECONNRESET
// - Database connections drop
// - No cleanup runs

You need to handle it explicitly:

process.on('SIGTERM', () => {
  console.log('SIGTERM received. Shutting down gracefully...')
  server.close(() => {
    console.log('HTTP server closed')
    process.exit(0)
  })
})

But server.close() alone isn't enough. Keep reading.

Liveness vs Readiness Probes

Kubernetes uses two probes to manage traffic:

Liveness Probe

Question: "Is this container alive?"

If it fails: Kubernetes restarts the container.

Implementation:

livenessProbe:
  httpGet:
    path: /health/live
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10

app.get('/health/live', (req, res) => {
  // If the process is running, it's alive
  res.status(200).json({ status: 'alive' })
})

Liveness should almost always return 200. The only time it should fail is if your app is in a broken state that requires a restart (deadlock, memory corruption).

Common mistake: Making liveness depend on database connectivity. If your database goes down, Kubernetes restarts all your pods. Now you have no pods AND no database. Bad.

Readiness Probe

Question: "Can this container handle traffic?"

If it fails: Kubernetes stops routing traffic to this pod.

Implementation:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

let isShuttingDown = false

app.get('/health/ready', (req, res) => {
  if (isShuttingDown) {
    return res.status(503).json({ status: 'shutting_down' })
  }
  res.status(200).json({ status: 'ready' })
})

process.on('SIGTERM', () => {
  isShuttingDown = true
  // Continue with shutdown...
})

During shutdown: Return 503. Kubernetes stops sending new traffic. Existing requests finish.

The Keep-Alive Problem

HTTP keep-alive connections are the silent killer of graceful shutdown.

Here's what happens:

1. Client opens connection to your pod
2. Client sends request
3. Your pod responds
4. Connection stays open (keep-alive)
5. Kubernetes sends SIGTERM to your pod
6. You call server.close()
7. server.close() waits for all connections to close
8. Keep-alive connections are idle but open
9. server.close() waits... and waits...
10. 30 seconds pass
11. SIGKILL
12. Client's next request on that connection fails

The problem: server.close() doesn't close idle connections. It just stops accepting new ones.

Solution: Track and terminate connections

const connections = new Set()

server.on('connection', (socket) => {
  connections.add(socket)
  socket.on('close', () => connections.delete(socket))
})

process.on('SIGTERM', () => {
  // Stop new connections
  server.close()

  // Close idle keep-alive connections
  for (const socket of connections) {
    socket.end()  // Graceful close
  }

  // Force-close after timeout
  setTimeout(() => {
    for (const socket of connections) {
      socket.destroy()  // Force close
    }
  }, 10000)
})

The Race Condition

There's a race between:

Kubernetes removing your pod from endpoints
Your app receiving SIGTERM

Sometimes, requests arrive AFTER SIGTERM but BEFORE the pod is removed from the load balancer.

Solution: Shutdown delay

process.on('SIGTERM', async () => {
  console.log('SIGTERM received')

  // 1. Mark as not ready (stop new requests)
  isShuttingDown = true

  // 2. Wait for load balancer to catch up
  await new Promise(resolve => setTimeout(resolve, 5000))

  // 3. Now close the server
  server.close()
})

Or configure it in your Kubernetes spec:

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["sleep", "5"]

The preStop hook runs before SIGTERM, giving the load balancer time to update.

Complete Manual Implementation

Here's everything together:

import express from 'express'
import { PrismaClient } from '@prisma/client'

const prisma = new PrismaClient()
const app = express()
let isShuttingDown = false
const connections = new Set()

// Health checks
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' })
})

app.get('/health/ready', (req, res) => {
  if (isShuttingDown) {
    return res.status(503).json({ status: 'shutting_down' })
  }
  res.status(200).json({ status: 'ready' })
})

// Your routes
app.get('/api/data', async (req, res) => {
  const data = await prisma.user.findMany()
  res.json(data)
})

const server = app.listen(3000)

// Track connections
server.on('connection', (socket) => {
  connections.add(socket)
  socket.on('close', () => connections.delete(socket))
})

// Graceful shutdown
async function shutdown(signal) {
  console.log(`${signal} received. Starting graceful shutdown...`)

  // 1. Stop accepting new work
  isShuttingDown = true

  // 2. Wait for load balancer (if in Kubernetes)
  await new Promise(resolve => setTimeout(resolve, 5000))

  // 3. Stop HTTP server
  server.close()

  // 4. Close idle connections
  for (const socket of connections) {
    socket.end()
  }

  // 5. Wait for active requests (max 25 seconds)
  await new Promise(resolve => setTimeout(resolve, 25000))

  // 6. Force close remaining connections
  for (const socket of connections) {
    socket.destroy()
  }

  // 7. Close database
  await prisma.$disconnect()

  console.log('Graceful shutdown complete')
  process.exit(0)
}

process.on('SIGTERM', () => shutdown('SIGTERM'))
process.on('SIGINT', () => shutdown('SIGINT'))

That's 60+ lines just for proper Kubernetes shutdown. Every app needs this.

The Zero-Config Way

Kaput handles all of this automatically:

import express from 'express'
import '@joint-ops/kaput'
import { expressHealthMiddleware } from '@joint-ops/kaput'
import { PrismaClient } from '@prisma/client'

const prisma = new PrismaClient()
const app = express()

// Kubernetes-ready health checks
app.use(expressHealthMiddleware())

app.get('/api/data', async (req, res) => {
  const data = await prisma.user.findMany()
  res.json(data)
})

app.listen(3000)

That's it. Kaput:

Handles SIGTERM
Tracks connections
Returns 503 on readiness during shutdown
Closes Prisma in the right order
Manages timeouts

Kubernetes Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: app
        image: my-app:latest
        ports:
        - containerPort: 3000
        livenessProbe:
          httpGet:
            path: /health/live
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["sleep", "5"]

Kaput's expressHealthMiddleware() provides:

/health - General health status
/health/live - Liveness probe (always 200)
/health/ready - Readiness probe (503 during shutdown)

Common Kubernetes + Node.js Mistakes

1. No SIGTERM Handler

// Bad: Process exits immediately, connections drop
app.listen(3000)

2. Liveness Depends on External Services

# Bad: If DB is down, pods restart forever
livenessProbe:
  httpGet:
    path: /health  # Returns 500 if DB is down

# Good: Separate liveness from dependencies
livenessProbe:
  httpGet:
    path: /health/live  # Always 200 if process is running

3. Readiness Doesn't Change on Shutdown

// Bad: Keeps accepting traffic after SIGTERM
app.get('/health/ready', (req, res) => {
  res.status(200).json({ status: 'ready' })
})

4. terminationGracePeriodSeconds Too Short

# Bad: Only 10 seconds to shutdown
terminationGracePeriodSeconds: 10

# Good: Enough time for requests to complete
terminationGracePeriodSeconds: 60

5. No preStop Hook

# Bad: Race between endpoint removal and SIGTERM
containers:
- name: app
  # No preStop hook

# Good: Delay to let load balancer update
lifecycle:
  preStop:
    exec:
      command: ["sleep", "5"]

6. Using npm start in Dockerfile

# Bad: npm intercepts SIGTERM
CMD npm start

# Good: Node receives SIGTERM
CMD ["node", "dist/server.js"]

Testing Graceful Shutdown

Local Testing

# Terminal 1: Start your app
node server.js

# Terminal 2: Send requests
while true; do curl -s http://localhost:3000/api/data; sleep 0.1; done

# Terminal 3: Send SIGTERM
kill -SIGTERM $(pgrep -f "node server.js")

# Watch Terminal 1 for shutdown logs
# Watch Terminal 2 - requests should complete without errors

Kubernetes Testing

# Watch pod status
kubectl get pods -w

# In another terminal, trigger a rollout
kubectl rollout restart deployment/my-app

# Monitor for 5xx errors
kubectl logs -f deployment/my-app | grep -E "(error|Error|ERROR)"

Quick Start with Kaput

npm install @joint-ops/kaput

// server.js
import express from 'express'
import '@joint-ops/kaput'
import { expressHealthMiddleware } from '@joint-ops/kaput'

const app = express()
app.use(expressHealthMiddleware())

app.get('/', (req, res) => res.json({ ok: true }))

app.listen(3000, () => {
  console.log('Server running on port 3000')
})

Deploy to Kubernetes with the manifest above. Zero-downtime deployments. No boilerplate.

Summary

Kubernetes graceful shutdown requires:

SIGTERM handler - Stop accepting work, clean up
Readiness probe - Return 503 during shutdown
Liveness probe - Always 200 (process alive)
Connection tracking - Close keep-alive connections
Shutdown delay - Wait for load balancer to update
Resource ordering - HTTP before databases

Kaput handles all of this with one import. Add health middleware, deploy, done.

DEV Community