Ever been frustrated by service downtime when deploying your chatbot? If you've aimed for zero-downtime deployments but ended up inconveniencing users repeatedly, this post might help. I want to share my experience implementing stable deployments using a backend rolling restart strategy.
Attempts and Pitfalls
Initially, I naturally tried to stick with the existing deployment method. However, since chatbot services need to process user requests in real-time, even a brief pause was critical. So, I decided to adopt a rolling restart approach, gradually replacing backend servers.
Similar to Blue-Green deployments, this involves preparing a new version (Green) and gradually shifting traffic from the old version (Blue). I also included health checks to ensure the new version was healthy before taking over.
# Example: Start deploying the new version
kubectl apply -f new-backend-deployment.yaml
# Example: Wait until new version Pods are ready
kubectl rollout status deployment/backend-deployment -n chatbot --timeout=5m
# Example: Gradually terminate old Pods and switch traffic
# (This is where an unexpected issue occurred)
However, an unforeseen problem arose during this process. Even though all the new version Pods were ready, traffic wasn't being switched correctly when the old Pods were terminated, causing some requests to be dropped. It seemed like the rolling restart wasn't functioning as intended. After three hours of struggling, I discovered the cause...
The Cause
The issue wasn't with the rolling restart logic itself, but with the service's state management approach. The chatbot service maintained specific session information in memory. When the old Pods were terminated, this information disappeared abruptly. When requests then went to the new Pods, sessions were broken. In essence, the Pods themselves were terminating correctly, but the service's state led to a poor user experience.
The Solution
To fix this, I implemented a strategy to move session information to an external storage (like Redis). This way, even when old Pods are terminated, the session data remains intact. New Pods can then retrieve this data and seamlessly resume existing sessions.
# Example: Kubernetes Deployment configuration (partial)
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-chatbot
spec:
replicas: 3 # Example: running with 3 Pods
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Set to ensure only one Pod is unavailable at a time
maxSurge: 1 # Allow up to one new Pod to be created above desired replicas
# ... (other configurations)
# Example: Python code to load session data from Redis (based on Flask)
from flask import Flask, session
import redis
app = Flask(__name__)
app.config['SECRET_KEY'] = 'your_secret_key' # Use a secure key in production
redis_client = redis.StrictRedis(host='your-redis-host', port=6379, db=0)
@app.before_request
def load_session_from_redis():
session_id = request.cookies.get('session_id') # Or other session identifier
if session_id:
session_data = redis_client.get(f'session:{session_id}')
if session_data:
# Logic to decode and restore session data
pass
@app.route('/')
def index():
# ...
pass
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
By externalizing session information like this, I ensured that user sessions are not interrupted and seamlessly continue even as Pods are terminated and new ones start. The health checks remain valid, and traffic is only switched when all Pods are in a healthy state.
Results
- Downtime during deployment has been reduced to zero.
- User experience has significantly improved, with no more session 끊김 (session interruptions).
- The rolling restart strategy has been successfully applied, enhancing deployment stability.
Summary — How to Avoid the Same Pitfall
If you're attempting rolling restarts for zero-downtime deployments, make sure to check the following:
- [ ] Verify that your service's state (like session information) is managed independently of the Pod's lifecycle (e.g., using external storage).
- [ ] Ensure your
Deploymentstrategy is configured withmaxUnavailableandmaxSurgeto allow for gradual traffic shifting. - [ ] Validate that your
readinessProbeandlivenessProbeaccurately reflect the actual health of your service. - [ ] After deployment, continuously monitor actual user traffic to detect any unexpected errors.
Top comments (0)