DEV Community

Cover image for Chatbots for Cloud Native Incident and Change Management

Chatbots for Cloud Native Incident and Change Management

Author: Shiva. working with one of the next generation Pay roll and IHCM Gaints. K8s ethusiast| DevOps story teller|automation and toil avoidence evangalist| people leader| cricket lover & photographer

Introduction

Kubernetes is the de-facto standard for container Micro Service orchestration. GitOps is a code version control-based approach towards build, release engineering & change management. It has empowered developers to be fully responsible and empowers a truly agile delivery.

Lots of organizations are still finding efficient and productive ways for developers accessing Kubernetes workloads in production environments. While many are finding innovative ways, this blog is about one of the possible ways that is tried and tested, of course with a custom development effort.

Self-Servicing Kubernetes workloads, especially during critical incident and change management phases, with faster turnaround is an interesting problem to solve. Tools around Kubernetes have evolved and has brought two of ways for self-service.

  • UI Based Self Service
  • Chatbot based Self service

Both of course with a solid RBAC system backing them.

In this blog, I would be discussing about the later (using chatbots for Kubernetes), since the former has standard tooling in market (Rancher GUI with ID and Auth management being one of my favorites).
Before I put through the problem statement, refer to the diagram blow showing the sequence from incident escalation through fix and de-escalation. The below diagram shows the possible areas of efficiency improvement targeted in this blog.

Image description

The problem statement:

When we had a bunch of Kubernetes specialists doing incident and change management, the question before us was

● Can we provide tooling to Ops to be quick on Incident resolution steps – hence enhancing the Mean Time to Recover?

● Can we provide a single window for all incident management actions – Can it be a chat room that promotes visibility and empowers quicker service for Ops?
We chose a chatbot with errbot framework! Yay – You got it right Python powered bot with Python being the darling of many DevOps & SRE (Site Reliability Engineering) Professionals. At least one thirds of the issues that landed up as critical incident had a secret recipe for fix

  • Micro service rollback (and /or)
  • Micro service restart (and/or)
  • Micro service roll forwards

While we let the best of Kubernetes experts to well architect a micro service to be placed on Kubernetes cluster and best of release engineers to form a CI (Continuous Integration) CD (Continuous Deployment) strategy for Kubernetes workloads, we decided to build a simple bot service that can do all the above incident fix functions using native Kubernetes ways, using python Kubernetes operator (e.g. https://kopf.readthedocs.io/en/stable/).

One Window for All Operations Approach – Chat Rooms for Swift actions

With this approach we had SRE working from chatrooms using chatbots restarting, rolling back or promoting the services in minutes through a single window interface!
Imagine the SRE working through K8s command lines and CICD tool interfaces Vs using chatbot –Viola!! Valuable few minutes of MTTR saved!
Some added reinforcements were done at process level to keep the source of truth at release engineering and version control level according to the organization needs to ensure sanctity of actions in an incident.

Taking it beyond SREs and DevOps - Are we empowering engineering community?

Empowering Engineering community was next big question we had to answer – More the engineering community is dependent on SRE, more demanding and toiling would be SRE roles!

Think of this sequence - An Incident lands on a micro service team alerting service (e.g., PagerDuty), gets later transferred to SRE – Just for a K8s restart, roll back or promotion across environments- losing valuable time to Recover while following the procedures incident call transfers!

What if we empowered the microservice teams to use Self Serviceable Chatbots? – That was our way forward.

For our way forward we wanted to round off a bot capability beyond Kubernetes operations, a bot that can be reliable, does the exact same process each time and can cater to all business use cases in incident management – In other words, a responsible bot!

A responsible bot – a process guide, a toil breaker, and a swiftness enhancer

When we had to design a responsible bot, we had the following features to be built

The bot should

  • Validate the need for a self-service action
  • Enable right user to have right access (least access for effective incident mitigation)
  • Track and trace all actions of self service (leading us to questions to be answered on a well-designed environment architecture)

How does the bot work?

The responsible bot

● Takes a mandatory justification & confirms it - for prod like environments - observability monitor or alert link (e.g., PagerDuty alert link) or ticket in proper state (e.g., Business approved Jira Ticket) for the case of well detected but not well alerted instances.
● Has a granular access to cluster, namespace, chat room and even an action on who can perform what by having a bot RBAC feature (e.g. https://casbin.org/docs/en/rbac-api)
● Can track and trace every action thro logs (e.g., ELK logs) and incident action traces (e.g., Logging action trace on Jira tickets with action owner)

Do you agree that this is a responsible Bot Indeed? If not, look at the value statement below
The bot aims at

  • shorter incident mean time to recoveries
  • cutting time for the co-ordination between Dev & Ops for production like environments
  • empowers engineering community to do informed Kubernetes actions even on production without ops dependencies
  • tracks action and records them for bettering the state of micro service deployments if necessary.
  • Fully self-serviceable!

MTTRs can even faster from few tens of minutes to few minutes after receiving an incident alert!

Image description

Seeding into Self Service Changes for the needy:

Organizations have the challenge of Continuous Releases to production like environments – they depend on release management team sometimes as there is a last mile human validation needed. The bot that is responsible has capability now to replace a release management personal if tuned in the right way!

Empower development and test teams to do responsible self-serviceable releases – is a case and space for chat bot to be enabling faster change management executions! What do you think? Do you have this problem or are you fully on Continuous Releases to Production?

Join us

Register for Kubernetes Community Days Chennai 2022 at kcdchennai.in

Top comments (0)