Peter Wurbs

Posted on Apr 13 • Originally published at Medium

Bridging the Gap: Integrating DevOps and ITIL

#devops #itil #itsm #operations

The Challenge of Combining DevOps and ITIL

DevOps provides methods and tools to improve the agility, speed, and quality of software development. However, it lacks widely accepted guidelines for integrating these principles with IT Service Management (ITSM) operations.

But even in a fully implemented DevOps culture, we need service processes when we want to bring our product to customers and are obliged to provide service levels. We have to provide a support interface, implement alerting, fix issues, etc.

This article outlines a rough idea of how to bring these two worlds together. The ideas are based on ITIL, the leading framework for providing best practices to implement ITSM. Here we focus on ITIL v2/v3 and only take the main processes into account.

Unlike ITIL, DevOps has no such standardized framework. I assume you know what DevOps actually means. But to make sure we're all on the same page, here's my understanding in a nutshell:

DevOps is a cultural shift and collaborative approach that breaks down the traditional barriers to integrate everyone—including business stakeholders, security, operations, and quality assurance—into a single, organization-wide feedback loop. Instead of working in isolated silos, these groups join forces throughout the entire project, sharing responsibilities and communicating openly. This unified mindset allows organizations to respond to customer needs faster, solve problems more efficiently, and deliver continuous value.

TL;DR

For those who lack the time to read the entire content, here’s a concise summary of the approach:

The application of IT Service Management (e.g., ITIL) remains essential, even when DevOps practices are fully implemented.
At the very least, Incident, Problem and Event Management processes are required.
It is rarely feasible to implement all IT Service Management processes within the cross-functional DevOps team itself. Therefore, compromises must be made to delegate some tasks.
Practicing DevOps significantly impacts several classic ITIL processes, especially Change and Release Management.
Informal, manually maintained, and often outdated Configuration Management (CMDB) systems are replaced by current and accurate data obtained from various sources such as Issue tracking, Git, CI/CD, Cloud and software APIs, metrics, and logs.
Also see the diagram at the end of this article for a good summary.

The long Story

In the following, we go through the main ITIL processes and draw a rough idea of how these could be integrated into a DevOps way of work, respectively, how these are affected by practicing DevOps.

The well-known "You build it - You run it!" is the most relevant guideline. With every decision, you should consider how far you’re straying from this mantra.

Incident Management

It's the primary goal of Incident Management to restore IT services to users as fast as possible. This is the most challenging process to be integrated into a DevOps culture.

1st Level Support (Service Desk): Can be outsourced to specialized external organizations. Very often, these organizations provide standardized services (within a company) for many teams and products in a shared manner to reduce costs. The DevOps team provides troubleshooting guidelines and FAQs, while the external desk provides feedback on common incidents towards the DevOps team to improve the service.
2nd Level Support: Handles technical issues that do not require software developers. It can remain within the DevOps team to prevent silos, though offering 24x7 coverage is challenging. Alternatively, it can be outsourced for 24x7 support if tight collaboration is maintained, including shared tools, shared documentation, and deep technical education provided by the DevOps team. Often, this support level is also responsible for the deployment of the software into production. Here, it's especially crucial to achieve a tight collaboration, trust, and transparency to prevent further silos and borders. It's good practice to hold regular shared sessions to exchange information, knowledge, findings, and experiences. And it's a very good idea that some deputies of the 2nd level support take part in the daily stand-ups and the other regular meetings of the DevOps team. In this area, the risk of creating new silos is especially high.
3rd Level Support: It's crucial that this task remains within the DevOps team, because 3rd level issues are mostly related to software bugs and their troubleshooting, and their resolution must not be segregated from the software development. This responsibility also applies to 3rd level issues concerning the CI/CD pipeline and the deployment target environment. According to "You build it - you run it!" the team is also responsible for these topics.

Some general guidelines and tips:

All parties should use identical tools or at least automated data synchronization to prevent information loss. Incident tickets must be integrated into the DevOps team's daily agile planning (e.g., Kanban boards) and Sprint backlogs, utilizing reserved team capacity.
Engineers working on 2nd and 3rd level issues must be able to directly communicate with users and customers. Otherwise there is a loss of information and speed, which of course leads to bad support quality and unsatisfied users.
The DevOps team must reserve some velocity/resources for incident issues along with the sprint planning. A budget for this could be set initially, observed, and then adapted. If the budget is not fully used, then the team can use the remaining time to improve some ops-related topics like observability, stability, etc.
In the case where the production target environment (where the software runs) is operated by an external party, the DevOps team must get transparent and fast access to all configuration data, logs, metrics, and other relevant data. This is mandatory for the feedback loop and quick support in case of issues. See also below “Operating Target Environments”.

Event Management

Event Management ensures services are monitored and events are categorized for action. Nowadays, the old term "Monitoring" is usually replaced by the modern term "Observability" to reflect the versatile perspectives of modern software architectures (microservices, containers, cloud, etc.).

Data from observability tools are possible triggers for Incident, Problem, or Capacity Management and must be shared amongst all affected parties to avoid silos and borders.
As for Incident Management, the DevOps team needs direct access to all the data to support fixing or preventing issues as quickly as possible. Ideally, the DevOps team owns the concept and implementation for observability tools.
The old assumption that "monitoring" is "ops-stuff" is no longer valid. When we take "shift-left" seriously, then the DevOps team must integrate observability capabilities directly into the software architecture. The generated data provide vital business and runtime metrics and are not purely operational tasks. So, it's a good practice in the team to plan these aspects as early as possible in the software development lifecycle as part of the Non-Functional-Requirements (NFR).
Like for Incident Management, it's challenging for a DevOps team to watch the monitors 24x7. Also, for this, the team might delegate this to an external organization (e.g. together with the 2nd level support). But also here, the obligations for both parties apply.

Problem Management

Problem Management aims to prevent incidents and minimize their impact. The process is mostly triggered when a customer requests a RFO report (Reason for Outage) or (the proactive way), when the repeated occurrence of the same incidents must be investigated in order to prevent them in the future.

Because complex analysis is required, the execution of these tasks mostly falls to the 3rd Level Support within the DevOps team.
The above-mentioned guidelines regarding time budget, ticket management, and tools apply here too.

Capacity Management

This is the balancing act of ensuring that IT resources are exactly the right size—not too small (causing slow performance or crashes) and not too large (wasting money). We achieve good capacity management by:

an appropriate upfront resource planning
a suited configuration of the different items like servers, containers, applications, etc.
good observability (see Event Management)
a modern software architecture supporting horizontal scalability

As a result, this process should be owned by the DevOps team too. It has been one of the biggest failures in the past to delegate capacity planning to an operations team. This resulted in endless ping-pong between developers and ops.

Yes, if watching the monitors is delegated to e.g. the 2nd level support or any other "monitoring" team, then this team will see alerts and can act accordingly. But the DevOps team must own the concept, the thresholds, the resource configuration, and the instructions on how to act.

Configuration Management

Configuration Management in ITIL is intended to document information about Infrastructure and Services required for ITSM. So-called Configuration Items (CI) are available in a CMDB (Configuration Management Database) to support different ITSM processes.

We all know that it was never possible and has never worked to keep the CMDB up-to-date. Either there was not sufficient discipline, the tools were not appropriate, or the data changed too often. I can't remember any case that a CMDB was helpful. Data have been incomplete, outdated, stale, or not consistent. This becomes much more problematic when we move to the cloud, spin up 100 containers every day using ever-changing IP addresses, don't know exactly on which server or container which part of our (microservice-based) software runs, etc.

So, this process is the most affected and maybe even made obsolete by the modern (DevOps) world:

DevOps replaces manually maintained CMDBs with an "everything-as-code" approach. Target configuration states are stored in Git and directly used in our tool pipeline to build, test, and deploy.
This is combined with the retrieval of the current state of configuration directly from the involved systems and software using APIs, dashboards, logs, metrics, etc.

Request Fulfillment

This process manages minor, standardized changes (service requests) or requests for information from users.

Self-service for users should be provided as much as possible. This saves time, money, and makes users happy.
The 1st or 2nd Level Support should handle simple, standard requests using guidelines provided by the DevOps team.
Complex or non-standard requests must be delegated to the DevOps team.

Change Management

Like Config Management, this is another ITIL process becoming mostly obsolete when we truly practice DevOps. ITIL defines this process as a kind of gatekeeper of the IT environment. Its primary mission is to maximize the number of successful IT changes by ensuring that risks are properly assessed and authorized before they happen. Unfortunately, this ended up in the past by adding more and more formal processes, checkpoints, gateways, paperwork, etc. This all mostly decreased the speed of delivering new features to the users and imposed silos and barriers mostly between developers and operations. When a company is really ready to introduce a DevOps culture, many of the formal rules vanish.

All changes, including new features, infrastructure, or config changes, are treated as code changes (everything-as-code), passing through the agile development in the team and CI/CD pipelines with automated tests.
However, some truly important formal procedural steps should be retained. This is especially the case if the deployment has an impact on the users' perception or some risks must be mitigated. A pragmatic view on these process steps is required.

Release Management

Release Management ensures that new or modified IT services are planned, developed, and successfully deployed to the production environment. When we practice DevOps, this integrated view of release and deployment is decoupled.:

A deployment is the technical process to provide software into one of the target environments (dev, test, qa, prod...). Deployments should happen continuously in small pieces, ideally in the background (e.g., via dark launching) without user disruption. Only by this, the DevOps team gets valuable feedback from users quickly.
A "release" is the logical provisioning of software, new features, etc., to users. The technical deployment has ideally already been done, see before. During release, only small activation tasks might be done, like config changes or ingress adaptions. Release Management is thus reduced to bundling feature sets while organizing a product roadmap and external communication.

What does this mean for the DevOps team?

The DevOps team should be responsible for the deployment concept, the preparation of the toolchain to execute deployments, and provide sufficient information if the actual deployments are done by an external organization.
It's clear that the Product Owner is responsible for the actual Release process.

Operating Target Environments

Finally, we must ask, which party should "operate" the target environments, where the software actually runs. We only touched it briefly in Incident Management. The classic answer has been in the "old" world: There is an operations department taking care of these environments. As a result, again, we have silos and borders. By the way, these borders have been the reason why DevOps was invented more than 10 years ago. So, let's invest some final minutes to ask how it could work in a DevOps world.

Again, ideally, the DevOps team has the resources and knowledge to deal with the target envs too. Then, the team can optimally combine software, infrastructure, deployments, and operations without any borders, gaps, or quarrels.

But in most cases (e.g. due to company restrictions), the target envs are operated by an external team. This is fine, but then...

the DevOps team keeps responsible for the infrastructure, deployment, and observability concept.
the DevOps team provides all required documentation, instructions to enable the external "ops" team.
the DevOps team must get transparent and fast access to all configuration data, logs, metrics, and other relevant data. This is mandatory for the feedback loop and quick support in case of issues.

It's also a good approach to distinguish between dev and prod environments. Even if the operation of the prod target is externally delegated, the non-prod targets, especially dev environments, should still be maintained by the DevOps team itself. Even a shared operation is possible. See the picture below for more details.

Summary Diagram

Final Note

As an alternative framework, Google's Site Reliability Engineering (SRE) is often cited as a highly effective best practice for bridging the gap between DevOps and ITIL/Operations.

DEV Community