SRE and ITIL

Tech

Feb 2

Written By

An often overlooked part of incident management is tracking information about the incident from the beginning of the outage to the completion of the last post-mortem action item. As a result, lots of knowledge tends to be lost, and fixes tend to get swept under the rug. This tends to be more of an issue with smaller companies that are experiencing rapid growth. However, even enterprises can overlook some of these issues since they tend to lose importance after an outage is mitigated and large projects continue to take drive deadlines. There are several methods to address this issue, but lets focus in on the ITIL Framework.

Problems

Confusion

When an outage or incident occurs, usually someone is paged or alerted in some way. This could be a seasoned SRE or a developer who has never worked in a support role before. Either way, they need to find the most effective way to mitigate the issue, as quickly as possible. One of the first things that this person should look for is some sort of documentation that shows how to respond to the signals that the monitoring/alerting software is sending. In other words, how do you react to the messages that are in the communication that you received? Your incident management protocol should point you to a location to find the required information. Do you have a template for what sort of information is provided? Is there just a list of known issues and ways to fix them, or are there detailed architecture diagrams that show service dependencies that might allow you to troubleshoot issues that my be fringe cases? The more information, the better, especially when there is an emergency. This information should be centralized, easy to find and easy to read, and provided to you as early as possible, preferably from you incident management tool suite.

Another source of confusion comes from times when an incident or outage is reported from something other than a monitor - also known as a human. without a proper system in place for a person to provide the information needed to properly triage an issue, it is often submitted in less formal channels, such as an email or Slack. This method usually lacks context and is often sent to the wrong person or group of people. This can sometimes either miss it's mark completely and never get looked at, or get incorrect or incomplete information sent to the support organization, which just creates confusion.

Lack of Organization

Most organizations have SOME form of documentation. Often times, it is lacking. When it does exist, it is usually done hastily and in a haphazard fashion that lacks any unified form or structure. It is also usually not correlated with all of the services that it applies to, just the one category that the author supports. As a result, it can sometimes become difficult to find the one piece of information required to solve an issue or mitigate an incident.

Another example of a lack of organization is how various types of work are organized. Since SRE work can either be project based or interrupt-driven, it needs to be handled differently. This is usually completely different than how development teams work, which is usually in 2-4 week sprints, with all of the other ornaments that come with Agile methodology. As a result, action items that get created do not get prioritized properly, and when a scrum team does try to prioritize it, there is no context to why the action item was created because there is no reference to the original outage. Because of this, these things tend to get pushed back until more context and details can be provided, which means that it will likely get a low priority or it will get kicked back to another team. It will only be remembered when the incident happens again, but this cycle may still repeat, since there is no process in place to track this issue and properly prioritize it.

Loss of Knowledge and Information

During an outage, everyone's priority is to resolve the issue as quickly as possible. Every second that the issue persists not only costs money, but damages credibility, increases burnout for the people involved, and many other detrimental effects that may not be immediately seen. If you have a proper incident management process in place and you have proper roles assigned, proper communication channels in place, and many other SRE best practices and being followed, then you will be able to obtain a wealth of information from the incident. However, if this information is not stored, categorized, and attributed to the proper services, teams, and infrastructure, this information could end up in a nebulous black hole of a incident management report somewhere that might lack context. It may not even end up in any proper documentation or runbook.

There is also a lot of metadata that can be observed from outages that can help drive larger decisions to increase stability. If you step back and look at a bigger picture, you may discover a weakness that you didn't realize that you had. For instance, let's say that you had a series of outages regarding latency across a bunch of different microservices within your on-prem datacenter. You may look at your network hardware, or create a support ticket with your ISP or your datacenter. But what if it was something simpler, maybe an issue with the hardware on a certain node? If you were able to easily correlate similar information between incidents, as well as an increasing frequency of incidents occurring with that node, it might point you to a degrading piece of hardware within a physical node. However, without a proper method to track this metadata, you would have a much harder time figuring this out.

Solutions

ITIL v4

ITIL v3 was released in 2007, and updated in 2011. At the time, DevOps was relatively new, and thus it was not a well-defined practice. Because of this, ITIL was a very IT Ops focused framework. It was not updated until the release of ITIL v4, in 2019. Needless to say, it did not age well. Organizations that strictly followed ITIL required many different groups to approve releases before they could go out. This was usually done in the name of improving stability. With the introduction of DevOps, ITIL in this form quickly became obsolete and was replaced in favor of the leaner, faster DevOps style. SRE became a discipline as well, and the methods outlined in the Google SRE publications became standard in regards to improving stability and reliability.

Enter ITILv4. Taking pages from many of the modern books surrounding DevOps and SRE, ITIL has evolved. The main changes are the adoption of the concept of value streams, as well as Lean, Agile, and DevOps methodologies. This puts this framework in a much better place to guide tech orgs in the modern technology ecosystem. There is a lot of different categories in this framework and a lot of information that related to SRE, but we are going to focus on 3: incident management, problem management, and knowledge management.

Incident Management

In ITIL, incidents are defined as "an unplanned interruption to an IT service or reduction in the quality of a service". Therefore, incident management is the process of managing the lifecycle of an incident. A key term here is lifecycle. This process goes as follows:

Identification - signals come from monitoring or from humans that something is broken
Logging - some record is created regarding the details of what is broken
Categorization - the service and responsible team for the incident is identified
Prioritization - The impact and severity of the incident is assessed; triage
Response - the incident is investigated, worked on and mitigated

This should be a well documented process, and you should have effective tools in place that work in a streamlined fashion to make the mitigation of incidents as quick and painless as possible. It should also have as much automation as possible, as the movement between these steps in the lifecycle should involve as little manual work as possible. Lastly, a ticket should be created that acts as a parent to any other chunks of work or information that stem from the incident.

Problem Management

One of the main functions of SRE is to reduce toil. This is where effective problem management comes in. This is the set of processes that take over once an incident has been mitigated. An incident can only be considered resolved once you have discovered the underlying cause and put a permanent fix in place. ITIL outlines the following stages in problem management:

Problem Identification - This can and should be done during incident response. It involves identifying and logging the suspected root cause, but only any hypotheses and ideas. These can be tagged in your incident management software for further analysis later
Problem Control - This is usually your post-mortem analysis, where you take any information that might have been recorded and review it, along with performing a formal root cause analysis. This stage also includes putting workarounds in place and sending out communications around the identification of the root cause.
Error control - This stage is where you put permanent fixes in place. Tickets get created and assigned to dev teams and worked into sprints as tech debt. Follow-ups need to be done to ensure that these items are done in a timely manner. Also, retrospectives might take place here to help improve the overall process.

As with incident management, this process should also be well documented and automated. It would be especially useful to have some automation in the error control step to send out reminders about action items. It's also helpful to have your incident management platform set up to automatically add important information to some central place for analysis during problem management.

Knowledge Management

The culmination of incident management and problem management in regards to SRE is knowledge management. Consider the amount of information that we have gathered from these other processes. This is all very useful, but if you aren't able to curate, store and deliver this information to the right people at the right time, then you have done very little to increase stability and decrease toil. You still have issues with confusion and disorganization if all of these separate bits of knowledge are not handled correctly. The ITIL knowledge management practice provides an excellent framework for tackling this. It categorizes knowledge into 3 categories:

Tacit Knowledge - This is what we usually call "tribal knowledge", or knowledge that is not documented. This is almost a curse word in SRE and DevOps.
Implicit Knowledge - This is like tribal knowledge, but it may be written down somewhere. However, it is generally just known as "the way that we've always done it" and is not really organized or linked in a way that makes it easily to access or draw context from.
Explicit Knowledge - This is organized, categorized and codified knowledge. It is all found in one well known place and is set up in a way that it is easily accessible and is linked to a relevant service. This is what we would usually call a "knowledge base", a "runbook", or some other documentation of the sort. This is where you want to get to.

After any incident is resolved (has been mitigated and has completed the problem management process), there should always be something added to your knowledge management system. If there is not, this means that you have learned nothing new, which is impossible since you figured out how to mitigate the issue and have put a permanent fix in place. Even if this means that you shut the entire service down, there is going to be some knowledge that needs to be recorded, as you never know when you may need to recall some historical reference to that service. This knowledge base, runbook, or whatever you wish to call it, should be linked to the service in your incident management tools, thus completing the cycle of learning. This may include linking runbooks in your monitor comms, adding your knowledge base articles into your indicent management software, or ideally using the information gained to work with other teams to come up with a way to automate toil.

Summary

Incidents are a stressful but inevitable part of operating any system. When you have moving parts in the system, it will eventually break, and it will do so more frequently as you add more complexity. However, since we are armed with this information, we are able to put systems in place to make incidents less painful. However, we need to iterate on our mitigation strategies each and every time there is an incident, or we will lose this small comfort that we gained by putting our incident management strategy in place. As we do this more and more, organization becomes a key tool in our improvement, and ITIL is a great framework to help guide us forward.

SREITIL