Breakthrough

We had an incident today. This one was reported by one of our internal employees that was not an engineer. We had recently purchased a subscription to an incident management platform called Rootly. Up until today, we had done a lot of "kicking the tires", tweaking, configuring, and just generally learning how to use the product. However, my team and I decided today was the day. 

It just so happened that we had a team meeting earlier where some feedback was shared that during the last incident, we had not sent updates frequently enough to stakeholders, and our handoff to team members who were in local time was non-existent. We realized that,  although there was a lot of context in Slack chat, there was not much indexed in a way where someone could look at a page, get the major points of what had happened, and take over. This was a driving factor in our decision to take Rootly for a spin today.

Let me step back a few years, to an earlier time in my career. I was working for a major media company with a very mature SRE org, and an excellent culture for learning. We had regular speakers from the C-Suite of large tech corporations, and we were lucky enough to have Alex Solomon from Pagerduty give a talk on "the Anatomy of an Incident". The talk was basically a walkthrough of an outage at Pagerduty. Without going into too much detail, every engineer left that room feeling like they knew nothing about SRE prior to that day. We all had a new take on how to do our job, and I felt like I had a new mission. I was going to take these practices and use them everywhere I went.

Fast-forward to today. Since then, I had moved into SRE management. I had never been able to achieve the level of efficiency in incident management that I had learned on that fateful day, but I had made strides at a couple of different companies. Today was different. 

My Lead SRE had created the incident in our incident channel with a simple slash command. This kicked off a series of automations that created a WAR room channel, and invited some of the relevant team members. I asked my lead to take the incident commander role, as I was in the middle of a meeting and could not pay enough attention to take the IC role. She accepted, and I assigned it by selecting her in a dropdown in the WAR room channel, and assigned myself the comms role. I had asked her to add any relevant information in the channel and I would update our status page, which would send notifications in various rooms where stakeholders frequented. 

As we worked the ticket, we found that the issue actually involved various other teams, so we added them with a few button clicks, since Rootly was integrated with Pagerduty, and we could page and automatically add people into the WAR room. Prior to this, finding the right people to respond was painful. We had to figure out which team managed which parts of our applications, and then figure out who was on-call (or even awake, we have globally distributed teams). Now that we have our services defined in Pagerduty, getting the right people on the call was quick and painless. 

The issue turned out to be a repeat of an issue that we had thought that we solved a couple of weeks back, but it turned out that we had only put a band-aid on the problem. This time, with the correct developers and architects on the call, we were able to really identify the root cause. Also, while all of this was going on, we were able to pin important events to our timeline, which updated our status page. All stakeholders were frequently updated, and we had a comprehensive timeline of events since they were aggregated on our status page.

Action items were defined, and Jira tickets were produced automatically after a really painless post-mortem that happened almost immediately after the call. A fix was pushed out the same day. Everyone was relieved that we will not have to deal with this during the upcoming holiday break.

This was a huge breakthrough in my SRE experience. It wasn't perfect, by any means. We still did a lot of things manually, but we finally saw what a proper incident management process could bring us. SRE has taken on many different definitions; some systems administration, some security ops, some dev ops, etc. But this was really, for the first time, an occurrence of what I believe true SRE is, and what it can bring to an organization. Fixing something that was broken, as quick, organized and efficient as possible. And the lessons learned will not only help us grow as an organization, but as engineers. I hope that living the experience that I had heard about in that talk so many years ago will inspire my team and everyone involved as much as I had been inspired by hearing the story in the 3rd person.

Previous
Previous

Snow