Leveling up your Ki: Achieving Super Saiyan Devops mode

April 28, 2022 Michael Marchese

In my career, I've managed 3 different SRE groups, and have been an IC in even more, especially if you want to include the "Systems Engineer" title. All of these were in different stages of their Devops and SRE journeys. What I've discovered is that all of these companies have had different needs, different resources, and different cultures. Cracking open the good 'ol Google SRE handbook has only helped me marginally in many situations, especially in the beginning stages of establishing a new SRE team. Even the journey to create your first SLI may likely include a series of introductory meetings, hours of organizing disjointed metrics, logs, and various other bodies of data, as well as simply learning what the products are designed to do.

Because of this, I've devised a way to figure out where your organization currently stands in terms of operational maturity. If you know where you are, you can get an idea of where you need to go. This is only a framework, and you may find that some aspects of your organization are further along the road toward operational Shangri-La, and some are not. And for fun, we will frame this journey in terms of Dragon Ball's Goku and his various forms.

Kid Goku

There are many great things to enjoy at this stage. Like Kid Goku, you have limitless potential. You can choose many different directions to go in terms of processes, tools, methodologies and frameworks. Also, like Kid Goku, you probably move really fast but make a lot of mistakes. Luckily, this is usually at a company's startup phase and many of the risks are acceptable in order to move quickly. Your only friends are Krillin and Yamcha (kind of), but you have many mentors. In this case, your mentors are the loads of SREs in the various DevOps communities. Be social, meet your Master Roshi and learn the Kamehameha.

Key indicators and behaviors:

Manual or slightly automated release processes with few or no gates.
Very little or no QA testing.
Few or no metrics being gathered to gauge uptime or performance. Very little visibility.
Logs only used for troubleshooting purposes.
Little or no documentation.
No formal process for incident management.
Small team of 1-3 people in charge of maintaining all environments.
Products may not be customer facing at this point, or they are still in beta. Focus is simply on "making it work".

Teen Goku

By now, you've overcome many obstacles. You've fought some difficult battles against some monstrous foes and have lost some people along the way. You've undergone some intense training and you feel that you're ready to take on anything. However, just when you think that you've got everything under control, you have to fight Piccolo Jr. and you realize that you aren't as strong as you thought that you were and that you will definitely need to train more.

It's important to realize that when a company is early in it's life, it may pivot unexpectedly, and possibly experience rapid growth spurts. What worked yesterday may not work today and it's important to remain ever-vigilant and to not take any moments of quiet for granted. Keep training because you never know when you may have to face your next big challenge.

Key indicators and behaviors:

You have some monitors and dashboards that show you if your applications are up or down.
You are not yet tracking uptime percentages or any other SLIs.
You have not created any error budgets.
You are now starting to generate metrics from sources such as logs and OOB monitoring tools and integrations.
You do not have very good tagging coverage for your metrics.
You have some documentation, but it is not organized into runbooks.
You may have a larger support team, but they tend to be mostly short term contractors. Knowledge is lost each time a contractor leaves the company.
You still lack formal processes for incidents, but you may have an informal process that everyone follows.
You have a CI tool in place, but everyone is using it in different ways, with no formal release process.
Products are now public, and the company is growing quickly. Products tend to change scope and importance quickly, so your priorities may change from day to day. There is a lot of context switching due to lack of project management.

Adult Goku

You've finally made it: You have a wife and child, you've defeated all of your enemies, you've brought all of your friends back with the Dragon Balls, and you can finally sit back and relax a bit. That is until new threats emerge! Luckily you now have many powerful friends and tons of experience. However, you are facing some of your toughest challenges to date. You must devise new techniques in order to prevail.

As a company grows and becomes more successful, it's products tend to not only grow in complexity, but they usually have a larger impact when they break. Now is not the time to grow complacent! This may be the most difficult time for support organizations since they can be impacted by severe outages that cost the company a lot of time and money. You have to start using new tools in order to make gains. Post-mortems, chaos engineering experiments, and other advanced tools will be your Solar Flare at this point.

Key indicators and behaviors

You have robust monitors and dashboards.
You have a formal process for releases, and this has minimized issues. However, human interaction is still required..
QA testing is done and a sign-off is required from QA in order to release to production.
You have a proper team of 3-5 SREs (possibly more depending on requirements)
Your logs are primarily used for metrics, and you only look at the text when doing in depth troubleshooting.
You have tagging in place which helps you find things easier. However, there is no good way to correlate metrics through a tagging standard.
You have runbooks for each product. However, there is still tribal knowledge on how to find and use the runbooks.
You have identified SLIs and have created SLOs for all products. This is in a dashboard used by the support org.
You hold post-mortem meetings after incidents and create action items. However, there is not sufficient tracking of action items and they sometimes get swept under the rug.

Super Saiyan Goku

After many trials and tribulations, you have finally achieved your ultimate form. Although you still face many strong foes, you are confident that you can defeat them. You have even more friends that you train with in order to make sure that everyone around you is able to handle any enemy that you may face.

At this phase, you should be holding game days, conducting chaos engineering experiments regularly, and building your own tools to fine tune your ability to maximize the user experience that you contribute back to the open source community. You are confident in your teams abilities, so you have a public facing dashboard that displays your SLOs, as well as the status of any ongoing incidents. You're finally able to take what you've learned and give back to the community, thus protecting all of us from the androids and aliens that attack us at every turn!

Key indicators and behaviors

You have externally facing dashboards that display your key metrics to your end users.
You conduct game days and run chaos engineering experiments.
You have automated incident remediation, which is put in place as a result of action items from post mortem meetings from outages.
All action items from post-mortem meetings are tracked and an incident is not closed until all action items are complete.
You have automated testing in your CI/CD pipelines, thus reducing human error in your production releases.
Your runbooks are part a formal incident management process and are constantly updated after every discovery.
You have a single-pane-of-glass view of all applications, facilitated by proper tag management.
Products are well-defined, and work on them is managed by a project manager.