Mike Marchese

View Original

Build vs Buy - It’s Never So Simple

I’ve implemented monitoring, alerting, and incident management platforms for several companies of various sizes, and every time, the question comes up - Do we spend money on a monitoring/alerting/incident management platform, or do we build one ourselves? And what exactly does that mean? It varies from place to place. Some feel that “buying” is finding a one platform fits all solution, some adopt a variety of tools that integrate together, and some want to build things completely from scratch… and everything in-between. I’d like to share my experience doing this in companies with very different cultures, budgets, and requirements. I hope that this helps SRE’s. EM’s, and general decision makers in this process.

Key Factors

I’ve found that there are the following key factors that go into the decision on which should be used to decide on your monitoring/alerting/IM:

  • Budget

  • Requirements

  • Culture

  • Pre-existing tools

  • Third-party integrations

Budget

Before you can even begin looking for a solution, you need to understand how much money you have to spend on your project. If your budget is zero, you would naturally look at open source solutions. However, this is not always the best decision. No work costs zero dollars. Even if you are a non-profit, work costs money. Even if there are volunteers, there is an opportunity cost. Open Source solutions require sweat equity. What I mean by that is that an engineer needs to take time implementing the solution. And open source tools require lots of work. Although you are not paying licensing costs, you need something to run them on, you need time to install and configure them, and this is often a very custom implementation. Also, open source tools have bugs. You are at the mercy of the community to patch them. Or you can do it yourself.

If you decide to pay for a solution, there is still work involved. Many SaaS services offer lots of capabilities. Many of them offer all in one solutions for monitoring/alerting/IM, but some of them specialize in one facet or the other, and you need to meke sure that they play well together. You may find some cost savings this way, but just understand that they will likely not work as well as an all in one solution. You will pay a premium for the all-in-one solution. This is where it is important to understand the requirements

Requirements

After you decide on a budget, it is time to start shopping, right? NO. You need to know exactly what you need. Different tools offer different solutions. At this stage, it’s important to understand your users and your needs. This needs to be a conversation that is between almost every department in the company. Remember, your customers will be both internal and external. You will want to not only monitor your production infrastructure, but also your lower environments, your application performance, logs, error reporting, etc. Also, who is going to visualize these metrics? Is it just going to be your internal stakeholders, or do you want to put a dashboard that is exposed on your corporate home page? Do you need to send weekly reports to your C-Suite? Does your marketing team want A/B metrics from changes done by your product team? Does your CISO want reports of vulnerabilities in your container images weekly?

Culture

Company culture plays a role in the decision making process as well. A company may have a culture of open source only, since they may produce open source tools themselves, so they want to dog-food this. However, this usually makes the decision much easier, as they usually understand the costs involved with open source. As an engineer, this gives you room to experiment, make mistakes, customize, write code, and really produce some great work. That being said, you are ultimately responsible for the final product. That means that when something doesn’t work right, you are the one responsible for fixing it. You usually have no support number to call, no trouble ticket to create, and will likely be spending many hours sorting through other engineers code to find the issue, fixing it, submitting a bug report, and going through the process of merging your patch. Or at least, that is what you should be doing as a responsible member/user of the open source community.

On the other hand, you may work for a company that is all about using man-hours on improving the product and has no interest in maintaining servers and custom implementation. This may sound empowering at first, but you will quickly find it overwhelming. There are SO MANY tools out there. Also, you will find lots and lots of overlap between these tools. Remember, all of these companies are striving to make their tool the best, most comprehensive suite of software around. This is a good time to look back at your requirements and prioritize them. All of these SaaS providers started out with one product in mind, and they usually expanded to include different functionality. Therefore, they are usually best at the thing that they started out with. If your company is security focused, find a company that originally was a security monitoring company. If there are stability focused, find the one that was an incident management company. If they were performance focused… you get where I’m going with this. What you may end up with is a mix of tools, which is ok.

Finally, you may work at a huge corporation with near-infinite sums of money to spend on this project. Congratulations. Now you will be the belle of the ball. Reach out to various big name vendors, and let them know what you’re looking for. Ask them for swag. Again, requirements are important. Usually, they will custom tailor your solution to your needs. Be sure to work with their solutions engineers, you will get the biggest bang for your buck. People in these sorts of companies are used to getting a premium experience as quickly as possible. That’s where automation comes in handy. I recommend using some sort of CI/CD tool that integrates with your helpdesk system to be able to deploy new monitors/dashboards/alerts/etc. as quickly as possible with as little manual intervention as possible. This is because companies of this size usually have huge requirements that would be slow with manual intervention. They are not paying for that, they are paying for a premium experience. It’s ultimately your job to provide that.

Pre-existing Tools

This topic sort of fits into company culture, but I feel like it is it’s own category. This tends to be especially important when starting with a company that is moving from a young startup into a medium size company, usually around a series 3 funding round. You will find that there are already tools used by your dev teams that are integral to their operations. You may find some of these methods absurd. However, fighting it is futile. It’s also counter-productive. It’s easier to ride the wave than it is to fight against the tide in this case. Your best bet in this case is to find a way to integrate it into your new tool suite. Almost all modern SaaS tools integrate in some way with other SaaS tools, to some extent. And if it happens to be a series of bash scripts, so be it. Be creative. Through it into a container and write a Kubernetes manifest, maybe a helm chart… do what you can to elevate their experience. You can do a lot with Prometheus exporters and a little Go. Remember, your goal is to increase velocity and improve the value stream.

Third-Party Integrations

I mentioned that MOST SaaS tools integrate well with other SaaS tools. However, this is not always the case. I’ve run into situations where some suites didn’t integrate well with their OWN tools. The best way to approach this is to create a workflow chart of how you expect a failure to go. Start with the origin - something breaking. How does this happen? Is it a code change? A hardware failure? A security breach? What are the signals of this? Ideally, it is discovered by your monitoring tools, but if you have a gap in monitoring, a developer or a marketing or sales person may find it. How do they report it? Worst case scenario, your CEO finds it. Even worse than that, a customer finds it. YES, that is worse. If your CEO finds it, you may be able to squirm out of that one, but if a customer finds it, they will likely take to Twitter and report it, and THAT IS BAD. Heads will roll. But I digress.

Once you create your workflow, you can identify the tools that will handle each step of the IM process, from the failure to the patch. You want to make sure that all of the tools that you choose integrate seamlessly to give you the shortest MTT* possible. This means having the least amount of clicking around to make things happen. You should also consider ways of keeping all parties involved informed of what is happening. This could involve Slack, Discord, external dashboards, RSS feeds, chatbots, AI tools… there a lot of things to consider here, and that could be a whole other blog post. Also, consider what happens after mitigation. How are you going to aggregate all of the information gained from the failure? Why did we not get alerted? What passed QA that shouldn’t have? Should we monitor Twitter (You should)? Going through chat logs and finding key messages and timelines is a laborious process, believe me. A good incident management tool will pay off handsomely during a post-mortem. However, it has to work with whatever communication channels you are using. If you’re using Slack, great. But if you’re not, this is something to consider.

Conclusion

I know that there are a lot of layoffs at large companies and there are engineers that have only worked for these companies. They were able to use the best tools that money could buy and now these engineers have to work at smaller companies with smaller budgets, different cultures, requirements, and tools. You may be wading into uncharted waters. Your best friend is communication. Find out who your customers are (internal and external), what your resources are, and what you have to work with. Be creative, be flexible, and work with what you’ve got. I’ve been in public schools, huge media companies, and young startups. They are all different. Build vs buy is bullshit. Even when you buy, there is still building. Often times, the building is trust. There is never a one size fits all solution. Meet people where they are and work with them. If you’re lucky, you can buy, but you will still have to build. Those agents don’t deploy themselves.