Bridging SRE and Customer Service

Tech

Feb 13

Written By

Many SRE methods, concepts, and principles can be applied across entire industries, in ways that most people had never considered. For instance, I was recently coaching a friend of mine who is a head chef on how he can use post-mortem and retrospective meetings to reduce days that they are “in the weeds”. I have gotten no feedback yet, but he did realize that the methods that I suggested would be much more effective than the current yelling and finger-pointing that currently happens at the end of those very long and stressful days.

Strangely enough, I have not always seen these principles applied in customer service within companies that deliver software. Since SRE and customer support seem like very customer-centric roles, it would make sense if they were both on the same page. I’ve often observed that customer service and support organizations are not trained or even aware of the methods that are used to handle incidents. It is frequently seen as a software bug or an issue with a feature and it is automatically moved into the product or software team’s queue of work. Since it isn’t seen as an incident, it is not triaged, categorized, and managed in the way an outage that is received from a monitoring signal, for instance, would be handled. This increases the MTTR, and since it is directly customer-facing, results in a bad customer experience.

In this situation, a little training goes a long way. Imagine this scenario: you have a company that makes custom scarves, and you just added a feature that lets you preview text on your scarf. One day a French football team starts doing really well and quickly becomes a local favorite. Lots of people suddenly want to order scarves with this team name. When trying to customize the text on the scarf, it causes your database to lock up due to some issue with a long-running database query. The request eventually times out, and the order can never complete. On top of that, your customers in France are seeing general slowness on the site due to an unplanned surge in traffic, on top of the database issues.

The customer support person creates a ticket saying that there is an issue with the new feature, and this goes into the development team's queue as an issue/bug. The monitoring tools might see something lock up, and you might kill the query and everything is ok. You figure out what the offending part of the query is and put a fix in place. You also scale out your resources in eu-west-3 to handle the traffic and adjust your auto-scaling policies accordingly.

However, next week, there is a game coming up against that team's Belgian rival, and this is HUGE. Your customer support rep knows this because that person spoke with the customer and they mentioned the upcoming game. Now there are a bunch of new orders for another team and you are having the same issue with your resources since you are getting almost double the traffic that you were. You scale your infrastructure or possibly even deploy to eu-central-1 for more stability. However, you suffered a second outage, ate up your error budget, and now you’re “in the weeds”.

Now let us imagine that your customer support rep was part of your incident management process. They would have created an incident rather than a bug, so it would have been triaged the same way as it was before, except with more context. The same ticket would be created with the development teams, but with more information about the database issue, which would allow it to be resolved much easier. Also, customer support is a great conduit between SRE and the end-user. They could have been having real-time communication during the incident with the customer, which adds a new level of transparency that people love. Lastly, the customer support person would likely have mentioned during a post-mortem meeting that there was a big game coming up and that it would be a good idea to prepare for the surge in traffic. This would have allowed you to avoid an outage altogether.

We can better understand the other perspective from an organizational mindset. How do we get technology orgs to have more visibility into the KPIs of the CS org? It is now common practice to get a survey after you have a support session with most companies. Due to this, it seems that these teams already have SLIs and I would suspect that they also have SLOs. However, tech teams usually only see these numbers in one slide at an all-hands meeting (maybe). Transparency, observability, and better communication are needed in this scenario. If we tear down the silos between CS and SRE, then you start to have a direct relationship between the metrics that SRE uses and customer satisfaction.

Here are a few suggestions to get SRE and Customer Support to see through each others eyes:

Add social media feeds into SREs health dashboards. Sometimes customers will find blind spots in your monitoring
Have a regular meeting with the customer support team. You may be surprised to find that there are issues where you did not expect.
Hold training sessions for non-technical employees. You will find that this makes it easier to get cooperation from teams in your tech initiatives if they have a better understanding of what you do.
Include customer support in the incident management process. They may have details that were not initially provided when the incident was first reported. They also have a direct line with the end user, which will give you final confirmation of the mitigation of an issue.

SRE

Bridging SRE and Customer Service

Cool Tools for April

SRE and ITIL

Questions? Say hi.