22 Mar, 2022
They can improve decision making, enhance cohesion between development and operations, ensure customers are always top of mind, and drive proactivity amongst team members. If you’ve recognised a level of reactivity within your team where customer problems are addressed after they’ve been raised instead of spotted before they’re even noticed, I’ll show you how SLOs can improve team proactiveness.
Most people are familiar with SLAs (service level agreements) but might have not heard of SLOs. An SLA is a contract between the service provider and a client, which helps the client set expectations on the quality and availability of the service. It also details the consequences or penalties for the provider if they don’t deliver on the service. An SLO is the objective your system must hit to meet the SLA. It’s made up of one or multiple SLIs (service-level indicators), which are the metrics used to measure the system you built such as availability, uptime, and latency.
I’m going to use a real client example to show how we helped an engineering team gain confidence in deploying changes to production and reduce their ‘mean time to detect or discover’ (MTTD) by the use of SLOs. Our client operates in the financial services sector and have recently launched a public-facing application that we helped with. The application is made up of a react front-end and a Kotlin Spring Boot backend running on Kubernetes with a Postgres database. It has few integrations to other internal services within the company as well as 3rd party service providers.
When I first started on the engagement, the engineering team were often unaware of any problems with the app until customers contacted the support team and it, in turn, escalated to us. Sometimes it would take up to a week before we were alerted. This was made worse because not all customers would contact the support team. Some customers would even keep retrying for a few days before notifying us at all. This also made troubleshooting and rollback difficult because we would have deployed multiple changes into production and we could not easily tell which change caused the problem, nor know which version to roll back to. Thankfully the app was still in beta, so any issues detected were minor. We’re determined to have it running as smoothly as possible.
After I settled down in the team, I recognised they would benefit from SLOs. I took the initiative to raise this with them and got them thinking about implementing SLOs for the app. The suggestion was well received because of the value SLOs bring to not only end users, but also the developers and operations engineers themselves.
Before we dive into the details of the how, it's important to understand the value that SLOs bring.
First, they act as the main indicator for when we should deploy changes to production by using an error budget derived from the SLOs. If you’re unfamiliar with the term error budget, I will explain it later in this piece.
Second, they signify whether a change is successful or not by observing the SLIs. For example, if your error rate changes radically right after a production deployment, you can immediately tell something is wrong and you can begin a rollback or prepare a hotfix before your SLO is breached.
Finally, monitoring SLOs can alert you of problems in production. For example, if you run out of capacity or are under a cyber attack, your application response times may increase until it begins to return errors. Your team will be notified so they can start investigating before your customers complain about your service on social media.
How did we come up with our SLO?
What I noticed in our SLO discussions, people were excited and jumped straight into setting a 99.9% availability SLO. The end goal was where our discussion began, which isn’t the best way to go about this. The team needed to take a step back and think about the metric that was going to help us better understand the system we built and, more importantly, what our customers care about.
As a result, we first collectively decided on the SLI that we believed was important. We focused on what was the right metric rather than how to collect that metric. We held a workshop to identify the gaps in our observability and turn that into JIRA tickets for backlog.
We then asked ourselves: if we were the customer, what would we truly care about? It came down to two things.
Is our customer experiencing an error or not?
How long does our customer have to wait before they get a response?
During the workshop, we made some SLO considerations that were discarded in the end. I’m sharing this because I want to show that not everything we thought was important is in fact important, especially to the customer. We had a great conversation about an SLI that measures the response time from a 3rd party service provider that our business logic relies on. Yes, it would impact our customer's experience, but we didn't make it our SLO for three reasons:
There is nothing we can do to improve someone else's systems apart from raising a ticket.
If this is a fatal error, the customer is going to experience an error returned by our service regardless, which is another SLI that we are measuring.
If we make this an async process and still return a successful response back to the customer while implementing a failed retry login on our backend, our customer is actually none the wiser!
Once the metrics important to our customers were identified, we wanted to set an objective based on it. It was very tempting to set the bar high and be proud of what we've built, but we had to think about the cost of building such a system. Here we took an under promise and over deliver approach, starting with a relatively lower bar for various reasons: our whole team was new, we had just inherited this app, we were not sure how our system behaved, and insufficient effort had been invested in building out the system's observability.
Finally, we had our SLOs:
95% of the requests has a success response
95% of the requests’ response time is within 10 seconds
I know a lot of people’s jaws will drop when they see this low bar, but I will explain why such SLOs were derived. Our app was still in beta so there was not much traffic. We had a couple hundred http requests daily. If we were to set our SLOs to something as high as 99.9%, a handful of errors in a day would be enough to breach our threshold. Also, the response time is tightly coupled with a non-reliable 3rd party service provider which had previously experienced latencies causing requests to take more than 30 seconds. After we moved to general availability (GA), the request volume built up and we were able to review the stats and raise the bar progressively.
Key takeaways from SLO our workshop
Do not over complicate your SLOs - have as few as possible.
Think from the customer’s perspective - not every single SLI is important to your customer.
Under promise and over deliver - it's always an easier conversation to raise the bar rather than drop the bar.
How we’re tracking the SLOs
Note: this article is not intended to be a technical implementation of alerting and monitoring using prometheus and Grafana (if you are interested to know more about that, feel free to reach out to us).
Our app is deployed on a managed Kubernetes cluster and uses NGINX as our ingress controller. It came with prometheus and grafana out of the box and is configured to collect NGINX ingress metrics by default so we’re able to just use these to track our SLO.
With our first SLO “95% of the request has a success response”, we need to work out how to get the percentage of the successful response (SLI). This was calculated as: 1 - (“error response” / “total response”) * 100.
It’s important to note, we only considered responses in the http 5xx range to be an error. We did not consider the http 4xx range as an error because those are client-side errors caused by things like invalid or expired tokens, calling incorrect paths, or supplying invalid data, etc. Your app may behave differently and therefore you may have a different definition of what is considered an error response.
Once we had the success rate SLI worked out, we tracked it over a period of 30 days and set a low threshold alert to notify us when it's at 97%. This gives us a chance to take corrective action before it drops below our objective at 95%, and, of course, we have an alert if it drops below 95%. Here’s a snippet of our PromQL:
For our second SLO, “95% of the request's response time within 10 seconds”, we first needed to work out how to get the percentage of request’s response time within 10 seconds (similar to what we did for our first SLO). NGINX has a response time histogram that we can leverage. It has buckets that keep track of requests that fall into different response time ranges. We can use the following formula to work out the percentage: (number of requests that take less than 10 seconds / total number of request) * 100.
Again, once we had the response time SLI worked out, we tracked it over a period of 30 days and set a low and high threshold like the success rate SLO above. Here’s a snippet of our PromQL:
We then created a dashboard in grafana to visualise our SLO. This is important because humans like and respond to visuals. The dashboard has become an integral part of our team’s workflow. Checking it first thing in the morning as well as after any prod deployment. It helps the team spot problems quickly.
I’m going to explain what the error budget is because I’ve mentioned it quite a bit throughout this piece. An error budget is a measure of the amount of errors your system allows before impacting customer experiences. In simple words, it's the pain tolerance of your customers.
To work out our error budget, we simply just work out: 1 - our SLO (95%) = 5%
We also visualise the error budget on our Grafana dashboard. See this snippet of our PromQL:
The above query turns the error budget into how many minutes our systems can afford to go down before our SLO is breached in a 30-day period.
Our team is able to use that to determine whether we can release a change into production. For example, if we still have 300 minutes of error budget left, we can comfortably say, “let's do it”. If we’ve exhausted the error budget, we can hold off any release until we’ve accumulated enough error budget. This also is a great way to help our product owner make decisions on when to release major changes or test new configurations, etc.
Near the end of my engagement with our client, the team became well versed in identifying issues in production. There was an incident where we had an error rate spike right after a deployment. It impacted customers’ ability to use our application. The team was able to identify and fix the issues well before any customers contacted the support team and without the help of a consultant. My involvement in offering the team resolutions decreased overtime. This brings happy tears to all of us in Midnyte City.
SLOs are not the only thing your team should care about. It's just a starting point. If your team is unsure of what to do when it comes to determining your service's health. Going through the process of setting up SLOs will guide your team to ask important questions, identify gaps, and better understand what your customers really care about.