4 Dec, 2024
There are tons of definitions from different companies or people all over the internet (just Google “what is observability”), and there is no absolute right or wrong answer. In this blog post, I am going to share my view after working with many start up and scale up companies on their observability journey.
Most of the time when people talk about observability, it is where to ship metrics, logs and traces and which observability tools/vendor to select. These things are important, but are just tools to help us get to where we want to be in our observability journey. The more important things to consider are:
What does it actually mean to have observability?
What are the benefits?
Why do we need it?
Where do we want this observability journey to take us?
If we jump straight into implementing tools and haven’t given those questions any thought, we might end up on a sub-optimal, and more expensive path.
Another common conversation around observability is what metrics to monitor. If the system you built is simple and predictable, it would be easy to work out a set of metrics to represent the state of the system, put some alerts on it and the job is done. This is what is referred to as known unknown. It works… until one day when an alert goes off and you can’t work out why it is triggered or what triggered it.
Metrics often lack the context to identify why and how the system failed, especially when it fails in an unexpected way. This is what is referred to as unknown unknown. This happens to the simplest systems we build, and it happens a lot in the complex systems we work in every day! Observability, when done well, is about handling the unknown unknowns and having rich context in the event. This enables us to better understand and explain any state our systems get into, regardless of how novel or bizarre it is.
Having mature observability means we are able to understand the state of our systems at any given time, how it got to that state, without any code change after the fact. Regardless if it is a wild anomaly or an issue or behaviour we have seen or expected. We want to be able to work out WTF is going on, quickly, when issues happen. We want to see the impact to customers without getting our engineers to add extra lines of code to capture the missing information in the heat of the outage. This principle holds, regardless of your choice of tools.
If you’ve read this far, you would probably guess that sitting at the heart of observability is the high cardinality, high dimensionality event our systems produce. Without those rich context events, it is very difficult for us to truly understand WTF is going on.
In the past we often had to trim the amount of telemetry signal our system was producing. Because of cost and/or the tools we used to analyse the telemetry signal not being able to handle the size of the data we collected. Those days are gone, there are tools that incentivise us to pack more attributes into any single event, allowing us to ask arbitrary questions against the data we collect to seek answers we need. This is not a recommendation on particular tools, this is to help us understand the secret sauce in good observability. We should not bend applications to work with tools, a good tool should help us implement good observability practice easily, and without upsetting the CFO.
Observability is like devops. It is a culture change first, and the tools we choose are to help us promote that culture. It won't be much help if we have state of the art observability tools that can answer any question we ask about the systems, but no one uses it. It's very important to foster a culture where engineers wonder “can we tell if all customers are doing what we want them to do?” as part of day-to-day development. Developers need to build the habit of being curious about what happens in production and how customers interact with their craft. Building that intrinsic motivation is going to help build a high performing team. The anti-pattern is burning the team out with noisy alerts and teams always being reactive to issues. This will negatively impact any team’s performance because they will spend too much time firefighting and on unplanned work.
Observability is a culture and thought process. It can help us build reliability into the systems we develop and maintain, as well as help building high performing teams. The tools that we choose are just tools, they help us get to our destination. Implementing an observability tool is not a silver bullet. The thinking around having those high cardinality, high dimensionality events is the core of the observability building block. Having that culture and thought process in the muscle memory of our organisations is going to create a long lasting effect to keep driving that quality food out of our kitchens.