21 Jul, 2023
Where there was so much unplanned work coming into the team, you were just spending your days responding to people asking you for things?
One company I worked at felt very much like this, and the origins were due to a complex set of factors (team restructures, leadership leaving etc) but the end result was a team of devops engineers spending all their days responding to requests for help. There was no time to make any improvements that would reduce this work, and most people by that point were stuck in the grind of responding to jira/slack requests.
How do we break out of this? Well at that company, the devops team rebranded to a platform team! A platform team provides internal services (through infrastructure, tools, wikis or other resources) that enable software development teams to build and deploy software more efficiently and effectively. Because that would help right, having a team that builds self-service products and platforms? Ideally it would reduce the amount of support requests coming in. But the problem here was building a platform team is a long term commitment. We also weren’t sure what exact services this platform would provide, or what people wanted. And who was going to build this platform if everyone was spending all day responding to tickets?
The other common response I see teams having is a knee-jerk response, and over-investing in tech solutions that they think will help…but they actually haven’t validated or tested this. Oh if we implement this tool, that will mean everyone can see that information for themselves and they won’t ask us! So you implement the tool…and don’t see any reduction in people asking you for help. Or sometimes help requests actually increase, because people are confused by the new tool!
If we go back to the problem of large amount of support requests coming into the team, there are a few problems in organisations that can contribute to this:
Lack of team ownership
Maybe in the past, the software teams had thrown work over the fence to the ‘devops’ team to deploy. And this way of thinking hasn’t shifted. We think the ‘devops’ way of working, where teams own the development and operations of services, has been cemented into every part of our industry…but it hasn’t. There is still ongoing work to do around this change in many organisations.
Teams don’t have the skills
Maybe this team has never had to run or maintain their services before, and they need help deciding on autoscaling metrics. Or they don’t know how to get their logs into sumologic.
No way for teams to self service
Even if the teams do have the skills to self-diagnose, or make changes to their infrastructure, sometimes they don’t have the permissions too. Especially when teams have platforms, access to those is (rightfully) quite restricted to the people managing the platform.
Those problems are big, and unlikely to be solved in the short term. And some of them, like teams not having the skills, will never go away. We can’t expect all software development teams to have a full end-to-end understanding of all the infrastructure deployed in your organisation. So given these problems are unlikely to be resolved, the two most useful things you can do is:
Create visibility into the problem
Find small, iterative pieces of work you can do to slowly improve
How many tickets/requests is your team resolving? How long are they taking to resolve? What sort of problems are you solving over and over again? Where is this work coming from (is it coming from development teams, team managers, particular functions etc)? All this information is valuable, and very easy to visually represent in a graph, pie chart etc., to help your team stakeholders get a better understanding of your team problems. Typically what happens though, is that information is never displayed in a way that makes it easy for people to make decisions. Showing a technology leader some graphs about the number of requests your team is expected to respond to, where these requests are coming from, and typical resolution times (like the examples below), is much more useful than ‘we are super busy because everyone keeps asking us to do things’.
Seeing these graphs visually allows you to see some key data points, and ask questions like:
What sort of tickets are taking over 2 weeks to resolve and why?
Ideally all tickets are resolved in 1-2 days, what are causing these requests to take so long to resolve?
Why are there so many CICD errors?
Is there something we could be doing to reduce IAM errors?
Likely you use a work tracking system (like Jira) where this data could be extracted from. If you don’t have this data easily available, using post-it notes to track in an office (or an old school excel spreadsheet) over a couple of weeks to get a sample of data would be enough.
In the team we were in, we had the data from Jira on the number of tickets (and time), but not the type of requests. We extracted about a month's worth of data, then went through as a team and quickly categorised them. What we got was a great idea of what work the team was spending time on in a month (and what work was being done over and over).
From the last exercise, you should have some good starting points for what sort of work your team could do to reduce the amount of questions coming into your team. The next step is looking at any pieces of work that are ‘low hanging fruit’ (easy and quick to complete) that would reduce the amount of requests you get in these categories.
For example, one issue we had was a lot of people raising requests with errors in their CI/CD pipelines. After responding to these requests, we realised most of these were from people in the organisation not knowing how to read the error logs in the pipeline tool. What we did was create a simple wiki page with basic diagnosis steps, and tried giving this to people if they raised a request to get help with a pipeline. This helped reduce requests coming into our team! Not all of them, there were sometimes real problems, but we discovered a lot of the problem was in people’s understanding and knowledge of the CICD tool.
Not all problems require self-service tools to be built, sometimes a simple wiki page (or some knowledge sharing sessions) will help. The key takeaway here is don’t leap into solving all the problems with awesome tools mode, rather focus on those quick wins that can deliver value. What can we try (very easily and quickly) to validate this is actually helping our ‘customers’. The quicker we can iterate and get feedback, the sooner we can confirm we are on the right track in terms of the problems we are solving. Another step to consider here is chatting to the people raising the requests, to get a better understanding of what problems they are facing and what help they really need.
Over time, with an investment in slowly iterating and implementing small fixes to help your customers, you will find your team becomes less reactive. In our team, this manifested in a few different ways:
We had ongoing conversations around themes we were seeing in requests, and possible fixes (be it doco, sessions, tools etc) which we fed into our backlog
A lot of the teams work was still responding to requests, but we had some capacity to invest time in automation/tooling to proactively address the other problems identified (i.e. for ECS scaling we put together a repo with some infra as code examples based on a few different types of apps we had in the organisation)
Individuals got more of a chance to work on things that interested them, responding to these requests became tiring and boring very quickly (especially because so many of them were similar). By creating space in the team to proactively fix issues, team members who wanted to help with automation/building tools could do that.
This work is ongoing in any team, as any commitment to continuous improvement is, however when the team members started to realise the benefits it wasn’t hard to continue to drive this approach in the team. Keeping ourselves accountable to our ‘customers’ and their needs (i.e. are we building a tool just because it’s fun, or is what people really need another wiki article), was the tougher battle. Over time we got better though, and it showed in how much easier our interactions were with other software development teams, and how much more quickly their requests were met.