Technical debt, part 2: managing debt
In our previous article, we have discussed the definition, causes and symptoms of technical debt. However, now is the time to discuss how is technical debt managed. This article will focus on one model of technical debt, the process of creating an assessment and the pros and cons regarding the execution (the work required to address the debt).
Miroslav Lazovic is an experienced software engineer from Belgrade. Over the past 15 years, he worked on many different projects of every size – from small applications to mission-critical services that are used by hundreds of thousands of users on a daily basis, both as a consultant or part of the development team.
Since 2016, he is focused on building and managing high-performing teams and helping people do their best work. He likes discussing various important engineering and management topics – and that’s the main reason for this blog. Besides that, Miroslav has quite a few war-stories to tell and probably far too many books and hobbies. There are also rumors of him being a proficient illustrator, but the origin of these rumors remains a mystery.
– Backend TL@Neon and Dev. Advocate @Holycode
Ticketmaster’s model of technical debt
One approach for classifying technical debt (that I would like to discuss here) is the technical debt model created by Ticketmaster – the world’s largest ticket sales and distribution company.
On the picture below you can see the Ticketmaster’s model of technical debt .
This model has been used few years ago when Ticketmaster started the initiative to reduce the technical debt across its many products. I was working at the company back then, so I had first-hand experience with this model, the assessment process, and the actual results. I have also used similar approach a few more times during my career, most recently at my current company as well.
What is so special about this model? This model identifies 3 different categories of technical debt – application debt, infrastructure debt and architecture debt. It makes clear that the application code is just one part of the story (as we pointed out in the previous article) – you also need to consider where and how your applications run and the overall design of your system. Let’s briefly cover each of these 3 categories (you can refer to the picture for more detailed examples):
– Application debt: this is the part of technical debt that is related to your application code. It can appear in many forms: low code coverage, poor maintainability, bad documentation, performance issues, over-engineering, insufficient logging, etc.
– Infrastructure debt: this is the part of technical debt related to the environment(s) where your applications are running. Number and specifications of your servers or databases, the way your deployments are performed, level of automatization, how do you ensure continuity, etc.
– Architecture debt: this is the state of the overall design of your system. How easy it is to add new components, are there any critical points that can cause catastrophic failures, do you have any guidelines when designing new parts of the system, etc.
Performing the assessment
If you take the presented model for a starting point, you can perform an assessment of technical debt over these 3 categories. But how do you perform the assessment? Below you can find several steps that you can follow – think of them as of guidelines, not as rules set in stone.
1. First, you need to organize the sessions where you will discuss each category of the technical debt with your team (and anyone who can contribute to the discussion but is not part of your team – software architect, support team, etc.). This means that you will have at least 3 sessions (one for application, infrastructure, and architecture debt each), but there is a good chance that you will discuss one of these categories much more than the other two. This means that you might need two or more sessions for this category. Also, make sure that your team has enough time for each session to avoid having to cut short an interesting discussion (for example, for the most recent assessment, my team used 2-hour sessions).
2. During these sessions, you should try to discuss all the relevant aspects of a specific tech-debt category. So, if you stick to the model presented earlier, during the session about application debt the topics might be test coverage, maintainability, performance, etc. The presented model should be used as a guideline – you don’t need to stick blindly to it. If there is any topic that is important for your team (but is not listed here), it needs to be covered. Make sure to take plenty of notes (or record thesediscussions), because they will contain a lot of technical knowledge and details that are crucial for understanding the issues that your team is facing.
3. For each topic that you discuss, assign a value from 1 to 5 that represents the level of debt (1 means low/no debt, 5 means large debt that you need to fix as soon as possible). So, in the case of application debt, maybe your tests are great (score 1), performance is also on a satisfying level (maybe 2, meaning there are some minor things that could be improved in terms of debt), but maintainability is problematic (score 4), because there is a lot of old libraries that you need to upgrade and some important parts of the codebase are really messy, making any additional modification hard.
4. Once you have completed all sessions for each of the 3 main categories, you need to arrange the topics from each category from the most to least important. This is a good time to start thinking about the acceptable level of debt – which means that you will have to consider which issues your team should address, based on a score assigned in a previous step (for example, you might want to address only topics with the score of 3 and more).
5. Once you have ranked the topics for each of the categories, you should write a document that will represent the summary of your findings. For each of the 3 categories, you should present the list of the topics arranged as described in the previous step, but you can leave out those that have a satisfactory level of technical debt or no debt at all. If we stick with the example from the previous step, this means leaving out topics with score 1 or 2. For each topic that ends up on a list, you need to state why is it important and what is likely to happen if technical debt in that area is not addressed. Even better if you can show some examples of the impact that technical debt will have on specific area, like performance drop, effect on user experience, loss of profit, etc. The notes that you have taken during the initial discussions will help you greatly during this step. This document will serve both as a reminder for your team and as an important information for any stakeholder that might challenge the importance of working on reducing technical debt.
So, the assessment is now complete and you have an overview of the technical debt of your system. Now begins the real work – and that is addressing the issues detected during the assessment (paying off the debt). There are several ways you can do this – and some are better than the other. Let’s first start with the not-so good strategies for addressing tech-debt.
Don’t rewrite or redesign everything
“If the team doesn’t have the right knowledge, or does not have clear understanding of why these issues were present in the first place – who can guarantee that you are going to get it right when you do it again?.”
When the team looks at the list of the issues, there might be a strong urge to fix some of them by rewriting the parts of your system from scratch (or the whole system, in the worst case). While this might be a good solution in some cases, you should be _very_ careful. If the team doesn’t have the right knowledge, or does not have clear understanding of why these issues were present in the first place – who can guarantee that you are going to get it right when you do it again? There is a good chance that the same mistakes will be made again, so you will end up with technical debt in the very same place very soon. Rewrite or redesign may require significant effort, as well as time and resources that the team might not have. Also, redesign of large system that is being actively used requires good strategy for migrating from old system to the new one, as well as plan for dealing with potential issues that may pop up along the way. While there certainly are cases when rewrite should be considered as a proper solution (small component with limited impact or legacy application that is very hard to maintain), it should not be your first choice for all of the issues. Often, a proper set of localized changes might be exactly what you need.
Don’t treat technical debt as a separate project
“You need them (debt-related work items) to stay visible because it’s still project-related work that needs to be done.”
One thing that happens often with this (or similar) initiative is that it is treated as a separate project, with its own tasks and board. While this might work sometimes, it should not be the preferred way of tracking work regarding the technical debt. Most teams have one board (let’s call it the default board), one “source of truth” that they use to track all work items that are important for the project. As soon as you move some of the work items to a different board, there is a chance that they will not be treated the same as the rest of the work items on the default board. Even worse, these additional boards may become places where “tasks go to die”. This is because by moving them away, you _reduce their visibility_. You need them to stay visible because it’s still project-related work that needs to be done. Even worse, these work items may be so critical, that they need to be addressed before doing anything else (if your level of technical debt is so high that it causes severe problems). That is why they should go to the same backlog and on the same board as everything else. They must be visible for the whole team – otherwise, other priorities will simply get in the way and addressing technical debt will fall to the sideline (or will stop completely).
Now you know which strategies are not so good when it comes to managing tech-debt. There are better ways of doing this, and I will cover one of them below. I have seen this approach being used in several companies to a great success and I have used it as well several times the past few years. This year, at my current company (Neon) we used this approach to reduce the tech-debt by almost 50% in just 6 months after the first assessment.
Assign priority and contagion factor to all technical debt items
Every work item related to technical debt should be assigned priority and contagion factor. Priority is self-explanatory, but contagion factor is a very interesting parameter. Contagion factor describes the probability of technical debt spreading (if not addressed). For example, you might have a very messy codebase for some of the central parts of your project and expanding the codebase further will propagate the debt because you will be building on a bad foundation (thus compounding the problem). On the other hand, you might have a well-contained technical debt, limited to some software module or service that is not considered critical; not addressing this debt for some time probably won’t affect anything, because the blast radius is limited. Following this logic, when you assign priority and contagion factor to your debt-related work items, you will clearly see which ones need your immediate attention (high priority and high contagion factor) and which ones could wait a bit before being addressed.
Team should continuously work on technical debt items besides other tasks
“There is always a tension between the business and engineering – one side wants to deliver as much as possible, while the other side wants stability.”
It is not likely that team will always have the freedom to focus only on addressing the technical debt (focusing on part of the debt is more realistic), but the team should try to be as consistent as possible when it comes to this. The general idea is to dedicate certain percentage of your time to these tasks. Will it be 10%, 20%, dedicated sprint or something else – depends on a lot of different factors and communication with your PO will play a very important role here.
There is always a tension between the business and engineering – one side wants to deliver as much as possible, while the other side wants stability. However, managing this tension is very important – you
must make sure that everyone understands why certain engineering issues must be resolved and what can happen if that’s not the case. That’s why defining the potential impact of technical debt was important during the writing of assessment. So, it is in the best interest of everyone involved that the team works on mix of work items – some of which will add new value for the end users and some of which will make project more stable. The ratio of these work items will frequently swing from one side to the other during the project lifecycle, but what is important is _consistency_. For example, if you work in sprints, you may dedicate part of your velocity in each sprint for tackling technical debt. Or, you may have a fully debt-focused sprint once every X regular sprints. There are multiple approaches to this – but the most important thing is to find strategy that works for you, stick to it and track your progress.
Do a regular re-assessment of technical debt
Keeping a technical debt below certain level is something you should do continuously – it’s rarely a one- time effort. You should do a reassessment of technical debt occasionally, to make sure that current debt is actually being reduced and new debt is not being introduced (or if it is being introduced, it is done consciously, with a mitigation plan). How often should this happen? Well, it depends on how much work is happening on your project – if the project is growing rapidly, you might want to do reassessment more often (and vice versa). From my experience, doing a reassessment once every 6 to 12 months was fine for most of the projects that I was involved in.
And here we conclude the discussion about technical debt. This is an important issue that engineers need to know about and that will have to be addressed at some point during the project lifecycle. Depending on how this is done, you can either make your product even better and more stable, or you may contribute for more issues down the road. While actual managing of technical debt can be done in multiple ways (one of which has been explained in this article), it is equally important to understand why debt exists in the first place and what is the impact it may have not only on your project, but also on business. Even if you partially address your technical debt, that is still better than not addressing it at all (especially if you tackle most critical issues), because once you go past certain point, they may be no way to recover.