Measuring the Cost of Production Issues on Development Teams

-

Deprioritizing quality sacrifices each software stability and velocity, resulting in costly issues. Investing in quality boosts speed and outcomes.

Image by the writer. (AI generated Midjourney)

Investing in software quality is commonly easier said than done. Although many engineering managers express a commitment to high-quality software, they are sometimes cautious about allocating substantial resources toward quality-focused initiatives. Pressed by tight deadlines and competing priorities, leaders ceaselessly face tough decisions in how they allocate their team’s effort and time. Because of this, investments in quality are sometimes the primary to be cut.

The strain between investing in quality and prioritizing velocity is pivotal in any engineering organization and particularly with more cutting-edge data science and machine learning projects where delivering results is on the forefront. Unlike traditional software development, ML systems often require continuous updates to keep up model performance, adapt to changing data distributions, and integrate latest features. Production issues in ML pipelines — corresponding to data quality problems, model drift, or deployment failures — can disrupt these workflows and have cascading effects on business outcomes. Balancing the speed of experimentation and deployment with rigorous quality assurance is crucial for ML teams to deliver reliable, high-performing models. By applying a structured, scientific approach to quantify the fee of production issues, as outlined on this blog post, ML teams could make informed decisions about where to speculate in quality improvements and optimize their development velocity.

Quality often faces a formidable rival: velocity. As pressure to fulfill business goals and deliver critical features intensifies, it becomes difficult to justify any approach that doesn’t directly
drive output. Many teams reduce non-coding activities to the bare minimum, specializing in unit tests while deprioritizing integration tests, delaying technical improvements, and counting on observability tools to catch production issues — hoping to deal with them only in the event that they arise.

Balancing velocity and quality isn’t an easy alternative, and this post doesn’t aim to simplify it. Nonetheless, what leaders often overlook is that velocity and quality are deeply connected. By deprioritizing initiatives that improve software quality, teams may find yourself with releases which might be each bug-ridden and slow. Any gains from pushing more features out quickly
can quickly erode, as maintenance problems and a gentle influx of issues ultimately undermine the team’s velocity.

Only by understanding the complete impact of quality on velocity and the expected ROI of quality initiatives can leaders make informed decisions about balancing their team’s backlog.

On this post, we’ll try and provide a model to measure the ROI of investment in two elements of improving release quality: reducing the variety of production issues, and reducing the time spent by the teams on these issues once they occur.

Escape defects, the bugs that make their technique to production

Stopping regressions might be essentially the most direct, top-of-the-funnel measure to scale back the overhead of production issues on the team. Issues that never occurred won’t weigh the team down, cause interruptions, or threaten business continuity.

As appealing as the advantages could be, there may be an inflection point after which defending the code from issues can slow releases to a grinding halt. Theoretically, the team could triple the variety of required code reviews, triple investment in tests, and construct a rigorous load testing apparatus. It would find itself stopping more issues but additionally extremely slow to release any latest content.

Due to this fact, with the intention to justify investing in any variety of effort to stop regressions, we’d like to know the ROI higher. We are able to attempt to approximate the fee saving of every 1% decrease in regressions on the general team performance to begin establishing a framework we will use to balance quality investment.

Image by the writer.

The direct gain of stopping issues is to begin with with the time the team spends handling these issues. Studies show teams currently spend anywhere between 20–40% of their time working on production issues — a considerable drain on productivity.

What can be the advantage of investing in stopping issues? Using basic math we will start estimating the development in productivity for every issue that could be prevented in earlier stages of the event process:

Image by the writer.

Where:

  • Tsaved​ is the time saved through issue prevention.
  • Tissues is the present time spent on production issues.
  • P is the proportion of production issues that might be prevented.

This framework aids in assessing the fee vs. value of engineering investments. For instance, a manager assigns two developers per week to investigate performance issues using observability data. Their efforts reduce production issues by 10%.

In a 100-developer team where 40% of time is spent on issue resolution, this translates to a 4% capability gain, plus an extra 1.6% from reduced context switching. With 5.6% capability reclaimed, the investment in two developers proves worthwhile, showing how this approach can guide practical decision-making.

It’s straightforward to see the direct impact of stopping each 1% of production regressions on the team’s velocity. This represents work on production regressions that the team wouldn’t must perform. The below table can provide some context by plugging in a number of values:

Given this data, for example, the direct gain in team resources for every 1% improvement for a team that spends 25% of its time coping with production issues can be 0.25%. If the team were capable of prevent 20% of production issues, it will then mean 5% back to the engineering team. While this may not sound like a sizeable enough chunk, there are other costs related to issues we will attempt to optimize as well for an excellent larger impact.

Mean Time to Resolution (MTTR): Reducing Time Lost to Issue Resolution

Within the previous example, we checked out the productivity gain achieved by stopping issues. But what about those issues that may’t be avoided? While some bugs are inevitable, we will still minimize their impact on the team’s productivity by reducing the time it takes to resolve them — often known as the Mean Time to Resolution (MTTR).

Typically, resolving a bug involves several stages:

  1. Triage/Assessment: The team gathers relevant material experts to find out the severity and urgency of the problem.
  2. Investigation/Root Cause Evaluation (RCA): Developers dig into the issue to discover the underlying cause, often essentially the most time-consuming phase.
  3. Repair/Resolution: The team implements the fix.
Image by the writer.

Amongst these stages, the investigation phase often represents the best opportunity for time savings. By adopting more efficient tools for tracing, debugging, and defect evaluation, teams can streamline their RCA efforts, significantly reducing MTTR and, in turn, boosting productivity.
During triage, the team may involve material experts to evaluate if a problem belongs within the backlog and determine its urgency. Investigation and root cause evaluation (RCA) follows, where developers dig into the issue. Finally, the repair phase involves writing code to repair the problem.
Interestingly, the primary two phases, especially investigation and RCA, often eat 30–50% of the full resolution time. This stage holds the best potential for optimization, as the bottom line is improving how existing information is analyzed.

To measure the effect of improving the investigation time on the team velocity we will take the the proportion of time the team spends on a problem and reduce the proportional cost of the investigation stage. This will often be completed by adopting higher tooling for tracing, debugging, and defect evaluation. We apply similar logic to the problem prevention assessment with the intention to get an idea of how much productivity the team could gain with each percentage of reduction in investigation time.

Image by the writer.
  1. Tsaved : Percentage of team time saved
  2. R: Reduction in investigation time
  3. T_investigation : Time per issue spent on investigation efforts
  4. T_issues : Percentage of time spent on production issues

We are able to test out what can be the performance gain relative to the T_investigationand T_issuesvariables. We are going to calculate the marginal gain for every percent of investigation time reduction R .

As these numbers begin so as to add up the team can achieve a major gain. If we’re capable of improve investigation time by 40%, for instance, in a team that spends 25% of its time coping with production issues, we can be reclaiming one other 4% of that team’s productivity.

Combining the 2 advantages

With these two areas of optimization into account, we will create a unified formula to measure the combined effect of optimizing each issue prevention and the time the team spends on issues it just isn’t capable of prevent.

Image by the writer.

Going back to our example organization that spends 25% of the time on prod issues and 40% of the resolution time per issue on investigation, a discount of 40% in investigation time and prevention of 20% of the problems would lead to an 8.1% improvement to the team productivity. Nonetheless, we’re removed from done.

Accounting for the hidden cost of context-switching

Each of the above naive calculations doesn’t bear in mind a significant penalty incurred by work being interrupted as a result of unplanned production issues — context switching (CS). There are many studies that repeatedly show that context switching is pricey. How expensive? A penalty of anywhere between 20% to 70% extra work due to interruptions and switching between several tasks. In reducing interrupted work time we also can reduce the context switching penalty.

Our original formula didn’t account for that vital variable. A straightforward though naive way of doing that might be to assume that any unplanned work handling production issues incur an equivalent context-switching penalty on the backlog items already assigned to the team. If we’re able to save lots of 8% of the team velocity, that ought to lead to an equivalent reduction of context switching working on the unique planned tasks. In reducing 8% of unplanned work we’ve also due to this fact reduced the CS penalty of the equivalent 8% of planned work the team needs to finish as well.

Let’s add that to our equation:

Image by the writer.

Continuing our example, our hypothetical organization would find that the actual impact of their improvements is now a little bit over 11%. For a dev team of 80 engineers, that might be greater than 8 developers free to do something else to contribute to the backlog.

Use the ROI calculator

To make things easier, I’ve uploaded the entire above formulas as a straightforward HTML calculator which you could access here:

ROI Calculator

Measuring ROI is vital

Production issues are costly, but a transparent ROI framework helps quantify the impact of quality improvements. Reducing Mean Time to Resolution (MTTR) through optimized triage and investigation can boost team productivity. For instance, a 40% reduction in investigation time
recovers 4% of capability and lowers the hidden cost of context-switching.

Use the ROI Calculator to guage quality investments and make data-driven decisions. Access it here to see how targeted improvements enhance efficiency.

References:
1. How Much Time Do Developers Spend Actually Writing Code?
2. Tips on how to write good software faster (we spend 90% of our time debugging)
3. Survey: Fixing Bugs Stealing Time from Development
4. The Real Costs of Context-Switching

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x