I hate to break it to you, but it’s only a matter of time before your organization experiences a bad release. Unfortunately, they happen far more often than we’d like.
Bad releases can occur for a variety of reasons. Something may have gone wrong during the rollout, or maybe there was a traffic increase during deployment. Perhaps a database migration failed against production data or critical features broke in unexpected ways.
Regardless of why or how they happen, bad releases hurt teams and businesses. Continuous, bad releases drive customers away over time, but one catastrophically bad release can seriously damage — or even destroy — a business.
Release mismanagement occurs at all stages in the software development process. Early problems tend to become more worrisome later on. For example, insufficiently tested software may break under production traffic, engineers may write software that doesn’t pass security compliance, teams may omit telemetry code that makes things harder to debug and resolve, etc.
Unfortunately, these and other troublesome practices are often repeated time and time again, essentially guaranteeing the same, unfavorable results.
One of the goals of the DevOps culture is to root out the mismanagement that causes problems in the first place. The DevOps Handbook offers real-world approaches that have been proven to mitigate bad releases and help companies succeed in the market.
In this post, we examine three common mismanagement practices that set teams up for software release failure. We also offer a few DevOps solutions that will help you avoid bad releases.
A bad release is one that causes unexpected technical or business problems. It may include some code that creates 500 error responses, or a script that creates problems by integrating a new, third-party service. Both impact users, but they have different resolutions and different business impacts.
A bad release can quickly turn into a horrible release if problems aren’t identified and resolved promptly. For example, Knight Capital experienced a famously bad release that led to financial ruin for the company. The post-mortem revealed a catastrophic chain of events.
The report reads like a software horror story: A previously unused feature flag was repurposed for new behavior. The previous, nine-year-old code wasn’t removed and wasn’t tested. A poorly executed manual deployment left incorrect code on one of eight servers. Then, the trading system routed orders to that one machine, which triggered the incorrect, nine-year-old code. Actions undertaken to repair the system triggered new bugs, which exacerbated the problem further.
This unfortunate series of events cost Knight Capital $440 million dollars in 30 minutes and a $12 million dollar fine to boot.
You know what a bad release looks like, but you’re probably wondering what caused the problems in the first place. Your initial thought might be that the release simply wasn’t tested enough, and that therefore the bad release could have been avoided with more testing. If so, your impression is generally correct, but only to a certain point.
First of all, testing the software is easier said than done. It requires significant infrastructure. Let’s say an organization released software with a performance issue. The team decides that from now on, it will do performance testing to prevent performance issues and potential regressions. Now, testing must happen in a specially designed performance testing environment. The resulting pipeline is longer, more complex, and more costly.
Second, testing software before going to production only identifies the problems that can be found before going to production. In other words, testing prevents shipping known bugs into production, but it doesn’t ensure that production is bug-free.
Testing prior to the time of production doesn’t identify or prevent infrastructure issues, like a server running out of disk space or memory, for example.
Testing also doesn’t protect against changes that were only pushed to production. These types of changes are critical, so teams may not have time to push them to pre-production environments, or it may not even be possible. Such cases lead to environment drift over time. A drifting environment leads to inconsistent results since testing occurs on different configurations, only to break later on in production.
In addition, testing doesn’t eliminate other workflow problems that were only caused by pushing to production. A fix may be pushed to production, then replaced by a new version coming from staging or QA. In this scenario, the problem is unwittingly reintroduced into production.
But if testing practices alone don’t cause bad releases, what else might be contributing? There are three major mismanagement practices that are common sources of release issues.
“Big bang” releases are the culmination of many different features, bug fixes, improvements, or any type of changes intended to go live in a single deployment. This type of release pretty much guarantees that you’ll have design problems.
If you’ve used Big Bang releases, you’ve probably experienced that familiar, unsettling feeling many of us get when something is about to go wrong. That sixth sense is one you should trust — it’s telling you there’s too much to account for.
Some software projects increase in scope over time until they culminate in a horrific, Big Bang release. This type of release makes it harder to test, deploy, monitor, and troubleshoot your software.
Big Bang releases may even require deploying new services from scratch, creating new infrastructure, integrating untested components, or all of the above. They are especially troublesome because most teams don’t plan for failure.
Big Bang releases can fail spectacularly, and teams are usually unprepared to and unsure of how to roll back the release — if a rollback is even possible.
This also creates complications for the business. Consider a release with multiple features. Maybe one feature is broken, but other, new business critical features are working great. With Big Bang releases, there is no option to keep business critical features working while you fix the broken one.
Big Bang releases are hard to develop, test, verify, and operate. Given that we know this to be true, it just makes sense that the opposite is also true: smaller releases are easier to test, verify, and operate. DevOps principles say that teams should work in small batches and deploy as frequently as possible.
Small batches paired with feature flags effectively render Big Bang releases a thing of the past. Feature flags give you fine-grained control over which features are available at any point in time.
So, how do the two work together?
Consider a change that requires modifying database schema, new code, and high infrastructure impact. First, you would apply small batches. You need to create one deployment that changes the database so that it works with the current code and future code. Next, perform one deployment with the code change. Now, you can leverage the feature flag to manage rollout.
Start small by enabling the new feature for your staff and gradually add more users. If something goes wrong, flip the feature flag.
Telemetry data tells the team what’s happening on the ground. This may be data such as CPU usage, memory, 500 error responses, or text logs. It can also be business-level data like total logins, purchases, or sign-ups.
Telemetry data accurately describes what’s happening at any point in time. It’s not enough to exclusively focus on error-related metrics. Teams must use telemetry data to verify that what should be happening, is happening.
Releases become bad releases when teams don’t realize something is going wrong. Knight Capital could have mitigated their problems if they had telemetry data about the percentage of servers running the correct code.
However, this kind of technical telemetry too is on its own inadequate. Teams must also consider business telemetry.
Here’s a prescriptive solution from Etsy that is discussed in The DevOps Handbook: Create a real-time deployment dashboard with your top 10 business metrics. Metrics can include new sign ups, logins, purchases, posts, etc. These metrics should be common knowledge within the team. A deployment dashboard can help you see quickly if things are heading in the right direction.
Having this information is critical after each release since most problems come during and after releases.
Business-level telemetry can help you identify issues that used to take months in just a few hours. If you see a significant change on the deployment dashboard, this is an indicator that the release is bad and must be investigated.
The more humans that are involved in a release, the more likely it is to become a bad release. This is not a knock against humans. (To the contrary, I proudly count myself among their rank and file!) It’s simply a realistic acknowledgement that we, as people, can’t complete the same task 10,000 times and do it exactly the same way each time. Bad releases tend to happen because of people more often than because of machines.
You’ve likely seen this play out before. The 30-step release process for DevOps usually works fine, but a special case release requires running a one-off script. Somehow, the script doesn’t make it into the instructions and something goes wrong. Or, an engineer makes a fat-fingers change on one of ten servers.
Manual releases create a plethora of problems. Slow, manual release processes tend to permit fewer deployments, which leads to the use of those troublesome Big Bang deployments.
Manual work isn’t very helpful, so it’s best to remove as much of it as possible. This is especially true for releasing software since tasks can be completed much faster automatically. Typically, businesses that use database release automation have higher release rates. These higher release rates enable faster product development, which makes them more profitable than their competitors.
Ideally, software releases should happen at the touch of a button, which requires specially designed release software. This, just like any other piece of software, may be tested and maintained along with the primary business code. More importantly, it segues into the best way to mitigate release mismanagement: continuous delivery (CD).
CD makes all changes deployable to production. It requires automating the software development process. Remember, manual work doesn’t only apply to releasing software; it also includes manual testing. Replacing manual testing requires automated testing of all code changes and deployment automation to put the changes into production.
CD is a powerful practice that can transform your team and business, so don’t underestimate it.
There usually isn’t just one cause for a bad release (even though it may feel like it). There may be a single line of code or a person to blame for parts of the problem, but that analysis doesn’t go deep enough. Software development is a complex process. There are many compounding factors like technical debt, business conditions, and past engineering decisions. This melting pot can create the conditions for bad practices to be put in place.
However, these problematic practices can be exchanged for better ones — specifically, the successful practices that guide the DevOps movement.
- First, replace Big Bang deployments with smaller batches.
- Next, put key business telemetry on a deployment dashboard to immediately identify issues.
- Finally, automate as many of the processes as possible.
This approach improves how teams build, develop, and release software. However, the changes shouldn’t stop there. Teams should work to continuously improve their processes. No single solution will completely eliminate bad releases, but using best practices will reduce their occurrence and business impacts.
How can you overcome release management challenges? Read more to find out.