How to Automate Software Production Readiness
As software engineers, we can all agree that there’s no such thing as perfect software. Whether we like it or not, there’s always something that can go wrong. Rather than strive for perfection, engineering teams should instead do everything they can to minimize potential disruptions by proactively addressing highly occurring preventable causes. This is where production readiness comes in.
Production readiness helps engineering teams answer whether their production services meet the operational standards that matter to their organization. Production readiness reviews and checklists therefore measure readiness across a number of categories, including reliability, security, observability, quality, maintainability, and more. Getting it right means that your team can avoid incidents that lead to rework, cause revenue loss and reputational damage, and even impact developer velocity and morale.
For teams that are in the early stages of leveraging production readiness, this approach often relies on manual checklists and reviews. While that is a good starting point, a mature production readiness process should be streamlined, comprehensive, and continuous. Let’s explore how you can take your production readiness to the next level by leveraging automation.
The three challenges that get in the way of production readiness
When engineering leaders start exploring opportunities for enhancing their production readiness capabilities, there are often three challenges that they come up against.
- A discoverability problem. Teams don’t have complete or up-to-date visibility into all the services and software that exists within their organization. There’s also a lack of clarity around ownership and accountability.
- A measurement problem. There isn’t a clear understanding of what needs to be measured in order to ensure readiness across multiple categories, nor of the metrics required.
- A cultural problem. Developers aren’t interested in trading product development time for ownership and production readiness tasks, nor are they motivated to do so.
To build a truly effective production readiness model, you need to address all three of these problems. At the end of the day, you can’t improve the things you don’t know exist or the things you can’t measure. And you definitely won’t get very far improving things that nobody cares about or has time for.
The solutions for each of these challenges hinge on introducing automation into your approach to production readiness.
How to solve production readiness challenges
While automation is a key driver in addressing each of these three challenges, it’s not a silver bullet. Setting your organization up for success will take time—but it’s important to remember that the investment will be worth it in the long run.
Solving the discovery problem
It’s hard to know what services need improvement if you don’t fully know what services you have in the first place. For production readiness to be effective, you need to know what all your services are, where they are, and who owns them.
Many teams will rely on spreadsheets and Notion pages to manually catalog all of their services. While this may work for small teams, they can quickly become incomplete and out of date once a team scales.
An automated service catalog can provide real-time visibility into each service, with easily accessible information and metadata including ownership, past changes and deploys, where the code lives, where the service lives in the tool chain, dependencies, and more. It integrates with different systems (e.g., Kubernetes), automatically pulling in all deployment information and more, so developers don’t have to make manual updates.
Solving the measurement problem
When it comes to identifying the components you want to measure as part of your production readiness, you should start by creating a list of all the things that are important to your organization. An early iteration of this might look like a rudimentary checklist that service owners have to review before taking anything to production—but that’s not going to scale well.
For instance:
- Is an owner defined?
- Are backups setup? Stored cross-region?
- Is data encrypted at rest?
- Is data encrypted in transit?
- Are secrets stored in Vault?
- Are logs emitted to ELK?
- Does the service store PII?
- Is it on the latest version of $Framework?
- Are instances running in prod VPC? Using the right security group?
- Is container scanning enabled?
There are two core challenges that many teams face with this approach. The first is data collection. There’s a lot of manual effort that goes into collecting the data required for a production readiness checklist. Service owners have to investigate all those elements, and answers may differ from one person to the next, making it somewhat unreliable.
The other challenge is the evaluation process. Today, production readiness happens mostly right before a feature or application goes to production as a single large task. This means that when a new check is introduced or a dependency changes, the production readiness of a service that’s already in production isn’t reassessed, and that can open the door to risk.
The solution here is—once again—automation. With an automated check system that integrates with detection tools and tracks all the measurements you care about, you can bypass potential errors. In other words, the goal here is to have a measuring system that checks your sources of truth rather than asking a human that might not actually know the answer.
Automation also allows for continuous evaluation, keeping production readiness as a steady burn, rather than a high-intensity, one-time effort. This means everything is always being monitored in real time against the most complete and relevant production readiness checklist. In turn, this reduces friction for developers and makes production readiness efforts more efficient and effective.
At OpsLevel, we’ve introduced another evolution to the production readiness checklist by taking a graduated approach. Rather than having a flat checklist where each item is weighted the same, we recommend partitioning your checklist into multiple (typically at least three) production levels or grades. As you can see in the image above, our customers often use Bronze, Silver, and Gold rankings. The Bronze level defines the minimum threshold a service needs to meet, Silver is the core level that has baseline requirements for when a service has cleared the bare minimum, and Gold is for future-proofing and aspirational standards that will be critical in the future, but are less of a priority now.
In this approach, each service is given a “grade” depending on the checks they have completed. If a service has all their Bronze checks complete and only some of the Silver checks done, then it’s given a Bronze grade. This makes it easier to compare maturity against other services while also ensuring that the must-have checks are prioritized. Plus, it makes it more manageable for service owners to have different milestones to hit versus only focusing on hitting the 100% mark.
If you currently have a long flat checklist for production readiness and want to evolve to this model, your organization will have to go through a pretty rigorous prioritization process to figure out where each check lives. Another important thing to consider is the natural sequence of the checks. For example, you wouldn’t implement canary deploys before you had reliable rollbacks in place, so that sequencing can help you decide which checks need to happen first.
This approach is also helpful from a visibility perspective. For service owners, they can quickly check in on any of their services to see its service maturity ranking and identify what still needs to be done. At a higher level, leaders can also check how the various services in their purview are doing from a maturity perspective and where the gaps are.
Solving the cultural problem
In order to develop a successful production readiness system, you need it to be embedded into your organization’s culture—but this isn’t something that’s going to happen overnight. Building a culture that prioritizes production readiness is a multi-step process that will take time.
- Step 1: Start at the top. Having buy in from your leadership will be a key driver in moving things forward and encouraging adoption throughout the rest of the organization.
- Step 2: Implement ruthless prioritization. This is the time to make really hard decisions. What trade offs will your team make in terms of feature development to implement production readiness work? Having the executives from step 1 on board will be helpful in these discussions as they will be able to rule on disagreements and advocate for changes that need to happen.
- Step 3: Incentivize teams to do the work. Giving teams the right data and tools is a great starting point, but it’s not enough. Developers need to feel like they are collectively contributing towards organizational objectives.
When it comes to incentivizing team members, we’ve seen our customers do this successfully in three ways. The first focuses on embedding production readiness into top-down goals. In practice, this can look like adding service maturity into your regular goal- and objective-setting cycle. For instance, you could have OKRs that are tied specifically to production readiness, as well as team and manager performance metrics tied to service levels. In addition, failing checks and lagging services could be added to the agenda for operational reviews. From a leadership perspective, there also needs to be complete visibility into performance against these goals and objectives through an automated reporting process.
The second approach we’ve seen our customers use is reserving capacity exclusively for production readiness. This means carving out dedicated time or resources within each team for ownership work. This could be 20% of the points in a sprint, one team member per sprint, or even having every fourth or fifth sprint dedicated to ownership tasks.
Lastly, you can also integrate service maturity into the software development lifecycle. Automating production readiness can be a stepping stone towards continuous development, and we’ve seen customers integrate our service maturity functions in their CI/CD pipeline.
The approach or approaches that you choose to implement will depend largely on your organization’s culture and way of doing things. Regardless, investing in automation will be key to reducing friction, making the trade-offs easier to negotiate, and ultimately making it easier for dev teams to do this work.
Achieving maturity in production readiness requires automation
Today’s engineering teams are being asked to be quicker, more agile, and more efficient than ever before. Often, this means that seemingly “non-essential” tasks like production readiness can be quickly deprioritized, leaving the organization open to increased risk. Leveraging automated processes—and embedding them within a culture of continuous improvement and alignment within the organization—can help teams stay agile and focused on the product while simultaneously prioritizing reliability, security, maintainability, and more. To us, that’s the ideal balance for engineering teams everywhere.
Want to learn more about what a great IDP looks like in practice? Our co-founders John and Ken recently shared their learnings in one of our webinars. You can watch the recording here.