‍

The 3 Most Overlooked Strategies for Minimizing Downtime

OpsLevel

March 29, 2022

Downtime sucks (duh) - it means unhappy end users and engineers. Failures and error messages frustrate customers and interrupt engineers (or worse, wake them up).

But from an engineering leader’s perspective, it’s especially frustrating when you realize it all could’ve been easily avoided. Let’s review 3 overlooked strategies for minimizing downtime.

1. Codify Tribal Knowledge

Have you ever worked at a company that had that one person with encyclopedic knowledge of every single service and system deployed to production?

An engineering manager would do anything to have this person on the team. For an engineer on their team, this expert would be an incredible resource that they could rely on for architectural knowledge of all kinds. Like the details of service configurations, metadata, or dependencies that are hugely helpful when debugging operational issues. Or when developing new features, so you avoid reinventing the wheel with duplicate functionality.

And naturally, Support and Product teams would love this person too: they would be the universal router triaging - and often answering - hard questions!

Unfortunately, this person is rare and fleeting. Even if they do exist at your company, eventually they will leave. Or just be on vacation at the critical moment when their insights are needed most.

Plus, after you have about 50 services at your company, this person - short of a photographic memory - likely doesn’t even exist! After that point it’s simply too difficult - impossible even - for one person to keep track of all the details of all the microservices running in production.

Many engineering organizations, especially those that have just gone through rapid growth, find themselves with a one single point of (human) failure. But it doesn’t have to be this way. With a microservice catalog automated, scalable service discovery - for humans - is possible.

2. Automate & Build-in Your Best Practices

A microservice catalog can also provide specific, actionable production-readiness guidelines to your developers. Then it continuously monitors for adherence to those standards with automated checks.

These guidelines can be crafted to cover every aspect of service quality, including security, scalability, reliability, and resiliency. Checks are dynamically applied to relevant services based on their language, tier, lifecycle stage, etc.

With the right reminders and guardrails in place, engineers can comfortably operate existing microservices and spin up new ones with best practices built-in. The result? You’ll be meeting or exceeding your service level agreements in no time.

3. Centralize Incident Response Tooling & Info

Of course incidents will still happen - so how prepared your teams are to respond matters. No matter how exacting your proactive guidelines are, teams need to be equipped to react effectively–especially when they can be paged at any time for a service they might not be an expert in.

A microservice catalog can help. It won’t replace your existing tools like PagerDuty or Datadog. Instead it complements and unifies them by providing complete context and connecting all the dots.

With a microservice catalog, you can access all the critical information necessary to resolve an outage. There’s no need to dig through ten different outdated wikis or spreadsheets to understand what a service does, who the owner is, and where the relevant runbooks and observability data resides.

Don’t lose precious minutes during an incident because an on-call engineer has to verify whether the impacted services are monitored in New Relic or Datadog.

Instead, track all the metadata about your services in one single place so you don’t discover an orphaned or poorly documented service during a sev1 incident.

Summary

No engineering leader, organization, or application can completely avoid downtime. But with these proactive, holistic strategies, it’s possible for software teams to be better prepared and substantially reduce their downtime.

If you’re considering any of the strategies, get in touch today to learn how OpsLevel can accelerate their implementation and make downtime a distant memory.