Understanding and Managing Configuration Drift
For most enterprises, microservices and agile methodologies tend to go together. So, when you adopt a microservice architecture, you’re embracing more than just a new paradigm for building services. You’re also committing to deploying new code and configurations more often.
More deploys mean more change, and no matter how hard you try, code and configuration changes will slip through the change management cracks.
You end up with systems that are out of sync with your deployment pipelines, release packages, and source control. That means you don’t completely understand what’s running in production. That’s a condition you never want to find yourself in. It leads to mysterious outages, unexpected regressions, and unhappy customers.
The consequences of configuration drift can be serious. It exposes your systems to potential data loss and extended outages.
When configuration falls out of sync, we refer to it as configuration drift. Let’s look at the different ways this happens, how we can avoid it, and how to fix it.
What is Configuration Drift?
First, let’s define what configuration drift is and why it occurs.
Configuration drift is when production infrastructure configurations fall out of sync with their expected state. For example, when primary and secondary networking systems have different configurations, they have “drifted” apart from each other. Or, when a software application’s configuration file differs from its latest package, it has drifted from its expected settings.
The consequences of configuration drift can be serious. It exposes your systems to potential data loss and extended outages. If a router fails and the secondary takes over with a different configuration, it may not function correctly, so a failover situation becomes a failure scenario.
If an engineer repairs a software application in production with a manual configuration change and doesn’t capture it in source control for the next release, the next release will write over the fix and reintroduce the problem.
As bad as it sounds, configuration drift is a fact of life. Changes happen, and even the most robust management system isn’t faultless.
Configuration drift is one of the primary reasons why disaster recovery and high availability systems fail. While you should make every possible effort to prevent it, you need to put procedures in place for discovering drift and recovering from it when it happens.
How Does Configuration Drift Happen?
The short answer to how configuration drift happens is “when someone subverts or skips the deployment process.” But it’s an oversimplified response and assumes that a sound deployment process is in place.
Manual Configuration Changes
There’s an outage in production. Nearly every engineer’s been in this position, and all of them want to do the right thing: fix it as soon as humanly possible.
Sometimes the fix is a simple configuration change. A port forward is missing on a firewall. You need to toggle an application setting because of a new client. A buffer value is suddenly too small because of increased client traffic.
The fastest way to fix those problems is to update the config and restart the process or let it re-read its configuration. Problem solved!
But now, you need to reflect that change in your failover systems and source control. If that doesn’t happen, you have configuration drift.
Failed or Incomplete Deployments
If a configuration update isn’t deployed to all your production systems, they fall out of sync with their expected state and, depending on the problem, with each other.
This problem can occur when a configuration change isn’t included in a new software release due to a failed merge or a packaging error. Or it can happen when a deployment fails, and some systems receive the change while others do not. Either way, you have configuration drift.
Identifying and Preventing Configuration Drift
So, how do you go about preventing configuration drift? How do you know when it’s happening? These questions go hand in hand because the answers depend on how you manage your environments and how you want to manage changes.
Configuration Drift and Configuration State
We’ve referred to updating your configuration in source control a few times so far. That’s because we hope that you’ve already adopted configuration as code. If you haven’t, it’s time.
The term configuration drift implies that your configuration has an expected state and one or more systems no longer match it. This means that you need an expected state stored somewhere, and there’s no better place than source control.
Avoiding Configuration Drift
The easiest way to manage manual changes is to never make them. This sounds glib, but it’s not a bad approach, either. You can get all or part of the way there with the right tools and processes.
Manual changes to infrastructure aren’t necessary if you manage your system changes via infrastructure as code (iaC) tools like Puppet, Chef, Ansible, or Salt. You can also take steps to make manual changes impossible for all or most engineers by locking down your system permissions.
For application code, you can avoid manual changes with Continuous Integration/Continuous Deployment (CI/CD) pipelines that make deploying change fast, simple, and reproducible. Pipelines also help avoid failed deployments since they make it easier to detect errors.
Completely eliminating manual changes is probably impossible for most enterprises, though. IaC technology is powerful and mature, but it rarely covers all circumstances. Even though the technology covers a lot of ground, getting the processes and culture in place is difficult.
The key is to make manual changes unnecessary in as many cases as possible. If doing the right thing is easy, engineers will choose it over the more difficult option every time.
Eliminating Drift
If you’re not interested in finding drift before you fix it, you can periodically destroy and rebuild your environments. This may sound like a good option if you have a robust CI/CD pipeline and automated system tools. You may already be doing this.
The easiest way to manage manual changes is to never make them.
But, as we said above, configuration drift happens. Wiping out a manual change might result in reintroducing a problem that someone fixed earlier, but failed to record.
Identifying Drift
If you’re interested in capturing drift and evaluating the changes before deleting them, you need a way to proactively look for changes and highlight them. This gives you a chance to see if the drifted changes should be committed back to your master copy.
If your current configuration state is stored in source control, it’s reproducible. So, with the right tools, you can compare it to production and find the changes. The IaC tools we mentioned above can do much of this work for you.
Managing Configuration Drift
In this post, we started out by defining configuration drift. We saw how it’s a common problem that can cause serious downtime and result in impactful losses. Then we moved on to the two most common causes of configuration drift. Manual changes and deployment problems are the most common causes of drift, so we discussed several methods for avoiding those problems and detecting the configuration issues they often cause.
Configuration drift is part of managing a large set of applications and infrastructure, but you can manage it with the right tools and procedures. OpsLevel can help with tools to track your microservices and integrate with your source control to track changes to configuration files. If you’re ready to start tracking your configuration changes and manage configuration drift, request your OpsLevel demo today
This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).