Behind the scenes: OpsLevel performance enhancements
The running refrain in startup land is "speed wins." We've all heard it countless times, usually referring to lightning-fast execution and getting ahead of the competition. But let's take a moment to appreciate the other kind of speed that can make or break a user's experience: application performance.
Amazon has its famous study that showed every +100ms of latency cost 1% of sales. Google has its own study that showed that 50%+ of users will abandon a site that takes longer than 3 seconds to load. It doesn't matter how brilliant your software is or how beautifully it's designed—if it's slower than a snail on tranquilizers, your users will be frustrated.
So buckle up, because I’m excited to cover a few performance enhancements we’ve recently made here at OpsLevel.
On our frontend: ~30% faster load time
Let’s not bury the lede. We just made our frontend load much faster and it’s going to continue to get better.
For background, OpsLevel’s main application is written using Ruby on Rails for the backend and Vue.js for the frontend. We previously used webpack, one of the most popular Javascript bundlers, for bundling all of our frontend Javascript and assets. We integrated webpack with Rails using webpacker.
With webpacker being discontinued in 2021, we knew that we wouldn’t be able to stay on it long-term and had to explore alternatives. We could use Shakapacker and continue using webpack under the hood, but as long as we had to change our build system, we wanted to explore alternatives.
Vite looked attractive as it was written by the same authors of Vue.js and it touted a superior frontend tooling experience. So we migrated our frontend build system from webpack to Vite and tied into Rails using vite-ruby. It’s been awesome so far!
Vite does do more than just Javascript bundling, but for Javascript bundling, it excels. It uses rollup.js under the covers, which provides a lot of the standard features of a bundler like tree shaking and minification. We noticed that it also has better out of the box tree-shaking, resulting in smaller bundle sizes with minimal configuration (more on that later).
With Vite, we also get lightning-fast hot module reloading which is particularly useful in an application like ours with many interactions and subpages. For our developers, it means less navigating to a component, and instead development changes appear nearly instantly.
After migrating to Vite, we also reviewed how we could optimize our frontend application structure.
Previously with webpack, our frontend was contained in a single entry point, somewhat like a single page application. That meant that whenever you visited a page in app.opslevel.com (e.g., your dashboard), you were downloading the JS and assets for every page in our application (e.g., teams, users, services, etc.).
The original idea behind structuring our application that way was that your first page load would be a bit slower, but subsequent page loads would be fast because you have the application cached in your browser. That made sense in the early days when OpsLevel had a smaller frontend footprint. But over time, as OpsLevel’s surface area has grown, it led to performance issues.
That bundle from webpack weighed in around 2.8 MB gzipped, meaning 2.8 MB would be transferred over the wire from our servers or CDN to your browser. The largest piece of Javascript in the bundle, application.js, was approximately 9MB uncompressed. Browsers parse Javascript serially and block while doing it (even though speculative parsing and deferrals can make it less painful for multi-bundle applications), so there was time spent waiting while chewing through all that JS.
After we moved to Vite, we split our frontend into multiple entry points. Today, if you visit a page in OpsLevel, you’re only downloading the assets for that page. That’s helped improve performance in a few ways.
On bundle size alone, we’ve gone from 2.8MB compressed to 880KB compressed (!!). The largest piece is now ~ 3MB instead of 9MB. Additionally, bundling with Vite has led to us generating more .js files in the bundle. While that sounds like it would worsen performance, it’s actually been a boon. The generated files are more stable and change less often, which improves the cache hit rate in our CDN.
Check out the bump after June 12:
When we had one giant 9MB (uncompressed) application.js, some part of that file would change every time we changed our UI. Since our team deploys multiple times per day, it meant that this file was updated multiple times per day, meaning it never had a great opportunity to be cached in a CDN for a long time.
The smaller individual files generated by Vite include things like our CSS or popular JS libraries that change infrequently. Again, though there are more files, their stability means they can remain cached by our CDN and your browser, which results in faster load times.
With Vite, we have a lot more frontend optimizations planned including bundle analysis to improve tree shaking (we recently took highlight.js from 1MB to ~100k by changing how we import), deduplicating dependencies, and more uglification/minification to squeeze every last byte out of our frontend bundle.
Our database: now with faster IO (and improved compliance)
Back when OpsLevel started in 2018, we set up an AWS account, clicked through the console, and spun up a shiny and adorable MySQL instance on RDS.
We kept adding other AWS infra in that single AWS account until around 2020. At that point, we began the typical journey of separating our infra into multiple AWS accounts for security and compliance reasons. We wanted to isolate production vs. non-production workloads and also tighten up access control.
For our stateless infra like EC2 instances and EKS clusters, the migration process was straightforward. Since that infra is stateless, we just spin up new infra, test it, redirect prod traffic from the old to the new, and spin down the old infra.
But that one original database, opslevel-production, was a bit of a pain to move.
It’s stateful. It houses all our customer data. We can’t just wind it down without first moving the data.
Since the database is actively serving our main application, any data migration would have to be an online migration. The cutover to a new db should have minimal downtime. It’s possible, but non-trivial and easy to mess up.
The easiest option would have been some way to just assign that existing RDS instance to a different AWS account (as well as dependent bits like the security and parameter groups). I imagine somewhere deep within AWS, there’s a database table associating that RDS instance with an AWS Account ID. Alas, AWS doesn’t support just changing that id association, so we were left doing a full blown database migration. Bleh.
Since we had no other choice, we put on our best 80s playlist, rolled up our sleeves, and did the migration.
We set up AWS DMS, triple checked the schemas between old and new db, audited the data to ensure all tables were replicating properly, and cut over to our new database on June 10, 2023. (We did the cutover on a Saturday as OpsLevel has lower usage on weekends).
One fun fact about our new RDS instance: it has a faster and better disk.
When we set up our original RDS instance in 2018, it was configured with a gp2 EBS volume.
gp2 volumes provide a range of IOPS depending on how large your volume is - the larger your volume, the more IOPS you get. gp2 volumes also provide a range of throughput (which is different from IOPS), again dependent on the size of the volume. With gp2, you have exactly one knob to turn to increase performance: your volume size.
gp3 came out in 2020 and lets you “provision IOPS and throughput independently, without increasing storage size”. Though AWS touted that migrating an active volume from gp2 to gp3 was completely seamless and transparent, we had an unpleasant experience when migrating a staging db and were since wary of the risk of taking a production outage.
However, this database migration provided us the opportunity to change our disk type to gp3, so we took it. With separate knobs to turn around IOPS, throughput, and volume size, we’ve tuned the new volume to have the necessary headroom in each of those categories.
After this tuning, we believe we were previously throttled on throughput with gp2, so we expect queries that fetch or write a large amount of data to be more performant (e.g., fetching large reports or our infra catalog writing lots of data while synchronizing large cloud accounts).
Our CI: Less waiting for our devs
One final optimization we made was to our CI system and came to fruition from our Vite migration. We were able to remove between 90 seconds and 180 seconds from most CI builds by fixing a latent bug.
While that doesn’t help our users directly, developers spending less time waiting for CI means more time shipping features. Otherwise, you know, this happens:
For context, our CI pipeline uses GitLab CI and one step is to run Rails tests (literally bundle exec rails test).
We’d noticed that some tests would randomly take a long time to complete.
To make matters worse, in our CI setup, we set a limit of 3 minutes for any test to complete. Tests that take longer than 3 minutes are set to fail, which then fails CI. While that limit is useful to catch newly introduced slow tests, it also meant that if you hit one of these intermittent slowdowns, you would have to re-run CI for your branch.
The condition was flaky though and we weren’t able to consistently reproduce it, which made debugging a pain.
We would have a hypothesis, change our CI setup, and since the condition was flaky, our only option was to wait a while and see if it didn’t happen before probabilistically ruling it out. We had several hypotheses, including resource contention and Rails app startup time.
Finally, we had a bit of a breakthrough when we realized that it was controller tests that were slow. That gave us something more concrete to focus on.
A little more digging and we had isolated a sample of a slow test where it wasn't the first test that was slow, but it was the first public controller test. That led to a theory around Rails asset compilation.
Rails has an asset pipeline that generates the various frontend assets necessary for the app (e.g., JS, CSS, etc.). The first time you run a public controller test that loads the app, Rails tries to load the generated assets from disk. If the assets don’t exist yet though, Rails will do an inline compile of assets. This is where the slowdown was! The first controller test that loaded assets was compiling those assets and adding 90s to 180s in CI time. (!!)
This also explains a lot of the flakiness. If we re-ran the pipeline again, since the assets were now present, no inline compilation would occur and the tests would run normally.
The way to fix this is to precompile assets before the test suite starts so the assets are always available for the tests. Since precompiling assets is still compiling the assets, we also wanted to cache the generated assets to not re-incur the compile time if pipelines were re-run. We had a few false starts to define the right cache key in .gitlab-ci.yml, but finally settled on the following:
Finally, we investigated why we weren’t always precompiling assets in CI from the get go. It turns out that we originally did have asset precompilation, but it was accidentally removed in a recent refactor of our .gitlab-ci.yml that changed some aspects of how we create our container images. Unfortunately, since removing asset precompilation made tests sometimes slow down, we didn’t immediately notice this impact and had this as a latent bug in our CI.
We made several performance improvements to our frontend, our database, and our CI to make OpsLevel faster for our users and our engineers. And we’re not done yet, with some more ideas to implement to make OpsLevel even faster.
Until then, sign up for a free trial if you want to experience OpsLevel yourself.
Thanks to Mark Katerberg and Doug Edey for their contributions to this article.