Migrating OpsLevel search to Elastic
OpsLevel recently upgraded its in-app search capabilities by migrating to Elasticsearch. We invested significant engineering resources into the project because we think search is a foundational capability for any catalog–and any foundational capability is worth doing well. For search, that means providing a fast, comprehensive, and user-friendly experience to the end user.
The first phase of the project is complete, so let’s review what prompted the switch, how we chose Elasticsearch, what we learned, and where we might go next.
The limitations of Search v1
Way back in 2018, OpsLevel’s first search was a fairly straightforward SQL query. Our data model was significantly simpler then, so the query was essentially:
As OpsLevel grew its product surface area over time, we had additional entities that needed to be searched on like tags and repos. Our query grew:
As we continued to grow, new challenges emerged with our SQL-based search.
The first issue was maintainability. Though we used Arel to keep the actual Rails code somewhat well factored, the actual SQL query itself was gigantic and hard to reason about. There were a lot of columns we were searching on, which made it difficult to change the query or add new associations to it. It had achieved an almost mythical status internally around its complexity.
The second issue was performance. As we onboarded new customers with hundreds or thousands of services, we’d occasionally see search queries timeout. That wasn’t surprising. The query is basically doing free text search on almost every attribute on a bunch of different tables. MySQL’s optimizer didn’t really have any great indexes it could use. :(
The final issue was the actual quality of the search results. The actual search itself was purely a wildcard / free-text search using SQL LIKE statements. There was no concept of relevance or tokenization or really anything a modern search offers.
So we opted to fix it.
In addition to improving the search UX for our larger customers, we realized rearchitecting search was an opportunity to:
- lay the groundwork for a search that would easily extend to objects beyond services
- provide users with context for why particular search results were returned
Options for Search v2
After committing to re-architect and upgrade search, our first step was to assess our options. We considered and evaluated:
- Postgres
- OpenSearch
- Elasticsearch
Postgres
OpsLevel is a Ruby app, so a key consideration for most technology choices we make is: how well do the existing Ruby gems fit our needs? For Postgres, the answer was not very well.
The most popular Postgres-driven search solution, the PGSearch gem runs in two modes: single-model-search and multi-search. Single-model-search includes advanced features, like ranking, ordering, and highlighting of search results, but those features do not exist in multi-search. But multi-search is required for searching across multiple tables.
So PGSearch wouldn’t make it easy for us to deliver a comprehensive search (across multiple object types) that also provides context to users about search results.
OpenSearch
In parallel with investigating Elasticsearch, we considered OpenSearch, the fork supported by AWS and the open source community.
There are many similarities between the two–OpenSearch is derived from Elasticsearch 7.10.2–as well as infra and hosting cost considerations that made OpenSearch attractive.
Ultimately, Ruby compatibility was the deciding factor. OpenSearch isn’t fully compatible with gems and tooling that exists for Elasticsearch and the alternatives weren’t as robust.
Elasticsearch
Ultimately Elasiticseach was the right choice for our needs. It’s purpose-built for search use-cases (unlike SQL databases), is highly scalable and customizable, and has quality, battle-tested Ruby gems.
Migrating to Elasticsearch
Overall, we found the migration path to Elastic to be smoother than expected. Some of the highlights:
Indexing Service metadata
Indexing data per service was very straightforward. We were able to use the Rails method as_json as our serializer, in a custom as_indexed_json function, as suggested by the Ruby gem (elasticsearch-rails).
For the initial indexing, we used a sidekiq job to import via Elasticsearch's bulk API. The worker indexed services in batches of 1000, metering itself out over time.
This incremental and scalable approach made sure that we are able to index our services without overwhelming our primary database with queries for all of the data related to all our services (names, aliases, tags, descriptions, etc.) as we built up our index in Elastic.
We also had an existing pattern of using Wisper callbacks to trigger background jobs after any CRUD activity on Rails models, so near real-time updating of data in Elasticsearch was also easy to set up.
elasticsearch-rails
The elasticsearch-rails gem made setting up our mapping (e.g. our schema definition) in Elasticsearch very simple. It also had a number of methods that made indexing, searching and retrieving the highlighting easy as well. For example, we used the map_with_hit function and the hit object made the highlight available via property hit.highlight.
Of course, not everything was crystal clear on the first pass. We had to sort out some vocab confusion in the various documentation as we were configuring our indexes. For example:
- An Elastic Cloud “deployment” should be thought of as a “cluster”. It contains related instances/nodes.
- An Elastic Cloud “instance” seems to be what Elasticsearch calls a “node”. The Elasticsearch nodes contain shards.
Testing
We considered a variety of testing approaches, but with the dead end above and without the bandwidth for our Platform Engineering team to set up an Elasticsearch cluster in our CI pipeline, we elected to take a mocking or stubbing approach.
We’ve previously used Webmock for similar CI use-cases, but elected to go with VCR in this instance because we wanted to test our behavior all the way through the Elastic search engine.
We needed to drive out complex behaviors in an unfamiliar domain with Test Driven Development (TDD) and have confidence that the queries we wrote would return the expected search hits, ranking, and ordering. VCR let us write tests that ran quickly on CI, without Elastic itself running on CI, but while still putting the entire system under test in our local dev environments.
It got the job done eventually, but not without some struggle to resolve flaky tests and level-up the team on VCR best practices.
Product Outcomes & Tradeoffs
Moving to Elasticsearch has given us:
- Faster, more reliable search
- Ranking and highlighting on our search results page
- An extensible framework for adding new objects to search
But we did make one clear concession: no more true wildcard search. In Search v1, our SQL-based approach supported this by default. With Elasticsearch, we had the opportunity to be more intentional about our configuration.
We could use a wildcard query or an nGram filter to support this use case, but the wildcard query would mean significantly slower queries, and the nGram filter would cause our index sizes to spike.
Ultimately, we decided prefix matching (a “git” search string would match “GitHub”, “GitLab”, “GitKraken”) was sufficient to support the vast majority of search use cases.
Future plans
Now that we’ve completed and GA’d the first phase of our migration to Elasticearch, we’re excited about all the possibilities ahead of us for further improving our search experience.
- Adding our Search to our external GraphQL API
- Adding new objects to our search, so users can find information (and OpsLevel functionality) in their catalog faster.
Potential adds include:
- API Docs
- Tech Docs
- Deploy events
- Team metadata (e.g. description or charter)
- Dependencies
- Individual Check Reports
Tired of filtering spreadsheets or toiling in Confluence to find the service metadata or docs you need? Come checkout our Elasticsearch powered service catalog.