‍

An SRE's Perspective on Backstage

Matthew Brahms

May 2, 2022

Open source developer portal project Backstage hasn’t been on the scene for long. But in that time it has gained popularity quickly. The rate at which it’s added stars on GitHub is telling.

So what’s behind all the buzz?

Since I’ve spent time working on a team that was building out and supporting Backstage, I can tell you what I’ve learned from working hands-on with Backstage to build a service catalog and internal developer portal.

From SRE to Platform Engineering

Before diving into my specific observations and experiences, we should level-set by reviewing when and why an organization might need tooling like this.

Many well-known engineering orgs are rebranding their Site Reliability Engineering (SRE) organizations into Platform Engineering teams. Platform Engineering isn’t a new concept, but it is the current trendy thing in our corner of software engineering.

We could have entire conversations around:

the reasons you might want to change team names
how this either extends or mutates the traditional SRE practices
how best to staff and empower a Platform Engineering team
where within an organization this team should sit
how this team functions: as a product team or an operations team (or both?)

But let’s stay focused and agree that what you call a team is less important than how they work and what their objectives are.

Backstage == Platform Engineering?

So what’s driving Backstage’s rise? Regardless of any name change, Dev-Site-Reliability-Platform-Ops-Engineers are still tasked with making it easier, safer, and faster for developers to ship reliable, quality features to end-users.

Today, the best way to accomplish this at scale is with an internal developer portal. Instead of being roving firefighters or medics, SREs are becoming platform engineers and building their own “internal heroku”–a single interface for all things developer.

Imagine this portal as a one-stop-shop for engineers to perform basic CRUD operations on their services and underlying cloud, compute, storage, or other relevant resources.

Day one for an onboarding developer could be as simple as logging into the developer portal. Here they would find a personalized dashboard filled with all the things they are going to be working on (based on their role). This might include some of the following things about each service they now help own:

the statuses (e.g. all related metrics and observability specific to their applications)
the associated git repositories
the who/what/when of on-call rotations (and a button to page them directly)
the documentation
the operational runbooks
the location of configuration and secret storage
etc…

To create a new service, in the portal you can simply click the + Add Service button! Then you are provided with a default service name (using the company-configured naming standards).

There are only a few selections left to complete the identity of your new service. These options are presented as populated drop-down lists of approved tooling choices such as language, framework, database, etc. Completing those and clicking Submit takes you to the new page describing your service.

Automation churns in the background providing you with sane defaults and plumbing your new service into all the different platforms and services that you need to get started. Best of all, you’re following your organization’s standards without any extra stress or hassle.

This is wonderful! Think that with one click of a button, you’ll get everything spun up in a way that just works, period. No futzing around with creating repos and then visiting every platform or tool in the stack to generate tokens or keys and register your service.

When you account for the upfront time savings and the rework or cleanup avoided by getting everything right the first time, we’re talking about reducing toil by an order of magnitude.

So, this is all pie-in-the-sky, magical unicorns and fairy dust stuff, right!?

After first hearing about this kind of tool, I was drooling and ready to kick the tires on whatever a tool like this might be.

It felt like the culmination of an evolutionary period: less mundane, tedious work (low return on investment) and more time spent on impactful innovations for end-users and customers (high return on investment).

Now let’s turn the conversation from the theoretical to the practical!

The Job to be Done

My team was a small, eight-member SRE team that was chartered to support a rapidly-growing “Digital” organization of over 600 software developers.

The company was going through COVID-induced growing pains as it attempted to better compete in the digital world. As any good team of newly-minted SRE’s would do, we set out to make data-driven decisions about our “customers” (developers) and their needs, while also advocating for the actual customer experience of our end-users.

The initial problem that we wanted to gather data on was simple in scope: can we identify all of the services in our organization? Based on that data, we wanted to do additional discovery on questions like:

what programming languages are most prevalent
what’s the status of these services (legacy vs microservice vs other)
what’s our volume of incidents
where and why are incidents happening
what does each of our services cost to run
which services are most important (where are single points of failure?)
etc…

The Spreadsheet Trap

Before beginning the quest, we had to determine which data store we could use to build our collection of service metadata.

Core requirements for this project: the collection needed to be trustworthy as well as easily discoverable and searchable by other teams or leaders in the org. As we looked to see how we were tracking these things currently, we found a single thing being used (in silos) by many different teams–spreadsheets.

When I first started in technology, a wise senior engineer first told me about a universal truth, that “the world is run by spreadsheets.”

But in my experience, this is only because barriers to entry are low. It’s easy to quickly plug a hole with a spreadsheet. On nearly any other dimension, they fall flat. And they certainly don’t provide even a hint of the magical unicorns and fairy dust we touched on above!

Enter Backstage

With spreadsheets out of the question, we turned to Backstage. It’s a tool that you can use to solve quite a few different problems but its primary purpose is “an open platform for building developer portals.” It recently achieved “Incubator” status in the CNCF and has a growing list of collaborators.

Backstage is a foundation that engineers and organizations can build on top of to meet their platform engineering objectives. You can use and write “plugins” that integrate with various other backends to centralize data about the software you run, maintain, and are generally responsible for.

So should you stop reading right here, go download it, and try to get it up and running?

Before you do, let’s walk through the Backstage journey in more detail.

Disclaimer: I am not here to bash, belittle, or put down the Backstage community. Their hard work and Open Source contributions are creating a valuable asset for many in the CNCF community!

Developer Mode: Contributing to Backstage and Making It Your Own

The first thing to understand is that Backstage uses npm and yarn to run locally (some of this setup pain has been automated away with the Backstage CLI). But for those on your team who are not node developers, having to locally setup and install nodejs and yarn will be something to work through very carefully.

In short, before fully embarking on your Backstage journey, be sure to take into account the experience of team members. In most cases, you will have SREs or operators familiar with running Go or Rust and this workflow will be new to them.

To get full value, there will be a lot of plugins to write

Second, Backstage’s language of choice for plugin development is TypeScript. Writing plugins and customizations in TypeScript was a significant obstacle that I and my team encountered. Ramping up on TypeScript was a non-trivial investment of time.

Operator Mode: Running Backstage

Backstage can be deployed in the typical CNCF ways (k8s, container, helmchart), which at first makes standing up the project fairly smooth.

But having a plan to update and deploy Backstage to production should be formulated before embarking on that journey. Work out your deployment patterns and pipelines to account for the node way of doing things, otherwise you’ll stumble over messy deploys.

After a year of developing and deploying Backstage internally, it still was one of our most fragile deploys. Each release was manual and caused Backstage downtime for our internal customers while we deployed.

This problem could’ve been overcome with more proactive investments in automating deploys. But it’s a non-trivial amount of work that would’ve come at the expense of more adoption or plugin work.

Building and maintaining working authentication and teams structures was also a tough nut to crack.

Working in a large enterprise with multiple AD/LDAP sources meant that we had to manually scrape and aggregate everything. Adding users to our own new AD (specifically stood up for our use in Backstage) and translating/mapping them to the correct teams and services for Backstage use was the most difficult and complicated work. Worst of all, at our size, developers were always coming and going from the company, so this work was forever on-going.

I learned that it takes a dedicated team to really run and get value out of Backstage.

Integrating with Backstage

As with any technology, the sharp edges that can cut you are not on the technology itself, but the people and processes surrounding it.

The primary way of getting info about ownership and other metadata into Backstage was teams committing a .yaml file into their repo at the root of the repo.

After writing thorough documentation, teams were still having to resort to using the UI to submit their services. Often, not all of the required yaml fields were present or configured correctly, so their service failed to show up in the portal. There was a lot of manual toil on the backend to get everything corrected and help teams get unblocked.

Another rough integration point was a classic computer science challenge–naming things. We were constantly struggling to implement a standardized naming convention across our large org, or untangling names that conflicted.

Working with multiple systems trying to bring that information into Backstage was a painstaking and often manual process. We had a long list of services named “frontend”, “backend”, “redis”, or some other widely-used technology. Producing educational documentation or embedding with teams to help them onboard and troubleshoot took a lot of cycles to address.

Adoption Issues

An issue that quickly popped up among various leaders in the org was “I already have a spreadsheet that does this.”

But once our team had done the legwork of convincing teams or leaders that “this is the way,” there was no streamlined approach to building customizations for the different software products and services that we needed to get data out of (let alone synthesize and report off of).

As an SRE org, we fully expect to write and maintain some “glue” code. But having to build full-on integrations from scratch for everything that wasn’t a basic cloud service was daunting.

This required being familiar with implementing plugins in TypeScript. We quickly found that we didn’t have the skillset (or resources to quickly obtain the skillset) to cover all of the integrations we needed, so adoption of the Backstage slowed because of this.

Ideally, what would have made adoption (and even integration) go better would have been the ability to get dedicated help. Most discussions and requests for help for integrations or plugins turned into an Issue on the Backstage Github. Most of these to this day have not gotten decent traction/help/answers.

A good example of one that was often requested was a full plugin for Datadog. There has been an open Backstage GitHub issue for this since May 2020.

Slicing and Dicing Information

After more than a year, we had the ability to get a service into the catalog, but not much more. The ability to do deep reporting, correlation, or anything meaningful with the data was still on the backlog.

Ideally, we’d want to be able to search each repo for the presence of a file or configuration associated with the Log4 vulnj, but that level of introspection was not yet in the GitLab plugin for Backstage.

The appetite for contributing to this work and building all of this out didn’t scale beyond our core team. There ended up being a lot of “I’d love to build this integration one day” conversations.

An example of an “I’d Love to Have This” moment was when we dreamed of the ability to determine what services were impacted by the Log4j vulnerability via our Backstage service catalog data. Having this ability would have been magical. We could have helped change the entire organization’s weeks-long focus from identification of the issue to straight-up remediation efforts.

But alas, no magic that day, so we had to resort to running a gnarly GraphQL query manually against our 6,000+ repos in GitLab. Ideally, we’d want to be able to search each repo for the presence of a file or configuration associated with the Log4 vulnj, but that level of introspection was not yet in the GitLab plugin for Backstage.

Retro: Is Backstage too Flexible?

Ultimately, after one and a half years of working on Backstage to develop, integrate, and drive adoption across our org, we had only a few hundred services of our 1,000+ services cataloged.

While Backstage is nominally ”free” software for cataloging your services, there are many challenges and obstacles to overcome.

In describing some of the challenges above that I’ve experienced, it’s easy to say “you could’ve solved that and contributed back to the community!” I agree in principle–but therein lies the hardest thing about Backstage: making it scale.

There are several scaling factors to consider. First, I learned that it takes a dedicated team to really run and get value out of Backstage. How many engineers it takes exactly depends on how much customization you need or want to bundle into your initial launch with Backstage.

If you want that full “developer portal” experience, the cost will be front-loaded (in terms of engineering headcount and budget) to get that work done. If you only want to dip your toe into Backstage and use it as a service catalog, you will still have to invest in your team that’s going to own the deployment, customization, and TypeScript/Node facets of Backstage.

Knowing this begs the question, “what initiatives/projects could our team have taken on if we didn’t have to contend with all of the overhead of Backstage?”

The second scaling challenge is that of getting a large organization to adopt it. With a larger org comes more variety and sprawl in technology used. To get full value there will be a lot of plugins to write (and open source!).

If you already have the tech stack very consolidated in terms of number of vendors/tools used (e.g. you only have one observability platform, not 3-5 of them), you then only have a single plugin to write and could see faster results. If you aren’t consolidated and think “I’ll consolidate first and then simplify my Backstage work,” I would point out that the effort to consolidate vendors/tooling has a burden of its own (teams doing tech debt work to move platforms and SRE headcount to help coordinate and lend assistance) and then there is still the work to write the plugin.

Lastly, does running Backstage yourself really add to your business’ bottom line? As engineers, we inherently love to tinker and build. But at some point there has to be an honest decision made within the classic “build vs buy” framework. Concepts from economics, like opportunity cost and comparative advantage, come to mind.

In my experience, the more time you have under your belt as an engineer, the more you can understand the true cost of taking on a project like this. Then contrast that with the resources your company has to actually accomplish the work.

Decisions, Decisions

I suggest thinking critically about what you most need to accomplish.

Especially if you are a small team aiming to drive impactful changes on your broader engineering org, there are other options to evaluate. Buying software means alleviating the overhead of having to operate, extend, and update another project. And you’ll get a return on your investment significantly faster.

There are paid tools, like OpsLevel, which provide useful features from day one. Many of these features solve many of the open issues and feature requests in the Backstage GitHub repo. There is no one tool that solves for every single use case for your needs–aside from the one you build from scratch yourself.

Check with your most senior engineers and see what they think the cost to build and run a project like this is before embarking on the journey. In my experience, more time you have under your belt as an engineer, the more you can understand the true cost of taking on a project like this. Then contrast that with the resources your company has to actually accomplish the work.

Quantifying what your realistic must-haves, nice-to-haves, and don’t-cares are is vital. Ultimately, you’ll likely need to make tradeoffs. Just don’t fall into the trap of biting off more than you can chew.

What’s next?

If you’ve found our discussion interesting, be sure to follow along with our blog. We’ll be going into the details on how OpsLevel is working hard to innovate on its developer portal.

Our vision is a future where Dev-Site-Reliability-Platform-Ops-Engineers don’t have to live with tough tradeoffs or stress about steep opportunity costs in order to innovate and find ways to be successful in their craft.

Wanna try Backstage? Head over to demo.backstage.io to give it a look!

If you’d like to see what OpsLevel is up to, feel free to reach out to us here and let us know! We value feedback and are truly excited to show others our innovations and what we have built!