While almost every engineering team has considered moving to microservices at some point, the advantages they bring come with serious trade-offs. At QCon London, Alexandra Noonan told how Segment broke up their monolith into microservices, then, a few years later, went back to a monolithic architecture. In Noonan’s words, “If microservices are implemented incorrectly or used as a band-aid without addressing some of the root flaws in your system, you’ll be unable to do new product development because you’re drowning in the complexity.”
Microservices were first introduced to address the limited fault isolation of Segment’s monolith. However, as the company became more successful, and integrated with more external services, the operational overhead of supporting microservices became too much to bear. The decision to move back to a monolith came with a new architecture that considered the pain points around scaling related to company growth. While making sacrifices in modularity, environmental isolation, and visibility, the monolith addressed the major issue of operational overhead, and allowed the engineering team to get back to new feature development.
Noonan explained several key points in the evolution of Segment’s architecture. The problems faced, and the decisions made at the time, sounded familiar to any experienced software engineer. Only with the advantage of hindsight is it clear which decisions could have been better. Noonan explained each major decision point on a timeline, and noted the pros and cons of each state of the system architecture.
In 2013, Segment started with a monolithic architecture. This provided low operational overhead, but lacked environmental isolation. Segment’s functionality is based around integrating data from many different providers. In the monolith, problems connecting to one provider destination could have an adverse effect on all destinations and the entire system.
The lack of isolation within the monolith was addressed by moving to microservices, with one worker service per destination. Microservices also improved modularity and visibility throughout the system, allowing the team to easily see queue lengths and identify problem workers. Noonan pointed out that visibility can be built in to a monolith, but they got it for free with microservices. However, microservices came with increased operational overhead and problems around code reuse.
A period of hypergrowth at Segment, around 2016-2017, added over 50 new destinations, about three per month. Having a code repository for each service was manageable for a handful of destination workers, but became a problem as the scale increased. Shared libraries were created to provide behavior that was similar for all workers. However, this created a new bottleneck, where changes to the shared code could require a week of developer effort, mostly due to testing constraints. Creating versions of the shared libraries made code changes quicker to implement, but reversed the benefit the shared code was intended to provide.
Noonan pointed out the limitations of a one-size-fits-all approach to their microservices. Because there was so much effort required just to add new services, the implementations were not customized. One auto-scaling rule was applied to all services, despite each having vastly different load and CPU resource needs. Also, a proper solution for true fault isolation would have been one microservice per queue per customer, but that would have required over 10,000 microservices.
The decision in 2017 to move back to a monolith considered all the trade-offs, including being comfortable with losing the benefits of microservices. The resulting architecture, named Centrifuge, is able to handle billions of messages per day sent to dozens of public APIs. There is now a single code repository, and all destination workers use the same version of the shared library. The larger worker is better able to handle spikes in load. Adding new destinations no longer adds operational overhead, and deployments only take minutes. Most important for the business, they were able to start building new products again. The team felt all these benefits were worth the reduced modularity, environmental isolation, and visibility that came for free with microservices.
QCon attendees discussing the presentation sounded like typical engineers joining a project with a long history. Quick remarks such as, “Well, obviously you shouldn’t do what they did,” were countered with voices of experience pointing out that most decisions are made based on the best information available at the time. One of the key takeaways was that spending a few days or weeks to do more analysis could avoid a situation that takes years to correct.