This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices. Thanks to Sam and Pawel for their time and input!
Sam Stagg is VP of Engineering at Pusher. Pawel Ledwoń is Platform Lead at Pusher. Pusher's real-time communication and collaboration APIs allow developers to build features like in-app notifications, activity streams, chat, dashboards, collaboration tools as well as multiplayer games.
For context, how big is your engineering team? Are you using Microservices, and can you give a general overview of how you’re using them?
There are 25 engineers in our team. The core Pusher product is a reliable and simple to use communication and collaboration API and supporting platform that give developers everything they need to build scalable interactive apps. This is not based on microservices.
Did you start with a monolith and later adopt Microservices? If so, what was the motivation to adopt Microservices? How did you evaluate the tradeoffs? If not, did you start directly with Microservices? Why?
As mentioned above, our core product does not use microservices, but it isn’t a monolith either. It’s a very highly focused set of services, coupled but not so tightly that we don’t obey service boundaries. However, the service boundaries are mostly defined for performance and scalability reasons instead of being defined by business domains.
The core product is also an architectural feat, proving that you can build a highly reliable, scalable and performant product like Pusher out of a stack consisting of a message bus optimized for performance over stability (Redis) and an event-driven framework for a “slow” dynamic language (EventMachine & Ruby).
Although we’re happy with this architecture for our core product, the stack we use is a bit of a dead end for extension into future product development. And given the continuing commercial growth of Pusher, changing the architecture feels like changing the engine on a rocket ship during lift-off.
So when we decided to invest engineering time in new product development we decided we also needed a new platform. This new platform would help us build new products and reduce the time-to-market for new product experiments. Microservices felt like a good choice for this, as a good architecture based around microservices allows you to rapidly deploy new features which can scale naturally if they are successful, and die gracefully if they aren’t.
How did you approach the topic of microservices as a team/engineering organization? Was there discussion on aligning around what a microservice is?
We prefer to be flexible in our definition of a microservice, as we see microservices as a philosophy underpinning or platform rather than a dogmatic “one true way.” We see these underlying microservice principles as:
● Exploit infrastructure automation and tooling.
● Use virtualized and containerized infrastructure.
● Build distributed, fault-tolerant systems.
● Deploy independent services as components.
We find this more useful than defining the size or limits to functionality of a service.
Did you change the way your team(s) were organized or operated in response to adopting microservices?
We have tried to organise our team to exploit Conway’s Law (organizations are constrained to produce systems that mirror the communications structure of the organization). Thus, we have a small team who builds core shared services for the platform, and services that support specific products are built by the teams who build the products.
How much freedom is there on technology choices? Did you all agree on sticking with one stack or is there flexibility to try new? How did you arrive at that decision?
Although we’re open to experimentation, our stack looks conservative. Teams write most of their code in Go. They often use relational databases.
At our scale, technological diversity carries a big price. If we wanted to introduce a second programming language, it would take at least a year to bring it to parity with our internal Go ecosystem. We’ve developed plenty of tools and experience since we started working on the new platform. Starting from scratch incurs huge costs on product teams.
We can imagine expanding our stack in the future. However, first we need to grow in either the organisational (more engineers) or infrastructure (more traffic) dimension. Technological diversity promotes creativity and, if done well, improves efficiency.
How do you determine service boundaries? What that discussion like within your team? Can you give some examples? What lessons have you learned around sizing services?
Each product using our new platform is a separate service developed by an independent team. It makes sense to draw boundaries between them to reduce coordination. Product teams have freedom to architect their internal systems to fit their needs.
The platform consists of a set of services running in front of product services. The platform is maintained by three teams. We have one team responsible for infrastructure (Kubernetes and peripherals), one for services (our front proxy, authentication, rate limiting, etc.) and one for analytics.
We don’t have a set of guidelines for our developers. Most of the time, we introduce boundaries for two reasons: (1) when splitting out a dependency makes it easier to reason about components and (2) when it improves scaling and availability of the system.
Proliferation of services is a big problem for small teams. We tend to start with a smaller set of larger components and refactor them only when needed.
How have Microservices impacted your development process? Your ops and deployment processes? What were some challenges that came up and how did you solve them? Can you give some examples?
Most of our challenges come from switching to Kubernetes. It’s completely different from our original stack.
Also, it doesn’t help the project is young and lacking in tooling, especially in the deployment area.
How have Microservices impacted the way you approach testing? What are lessons learned or advice around this? Can you give some examples?
Testing systems comprising many services has always been problematic. Dependency management in test suites is still awkward. I hope that with the cloud-native ecosystem growing up, communities will develop new testing frameworks aimed at such architectures.
On the new platform, we lean towards acceptance and integration testing. Unit tests are still useful for some abstractions, but for user-facing features they are difficult to mock or skip important parts of the system.
How have Microservices impacted security and controlling access to data? What are lessons learned or advice around this? Can you give some examples?
Moving to the new platform helped a lot with security. We have several security layers common to all products. Services can put additional local protections in place.
Kubernetes allows us to implement granular access control to our clusters and enforce network policies. On top of that, we secure communication on the application level using a service mesh.
We also found implementing secure authentication tricky. Plenty of articles describe problems with JSON Web Tokens. One of the first major decisions in the platform was integrating a common token verification system into our front proxy. This way product teams don’t need to implement authentication logic. Services don’t know our customers’ API keys, removing a significant attack vector.
Have you run into issues managing data consistency in a microservice architecture? If so, can you describe those issues and how you went about solving them?
There is one particular problem companies like Pusher face - developers need to implement each feature in at least a dozen SDKs. This is a work multiplier and the time to develop new features is proportional to the number of SDKs. We have to keep things for our clients simple by avoiding any sophisticated coordination logic on that level.
Lack of data consistency increases complexity. We keep our client libraries lean by ensuring APIs they use provide consistent information via simple endpoints.
One such problem is resuming interrupted subscriptions. For example, a client subscribed to a chat loses the connection for 5 minutes. When it comes back, the client needs to retrieve missed messages and re-establish the subscription. It seems like a trivial problem, but out-of-the-box solutions that work at Pusher’s scale are hard to find.
We found a pattern across subscription implementations - services combine a primary data store (e.g. a relational database) and a message bus (e.g. Redis) for real-time updates. Because most message buses offer limited buffering capabilities, clients sometimes need to contact the primary store before resuming the subscription. Because this design is so common, we implemented an abstraction we call “resumable subscription.”
Resumable subscriptions coordinate the primary store and the message bus to provide a simple, consistent API for clients. They prevent data loss and duplication arising from having two disjoint sources of data. Requirements for the primary store and the message bus are low enough for existing open-source projects or hosted solutions.
Thanks again to Sam and Pawel for their time and input! This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices.