This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices. Thanks to Christian and Stefan for their time and input!
Christian Beegden is the CTO at Sumo Logic. Stefan Zier, the Chief Architect at Sumo, joined for this interview. Sumo Logic is a cloud-native service that is one of the most powerful machine data analytics services in the world.
For context, how big is your engineering team? Are you using Microservices and can you give a general overview of how you’re using them?
The engineering team is about 100 and two thirds of them are writing code. The idea behind Sumo was to get away from solving log management via enterprise software and turn the product into a service. For almost 10 years before starting Sumo Logic, we struggled with the scaling aspects of log management enterprise software, just processing logs at 5,000-10,000 events per second.
When we started Sumo Logic, it was not even a discussion whether we were going to build the system as a multi-tenant service -- it was obvious.
Eventually, you have the biggest Oracle box you can possibly buy and it just doesn't get any faster after that. Clearly, we needed to build a system that scaled out horizontally. Very quickly, we also learned about other aspects of scaling, such as divide and conquer, where you split the customers over multiple clusters.
We also knew there were going to be parts of the system that handling data ingest, other parts handling indexing and yet other parts to handle querying, caching, and aggregating that data. It just felt very obvious that those were just small applications that would run independently and we would tie them together somehow. It was clear it would not work on a single box.
And we went through a great deal of pain and discussion around the complexity and granularity of services. I remember heated but constructive discussions between Stefan and myself in 2012 and 2013. Maybe we built a system that's too complicated? You can't even run it on a laptop anymore. In retrospect, I think it mostly worked out well for us.
Did you start with a monolith and later adopt Microservices?
Long story short, this isn't a "we had a monolith and we broke it up" kind of a story. This is an "always Microservices, always in AWS" kind of story.
How did you approach the topic of Microservices as a team/engineering organization?
We had previously built a similar application as a monolith [elsewhere] and understood the basic ins and outs of it well. Initially, we didn’t get the decomposition of the application into Microservices quite right.
We didn't fully appreciate some of the effects of data gravity and data movement and separating ingest from search paths. Different scaling behaviors driven by different inputs. It took us probably a good year or two to gravitate towards something that ended up and working and scaling.
Did you change the way your team(s) were organized or operated in response to adopting microservices?
We currently subscribe to a model called Product Development Units (PDUs). They’re a pretty strong container around a set of these Microservices as well as a set of back end developers. UI developers float between PDUs. And usually, the PDUs also have a product manager attached.
The UI developers and quality engineering teams kind of float across, but product managers and back end developers are permanently attached to the PDUs. And each PDU owns some of the microservices. They're similar to the two pizza size teams Amazon likes talking about.
PDUs are fairly strong containers around pieces of code and operational knowledge, which has downsides when you need to roll out crosscutting changes. A few examples of things we've had to do recently is go from Java 7 to Java 8 - which I'm kind of embarrassed to admit was recent. Or the process of migrating to VPC. Getting all the PDUs to execute the crosscutting change for their services of it is not easy.
The way we've organized that has certainly changed over time and the numbers have changed. When we started out, it was four or five developers to 20 plus microservices. Now it's 80 to 40 or 50 microservices.
Initially having each service owned by one person made no sense. Or even one team-- everyone owned everything. But eventually as the team grew, we divided the services up amongst several teams. Not an individual, but a team.
One thing that's really important is that the people building the software take full ownership for it. In other words, they not only build software, but they also run the software and they're responsible for the whole lifecycle.
They need to be given enough liberties to do that as well. It's the whole freedom and responsibility thing. If they're going to be the ones to be woken up by the software they also need to be able to make the decisions as much as possible within some basic guidelines on how to build the software.
It's a federal culture really. You've got to have a system where multiple independent units come together towards the greater goal. That limits the independence of the units to some degree, and they have to agree that there is a federal government of sorts. But within the smaller groups, the idea that they can make as many decisions on their own as possible within guidelines established on a higher level. That's one way to look at it.
How do you determine service boundaries?
That's a tough one. I always thought about this kind of distributed architecture, it's just like object-oriented design on a higher level of abstraction. We've grown up basically being taught object-oriented design and the basic rallying cry has always been, “low coupling and high cohesion.”
That's how you would try to look at classes and interactions between classes, packages, modules, and so forth. And on some level, we now just have another sort of container.
I remember looking at it very much from that perspective. How have we in the past figured out what are the things that need to go in one package, what are the things that need to go in one module, and just lift it up to services.
Some of it is intuition. Grab a whiteboard and let's think real quick what are the main components of the system, break them down, one, two, three, throw in a bunch of arrows to connect the components and start making GitHub repos and then you have a starting point.
I think domain experience has helped in this case, but there are some things we got brutally wrong. We initially separated the indexing and the searching of data. It should have really been the same module or the same service from that perspective because of...well maybe not the same module but there are some sort of aspects of taking the separation too far in terms of different clusters and the data then has to be...it's very cumbersome process of exchanging what is essentially the same data between those clusters. It adds latency, it adds all sorts of really, really, really nasty implications on your architecture that took us two, three years to figure out how much this initial decision affected a core part of the architecture.
What was that discussion like within your team? Can you give some examples?
We started with every Microservice having a super hard boundary. I remember that I was really adamant about that. We had separate repos for everything because I was just so burned by people -- even in the old world when we just had one application -- by people not even following basic conventions around what should go into which package and which package should just never call some other packages.
In order to keep the coupling low, I was radical about this. So literally physical boundaries between these things, separate repos, and everything just talks via messages and a bus. We ended up with a mono repo eventually, and also request/response-based APIs in many places. I think we have -- how many modules do we have at this point? Two hundred plus. And everything just being asynchronous messaging caused a lot of people to lose a lot of hair in practice.
What lessons have you learned around sizing services?
Upgrading all services at once is a bad idea. There was a period, where for weeks at a time, we deployed, saw something go off the rails, rolled back, went back to the drawing board. Deployed, saw another thing go off the rails, rolled back, went back to the drawing board. At some point we did this for three, four, five weeks in a row, and eventually we realized that this is just absurd. Did I mention that we really enjoy banging our heads against a wall?
At the time we probably already had on the order of 20 some services. And the team was probably 15 or 20 people. The alternative was to go service-by-service which seemed equally crazy.
How are you going to upgrade one service, restart it, make sure it's okay, upgrade the next service, restart, make sure it's okay, and do that 25 or whatever times, all within a two hour maintenance window? Also not possible.
And so where we landed was in the middle. We invented a concept that we called “assembly groups,” basically smaller groupings of the 25 services. Anywhere between two, three, four, five, six, of these services that get upgraded together.
How have Microservices impacted your development process? Your ops and deployment processes? What were some challenges that came up and how did you solve them? Can you give some examples?
We realized scaling developer onboarding was highly leveraged. In the past, we've not done very well at it, and some people joining the organization had a pretty bad experience.
The way we're setting people up right now is to get paired up with a mentor, so someone who's been here while, who knows their way around. And we're in the process of creating a curriculum that walks them through a few basic things they need to know.
We had historically written a lot of wikis to document bits and pieces. At this point, I would very strongly advise against wikis in general. Or at least wikis in the form where everyone can write anything at any time.
Content management over time is just super hard. So we're basically back to having a moderator/curator. This is very centralized and all of that and it can feel a little heavy handed. I think wikis can work if you have something on a size and scale of Wikipedia. If you just have a bunch of developers, it's just not enough incentive to consistently keep stuff updated. That's what we found, unfortunately.
We now have a curated GitHub wiki that I curate. I try to keep it limited to the key concepts that you need to understand, as opposed to everything and the kitchen sink, which the wiki eventually becomes. The individual pages are pretty strongly curated and many of them I wrote myself and then got reviewed by other folks.
How have Microservices impacted the way you approach testing?
There's a number of ways in which we commonly test. One is what we call a local deployment. We run most of the services on a laptop, so you get a fully running system. Currently, a laptop with 16GB of RAM is stretched to the limits running that.
So that kind of doesn't really scale. The second variation is what we call a personal deployment. Everyone here has their own AWS account, and our deployment tooling is so fully automated that you can stand up a full instance of Sumo Logic in AWS in about 10 minutes.
The third way we do it is what we call Stubborn, which is a stubbing module we built. Stubborn lets you write stubs of microservices that behave as if they were the real service and advertise themselves in our service discovery as if they were real service. But they're just a dummy imitation that does something that you have control over. That is much more lightweight than running all of these services.
For example, if you're working on search components, you always need the service that knows about which customers exist. You need the service that knows about users and partitions and all of these things, but you don't really need the real version with all its complexity and functionality. You just need something that pretends like there's a customer here, and there's a user here. We use Stubborn in cases like that.
What are lessons learned or advice around this? Can you give some examples?
I think testing is very very difficult with microservices overall, especially once you move towards a continuous deployment model. It's not even like anything that's sort of human exploratory testing--like when do you do it? So we've invested and continue to invest fairly heavily into integration testing, unit testing, and would do a lot more if we had the people to do it, quite honestly.
How have Microservices impacted security and controlling access to data? What are lessons learned or advice around this? Can you give some examples?
We understood that in order to get people to trust us in the cloud, security had to be front and center. And we knew that a lot of customers would need to be compliant with PCI, with HIPAA, with all of these things. We needed to design it to be aware of security from the ground up.
Our deployment tooling is model driven and the model understands things like clusters, microservices, and how they talk to each other. We can generate a pretty tight set of security controls from that model.
On a network level in particular we can generate the firewall rules strictly from the understanding of who talks to whom. This is one of the places where AWS does some of the heavy lifting for us from a security perspective as well.
Something we realized over time is that the whole cloud thing can actually work out advantageously from a security perspective. You simply can’t do anything by hand, so everything has to be automated. And once you are starting to just script and automate the heck out of everything, it suddenly becomes much more possible to just tie everything down by default.
There’s a lot of other stuff going on here as well, including thinking through encryption and key management -- and that’s where the fact that we went multi-tenant by principle from the start just forced us to invest into the design and not brush over things with oodles of “is ops problem” now one-offs.
And finally, security is not just architecture and code -- in reality there’s a ton of process around security as well. This is something we figured out quickly. Specifically the fact that customers need to see audit reports. You take that and turn it into an advantage, look at what the audits require in terms of controls, and then test against it. Then the auditors come in and they test it again. This is certainly not without pain, but it establishes, if taken seriously, and we do take it seriously, pretty good habits. Of course audits are only one part, architecture is another part, and then you add pentesting, bug bounties and so forth. And you drive that through the entire product development organization, including all the developers.
Thanks again to Christian and Stefan for their time and input! This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices.