This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices. Thanks to Chris for his time and input!
For context, how big is your engineering team? Are you using Microservices and can you give a general overview of how you’re using them?
Our engineering team is about 60. We have a few different teams including some development teams as well as our SRE team which runs some of our production operations. The actual architecture we have is composed [of] a core NTA or email engine, which is also the same thing as our on-premise momentum product. And then we have some REST APIs built into that and also the SMTP as well. But then we have a dozen or so microservices that are mostly written in NodeJS. And these microservices are used for anything from account use or management, authentication, reporting analytics, other kinds of data management as well.
Did you start with a monolith and later adopt microservices?
We've been in business for well over 10 years and we started off actually as an on-premise enterprise software company. And we've been in the cloud now for about three years or more. But SparkPost, we sold Momentum which is our MTA again on premise. It's a pretty mature product. It's written in C and Lua, but it's very high performing, very flexible, very modular. But still at the end of the day it's more or less a monolith. So when we were getting ready to launch in the cloud we really started with that as our platform. We did build some APIs into that because it has an HTTP server. So core email services are part of this monolith. But then we have coexisting alongside that a bunch of other microservices. So unless it's really tied in directly to the email processing, it's separate from the monolith, so we have basically microservices with one kind of large monolithic service that coexists within them.
What was the motivation to adopt Microservices? How did you evaluate the tradeoffs?
It's really important for us [to have] continuous delivery and continuous deployment, so that's why microservices are really important to us. Each of the teams can really work more independently when working on microservices. The complexity is definitely lower when you're dealing with smaller microservices. The smaller services means faster build times, faster test runs, faster deployments, and then every deployment is a little bit less risky as well, easier to rollback and things like that. Right now for the microservices they're deployed on demand because every individual user story goes out independently of the others. On the monolith side of the platform we do deployments twice a week on a more regular schedule.
How did you approach the topic of microservices as a team/engineering organization? Was there discussion on aligning around what a microservice is?
Back in 2013 microservices wasn't such the buzzword it is now. What we were mostly focused on was API first. What we wanted to do was to build APIs quickly and have them be independent of the user interface. But then as we were looking at this some team members who had either read about or thought microservices sound cool. It seemed like it solved a lot of our problems around keeping things really nimble, reducing our cycle times, reducing the complexity, allowing us to build things and deploy them fairly independently of each other. So several of us had more of a background on continuous delivering, continuous deployment. This seemed like it would also solve a lot of the problems we had the in the past where, again maybe larger applications -- monoliths if you will -- just have some challenges when you're trying to do frequent deployments.
Did you change the way your team(s) were organized or operated in response to adopting microservices?
The team that is developing these microservices started off as one team and now it's split up into three teams but under the same larger group. And each team has some level of -- and this is a relatively new change -- responsibility around certain domains and certain expertise. But the ownership of these services is not restricted to any one of these teams. So what that allows is for any team to work on say, new features or fixes or production issues relating to any of those services. That allows them to be a little more flexible both in terms of new product development as well, just because you're not getting too restricted and that's based on our size as a company and as an engineering team. We really need to retain some flexibility. If we were a lot larger then it would make more sense to have a single, larger team own one of those microservices. Prior to these changes we made probably about six months ago, we had two teams and it was much clearer in terms of which of those teams was more responsible for each of those microservices but that created all kinds of challenges when it came to, for example, we're doing production on call rotations or other kind of support rotations, and somebody hasn't worked with it or doesn't really know it very well. So it can create some problems there. So [it's] better, I think, to have a little bit more broader responsibility for these services and it gives you a little more flexibility. At least that works for us at this time, where we are as an organization.
How much freedom is there on technology choices? Did you all agree on sticking with one stack or is there flexibility to try new? How did you arrive at that decision?
The decision to standardize on Node.js was a decision that did work out well for us. It could have gone a different way if Node.js hadn't continue to pick up steam, so that was a good decision that that's held up very well for us since then. We haven't had a real desire to change from Node.js. We've done some experimentation with Go. And it's entirely possible we might include Go as an alternative language. But where we've been much more open in terms of standardization is really more on some of the back end services or databases. So we use really a lot of different databases and we change them pretty frequently based on our needs.
Have you broken a monolithic application into smaller microservices? If so, can you take us through that process? How did you approach the task? What were some unforeseen issues and lessons learned?
We haven't added that many additional services or API endpoints into this monolith over the last few years, so really anything new is being done with separate microservices. It's really important for us [to have] continuous delivery and continuous deployment. For us, that's why microservices is really important to us. Each of the teams can really work more independently when working on microservices. The complexity is definitely lower when you're dealing with smaller microservices. The smaller services means faster build times, faster test runs, faster deployments, and then every deployment is a little bit less risky as well. Easier to roll back and things like that. So we were able to get, you know right now for the Microservices they're deployed on demand because every individual user story goes out independently of the others. On the monolith side of the platform we do deployments twice a week on a more regular schedule.
How do you determine service boundaries? What was that discussion like within your team? Can you give some examples?
We try to scope the band for these microservices along natural domain boundaries or even underlying data. For example, we have a suppression microservice, and it keeps track of millions and billions of entries around suppressions but it's all very focused just around suppression so there's really only one or two tables there. It's the same thing for other services. We have our web host or IP pools. Really in most cases you're talking about one or two tables that the service is managing.
I guess another way to think about this is following the rule around high cohesion and loose coupling. In most cases we really don't want to have one service calling another service. That's generally not a good idea. Certainly in some cases that's fine but we've actually found in one case -- this is actually our probably more complicated use case. We used to have one service that was called the User's API. Then we had another one called the Account Holder's API. So the User's had users, API keys, authentication. And then we had this account's API that had accounts of accounts, billing information. But the problem was that they were actually having several calls between them. So you would do something in accounts and have to look at something in users or vice versa. You would do something with users and have to look up information in accounts. So we ended up actually just merging them together into what we call the Accuser's API. And so that's now our largest service. Initially we just folded the code to get into the same repo. What we started doing was eliminating these links, internal API calls between them. So then they're just using the same model. And now we're going through a process of, as we're moving stuff from Cassandra DynamoDB, we're also refactoring. The data model could be a lot simpler. So going from dozens of tables to a small handful of tables, going with a more record-based approach rather than sort of a quasi-relational approach.
So if you think about an account, rather than having all these different tables and using Cassandra as a quasi-relational database where every relation actually has a table in Cassandra, you do more of a kind of a record-based approach or document-based approach where you have a more nested structure in DynamoDB. So that's been helpful. It's helped simplify the code. We're going through it refactoring some other things, but it is still the largest service we've got, so we do have challenges around how long it takes to build and deploy. It can take up to 30 minutes to actually build and run through all the tasks for it. We try to shoot for more like a 10-15 minute boundary for doing that. It can slow things down a little bit because this is a larger service, but if you end up with trying to break it down into smaller services then you end up with too much coupling between the two and it creates its own problems. So I think we're better off now having combined them. But again, there are some tradeoffs there.
What lessons have you learned around sizing services?
90% of the time you're talking about one or two tables. And then you sort of have this outlier, which is this Accuser's API. I think microservices is services. And the whole idea of services in general is not a new concept, whether going back to CORBA or other things like that. It's a very old concept. Obviously the implementation is a little bit different but even going back to things like object-oriented design approaches, again with this loose coupling, high cohesion. All those things that you've learned about in the past about developing services and components all holds true for microservices as well.
How have microservices impacted your development process? Your ops and deployment processes? What were some challenges that came up and how did you solve them? Can you give some examples?
I think we do things maybe a little bit different than other companies, at least some that I've talked to. Our development team is actually responsible for building and managing the deployment pipeline for the microservices. So we use a tool called Bamboo for the pipeline orchestration. That's an Atlassian tool. And then our developers work closely with our site reliability team on monitoring, reliability, and scaling, things like that. And then both teams have their own on-call schedules.
The nice thing about microservices is that they don't need a very complicated operations process. Honestly, it simplifies things. They're much smaller. It's a lot easier to manage a lot of small things and then some very large things. So again we don't have a dedicated dev ops team, for example. Dev ops for us is more just the way we do things. There's nobody who has a title of dev ops engineer. One of the earlier challenges we had with deployments was just around the automation of this, so we have gone through different iterations of that at least. We've settled on using Ansible which is really good for our use case. It solves a lot of problems around our continuous deployment also managing the code, the database schema changes, migrations, as well as configuration changes. We have to be able to do all those things in kind of a forwards and backwards compatible way so that you don't have any breaking changes, so you don't have any downtime. So Ansible was really helpful there.
There's just really a lot of responsibility on the development team in terms of building and deploying their own microservices, and I think that's honestly the best way to do it. There's a lot of good tools out there for example, like New Relic. So our development team will use New Relic. That's something really our dev ops or SRE team looks at. It's definitely a good tool. I think having development teams be more in touch and responsible for production is very doable when you cut back your services and I think you end up with a better result.
How have microservices impacted the way you approach testing? What are lessons learned or advice around this? Can you give some examples?
It's really changed a lot around how we do testing or QA. Historically, we had a separate QA team. First [our QA team] was doing manual testing, and then they started doing their own automated functional testing. But when we started doing microservices we decided really to streamline things and so there is no separate QA team. The actual development team writes all their own functional performance tests. They also have their own automated functional tests and automated rollbacks. Much of what I was discussing from an operations perspective is having the development team be fully responsible for the microservice [means] not having outside dependencies with separate functional groups. It's a lot simpler. I don't think that having separate QA brings a lot of value in our certain situation. On our platform team which is that kind of larger monolith we have, we do have separate performance test engineers that really help with the benchmarking for that platform and it's a little more complicated and so some of the prior practices continue to happen there. But again, on the microservices side, everything is so much smaller. I think if you're somebody coming from a more traditional environment where they have these larger applications, more traditional development processes, I think it's really important to just put a lot of that stuff aside and really look at how to simplify things.
And then what we'll do is before anybody writes any API changes or a new API, they will update the documentation first, have that change reviewed to make sure that it conforms with our API conventions and standards which are all documented, and make sure that there's no breaking change introduced here. We'll make sure it conforms with our naming conventions and so forth as well. And then once that's reviewed then we can also write the API changes and sometimes those are done somewhat concurrently so that the documentation is done at the same time as the actual API changes. That's really helped out. Early on we were really strict around the API governance process. But now that's been pretty much ingrained in people and all the conventions are pretty well-known and understood. So that's how we conform to those standards, but we've tried other tools in the past. We started off using Apiary, which is a good API documentation tool. And then we migrated to using the API Blueprint Spec. But then we were using a tool called Jekyll to generate static HTML documentation, so that's what we have right now. And it gives us more control over the look and feel of the documentation. But we did use something called Dredge, which basically takes the entire spec and then does validation of the API but it was really just a lot of trouble and it seemed like there were just a lot of weird cases that we just ended up not using it. We also have a lot of monitoring in place, so if there's something really serious that broke, whether it be through tests or monitoring, we would catch it. That's not to say that there haven't been some weird edge cases where somebody fixed the bug and then somehow introduced some weird breaking change into the API. But those are very, very rare.
How have microservices impacted security and controlling access to data? What are lessons learned or advice around this? Can you give some examples?
What's probably worth calling out is we use Nginx as a proxy for all of our APIs, and in Nginx you can share a custom Lua in Nginx. So we've integrated it into the custom authentication API key authentication authorization mechanism. What that allows us to do is centralize all that stuff and all of that logic. All the services themselves don't know anything about authentication and authorization so to really keep things simple from that perspective, we're able to bake it all into Nginx. There are other tools out there that can do that, anything that's sort of an API gateway type of product. But the advantage of that is, again, you're really keeping the service itself very simple and all the access control stuff can be centralized. From a security perspective we do other things that I think APIs in general prevent. They provide a better security profile than some types of applications. But we do take advantage of things like GitHub's new dependency vulnerability scan that they do for you. Since we have everything in GitHub that's a cool new thing they do. We also scan around code. We also have secure coding practices for things like SQL injection and the like.
Have you run into issues managing data consistency in a microservice architecture? If so, describe those issues and how you went about solving them.
I think one of the challenges if you are coming from this monolithic service or system is the idea of these very heavy transactions. Then when you go to microservices the biggest thing that you need to just come to terms with is that every service is just responsible for its own transactions. If you're going to have some sort of transaction at a larger level that's going to use multiple services, you have to basically assume that some of the services may or may not work or that there's going be some sort of failure or error and then bake into that some asynchronous retries and such. So for example, we'll consume data off queues and it will process that and try to update multiple APIs. And then if one fails we retry the other ones or retry as needed. You just have to that kind of accept that's the way that things work. I mentioned earlier that if one service is trying to update multiple data tables at the same time there's different ways to deal with that. But certainly some databases will support transactions, but if you're going across databases like what we've done you know we've been able to leverage agains this more asynchronous replication just using Lambda. So we just update DynamoDB and then there's a synchronous process which will then synchronize the data to other data sources. I think that for us we don't really need data updated and to be 100% consistent. And I think that once you just give up on the notion or need that data has to always be 100% consistent I think your life gets a lot better. I think that maybe that's very difficult for some people to come to terms with, but I think that when you're dealing with the cloud in some of these databases or microservices you have to just bake that into your architecture.
Thanks again to Chris for his time and input! This interview was done for our Microservices for Startups ebook. Be sure to check it out for practical advice on microservices.