Two weeks back at CodeMaker AI we made announcements that new optimized APIs become available. We haven’t really explained at that time what involved making the change and what was the rationale behind it.
When I started this project back in early 2023 I made a conscious decision to build the entire service on a serverless stack, this would be at least the fourth time in my career when I would be building a production grade system in the same way, so this wasn’t by any means uncharted territory for me. I was already familiar with the well-known trade-offs like the cost versus service latency and issues with cold starts. The cold starts turn out in the end not to be our biggest problem.
The initial phase of prototyping and experimenting with the tech stack was successful and the service came to be by early March 2023. We ended up being happy with the tradeoff that we made, initially the development was fast, we were able to make deployments multiple times a day and the entire process was automated. When we finally launched, the biggest benefit came into play with service with very little usage in the early days, our first-ever bill was only $35.25 per entire month. Everything look great, except for one thing.
That one thing was the API latency, because the product is in an emerging market of Generative AI we have certain characteristics of our product performance. Typical model evaluation performance is being measured in thousands of tokens per second, which makes it unsurprising that a request can take anywhere from 2-3 sec up to 60 sec, we also had an ambitious goal that we challenge ourselves with allowing processing inputs that exceed the typical limits of the context of window with the limit at that time being 256 KB, currently increased up to 1 MB. This cause not uncommon to see API request latencies in the ranges of 30+ seconds.
All of this would not be really a problem if not for one thing, the AWS API Gateway request timeout hard limit of 29 seconds, this may be a small, but important detail, because there are other serverless offerings on the market that are not constrained in such way, like Google CloudRun. The API Gateway limitation forced us to build a solution that would work around that limitation, at that time we were still committed to continue investing in the serverless stack, so we ended up building on top of the existing tech stack. Unfortunately, this introduces a fatal flow. That flow was the latencies at P0 being continuously in the 1-2 sec ranges. Rather than hundreds of milliseconds, the requests would be taking at minimum an order of magnitude more. Now that may not have much of an impact on the requests that would be taking, either way, a couple of seconds, but in the meantime, we had also built other features that either were directly tide with user-triggered action or were aimed to optimize the end-user experience and even this optimized versions has introduced a visible delay to the users.
Fast forward 6 months into the future and now our service has grown 21 times, our cost is now on par with the use of serverfull stack and in fact, switching from serverless to serverfull would be a cost optimization for us at this point. We have also collected feedback from our users that the latency is an actual paint point for them.
So we re-architected the service, made the needed changes to move to the full server stack, and updated the integrations, we also used this as an opportunity to introduce a couple of small optimizations. The end result was as one could predict, our P0 latency has decreased by 40%, and our infrastructure cost increased, but one thing out of this entire experiment was a complete surprise.
The completely unexpected consequence of this change was the user's response to it. What we noticed is that our service usage grew by 30% week over week, measured throughout the entire week after the launch, and the only thing that changed was the API latency. No new feature has been launched at that time. I am completely aware of the discovery that Amazon made years back, that with every 100ms increase of latency their sales drop by 1%, but for the first time I have so clear indication of how performance is being perceived by the end user and how optimizing it ends up offering an overall a better user experience.
This isn’t the first time I have optimized a service, but in the past, most of those services were used in fully automated use cases, where the client was another service or system, in such case the main factor for the optimization was simply cost saving for the service provider. In this case, our cost in the short term did increase, but at the same time the attractiveness of our product increased, and hopefully the user satisfaction. Those are the tradeoffs that are worth the price.
At this time, we are committed to our new architecture. This switch required us to make some more changes to our build and deployment processes and from now on also requires capacity management, but in the end, this isn’t something that can not be dealt with and can be fully automated.
The learning from this entire experience is that a serverless stack can help in optimizing the operating cost. If the service turns out to be a complete bust, we would not have to spend cost going in thousands of dollars to learn that. The serverless stack was initially 20 times less expensive to operate until the usage actually caught up. It also matters who is your service end user, if your end users are people it will matter to them how long it takes for the API to complete. Through this experience, we learned a lot and we remain committed to making the service even slightly better every day.
CodeMaker AI offers tools and automation for software developers for writing, testing, and documenting source code.
Top comments (0)