DEV Community

Cover image for Faster, better, cheaper: lessons from three years of running a serverless weather API

Faster, better, cheaper: lessons from three years of running a serverless weather API

Since launching Pirate Weather in 2021, I have learned an incredible amount about some of the upsides and downsides to running a weather API. Some things have worked well, some have not, but overall, it has been an unparalleled experience. While I’ve written a number of blog posts about specific aspects of this service, I wanted to take some time to put a few stray thoughts together about topics that are important, but not enough to write a whole post about.

Traffic spikes- good news and bad news

Thanks to some incredibly exiting press coverage, Pirate Weather was made the internet rounds last year! One of the upsides to an AWS serverless infrastructure is that capacity scales to meet demand; however, the challenge is that there’s often hidden links that limit things. So, for my first lesson learned, it’s that I should have done more load testing, since things don’t always calmly build up to let you find the hidden, limiting links with time to spare.
In particular, I ran into three main issues that contributed to downtime:

  1. Signups flow, and specifically email. It turns out that AWS imposes a default limit of 50 outgoing messages per day unless the quota is raised, and so after that was hit, my signup service (the API Gateway Portal) was unable to confirm new accounts, preventing new users from coming on board.

  2. EFS capacity. Pirate Weather relies on EFS burst credits to keep the shared file system humming along between the new model data coming online and forecast requests. EFS credits are generated automatically based on the amount of data stored on the share, and I try to keep this balanced. However, when there was a burst of traffic then the share run out of credits, and throughput slowed way down, knocking the service offline. The quick fix was to switch to a provisioned file system, which allows me to set a minimum throughput and regenerate the burning credits rapidly and setting a cloud watch alarm to alert me when they’re running low. Longer term, this is addressed permanently in version two by switch to different systems for model data ingest (S3] and serving data (local NVME), to make sure that one doesn’t overwhelm the other.

  3. Finally, a time-honoured lesson that can’t be repeated enough is not to rely on single availability zone (AZ) services. When my primary AZ went down last year, I didn’t have things set up to run in a second AZ, and so the service went down. Luckily, AWS has a ton of tools for making things run in multiple availability zones, and so now my EFS share spans multiple zones, as does my lambda response!

Monitoring

Another great aspect of serverless computing is how easy it is to monitor things! I tried around with a few different services, but eventually settled on Grafana. They have a generous free tier, and by running the agent on a tiny EC2 instance it is able to query my Kong API Gateway containers to grab metrics from them.

This is then paired with their AWS account integration to grab some key metrics from Lambda and EFS, ensuring that I have a complete picture of how everything is running in one dashboard.

Costs

One of the best parts of getting some Pirate Weather press was that Corey Quinn noticed my service and offered to take a look at my setup to see where costs could be optimized! His company, The Duckbill Group, is pretty well unparalleled at deciphering AWS bills and optimizing services. Now, Pirate Weather runs a pretty lean operation (I originally built it to fit in the AWS free tier), so I wasn’t sure if he would find much to optimize, with just a few containers, a EFS share, and a lambda function… I was very wrong.
There were four primary insights that he identified with my setup:

  1. Sagemaker. You might be wondering why I had a sagemaker instance running when that’s not part of the infrastructure, as was I! It turns out that it’s very, very easy to leave these notebooks running, and they’re not cheap. I’d mucked around with one one day (link to AWS blog) and forgotten to completely disable it, so that was a quick and easy fix.

  2. ClousWatch logs. This was another area where it hadn’t occurred to me to look for costs, but it turns out this service can add up quickly. I’d added a few print statement to help debug a couple of processing scripts, as well as my main API response, and while these logs were great for fixing issues, they also ended up adding up. By streamlining what was being ingested and what the retention period was, this bill was trimmed right down.

  3. API Gateway. AWS API Gateway is a truly phenomenal service. It let me get a public API off the ground and running with almost no setup or overhead and was the perfect solution for years. However, this comes at a (literal) cost, and the downside is that per request cost isn’t cheap. This was a longer-term fix, but ultimately resulted in me moving over to the Kong API gateway, which also allowed the service to scale past 10,000 users!

  4. One for ARM and ARM for one! Speaking of containers, most of the data ingest and processing happens in Fargate containers, and Corey pointed out that ARM containers are an objectively better choice when possible. Since my processing scripts were all Python, with some light tweaking I was able to move my ingest over to this container type.

All these changes saved on billing costs, which is great in and of itself, but he also had some broader insight on improving my setup. There’s no one rule to optimizing AWS bills, with way more nuance than I ever imagined, and it was invaluable having him to take a look!

Top comments (0)