DEV Community: Alexander Rey

Keeping Pirate Weather Afloat: Inside the AWS Pipeline and the Christmas Eve Outage

Alexander Rey — Wed, 22 Apr 2026 15:30:47 +0000

Since it's been a while since I last covered Pirate Weather's AWS infrastructure, I thought it was time to write a short update on how everything fits together, and also explain where things have gone wrong. At a high level, Pirate Weather is a Python script that reads Zarr files. These files are created from a series of scripts that run on a schedule, download the data, perform some light processing, and save .zip files for the response script.

Ingestion & Processing: A suite of Python scripts runs on a precise schedule, triggered by Amazon EventBridge. These scripts are orchestrated by AWS Step Functions, which manage AWS Fargate containers (using our custom ARM-based image). These containers download raw data, perform light processing, and "chunk" the data into Zarr format for lightning-fast retrieval.
Storage Strategy: The processed Zarr data is initially persisted as zip files on Amazon S3. To minimize latency, an rclone container syncs these files to autoscaled EC2 NVMe instances.
- By serving data from local NVMe storage rather than directly from S3, we achieve the IOPS necessary for real-time weather requests.
- Using zip files avoids the having a ton of S3 objects and the associated transaction costs.
- Notably, the time for each model forecast is included in every chunk, which avoids having to rely on metadata.
ECS service: An ECS service coordinates four containers running on the EC2 instances: rclone for syncing, the production FastAPI container, the development container, the historic data (Time Machine) container, and Kong.
- This ensures that things are restarted if there are issues, handles placement on the instances, and container updates.
Traffic Management & Security: Inbound requests are routed through Amazon CloudFront to a Network Load Balancer (NLB), which passes it to the EC2 instances. From there, traffic hits a Kong Gateway container, which manages authentication and rate limiting.
Data Persistence: The gateway and API layers are supported by Amazon ElastiCache (Redis) for rapid session/rate-limit caching and an Amazon RDS database for persistent metadata and user information.

There's quite a few nuances to the various pieces; however, this is "meat and potatoes" of it.

December 24, 2025 downtime incident

The four hour production downtime had two root causes. The first was traced to a configuration conflict between our AWS Step Function definitions and the underlying ECS cluster strategy. While our ECS cluster is architected to run a resilient 50:50 mix of Fargate Spot and Fargate On-Demand instances, the Step Function definition responsible for triggering the ingestion tasks contained an explicit override. As seen in the configuration snippet below, the task was hardcoded to rely exclusively on FARGATE_SPOT. During a period of high Spot instance reclamation in our availability zone, these ingestion containers were repeatedly terminated by AWS before completion, halting the data pipeline.

  "CapacityProviderStrategy": [
    {
      "CapacityProvider": "FARGATE_SPOT",
      "Weight": 1
    }
  ],

This is an issue on it's own; however, should have been recoverable; however, the ingestion failure was amplified by a logic error in the processing scripts, which lacked a fallback mechanism for missing GFS data when the two day buffer was exceeded, causing the forecast generation to fail entirely rather than serving stale or partial data. To resolve this, I have updated all Step Function task definitions to remove the explicit CapacityProviderStrategy override. The tasks now defer to the ECS cluster’s default capacity provider strategy, ensuring a stable 50:50 distribution between Spot and On-Demand instances. This change guarantees that even if Spot capacity is volatile, the On-Demand instances will ensure the ingestion process completes successfully. I've also added additional logging on when ingest tasks fail, which will avoid missing failures in the underlying data, as well as a check to avoid serving stale model results (PR #542).

Navigating Stormy Seas: Troubleshooting Pirate Weather’s Container Memory Usage with AWS Container Insights

Alexander Rey — Thu, 20 Mar 2025 18:23:31 +0000

Running applications in containers provides numerous benefits—easy deployments, rapid scaling, streamlined updates, and more. However, troubleshooting and maintaining observability can become challenging, especially when containers run in cloud environments. Issues that didn't surface during testing might appear suddenly in production, and without direct access to underlying processes, diagnosing root causes can quickly become complex. This exact scenario happened to us recently at Pirate Weather, where unexpected container failures led to brief and seemingly random downtime incidents.

As outlined previously in our infrastructure overview, Pirate Weather’s production stack is hosted on Amazon Elastic Container Service (Amazon ECS), utilizing a series of ECS tasks managed by a single ECS service. Our service ensures high availability by maintaining at least two tasks running at all times, each task consisting of three distinct containers. Behind these tasks, we leverage auto-scaling EC2 instances to handle dynamic workloads.

When we began experiencing intermittent downtime, our initial investigation revealed ECS tasks failing with the infamous "Error 137," indicating that containers were terminated due to exceeding memory limits. Although this was a helpful clue, we still didn't know which specific container was responsible or whether the issue stemmed from sudden memory spikes or gradual leaks.

Initially, our monitoring setup involved using the Kong Prometheus plugin, but this provided insights only at the API gateway level, not deep within our ECS infrastructure. Seeking a more comprehensive solution, we discovered AWS Container Insights, a built-in feature of Amazon CloudWatch that offers detailed metrics and logs for containers running on Amazon ECS.

Enabling AWS Container Insights was incredibly straightforward—just a single click in our ECS task definition settings, a quick update to our ECS service, and within minutes, we had detailed container-level metrics on CPU utilization, storage I/O, and crucially, memory usage available directly in CloudWatch. Arguably one of the easiest yet most impactful updates we’ve ever made!

After collecting data for a couple of days (and observing several more restarts), we revisited CloudWatch. The depth of data was impressive—almost overwhelming at first glance—but by filtering on key parameters like ClusterName, ContainerName, and ServiceName, we quickly identified the culprit. Our "TimeMachine" container, responsible for handling historical data requests, was steadily leaking memory and occasionally experiencing significant spikes. These spikes caused it to exceed its allocated memory, resulting in container termination and subsequent stack downtime.

While a complete, permanent solution requires deeper investigation into the memory leak, AWS Container Insights provided immediate actionable insights, enabling us to implement two effective short-term solutions:

We modified our task definition to include the --limit-max-requests=25 flag in our Uvicorn Docker command, automatically restarting worker processes within the container to mitigate the slow memory leak.

We leveraged ECS's newly available container restart policy, enabling graceful restarts of our TimeMachine container upon memory overload events. This ensured only the problematic container restarted rather than impacting the entire ECS task.

Though further work is needed for a long-term fix, activating AWS Container Insights significantly streamlined our troubleshooting process, demonstrating the immense value this tool offers for quickly diagnosing and resolving container-related issues.

Faster, better, cheaper: lessons from three years of running a serverless weather API

Alexander Rey — Sun, 31 Mar 2024 16:19:21 +0000

Since launching Pirate Weather in 2021, I have learned an incredible amount about some of the upsides and downsides to running a weather API. Some things have worked well, some have not, but overall, it has been an unparalleled experience. While I’ve written a number of blog posts about specific aspects of this service, I wanted to take some time to put a few stray thoughts together about topics that are important, but not enough to write a whole post about.

Traffic spikes- good news and bad news

Thanks to some incredibly exiting press coverage, Pirate Weather was made the internet rounds last year! One of the upsides to an AWS serverless infrastructure is that capacity scales to meet demand; however, the challenge is that there’s often hidden links that limit things. So, for my first lesson learned, it’s that I should have done more load testing, since things don’t always calmly build up to let you find the hidden, limiting links with time to spare.
In particular, I ran into three main issues that contributed to downtime:

Signups flow, and specifically email. It turns out that AWS imposes a default limit of 50 outgoing messages per day unless the quota is raised, and so after that was hit, my signup service (the API Gateway Portal) was unable to confirm new accounts, preventing new users from coming on board.
EFS capacity. Pirate Weather relies on EFS burst credits to keep the shared file system humming along between the new model data coming online and forecast requests. EFS credits are generated automatically based on the amount of data stored on the share, and I try to keep this balanced. However, when there was a burst of traffic then the share run out of credits, and throughput slowed way down, knocking the service offline. The quick fix was to switch to a provisioned file system, which allows me to set a minimum throughput and regenerate the burning credits rapidly and setting a cloud watch alarm to alert me when they’re running low. Longer term, this is addressed permanently in version two by switch to different systems for model data ingest (S3] and serving data (local NVME), to make sure that one doesn’t overwhelm the other.
Finally, a time-honoured lesson that can’t be repeated enough is not to rely on single availability zone (AZ) services. When my primary AZ went down last year, I didn’t have things set up to run in a second AZ, and so the service went down. Luckily, AWS has a ton of tools for making things run in multiple availability zones, and so now my EFS share spans multiple zones, as does my lambda response!

Monitoring

Another great aspect of serverless computing is how easy it is to monitor things! I tried around with a few different services, but eventually settled on Grafana. They have a generous free tier, and by running the agent on a tiny EC2 instance it is able to query my Kong API Gateway containers to grab metrics from them.

This is then paired with their AWS account integration to grab some key metrics from Lambda and EFS, ensuring that I have a complete picture of how everything is running in one dashboard.

Costs

One of the best parts of getting some Pirate Weather press was that Corey Quinn noticed my service and offered to take a look at my setup to see where costs could be optimized! His company, The Duckbill Group, is pretty well unparalleled at deciphering AWS bills and optimizing services. Now, Pirate Weather runs a pretty lean operation (I originally built it to fit in the AWS free tier), so I wasn’t sure if he would find much to optimize, with just a few containers, a EFS share, and a lambda function… I was very wrong.
There were four primary insights that he identified with my setup:

Sagemaker. You might be wondering why I had a sagemaker instance running when that’s not part of the infrastructure, as was I! It turns out that it’s very, very easy to leave these notebooks running, and they’re not cheap. I’d mucked around with one one day (link to AWS blog) and forgotten to completely disable it, so that was a quick and easy fix.
ClousWatch logs. This was another area where it hadn’t occurred to me to look for costs, but it turns out this service can add up quickly. I’d added a few print statement to help debug a couple of processing scripts, as well as my main API response, and while these logs were great for fixing issues, they also ended up adding up. By streamlining what was being ingested and what the retention period was, this bill was trimmed right down.
API Gateway. AWS API Gateway is a truly phenomenal service. It let me get a public API off the ground and running with almost no setup or overhead and was the perfect solution for years. However, this comes at a (literal) cost, and the downside is that per request cost isn’t cheap. This was a longer-term fix, but ultimately resulted in me moving over to the Kong API gateway, which also allowed the service to scale past 10,000 users!
One for ARM and ARM for one! Speaking of containers, most of the data ingest and processing happens in Fargate containers, and Corey pointed out that ARM containers are an objectively better choice when possible. Since my processing scripts were all Python, with some light tweaking I was able to move my ingest over to this container type.

All these changes saved on billing costs, which is great in and of itself, but he also had some broader insight on improving my setup. There’s no one rule to optimizing AWS bills, with way more nuance than I ever imagined, and it was invaluable having him to take a look!

Deploying Kong Gateway (OSS) in Production on AWS Using serverless Tools

Alexander Rey — Fri, 17 Nov 2023 13:55:09 +0000

You can bring a data scientist to a database, but you can’t make them an administrator

Weather APIs can be intricate, dealing with a myriad of data flowing in and out. At Pirate Weather, my background in data processing equipped me to handle file operations with Python, but delving into the realm of cloud infrastructure was an entirely new challenge. While I am familiar enough with the command line and know the basics of AWS, I started this without experience in networking or databases, and frankly, I wasn't eager to learn. This meant that serverless tools were an ideal solution, letting me abstract away the infrastructure complexities and focus on actual building.

Initially, AWS API Gateway was an incredible tool, full stop. It let me rapidly deploy Pirate Weather as a functioning API using little more than a Lambda function and a URL, which was exactly what I was looking for at the time. It was easy to set up, very high performant, and allowed just the right amount of customization. However, two years post-launch, Pirate Weather started to come up against the API key limit imposed by AWS API Gateway, and the developer portal I was using had been depreciated. This meant that it was time to find a new solution, and Kong Gateway was exactly what I was looking for!

Why Kong?

Why Kong Gateway (OSS)? Five main reasons:

Cloud native. Kong is designed to run in containers and has built in support for AWS Lambda, which meant it fit right in with my existing infrastructure.
Scalable. Running in containers, Kong is more than capable of handling as many requests as I could throw at it and doesn’t have any key limitations.
Compatible. My awesome registration provider (Apiable) already supported it as a backend, and the API is very straightforward.
Customization. Kong supports custom plug-ins, which let me use a URL based API Key as authentication.
Open Source! It felt wrong to put an open weather API behind a proprietary gateway, so this was an added plus.

Overview

Implementing this was a significant undertaking, and I relied heavily on other published walkthroughs, so I wanted to take the time to explain the process here. At a high level, the architecture is straightforward: a network load balancer in front of a containerized Fargate service interacting with a cache, database, and Lambda function.

Image

The system starts with a lightly customized Kong OSS image, which is build using Docker on an EC2 ARM instant and a very simple dockerfile:



FROM kong:3.2-ubuntu
USER root
COPY kong-plugin-request-transformer_1251-0.4-1.all.rock /tmp/kong-plugin-request-transformer_1251-0.4-1.all.rock
WORKDIR /tmp
RUN luarocks install kong-plugin-request-transformer_1251-0.4-1.all.rock
USER kong

Why the custom image? By default, the “Request Transformer” plugin runs after authentication; however, I needed it to run beforehand to extract the API key from a URL string. Specifically, the plugin adds a header from a URI capture: apikey:$(uri_captures['apikey']). This required that I create a ever so slightly modified version of the built in transformer built in transformer with a priority of 1251 so it would run beforehand. By downloading the request transformer files, I could adjust the priority and build a new rock. The created the file that gets copied over and installed in the dockerfile.

Database

Now that I had a functioning image, the AWS infrastructure falls into place around it. Kong stores everything in a Postgres database, and while I could have spun up my own, it was easier to rely on the AWS option, Aurora Postgres. Since the load is relatively small, I’m using the smallest Serverless v2 option, which uses 0.5 Aurora Capacity Units. By running this on RDS, it means I don’t have to worry about database updates, maintenance, or backups, and it will scale if there’s ever a wave of traffic.

Kong can also rely on Redis for caching API calls or authentication. While I’m not caching any data yet, caching authentication quotas did produce a slight performance improvement, and allow for quotas to stay in sync when multiple instances of the Kong instance are running or if it gets restarted. For this, I spun up a simple, single node t4g instance, which provides a primary endpoint that Kong uses.

Container

With the AWS infrastructure in place, it was time to get to Kong. At the core, it is an ECS service that calls task definition.

This calls the public image with a few environmental variables referencing the database and cache. The service is designed to run as many copies of this task as required, scaling them up and down as needed. In terms of networking, the containers are assigned a public IP address, which allows the image to be pulled, and are otherwise attached to a private subnet. I also had to configure security groups to allow access to the database, cache, and Lambda function.

With the basic Kong container in place, it was time to get traffic two and from it! At the front end, a Network Load Balancer interfaces between the broader internet and however many Kong containers are currently running. It’s configured to distribute traffic evenly between functioning containers, and Route53 is used to set the DNS for api.pirateweather.net to this load balancer or a fallback (more on that in a minute). For the back-end, I configured Kong to pass requests to a Lambda function using their built in plugin and a Lambda endpoint in my VPC. I did all the Kong setup using the wonderful Konga running in a docker container on a management EC2 VM.

Fallback

While this setup is designed to be resilient and reliable, I’m always thinking about possible failures, and Route53 has a tool built specifically for this! By adding in a healthcheck, Route53 will either return the DNS for the Elastic Load Balancer, or in case there’s an issue with my Konga setup, fall back to AWS API HTTP Gateway. This does not provide quota management or allow for new registrations but keeps things running at a baseline level.

Compared to the regular AWS API Gateway, this setup has advantages and disadvantages. The always on database, container, and cache result in a slightly higher bill than before; however, this should remain relatively flat with increased usage. It’s equally fast, provides a wider range of customisation options, and scaled past the 10,000th key without missing a beat! In six months of production this setup has been rock solid, easily handling more than 20 million requests per month.

Apiable

Sitting alongside all of it is Apiable. This service was exactly what I was looking for to replace my depreciated AWS developer portal, handling registration, signups, and quota plans. Their service interfaces with my Kong admin API via the same port/ load balancer but a different endpoint within Kong, so all I had to do to connect the two was create an admin consumer API key and point the URL to the correct place.