DEV Community: Sagar Trivedi

Building CRD Operators Faster with Tilt + AI Agents: A Practical Feedback Loop

Sagar Trivedi — Sun, 01 Feb 2026 14:58:25 +0000

Kubernetes CRD operators and webhooks are powerful, but the development loop is slow. You write Go code, rebuild the image, load it into a cluster, deploy manifests, test, repeat. That friction is a big reason DevOps engineers skip writing operators or webhooks, even when they know it would solve real problems.

Why Tilt Matters for CRDs and Webhooks

Tilt is built for tight local feedback loops. Instead of “build image → push → deploy” on every change, Tilt can live‑update a running container: sync a freshly compiled binary into the pod and restart the process in seconds. For operators and webhooks, this is game‑changing because most changes are in Go code, not Dockerfile or base image. Learn more about Tilt and live updates here: https://tilt.dev and https://docs.tilt.dev/live_update_reference.html

The pattern looks like this:

Local go build produces the manager binary
Tilt syncs that binary into the running manager pod
Tilt restarts the process
You immediately see new behavior in logs or events

This is the fastest loop you can get without shifting the operator entirely to a local run (which is harder for admission webhooks and TLS).

Why AI Agents Usually Miss This

Most AI agents that generate Kubernetes code stop at scaffolding. They can write a CRD schema, generate controller logic, and produce YAML, but they don’t integrate a developer feedback loop. The result is “code that exists” rather than “code you can iterate on quickly.”

That gap matters because operators require iteration. The first pass is almost always wrong: schema needs tweaks, webhook invariants need correction, RBAC is too broad, or reconcile logic misses edge cases. Without a fast loop, every fix is expensive, and agents aren’t incentivized to optimize the human-in-the-loop experience.

Why Fast, Reliable Feedback Is Essential for Agents

An agent’s output quality depends on its ability to see consequences. If the agent can immediately deploy, observe, and validate, it can self‑correct.

For CRDs and webhooks, the feedback must include:

CRD registration success
Webhook configuration installed and serving
Manager pod running with the new binary
Admission failures or reconcile errors surfaced quickly

A fast, reliable feedback loop makes the agent better and makes the user trust the agent.

The Repo: K8s Operator Creator Agent

This project is an agent‑first workspace that bakes the feedback loop into the workflow. It combines:

Kubebuilder + Kustomize for standard operator scaffolding
Tilt for live‑update and rapid iteration
kind for dev‑only cluster safety
Optional MCP validation or kubectl verification

The agent does a few things in order:

1) asks for CRD purpose, fields, validations, webhook type
2) generates code and manifests
3) deploys to kind
4) validates the cluster and iterates

It’s designed to reduce the gap between “I have an idea for a CRD” and “I have a working operator and webhook in a local cluster.”

Repository (copy/paste):

https://github.com/Sagart-cactus/k8s-operator-creator-agent

How to Try It (Sample Prompt)

Here’s a concrete sample prompt you can use with Codex or Claude Code inside the repo:

You are the CRD/Webhook Builder agent for this repo. Follow AGENTS.md strictly.

Goal: build a “TTLJob” CRD that runs a Job and auto-deletes it after a TTL.

Requirements:
- CRD: TTLJob with spec fields:
  - image (string, required)
  - command ([]string, optional)
  - ttlSeconds (int, required)
  - backoffLimit (int, optional, default 3)
  - labels (map[string]string, optional)
- Status fields:
  - phase (string: Pending/Running/Succeeded/Failed)
  - startTime (timestamp)
  - completionTime (timestamp)
  - jobName (string)
- Validating webhook:
  - ttlSeconds must be >= 60
  - image must be non-empty
  - backoffLimit must be 0..10
- Mutating webhook:
  - default backoffLimit to 3
  - default ttlSeconds to 3600 if not set
- RBAC: namespace-scoped
- Safety: only mutate CRs labeled “dev-mode=true”
- Fast dev loop: use Tilt + local compile + binary sync + restart
- Use kubebuilder + kustomize conventions

Steps:
1) Ask any missing questions (only if needed)
2) Generate CRD schema, controller logic, webhook scaffolding
3) Add Tiltfile, dev overlay, and Make targets
4) Create kind cluster and deploy
5) Verify with MCP if available, otherwise kubectl
6) Summarize how to run `make dev` and test with a sample CR

Closing Thoughts

Tilt is not just a developer convenience, it’s a critical piece of the reliability story for AI agents. When an agent can test, observe, and iterate rapidly, its outputs improve dramatically. That’s exactly what this repo tries to enable: a fast, safe, and repeatable loop for CRD operators and webhooks.

If you build with CRDs or work in DevOps, I’d love feedback.

Understanding Redshift

Sagar Trivedi — Mon, 31 Jul 2023 03:25:49 +0000

Amazon Redshift is a great offering by AWS and it is one of the most popular and fastest cloud data warehouses in the market right now. It gives you a great performance, can be scaled, is secure and easy to manage. The connectors and queries are almost identical to Postgres hence it becomes very easy to query the data. Today we will take a look at the architecture of Redshift and understand how it achieves performance and scale even at tera and peta byte levels.

Architecture

AWS Redshift architecture Taken from https://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html

The above diagram is the architecture of AWS Redshift. We will start from the top and move towards the bottom and understand each part.

Leader Node: The leader node is responsible for the communication with the connectors. It abstracts the underlying architecture of the compute nodes. When a connector sends a query to the Leader node, it decides on which compute node to fetch the results from and after fetching the results it aggregates the data and sends the query result to the connector. Similarly It decides which data will be stored in which compute node.
Compute Node: The compute node is responsible for storing and retrieving data. Each compute node has its own CPU, memory and disk. When it receives instruction on what needs to be fetched it uses Node slices to fetch the data. The amount of CPU, memory and disk is dependent on the node type and in order to handle the scale we can easily increase or decrease the amount of CPU,memory and disk by changing the node type providing us with a truly scalable data warehouse.
Node Slices: Node slices are the smallest part of this architecture. It is not very clear from the AWS docs but imagine them as a way to parallely fetch/store the data. Each node slice is allocated a portion of the node’s memory and disk space where it can process the workload which is assigned to it. Which node slice will process the workload is determined by the leader node.A single query workload can be distributed across node slices in different nodes each one working parallely. The number of node slices per node depends on the node type.
Cluster - This entire interconnection of a Leader and compute node over a high speed network is called a cluster. One thing to note here is the leader node is created only when there are more than 1 compute nodes in a cluster. So if there is only a single node, the cluster is basically the node itself.
Network - The network on which the leader and compute node communicates is a high speed private network abstracted from the client. AWS uses close proximity and custom communication protocols to achieve the said high speed network.

How data is stored in Redshift

Now we will look into some of the concepts and components used to store the data in a particular way.

Columnar Storage

Columnar storage is where we store columns together instead of rows. So for example If there is a data like this

A typical relational database would store store the entire row in a block something like this

But in Columnar Storage, data of a single column is combined and stored in blocks something like this. Of course each column will have its set of blocks.

Why would someone use columnar storage? In traditional RDBMS you would be required to create indexes on the fields which your queries need to filter or group by for faster execution. But for that you would be required to know all the queries needed beforehand. Consider an extreme case where you create an index for every column, that would just mean replicating all the data in the table again in indexes. Redshift does the same thing; it does away with the actual row data and stores data in a way very similar to an index for every column of data. Redshift thus removes row-level work by using columnar storage and only performs input and output (I/O) operations for the columns required by a given query.

" width="800" height="529">Number of blocks accessed in row level storage

" width="800" height="541">Number of blocks accessed in Columnar storage

The above illustrations shows suggestive comparisons between traditional row level and columnar storage on how many blocks are accessed if we select only a single column in the data.

Compression

The columnar storage changes the way we can compress the data. It opens up a lot of improvements in compression when compared to traditional row level storage. When each block contains data of the same type and similar values, compression becomes more efficient. Moreover we can use different compression techniques on different columns since they are not stored together with other data types. This is known as Column encoding in Redshift. This is not to be confused by the character set that a column uses, Amazon Redshift always stores data in UTF-8 and unicode. Amazon Redshift can choose a compression technique for you or you can select from any of these options. When compression becomes efficient it increases your I/O throughput.

Massively Parallel Processing

Amazon Redshift taking advantage of compute nodes and node slices can perform parallel processing at a massive scale, hence the term Massive parallel processing. But there is a catch to it, You need to design your tables in such a way that they can take advantage of this parallel processing. This is very critical when you have a large dataset. Amazon Redshift uses distribution style to determine the node slice where the row will be saved.

Distribution Style

Amazon Redshift uses 4 distribution styles to determine the data distribution among the node slices. One thing not to be confused with is the distribution style is defined at a table level and not at a database level. Let’s take a look at them.

ALL - This is the most simple distribution type. You replicate your table data on all the nodes. This might sound a little non-optimal at first but imagine a small table requiring join with other tables all the time, instead of fetching and transferring data from different compute nodes it makes sense to keep a copy of it in all the nodes. This is not optimal for large datasets.

Recommended for: Small tables requiring join with most tables.

EVEN - This distribution tries to evenly assign the table rows to all the nodes in a cluster. This distribution style is recommended when the table is queried individually and not with joins. When queried individually it makes maximum use of parallelizing load since each node has data and the execution time is reduced. When queried with a join on other tables, rows may be matched on all the nodes increasing the network traffic which in turns impact query performance. The only exception is that it can be joined with a table with ALL distribution since that table will have data on all nodes the query performance is not compromised.

Recommended for: Tables queried individually or joined with tables having distribution type ALL.

KEY - This distribution assigns rows among nodes based on the value of a column or columns. Amazon Redshift will make sure that rows with the same value of that column will reside in the same node. When you want to optimize the join between 2 large tables you can use this distribution on the common column or columns so that both tables are co-located on the same nodes.

Recommended for: Large tables having joins together and needs optimization.

AUTO - This distribution style starts with ALL to the table and then the style is switched to EVEN as the table grows and it does not switch back to ALL when the size reduces.

Recommended for: When you are not sure of which distribution style to use and are just starting.

One additional thing to note is that distribution style is not set in stone. The amazing thing about amazon redshift is that a simple table rewrite can change the distribution style of the table.

Sort Key

Sort keys are used to determine how rows in a table are sorted. When used properly it helps the query optimizer to read fewer chunks of data which improves query performance. When storing the data a set of metadata is created called zone maps. This zone map contains the min-max range of the values stored in each 1 MB block. Zone maps are used during query processing to find the relevant blocks before disk scan. If the columns are not sorted on the disk the min and max range can overlap and for a particular value so there can be 2 or more blocks returned for a query. There are two srot types in Amazon redshift

Compound Sort Key - This is the default sort type. You can specify one or more columns as the sort key. The first column specified has the highest weightage and the weightage keeps on decreasing with the next columns. So use compound sort key when your query includes joins, group by, order by and partition by and your table size is small.
Interleaved Sort Key - In this sort type Amazon Redshift gives equal weightage to each column selected as sort key. Hence the performance improved significantly when the query uses equality operator in where clause on secondary sort columns. Adding rows to an already sorted column affects performance hence vacuum and analyze operation should be used regularly to re-sort the data and the zone maps. So use interleaved sort keys when you have only one column as sort key or when there is an equality operator in queries for the secondary column or when the table is huge. It is not recommended to use interleaved keys for monotonically increasing attributes like dates, timestamp, auto increment ids.

I hope this article will help you in better understanding Amazon Redshift. Do let me know how you like this article.

Email Authentication for Dummies

Sagar Trivedi — Mon, 31 Jul 2023 02:59:43 +0000

Email authentication techniques are the ways in which a recipient mail server verifies that the email which is being sent is indeed by the sender it claims to be. It is used to block harmful or fraudulent use of emails like phishing and spam. If you are a newbie trying to understand how emails are authenticated or a marketing manager who gets baffled by terms like DMARC, SPF, DKIM etc., this post is intended for you.

How does email authentication work?

There are many ways of email authentication but all of them somewhat follow the below approach

The sender organization states a policy on how the emails from its domain name can be authenticated and publishes this policy for everyone.
The receiver organization authenticates the incoming email by the policies that were published by the sender and then takes appropriate action on whether to deliver or flag or reject the email.

Now let us look at a use case where email authentication is really necessary when you are using an email provider for sending emails. For example:

We have a domain foo.com and an email address app@foo.com. You are using a provider like sendgrid to send the emails.
When you send an email from app@foo.com to bar@gmail.com, gmail receives an email from domain foo.com but was actually sent by a domain or IP owned by sendgrid. How can gmail actually verify that the email was indeed from foo.com and not from somebody else using sendgrid?
Foo.com has published an authentication policy (these are usually TXT and CNAME records) where it has authorized sendgrid servers to send email on its behalf. Gmail.com verifies the policy and checks whether sendgrid indeed is authorized.
Based on verification gmail decides whether to direct the email to junk, inbox or block the email.

There are 2 major policies for email authentication SPF and DKIM.

SPF

SPF is used by the email sender to define which IP addresses are allowed to send email on their behalf.

The sender domain publishes a TXT record in a standard format specifying the mail servers which are authorized to send email on its behalf. This TXT record added in the DNS is known as SPF record.
The receiving mail server will check this record to determine whether the mail server from which it has received the email is authorized by checking this record and taking action accordingly.

DKIM

DKIM is used by the email sender to provide an encryption key and digital signature to verify that the email was not modified or altered.

The sender publishes a specially formatted cryptographic key as a TXT record. While sending the message the sender mail server generates a special DKIM signature and attaches it to the header of the message.
The DKIM key is then used the the receiving server to decrypt the message signature in the header and compare it with a fresh version. If the values match the message is not altered or modified
Now that the receiver knows if the email is sent by an authorized server and has not been altered or modified in any way, How does the sender know if someone was trying a malicious activity on their behalf? This is where DMARC comes into picture.

DMARC

A DMARC is also a DNS record which tells the receiver on what action to take if the email does not meet the authentication criteria. It contains a url where the receiver may send the report of a malicious email and a policy which tells what to do with that email. There are three actions that can be taken

none: Just report, take no action
quarantine: Report and move the message to junk folder
reject: Report and bounce the email.

DMARC is optional and a receiving server may choose to not follow it, but lately it has become a standard and almost all major email providers honor DMARC.

To summarise, email authentication is the basic thing that improves email delivery. That being said, It does not guarantee email delivery, there are things like shared IP, email content, receiving server policies etc. that also play an important part in the email delivery. Maybe a topic for next time.

Sagat T.

Handling Incidents Mindfully 🧘🏽 — Circles of Control and Chaos

Sagar Trivedi — Sun, 30 Jul 2023 17:54:25 +0000

Namaskar 🙏🏽

Has your team complained about the stress level when there are incidents?

Is your incident resolution time usually high?

Does an incident increase anxiety levels among team members?

I have just coined a theory about how we can answer and resolve these issues — Circle of Control and Chaos. Here we talk about how we can empower people to handle incidents better and reduce the anxiety and stress that comes with it. A few years back, I saw this great video of Stephen Covey explaining the circle of concerns and circle of influence from his book 7 habits of highly successful people. This concept is based on the similar principle of 2 circles but with a different approach.

Everything that is to be done to handle an incident is inside this outer Circle of Chaos. Inside this outer circle, every person involved has things that are possible for them to do based on skills, authority, access, knowledge etc. they fall inside their inner Circle of Control and all the other things that are not possible for them or are irrelevant for them to handle the incident remain in the Circle of Chaos. The circle of control is a subset of the Circle of chaos. Things inside and outside of the circle of control are different for every person involved in the incident. Any organization that needs to handle incidents effectively needs to do 2 things for every person involved in the incidents

Increase the Circle of Control.
Make sure no one is working on things that are in the circle of chaos

What are the things that will increase the circle of control and decrease the circle of chaos?

Giving Authority, Priority and Clarity

By giving the person handling the incident authority to dictate and prioritize an item and clarity in case of conflicts, you increase their circle of control. Let’s take an example: If a person is accountable for an incident and only another team which can fix the issue is already working on an important project. This clearly increases stress and anxiety in a person handling the incident and also increases the time in resolving the incident. One of the best ways to solve this is that we should provide

Clarity — by deprioritizing any one of them.

Authority — to the person handling the incident to take any actions necessary to resolve the incident and restore services as soon as possible.

Priority — Providing them quick and prioritised access to people and resources helping then to resolving the issues faster.

Doing all these 3 things helps in increasing the circle of control. The bigger the circle of control the faster is the resolution to the incident.

Defining Clear Roles and Responsibilities

Just like we have evacuation policies defined for the workplace “In case of emergency you need to …” so that there is no chaos at the time of calamity, We should also have policies beforehand defining clear roles and responsibilities for every person “In case of an incident you need to …”.

We also need to make sure that the roles and responsibilities assigned to a person reside inside their circle of control. For example, Communicating to stakeholders about an incident may be inside the circle of control for a product manager but the same task is in a circle of chaos for an engineer/developer.

If you are a person with a high sense of ownership and responsibility, It is very likely that you will tend to think and articulate all things inside both the circles and thereby increasing your stress. Having clear roles and responsibilities reduces the stress, knowing that a thing which is in my circle of chaos is in someone else’s circle of control, someone who has better knowledge, more authority/knowledge to solve a problem. The lesser the number of items in the circle of chaos the lower is the level of stress among the team handling the incidents.

To summarize, by giving Authority, Priority and Clarity we empower the person handling the incident to do their jobs better and by defining clear roles and responsibilities we reduce the stress by assigning items to the people that they are comfortable with. I will also add that these things are not always possible and may depend on the type of incident, availability of people etc. But we can try our best to follow these guidelines so that an incident is resolved as quickly as possible with the least amount of stress and anxiety.

Handling Incidents Mindfully 🧘🏽 — Part 2: Do not React, Respond !

Sagar Trivedi — Sun, 30 Jul 2023 17:49:30 +0000

Namaskar 🙏

Welcome to the second article in the series of “Handling Incidents Mindfully”. If you have landed directly to this article I suggest you go to the first part of the series, where you will find a brief introduction and the first step in this journey — Acceptance. Let’s dive into the second part “Do not React, Respond !!!”.

What is the difference between React and Response? Aren’t they one and the same? Yes React is as related to Respond as Javascript is to Java. We will start by taking a look at the definition of React and Response with the help of an example.

React — When you act without giving a thought it is known as react. The action is almost instant and is governed by your subconscious mind. A textbook example in mindfulness is of neanderthal times where you would have a confrontation with a dangerous wild animal. It would cause an adrenaline rush and the mind would either tell you to fight or flee. There is no thought process here, just reflexes working in order to save you. In today’s world, there are no such dangers, but people have started to react to things as simple as criticism received on your code or occurrence of an incident. React is a state of high stress.

Response — When you act by giving a thought it is known as a response. Here you are utilising your mind hence the term mindful 😁. If you are thinking about how responding will help you in confrontation with a dangerous wild animal, the answer is it will only help you get eaten. But the times right now are not neanderthal and Response will help you a lot when people are criticising your code or system design. Response is a state when you think before you act.

Let’s connect React/Response to Incidents. Anyone who has seen enough incidents knows that many of them involve loud heated arguments and people are under stress (in a state of react) hence many incident reports have words like warfront, war room, war-like situation. How do we handle this stress? Accept the fact that there are going to be incidents and people involved in those incidents will undergo a lot of stress (used our first step Acceptance here😉 ). We need to provide them with all the tools necessary to ease out on the stress.

Organisations always look at these processes and tools as something which is to be used to find the issue/problem or something that would help the product launch. But never from the point of view on how it will help the people handling the incident to respond and not to react. How many organisations evaluate a tool on the basis of how it reduces the stress of people handling the incidents? Is this a major factor that is considered when you are finalising a tool or setting a process? The state of react is uneasy and unhealthy. We should also consider the stress factor when finalising a tool or setting a process in terms of incident handling. If you are at a position in an organisation where you decide the budgets of these tools please consider this factor as well spend more if it reduces stress. If you are in a position, where you decide priority or scope of implementing metrics and monitoring of a product launch please treat them with importance and give them a thorough thought. If you already are doing these things or will start doing it not because of the goals or KPI involved in incident management but actually thinking about stress that people will feel, congratulations you have practiced or will be practicing another great mindful value Empathy.

To summarise, Incidents are involved with high stress which force people to react instead of respond. We should accept this fact and bring a change in the culture where we consider these factors in setting process and priority. Please note that we are talking about reducing not eliminating.

Next, how do you look at solving problems when you know that every minute that you are thinking of a solution is costing business? The stress that is involved in finding the root cause is one thing, but the stress of fixing the issue is altogether another ball game especially when the solution or fix is not straightforward to implement. The next part how do you handle a stressful situation where there is chaos all around and you are not sure on what needs to be done, again with a catchy title “The Circles of Control and Chaos”

Handling Incidents Mindfully 🧘🏽 — Part 1: Acceptance

Sagar Trivedi — Sun, 30 Jul 2023 17:44:21 +0000

Introduction

Namaskar 🙏

Change is the only constant, this metaphor is applicable both in life and in software. Today we discuss one of the things which come as a part of the constant changes we deploy in the software, INCIDENTS.

In my work experience from 2015 to 2019, I was responsible for working on a web application that was responsible for the revenue of the entire organization. A downtime or incident was attributed to financial loss and brand reputation. Every change was high stakes with stress on each deployment. Having said that, we went from deploying twice a week in 2015 to deploying 2 times or even 5 times a day. There were many incidents along the way and I have learned a lot in terms of culture that helped me in handling these incidents and I would like to share these learnings with you. I am a strong supporter of meditation and have been practicing it for more than about 4 years and it has drastically changed the way I work and interact with people. Each culture shift or value we will discuss are mindful practices that I, my peers, and leaders (now ex) used to follow. In this series of articles, we will look at these practices and what an organization can do in terms of its culture or values to better handle incidents.

I am hoping that these values will help your organization irrespective of

The severity of the incident
Size and Complexity of the system
Organization size
Number/Type of users

We will start with the first and the most important step: Acceptance.

Acceptance

The textbook definition of an Incident is,

An Incident is something that happens which is especially something unusual or unpleasant

In terms of software development,

An Incident is an event that disrupts the normal operation of the system (Software, website, etc).

Many organizations in their mindset, are afraid of incidents and relate them with financial loss and damage to the brand among its clients or customers. After a major incident, the common reaction is to start putting additional checks and reviews in the deployment process. We add sync ups, release meetings, review boards, etc doing our best to make sure there are no incidents in production. While these checks and processes have their merit in reducing the number of incidents, another thing that is often overlooked is the impact it has on the speed or velocity in which you deploy changes. We need to know our risk appetite and decide on the red tapes we need to put on the deployments.

Aiming to reduce the number of incidents is a fair ask and adding gates and checks in the process is fine. But adding these checks aiming to eliminate incidents is a wrong conception. The only way one can stop incidents from occurring is to stop releasing changes.

Organizations tend to articulate and connect an incident to a bug, a gap in process or technical constraints, etc. What they should be doing is bringing a culture of Acceptance towards incidents.

What do we mean by acceptance? It means recognizing the fact that the real reason for any incident is because we keep on releasing changes or growing at a good rate, so as long as an organization is doing both, they should accept the risks that there are going to be incidents involved along the way. The first step toward incidents should always be accepting and normalizing them. Decide on the pace at which we want to deploy changes and also accept the risks involved in it. Once you accept the risks involved we can start looking at incidents as a part of the whole software lifecycle process.

True acceptance of incidents as a part of your deployment process can only be achieved through culture in an organization. Incidents should be looked at as a stepping stone to improvement and how things could be better. The only thing that should matter is how we respond to an incident and how we are making sure that the same mistakes are not repeated.

So with this, we conclude our first part. Do let me know whether you liked it, hated it, or have a different view towards incidents.

We will go to the next part soon, with a catchy Title “Do not React, Respond !!!”

Until next time.

How to save cost in non-prod AWS environment

Sagar Trivedi — Sun, 30 Jul 2023 17:37:43 +0000

If you are in an organisation whose non-prod AWS environment costs are high and want to cut short on the cost without compromising on the development and deployment velocity, here is a short checklist/quick wins of actions that you can do in order to save money. Of Course you need to check whether the action has an impact on your environment or any process that they follow.

Use instance scheduler: There is this great solution on the AWS solution library. I highly recommend this. It stops and starts EC2 instances and RDS on a defined schedule. Use this to stop the environment when not in use. Instance Scheduler

Spot instances: If your application is stateless we can use spot instances or spot fargate in the environment to reduce the cost.

Single NAT gateway per VPC: NAT Gateways are costly and many IaaC by default create one NAT gateway per AZ which increases the costs. If it does not impact the working, you can create a single NAT gateway per VPC for your non prod environment. This will bring down a considerable amount of cost.

RDS IOPS instead of Storage: If your RDS storage requires IOPS, check if increasing the disk size meets your requirements. If your non-prod environment has reduced load, creating a RDS of 400GB is a great deal rather than creating a 100GB RDS with 1000 IOPS. (~$46 compared to ~$112)

Clean Up scripts: There should be an automation script which regularly detects and deletes unused AMI, EBS volumes, RDS snapshot etc. Storage Lifecycle manager is a great option to handle this. snapshot lifecycle

Schedule Scaling: If we are clear on the time when the environment usage will be high, We can increase or decrease the number of EC2/containers in the environment accordingly. S3 One Zone IA / S3 Intelligent Tier / LifeCycle Policy: This should not be needed always, S3 is dirt cheap, But if objects are accumulated over time, are huge, non-critical and rarely accessed, we can use these solutions accordingly.

Reseller: I do not know about other countries, But in India If we go for offline billing via a reseller of AWS we get a flat discount of 10% on the total monthly bill (Of course you need to commit a monthly amount and this may differ from customer to customer)

Reservation: If your future load is planned and have a clear requirement you can go for a savings plan or reservations.

To summarise, the above actions can help in reducing your non-prod AWS environment bills. Let me know what you think about it. Did any of these give you a TIL moment? Do you have something that is not listed here? Do you want me to cover more services? Do let me know.