DEV Community: Armando Contreras

DevOps Roadmap 2026

Armando Contreras — Tue, 10 Mar 2026 06:19:06 +0000

What is a Devops?

So first, what is DevOps? What are the duties and responsibilities of a DevOps engineer?

There’s no official definition of DevOps as a role — it originated as a cultural philosophy, not a job title. DevOps represents collaboration between development and operations teams to deliver software faster and more reliably. As a DevOps engineer, you typically handle infrastructure automation, CI/CD pipeline management, cloud resource provisioning, monitoring and logging, automation implementations, scripting, and facilitating collaboration between teams.

That’s the theory. In practice, it’s whatever the company defines it as. Each company has its own responsibilities and definition because there’s no official reference or standard for the role — like many things in software. You might do work closer to an SRE, platform engineer, or sysadmin. You might be setting up pipelines, cloud organizational setups like landing zones, or it could be a mix of all that.

You’ll be heavily engaged with the SDLC (Software Development Lifecycle). Even though your main responsibility likely lies in building, deploying, and operating the app, you’re not limited to that. You also handle monitoring, setting up fleet agents, APM (application performance monitoring), log ingestion, building metric and performance dashboards for apps, and alerting.
If someone wants to become a DevOps engineer, what do they need?

I’ll go through the entire roadmap, but let me start with the basics. In my view, anyone who wants to become a good DevOps engineer should have a background as a developer. Why? Because you need to understand the entire SDLC — you need a holistic view of it. How will you set up a CI/CD pipeline or troubleshoot its components if you don’t know how it works from the ground up? I think this knowledge is transferable across different tech stacks. You can learn to write code in a language like Python or Node.js, which aren’t low-level languages. And even though working with languages like C# or Golang is different, the building and deployment of apps follows similar patterns — with some variations in areas like dependency management and artifact storage.
Press enter or click to view image in full size

Also, having that developer background will give you the ability to develop code solutions, write scripts, build automation, generate reports, and develop APIs for internal usage like automation triggers. I mean, you can achieve this knowledge without being a developer previously, but I think it will be a little harder to get there if you don’t have that base knowledge.
Script Language

I’m going to use this https://roadmap.sh/devops roadmap as a base. Starting with programming languages — as I mentioned before, you can learn any language, but specifically for DevOps, my recommendation would be Python and Golang. Python for its ecosystem of libraries where you can handle most DevOps tasks related to simple scripting, CLIs, and automation. And Golang because most DevOps tools like Docker, Terraform, and Kubernetes are written in Golang behind the scenes. Knowing this language will open many doors in terms of extensibility when you need to fix a provider, create your own, or build CLIs. The main reason is that Golang is very performant against an interpreted language like Python, where you can expect a little more delay because of its interpreted nature. However, it will also depend on the company stack where you work. For some cases, knowing JavaScript for Node.js apps or C# isn’t bad because even if you’re not a super expert in those languages, it will be helpful to set up CI/CD, deployment, dependencies, etc.
Operating System

I recommend using any Linux or Unix distribution — Ubuntu and Debian are the main ones, but if you use macOS, it won’t be very different.

For scripting, I recommend Bash. It’s very helpful for basic script operations, and it’s useful when you build Docker images or other types of images — for example, to build for different architectures, download dependencies, or handle file moves.

Also, if you can learn Vim or Nano for terminal editors, it will be super helpful for those situations where you’re on a server with just terminal access and need to edit configuration files. You don’t need to be an expert — just learn the basics like editing a file, adding lines, removing lines, and searching text in a file.
Git

Having a good understanding of Git will be very helpful. I recommend you read at least the basics of the Git Pro book — this is a free book https://git-scm.com/book/en/v2 that will give you the basics of Git like branches, commits, merging strategies, staging, reverting changes, and authentication with the provider via basic auth or SSH keys. You don’t need supercomplex knowledge of Git, but having the basics and more will be super useful. For code hosting solutions (version control systems), I’d recommend GitHub or GitLab — both are great options. There are also self-hosting options, but at the end of the day, it’s better to use a solution like these. Cloud providers have their own solutions too, but in most cases you’ll end up using GitHub or GitLab.
Containers

I highly recommend learning Docker — understand the basics of how it works behind the scenes, kernel sharing, and the differences between containers and VMs. I highly recommend the book Learn Docker in a Month of Lunches. Even though there are other container engines out there, I recommend starting with Docker. Also, Docker Compose is super helpful for managing containers locally and basic POCs where you need multiple containers connected — like the database, Redis cluster, background tasks, backend, and frontend. Learn volume management and mounting, building images, how to set them up, and caching images to speed up builds.

Also, it’s not shown here on the screen, but Kubernetes is super related to DevOps and SRE roles. Learn how it works, the concepts, fundamentals, when to use it, and how to set up apps locally. There are different mini clusters you can set up locally like Minikube, k3s, and the Docker default Kubernetes cluster. If you can, please learn it. I recommend two books: Learn Kubernetes in a Month of Lunches and Kubernetes Up and Running (3rd edition). Cover the basics of node management, control plane, etcd, components of a cluster, and resources like services, ingress, load balancers, namespaces, replica sets, deployments, pods, volume sets, volumes, daemon sets, CRDs, and networking with Docker and Kubernetes (containers in general). also if you have the option to mound your won cluster from scratch will give you a deep understanding of kubernetes
Load Balancers

This is very helpful because you need to forward traffic to applications — so knowing how to do it with load balancers is essential. I recommend learning reverse proxies like Nginx and Traefik (which isn’t listed here). There are also cloud-focused options like AWS ELB or Azure Load Balancer, which are compatible with Kubernetes setups via ingress — for example, the AWS Load Balancer Controller for EKS. Learn the different load balancing strategies. Rather than caching servers, I’d say caching solutions like CDNs are more important — these optimize requests before they hit the server. Also look into proxy servers and firewalls like security groups where you define rules for inbound and outbound traffic — what’s allowed or denied. You may not need to set them up yourself, but understand the use case and where they fit.
Write on Medium

As you can see, a good understanding of network concepts will be very useful. Even if you’re not expected to be a network engineer, having more than the basics will be more than enough for most normal use cases.
Networking & Protocols

Do you know how DNS works? When you create a request from your local machine or a VM in the cloud, do you know how it gets routed to where it should end up? Do you understand DNS hosted zones, domains, and DNS servers? What about types of records — A, AAAA, TTL, CNAMEs, SOA, subdomain delegation? What about HTTP protocols — the meaning of the acronym, why it’s needed, what a Certificate Authority is, why some are accepted and others not? What is HTTPS? What about SSL and TLS termination? I don’t think you need to understand the depths of the TLS protocol, but if you can, take a look to understand what’s needed when a request is created at a deep level. Also, SSH and RDP — when you need to get into a server via remote connection, knowing how to make it secure, manage open ports, and understand the fundamentals is important. Most of the time, TLS termination is handled by load balancers, but there will be some use cases where you’ll set up your own Private CA to sign your own certificates for private TLS validation. It’s not very common, but if you know how it works, it will be valuable.
Press enter or click to view image in full size

Understanding the OSI Model helps you grasp the different types of load balancers. For example, in the cloud you’ll commonly find application load balancers (layer 7) and network load balancers (layer 3 and 4 for TLS).
Cloud Providers

I highly recommend learning AWS, with Azure or GCP as a second option. It will depend on the company where you work, but AWS is the gorilla of the cloud. If you can, learn the fundamentals of networking, identity and access management, and core services like EC2, ECS, S3, EKS, EBS, Lambda, CloudWatch, etc. And obviously the DevOps services like CodePipeline, CodeBuild, and CodeDeploy. I think talking about cloud providers deserves an entire video, but for now I can tell you that based on my experience — client support, services, documentation — the best is AWS. I haven’t used GCP, so I can’t give you a review of it.
Models of Compute, Services

Estos son diferentes modelos de servicios en la nube que definen el nivel de responsabilidad entre el proveedor y el usuario:

on premises
iaas
paas
saas
faas

Press enter or click to view image in full size
Iac Tools

I recommend learning Terraform above any other IaC provider. Why? Because of its modularization, cloud and service agnostic nature, extensibility to build your own providers, and many other benefits. If you work with different cloud providers, you have a tool that can be used across all of them. With a good understanding of it, Terraform can simplify a lot of things with DRY (Don’t Repeat Yourself) configuration. If you’re working with AWS, also learn how to use AWS CloudFormation and AWS CDK — these are the main foundations of IaC for that cloud. Serverless Framework can be used sometimes as well.
CI/CD TOOLS

First, what is CI/CD?

CI (Continuous Integration): A software development practice where developers regularly merge their code into a central repository. After each merge, one or more automated builds are initiated and tests run against the committed code. The main goal is to detect and address integration issues early by automating the build, testing, and validation processes. With CI, developers integrate their changes often, and each integration triggers an automated build and test process to ensure code quality and compatibility.

Continuous Delivery: If deployment to production occurs with continuous delivery, there will most likely be a manual approval process rather than an automated deployment. CD focuses on ensuring that software is always in a releasable state, but the actual deployment decision is made manually. The software goes through automated build, testing, and validation processes to ensure it’s ready for deployment. Once the software is deemed ready, it can be deployed to production at any time — but the decision to deploy is typically made by a human. CD provides the flexibility to release software frequently and reliably while still allowing for manual approval before deployment.

CD (Continuous Deployment): There is no manual approval process — code revisions are pushed directly into the production environment. Instead, teams rely on testing practices and guidelines to ensure the code meets quality checks before being automatically deployed to production.

Among the CI/CD providers, I recommend learning GitHub Actions. GitHub provides courses on how to use it, and its documentation is excellent. If you need examples, most open source libraries on GitHub have their CI/CD setup public — you can see how they trigger workflows when a pull request is opened or when there’s a merge to the main branch. GitLab CI is also great, especially with its templates. Both are excellent options. I’ve set up hundreds of actions with both, and I can tell you either one is a solid choice. Also, make sure to understand runners and try building your own. As I mentioned before, learn the CI/CD services of AWS or any other cloud provider that interests you in case you need them. For example, in specific cases where you need to set up CI/CD connected to the same VPC as your server or app, it’s more difficult to configure runners from other solutions — so take that into consideration.
Everything is a TradeOff

though this image recommends HashiCorp Vault, I advise you to learn what fits best for you. In my case, I’ve only used cloud secret managers like AWS Secrets Manager and Azure Key Vault — both work well for their use cases. The same applies to infrastructure monitoring and log management. In my experience, I’ve used Grafana, Elastic, Splunk, New Relic, and AWS CloudWatch, and I can tell you that even though they have some differences, in general they’re pretty similar. So knowing one well and depending on your needs is okay — as long as you understand the different components of the observability stack: logs, metrics, traces, OpenTelemetry spec, fleet agents, APM, RUM (or browser monitoring), synthetics. For me, I like Elastic more, but any of them is fine depending on the tradeoffs and your needs. Same case for artifact management
Cloud Design Patterns

https://aws.amazon.com/es/architecture/well-architected/ Rather than going into specific concepts like availability, management, etc., I recommend taking a look at the Well-Architected Framework — which is very similar across clouds and gives you a solid foundational knowledge of cloud setups and best practices.

https://roadmap.sh/pdfs/roadmaps/devops.pdf
References

https://www.reddit.com/r/devops/comments/1f7b1ix/what_is_devops_really/
https://medium.com/@acontreras-mp/devops-and-sdlc-software-development-life-cycle-43cca504d09a
https://aws.amazon.com/es/architecture/well-architected/
https://www.reddit.com/r/sre/comments/1mo453d/what_is_the_difference_between_devops_sre_and/

AWS EC2 Instances purchasing options

Armando Contreras — Mon, 01 Apr 2024 04:58:38 +0000

I know that understanding the different purchase options for AWS EC2 instances can be confusing at the beginning. In this blog i will explain you how these models works and help you understand how you can benefit from each of them. Optimizing your compute usage can significantly impact your billing. One of the most important decisions when defining the instances for your workload is to determine the lifecycle of an instance(short-period of time? long period?. There are two different ways to run your instances (on-demand, spot), even if you bought reserved instances those are running on-demand. Then we have the tenancy (shared or dedicated) that determines if your instances will be retrieved from a shared pool of VM's or from a dedicated host, instance(the dedicated tenancy will not be explained on this blog).

OPTIONS AVAILABLE

so what are the options available?

On-Demand: Pay per second, only for the time that the instance is used. Each instance has a fixed price per hour, so there's a record of the time used by the instances. Some need specific licenses like Windows OS, for that reason, there's an extra fee due to the use of those licenses. This model is a good option for testing or short-term launches.
Saving Plans
Reserved Instances
Spot Instances
Capacity Reservation
Dedicated Hosts:Pay for a physical host that is fully dedicated to running your instances, and bring your existing per-socket, per-core, or per-VM software licenses to reduce costs.
Dedicated Instances: Pay, by the hour, for instances that run on single-tenant hardware.

i will not cover dedicated hosts and instances due that is a very specific topic and. don't see a high value on knowing them at detail for the general public

Saving Plans:

The Saving Plans offer a flexible pricing model that allows you to save money on some AWS services. This option offers lower prices for EC2 regardless of the family, instance type, OS, tenancy, or region. It also applies to Fargate and Lambda workloads. How does it work? It offers a discounted pricing compared to on-demand that can be, in the best case, a 72% discount. For compute (applies to EC2 instance usage, AWS Fargate, and AWS Lambda service usage regardless of instance family, size, AZ, region, OS, or tenancy),For EC2 instances (EC2 Instance Saving Plans provide the lowest prices, offering savings up to 72% in exchange for commitment to usage of individual instance families in a region (e.g. M5 usage in N. Virginia). This automatically reduces your cost on the selected instance family in that region regardless of AZ, size, OS or tenancy) it can also be used with sagemaker but that is beyond my knowledge. this model work in exchange for a commitment to using a specified amount of compute power (measured per hour) for a 1 or 3 year period instead of making commitments to specific instance configurations. When you subscribe to a saving plan, the prices that you will pay will remain fixed for the period of the plan. Here you have 3 options of payment for your commitment:

All-upfront: All payment in the first month
Partial upfront: An initial payment (at least the 50%) and then a reduced amount of the total per month
No-upfront: No payment is done at the beginning, each month you will pay a portion of the total

The percentage of savings depends on various factors such as the commitment period, the hourly commitment usage.if the EC2 Instance saving plan is chosen region, and instance family become part of the factors to consider. For services that utilize EC2 behind the scenes, the saving plans do not apply to those services, but instead to the underlying EC2 instances.

Reserved Instances

Provides a significant saving on your EC2 costs compared to on-demand instance pricing. These instances are not physical instances but rather a discount applied to on-demand instances, these on-demand must match certain attributes like instance type and region to apply the billing discount.
So how does it work? Let's imagine you have an m6g on-demand instance, when you buy the reserved instances you make the attributes to match the running instances, and automatically the discount will be applied if there is a match between the RI and the on-demand instance. You could also buy the RI first and then launch the instance.

Which values determine the pricing of the instance?
instance type, region, tenancy and OS platform
which terms commitments are available?
are the same options as a saving plan, from 1 to 3 years(3 offering a bigger discount).

once you bougth a reserved instances it can't be canceled but a new feature got integrated this week to allow you regret your purchase before 7 days you can also modify, change or sell your RI if your needs change.

offering class

this is an option that allows you to modify or exchange your RI's depending on the offering class selected

comparision table of offering classes

Standard: Offers a significant discount but can only be modified. This means that you can adjust some of its attributes during the years of the RI but can't be exchanged for a different RI.
Convertible: Provides a lower discount than Standard RI's but the main advantage is that it can be exchanged for different Convertible RI's with different attributes or modify the same RI.

The first thing to define before buying is the scope of the RI. The scope doesn't affect the price but makes a difference on the features available. For example, in the case of a regional the RI doesn't reserve capacity. On the other hand, selecting a zonal RI reserves capacity on that AZ and the discount only applies if the instance is launched on that AZ. Although this scope is not flexible on the instance type and size, the regional applies to the use of a specific family regardless of the size (this concept is known as instance size flexibility). I encourage you to take a look into this

comparison table
to understand this in detail.

Regional: the RI is bougth for a specific Region(regional RI)
Zonal: the RI is bougth for a specific Availability Zone(zonal RI)

how to use the RIs?

a common mistake that happens when we hear Reserved instances is that we think that a physical instance is getting reserved for our use but this is not true(unless a zonal RI is selected), that really happnes is that a reserved instance is a discound applied to on-demand instances, as i explained at the beggining of the IRs section a match between the on-demand instances and the RI attributes must exists to apply the discount automatically.

getting deep into instance size flexibility

the flexibility of the instance is determined by a concept called normalization factor. the discount applies total or partially to the ondemand instances depending on the size of the RI (only applies to regionals RI's). the only attributes that must match are the family, tenency and platform.

normalization-factor-table

each instance has a normalization factor, which is applied acording to the size of the Instances, thsi value is used to apply the discount. example a t3.medium has a normlization factor of 2 but imagine you launched 2 t3.small instances with a normalization value of 1, in this case the benefit applies as if you were using only one medium instance due to the normalization factor. but if for example you had a t3.large instances that has a NF of 4 then only 50% of the discount will be applied.

this is applied from the smallest instance size to the largest in use
examples of RI's with normalization factor
under organiation accounts the discount is applied first in the account where the RI was bougth and then if applicable to another account in the organization.(i don't know under which condtions this applies)

I will not delve further into the topic of reserved instances but here are three links that may be useful to you and that I have not covered in my blog

RI pricing calculator

SPOT INSTANCES

the spot instances uses compute that is not currently utilized of ECE, this is available at a lower pricing than compared to on-demand. tha can be up to 90% in the best case. aws defines the spot pricing per hour called spot price The Spot price of each instance type in each Availability Zone is set by Amazon EC2, and is adjusted gradually based on the long-term supply of and demand for Spot Instances

spot instances runs only when unused capacity is available

spot instnaces are a good option for workloads than can be interrupted. allowing you to have the most cost-effctive solution with a trade-off of losing instance availabily

KEY CONCEPTS

Spot capacity Pool: a group of unused EC2 instaces with the same instance type and Az
Spot Price: price of the spot instance per hour
Spot instance Request: request for a spot instace when capacity is available. can be a one time or persitent to be reused in future requests. ec2 automatically recreates a persistent request when it has been fulfilled and interrupted.
Spot Instance Interruption: ec2 terminates, stops or hibernates your spot instance when it needs the capacity back, an interruption notices is sent to give two minutes windows time before interrupting your instance so you can rebalance your workload.
EC2 Instance Rebalance: EC2 emits an EC2 instance rebalance reconmendation to notify that a spot instance is at high risk of getting interrupted. this gives you the opportunity to rebalance your wokads usign other spots intances without having to wait for the 2 minute interruption i will not delve into the usage of spot instances because this is a very extensive topic and i expect to create an specific post in the future to show its usage at detail in conjuntion with eks karpenter

CAPACITY Reservation

a CR allows you to reserve compute capcity for EC2 instances under an specific AZ. this is very useful on strict capacity requirements for workloads that require capacity assurance for High Availability, with this you can ensure that you'll always have access to the ec2 capcitity you've reserved it for as long as you need it.
there are other uses cases as machine learning models training where you can need more GPU's during the training porcess but there is no more capcity available at that time, with CR's you can ensure that oy the capatity needed for the time of the training(i will not delve into the capacity blocks for ML due to my lack of knwoledge in the Ml area).
one of the main advantages of this purchare model is that you can buy this at any time without a commitment of one-year or three-year term. The capacity becomes available and billing starts as soon as the Capacity Reservation is provisioned in your account. When you no longer need the capacity assurance, cancel the Capacity Reservation to release the capacity and to stop incurring charges
this model can be combined with the billing discounts of Savings Plans and Regional RI's to reduce the cost of a Capacity Reservation. zonal RI's doesn't need a capacity reservation becasue that feature is already provided by that purchase model.

CR's can only be used by instances that match their attributes. By default they are automatically used by running instances that match the attributes. If you don't have any running instances that match the attributes of the Capacity Reservation, it remains unused until you launch an instance with matching attributes

Differences between Capacity Reservations, Reserved Instances, and Savings Plans

this a great table to understand the differences between the main pricing models

important to know that you will get billed as soon as the CR starts and ends once gets expired or cancelled

How to make decisions?

it's fully dependant on your use case. for an initial POC you can start with on-demand once you have a clear definition of your mimum usage i will go with saving plans, not to cover all the costs but at least have a discount with a defined minimum usage of my ec2 instances. then once you have a clear definition of your usage and your expectations for the future, i'll take a look into the RI's at least for a 1 year commitment.(but only for those workloads that i'm sure will be using that capacity for that period of time). 3 years commitment for RI's? is a great discount but the commitment is a lot of compromise. i will prefer to go with a convertible in case that the business needs change over time. spot instances? only for those workloads that can be interrupted and i want to optimize the cost of it. i have seen amazing histories of costs optimizations using spot instances with eks karpenter so if you know how to use it you can get a huge cost optimization. at the end i will say that probably you will not find the best option at the first try but the cost-optimization is a work of progression over time and will get mature in conjuntion with your workload expectations and business needs.

i hope you liked this content. i'm making this EC2 at detail serie as part of my deep study of EC2. i found myself in some situations where i was creating cost-estimations and i didn't know the differences between the purchase options of ec2 and now that i got the opportunity to delve into this topic i feel very confident about providing the nearest rigth solution depending on the use case and the needs of my clients. maybe that happened to others... being on the aws cost calculator and not knowing the difference between each purchase option😄

AWS EC2 Instances Types (all you need to know)

Armando Contreras — Mon, 01 Apr 2024 03:24:03 +0000

I'm fully sure you have been involved at some point with EC2. In my opinion, this is probably the most important service for most of the AWS workloads. In almost all of the solutions architectures I have been involved, the most expensive cost is EC2. I have found a lot of times overprovisioned instances, cost optimization opportunities by changing the instances type or architecting your applications to work on ARM processors, or just workloads that can work better using a different family instance type, without compromising operational excellence and iptimizing costs. In this blog, I will guide you over all the details of ec2 instance types that you need to know to see these opportunities and optimize your operational costs.

NITRO SYSTEM

before going deep into the instance types you need to know that aws built its own virtualization technology called nytro system. this is combination of hardware and software built for high performance, availability and security. offering capabilites "bare metal" a common myth with cloud virtualization si that the performance can be reduced compared with on-premises. nytro system remove the layer of resources utilized by the common virtuzalization technologies this is very useful for workloads that needs full access to the host hardware(later i will explain you what are the bare instances types). in words of aws documentation the perfomance is so close to bare metal that you can't notice the difference. the main components of nitro system are focused on the essential, remove common elements commonly used in other virtualization technologes. one of the main features is that it implements a card that externalizes the network, storage and security components of the hosts. removing that duty from the instance hardware. this nitro system is the underlying virtualization technology used for the last generation of ec2 instance that allow aws to have full control on the virtulizacion. i engourage you to take a look into this about the history of nitro and how it helped AWS to develop and offer a wide range of instances types without sacryfing perfomance, recuding costs, optimizing perfomance and offering a better security. even creating its own processors as the graviton instances giving to the client exactly the compute resources it asked for.

UNDERSTANDING THE EC2 INSTANCES TYPES AND ITS NOTATION

All EC2 instance types with details (current generation)
Currently, we have two instance generations, the previous and the current generation powered by the Nitro system. You can still use the previous generation, but the recommendation is to use the latest.

INSTANCES TYPES PRIMARY CATEGORIES

The use cases of the workloads running on EC2 can be very different. For this reason, the instance types are divided into categories allowing you to right-size your workloads either horizontally or vertically.

The type of instance selected at the time of launch will determine the hardware available for your application. Each type offers different capabilities such as memory, compute, storage, and network (sometimes these capabilities are combined, I'll explain later). Each of these instance types are categorized into groups or families.

General Purpose: Offers a similar proportion of compute, memory, and networking. Ideal for apps that use these resources in a balanced proportion.
Compute Optimized: Designed for apps with intensive CPU usage.
Memory Optimized: Designed for high performance for apps/databases with high memory (RAM) usage such as cache databases.
Storage Optimized: Designed for workloads with high I/O with high volumes of storage with optimized IOPS.
Compute Accelerated: Optimized for instances that use hardware accelerators, or co-processors, to perform functions that will be performed more efficiently than is possible in software running on CPUs.
HPC Optimized: Designed for HPC workloads at scale. My understanding of this topic is null so I can't provide more details.

NOMENCLATURE CONVENTION

EC2 currently offers a lot of different instance types and at the beginning, I know that it can be very confusing to decide which one makes more sense for your workloads. So let's make this easy for you. Let's get started by understanding the nomenclature. The type of instances is named based on its family, generation, processor family, additional capabilities, and size.

The first position is the family. It can be c,r,m.
The second position indicates the generation-version of the instance. Do not confuse this with the generation instances (previous or Nitro powered). It's more like a versioning of the instance type. generally a number (5,6,7).
The third position indicates the family of the processor g(graviton), a(amd), i(intel). The pending letters indicate the additional capabilities as volumes, networking capability, or characteristics of the CPU.
After the dot, it indicates the size of the instances as small, xlarge, or metal for bare metal instances.

examples:

c6gd.medium: In this case, the 'c' stands for compute-optimized family, '6' is the generation of this family, 'g' indicates that it's a Graviton, 'd' indicates it has additional storage, in most cases, an NVMe volume. Lastly, '.medium' is the instance size which, in this case, is 1 vCPU and 2 GB of RAM memory.
c7gn.large: In this case, 'c' stands for compute optimized family, '7' indicates the generation of the family, 'g' represents that it uses graviton processors, 'n' indicates that it is network and EBS optimized. Lastly, '.large' is the instance size, in this case is 2 vCPUs and 4 GB of RAM memory.
i3en.metal: Here, 'i' stands for storage-optimized family, '3' indicates the generation of the family, 'e' stands for extra storage or memory, 'n' indicates that it is network and EBS optimized. The 'metal' after the dot represents the instance size, which in this case refers to a bare metal instance, providing all the resources of the underlying server without nytro hypervisor, with bare metal instances you have the option to configure your own hypervisor, i haven't had any ooportunity to test this type of solution so i can't provide more details.

Once you understand this nomenclature, let's proceed with understanding the instance families. Warning: This can be a lot of data, but I will not go deep into each instance family, just an overview to give you an idea.

INSTANCE FAMILIES

C - Compute Optimized: Optimized for workloads that require high usage of compute, ideally for applications with intensive use of CPU.
D - Dense Storage: Designed for workloads that require large quantities of dense storage (HDD), ideally for warehousing.
G - Graphics Intensive: Equipped with GPUs and optimized for workloads with intensive use of graphics, such as 3D, video streaming, or graphic design.
HPC - High Performance Computing: Used for HPC workloads offering high networking throughput and compute capacity. I haven't seen workloads using this instance type and is a very specific topic out of my knowledge.
I - Storage Optimized: Optimized for storage, ideally for databases or other types of applications that require high storage operation.
Im - Storage optimized subcategory with a one to four ratio of vCPU to memory: This offers a specific proportion of vCPUs to memory, oriented for specific databases and storage applications.
ls - Storage Optimized subcategory with a one to 6 ratio of vCPU to memory.
Inf - AWS Inferentia: Designed for machine learning and inference. My knowledge of machine learning instances is null so I can't provide more details on this instance type.
M - General Purpose: This is the most common instance type offering a proportional compute, memory, and network. Very used on web servers workloads.
Mac - MacOS instances offering Mac instances, commonly used for Apple development applications.
P - GPU acceleration: Offers GPUs and are optimized for parallel compute. I have seen workloads using this instance type for machine learning training models.
R - Memory Optimized: For apps that need high memory usage, such as databases, big data apps.
T - Burstable Performance: This type offers a good balance between cost and performance for workloads that don't use CPU constantly but at some point have spikes. This is done via credits that I will explain later.
Trn - AWS Trainium: Especially designed for machine learning models. I don't know the difference between this, Inf, and P families.
U - High Memory: Designed for applications that need large quantities of memory, such as databases. This makes me think about the difference against the R family.
VT - Video Transcoding: Optimized for transcoding of video.
X - Memory Intensive: Similar to the R family but fully focused on memory, made for workloads with extremely large quantities of memory.
F - FPGA: Field Programmable Gate Arrays, I can't give a description of this because I don't fully understand its use.

i would like to create a better table presentation of the instance families but this is just a high level description, you don't need to kwow all of them at detail so it's up to you to dive deep into those that caugth your interest

PROCESSOR FAMILIES

amd and intel has a line of processors specifically designed for servers under virtualization technologies, if you go to their websites you will find some of the processors that aws offers in the instances with a,i letters. there are some instances that doesn't have a processsor letter on its nomeclature which in most cases are intel processors, i don't know the context of this difference.

a - AMD processors
g - AWS Graviton processors: These instances use the Graviton processors created by AWS, based on ARM architecture. These offer a better cost-performance ratio compared to other processors of the same instance size, and can be more beneficial for some workloads.
i - Intel processors

ADDITIONAL CAPACITIES

b - Block Storage Optimization: These instances are optimized for block storage (EBS), tripling throughput and EBS performance IOPS.
d - Instance Store Volumes: Offers temporary storage directly attached to the host hardware, ideal for temporary data with high-speed performance as caches. In most cases, it's an NVMe ephemeral volume.
e - Extra Storage or Memory: Indicates that the instance has high memory (RAM) capacity, which is useful for applications that need very high usage of RAM or very high storage (HDD).
n - Networking and EBS Optimized: Offers high networking throughput and EBS, ideal for app workloads that require high network throughput.
z - High Performance: Designed for high performance in general. Can include processors with specific capabilities, more GHz, better networking, and IOPS.

in some cases the addtitional capacities can be combined, i just put here the most important ones, there asome other additonal capabilites that are just present in just a single instance so i don't see a high value in knowing them.

hey i know it's a lot of data and sincerely you don't need to know all of these but is very imporantnt to understand the notation of the intance types. save this post as a cheatsheet to have when rigth-sizing your workloads. at least on my experience i can say that is a very complex task to rigth-size your worloads and in most cases is more expensive to do a benchmark comparing different instances types than getting it done. for long-term solutions this is very beneficial combined with the purchase options available for EC2 that i'll explain in a different post of this serie

at this point i have some questions for you.

what is the meaning of each letter? (select differnt examples from the list of the aws instances types)
if i gave you any instance type of the most commonly used, are you able to know the meaning of its nomenclature?
what is the nytro system?
what are the different categories of instance families?
what is the difference between a .bare metal instance and an instance with the same size not being .metal?

i missed something important to let you know and we are almost done...

lastly we have one of my favorites commonly used which are the burstable instances

i encourage you to take a look into this link to have a detailed explanaition of it. i have seen some overprovisioned workloads where this instance type can be benefitial for those that can have some specific spikes of CPU usage and then lowering the usage for intermittent periods of time, i will explain later the use of ASG's and all the capabilities we have available with them in a different post. for comparing your instance selection the best tool that you can use is.

instances.vantage.sh

this tool allows you to compare different instances types and sizes at the same time.

thanks for arriving at the end of this post, i know a lot of us can think that we already know ec2, but knowing the details of the instance types will be very benefitial for those use-cases where you need to provide the best solution for a specific workload being cost-optimized without compromising the operational excellence.

AWS EC2 IMDS(Instance Metadata Service ) all that you need to know

Armando Contreras — Sun, 24 Mar 2024 22:56:26 +0000

In my detailed study of the AWS EC2 service, I came across a common configuration called IMDS, which stands for Instance Metadata Service. If you’ve launched an EC2 instance or created a launch template for an Autoscaling group, you’ve likely seen this option.

So, what is it? IMDS is a local service endpoint that your services, scripts, or applications within your EC2 instances can connect to in order to acquire instance metadata, such as hostname, events, security groups, or AWS credentials. It is important to note that you can only access this endpoint from the instance itself.

The data shared by this service is not protected by authentication or any cryptographic method. This means that anyone who has access to the instance from the inside can access this endpoint. Therefore, you should never store sensitive data like passwords in the data used by your launch template.

the service is exposed in two ip’s addresses: for ipv4 169.254.169.254 or over the IPv6 protocol, it’s address is [fd00:ec2::254]. The IPv6 address of the IMDS is compatible with IMDSv2 commands. The IPv6 address is only accessible on instancias powered by nitro system

How to use it?

this service comes in two flavors.

IMDSv1— a request/response method
IMDSv2 — a session-oriented method

you can have enabled both or just one. The PUT or GET headers are unique to IMDSv2. If these headers are present in the request, then the request is intended for IMDSv2. If no headers are present, it is assumed the request is intended for IMDSv1. my recomendation is just to use the v2 due to its security imporvements that i will comment next.

Add defense in depth against open firewalls, reverse proxies, and SSRF vulnerabilities with enhancements to the EC2 Instance Metadata Service | AWS Security Blog

aws.amazon.com

This blog provides a detailed explanation of the security improvements in V2 compared to V1, which I recommend reading. To provide an overview, IMDSv2 requires an authenticated session for each request. A session can be requested via an HTTP PUT request, with a header that sets the session token duration(with a maximum of 6 hours 21600 seconds). There is no limit on the number of requests made by a session or the number of sessions. This token is only usable within the same EC2 instance.

TOKEN=(curl -X PUT "http://$IP_ADDRESS_IMDSV2/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 3600")
curl "http://$IP_ADDRESS_IMDSV2/latest/meta-data/profile" -H "X-aws-ec2-metadata-token: $TOKEN"
curl "http://$IP_ADDRESS_IMDSV2/latest/user-data" -H "X-aws-ec2-metadata-token: $TOKEN"
curl "http://$IP_ADDRESS_IMDSV2/latest/dynamic" -H "X-aws-ec2-metadata-token: $TOKEN"

The use of a PUT request adds an extra layer of security to IMDSv2 against misconfigured third-party firewalls. By default, most of these firewalls do not support PUT requests. This helps avoid unauthorized access via a misconfigured Web Application Firewall (WAF) or reverse proxy.

GET and HEAD methods are allowed in IMDSv2 instance metadata requests. PUT requests are rejected if they contain an X-Forwarded-For header.

knowing all this my advice is to configure your instances to use the IMDSv2 if you need it or your underlying applicaion running inside your insntace will use the isntance profile attached to the instace, a good practice with containerized workloads on eks is to restrict this endpoints to containers. One way is to block pod IMDS access is to apply a network policy, enforced by the Amazon VPC CNI or an add-on, to ensure pods are unable to reach the Instance Metadata Service. To do this, configure your network policy to block egress traffic to 169.254.0.0/16. Another way to block pod IMDS access is to require IMDSv2 to be used, and to set the maximum hop count to 1. Configuring IMDS this way will cause requests to IMDS from pods to be rejected, provided those pods do not use host networking, instead to provide iam permissions to your pods use service accounts with OIDC via iam ROLES. this will prvent pod iam permissions escalation and ensure least-minimum privilege in your end.

you can also prevent the launch of instance with IMDSv2 not enabled with a IAM policies like this

{
    "Version": "2012-10-17",
    "Statement": [
               {
            "Sid": "RequireImdsV2",
            "Effect": "Deny",
            "Action": "ec2:RunInstances",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Condition": {
                "StringNotEquals": {
                    "ec2:MetadataHttpTokens": "required"
                }
            }
        }
    ]
}

or prevent the hop size limit to exceed a fixed value

{
    "Version": "2012-10-17",
    "Statement": [
               {
            "Sid": "MaxImdsHopLimit",
            "Effect": "Deny",
            "Action": "ec2:RunInstances",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Condition": {
                "NumericGreaterThan": {
                    "ec2:MetadataHttpPutResponseHopLimit": "2"
                }
            }
        }
    ]
}

other endpoinds available to retrieve data from the IMDS endpoints
check other iam polies related to the IMDS

lastly to avoid a confussion there is an important difference between the instance identity role and the instance profile. AWS services and features that are integrated to use the instance identity role can use it to identify the instance to the service. The instance identity role credentials are accessible from the Instance Metadata Service (IMDS) at /identity-credentials/ec2/security-credentials/ec2-instance. and the instance profile is an iam role that allows the isntance to assume this role and then send requests to aws services.

i hope this post could be useful for you. this was a feature that i didn’t know until today and that is a very important security consideration for any workload on EC2 instances. : )