Time to move on to Day 2 cloud operations

#aws #devops #cloud #softwareengineering

Anyone who has been following my past content on social media knows that I am a huge advocate for cloud adoption, and I have been focusing on various topics related to cloud in the past almost a decade.

While organizations taking their first steps using the public cloud or rushing into the public cloud, they are making a lot of mistakes, to name a few:

Failing to understand why are they using the cloud in the first place, and what value the public cloud can bring to their business
Bringing legacy data center mindset and practices and trying to implement them in the public cloud, which results in inefficiencies
Not embedding cost as part of architecture decisions, which results in high cloud usage costs

In this post, we will focus on the next steps in embracing the public cloud, or what is sometimes referred to as Day 2 cloud operations.

What do all those dates mean?

When compared to software engineering, Day 0 is known as the design phase. You collect requirements for moving an application to the cloud, or for developing a new application in the cloud.

Day 1 in cloud operations is where most organizations are stuck. They begin migrating several applications to the cloud, deploying some workloads directly into cloud environments, and perhaps even running the first production applications for several months, or even a year or two. This is the phase where development and DevOps teams are still debating about selecting the most appropriate infrastructure (VMs, containers, perhaps even Serverless, managed vs. self-managed services, etc.)

Day 2 in cloud operations is where things are getting interesting. Teams begin to realize the ongoing cost of services, the amount of effort required to deploy and maintain workloads manually, security aspects in cloud environments, and various troubleshooting and monitoring of production incidents.

What does Day 2 cloud operations mean?

When organizations reach day 2 of their cloud usage, they begin to look at previously made mistakes and begin to fine-tune their cloud operations.

Automation is the king

Unless your production contains one or two VMs with a single database, manual work is no longer an option.

Assuming your applications are not static, it is time to switch the development processes to a CI/CD process, and automate (almost) the entire development lifecycle – from code review, static or dynamic application security testing, quality tests, build creation (for example, packaging from source doe to container images), up to the final deployment of a fully functional version of an application.

This is also the time to invest in learning and using automated infrastructure deployment using an Infrastructure as Code (IaC) language such as Terraform, OpenTofu, Pulumi, etc.

The use of IaC will allow you to take benefit of code practices such as code versioning, rollback, audit (who did what change), and naturally the ability to reuse the same code for different environments (dev, test, prod) while gaining the same results.

Rearchitecting and reusing cloud-native capabilities

On Day 1, it may be ok to take traditional architectures (such as manually maintaining VMs), but on Day 2 it is time to take the full benefit of cloud-native services.

The easiest way is to replace any manual maintenance of infrastructure with managed services – in most cases, switching to a managed database, storage, or even load-balancers and API gateways, will provide a lot of benefits (such as lower maintenance, resource allocation, etc.), while allowing IT and DevOps teams to focus on supporting and deployment of new application versions, instead of operating system and server maintenance.

If you are already re-evaluating past architecture decisions, it is time to think about moving to microservices architecture, decoupling complex workloads to smaller and more manageable components, owned by the development teams who develop those components.

For predictable workloads (in terms of spike load of customer demand), consider using containers.

If your developers and DevOps teams are familiar with packaging applications inside containers, but lack experience with Kubernetes, consider using services such as Amazon ECS, Azure App Service, or Google Cloud Run.

If your developers and DevOps teams have experience using Kubernetes, consider using one of the managed flavors of Kubernetes such as Amazon EKS, Azure AKS, or Google GKE.

Do not stop at containers technologies, if your workload is unpredictable (in terms of customers load), consider even taking architecture one step further and consider using Function-as-a-Service (FaaS) such as AWS Lambda, Azure Functions, Google Cloud Functions, or event-driven architectures, using services such as Amazon EventBridge, Azure Event Grid, or Google Eventarc.

Resiliency is not wishful thinking

The public cloud, and the use of cloud-native services, allow us to raise the bar in terms of building highly resilient applications.

In the past, we needed to purchase solutions such as load-balancers, API gateways, DDoS protection services, and more, and we had to learn how to maintain and configure them.

Cloud providers offer us managed services, making it easy to design and implement resilient applications.

Customers' demand has also raised the bar – customers are no longer willing to accept downtime or availability issues while accessing applications – they expect (almost) zero downtime, which forces us to design applications while keeping resiliency in mind from day 1.

We need to architect our applications as clusters, deployed in multiple availability zones (and in rare cases even in multiple regions), but also make sure we constantly test the resiliency of our workloads.

We should consider implementing chaos engineering, as part of application development and test phases, and be able to conduct controlled experiments (at the bare minimum in the test stage, and ideally also in production), to be able to understand the impact of failures on our applications.

Observability to the aid

The traditional monitoring of infrastructure and applications is no longer sufficient in modern and dynamic applications.

The dynamic nature of modern applications, where new components (from containers to functions) are been deployed, running for a short amount of time (according to application demand and configuration), and decommissioned when no longer needed, will not be able to handle by traditional monitoring tools (commonly deployed as agents).

We need to embed monitoring in any aspect of our workloads, at all layers - from the network layer such as flow logs, infrastructure layer (such as load-balancer, or OS, containers, and functions logs), all the way to application or even customer experience logs.

Storing logs is not enough – we need managed services that can constantly review logs from various sources (ideally aggregated into a central log system), use machine learning capabilities, try to anticipate issues, before they impact customer experience, and provide insights and near real-time recommendations for fixing the arise problems.

Cost and efficiency

In the public cloud, almost any service has its pricing – sometimes it is the time a compute resource was running, the number of invocations of a running function, a storage service storing files, database queries, or even an egress data from the cloud environment back to on-prem or to the public Internet.

Understanding the pricing of each component in a complex architecture is crucial, but not enough.

We need to embed cost in every architecture decision, understand what is the most valuable cost option (for example choosing between on-demand, savings plan, or Spot), and monitor each workload's cost regularly.

Cost is very important, but not enough.

We need to embed efficiency in any architecture decision – are we using the most suitable compute service, are we using the most suitable storage tier (from real-time, to archive), are we using the most suitable functions resources (in terms of Memory/CPU), etc.

We need to combine an architect's view (being able to see the bigger picture), with an engineer or developer's experience (being able to write efficient code), to meet the business requirements.

Security is job zero

I cannot stress enough how important security is in today's world.

I have mentioned before the dynamic nature of modern cloud-native applications, and the evolving threats identified every day require no mindset when talking about security.

At first, we need to embed automation – from testing new versions of code, regularly scanning for vulnerable open-source libraries, embedding SBOM (Software Bill of Materials) solutions (to be able to know which components are we using), automatically deploying security patches, and finally running an automated vulnerability scanning tools to detect vulnerabilities as soon as possible.

We should consider implementing immutable infrastructure, switching from over-changing VMs (containing both libraries, configuration, code, and data), to read-only immutable images of VMs or containers, being updated (to new versions), in an automated CI/CD process.

Data must be encrypted end-to-end, to protect its confidentiality and integrity.

Mature cloud providers allow us to manage both encryption keys (using customer-managed keys), and secrets (i.e., static credentials) using managed services, fully supported by (almost) all cloud-native services, which makes it extremely easy to protect data.

Lastly, we should embrace a zero-trust mindset. We should always assume breach, and in this mindset, we should verify any request coming from any customer, over any medium (mobile, public Internet, Wi-Fi, etc.). We need to authenticate any customer's request and assign each customer the right number of privileges to access our applications and act, following the principle of least privilege.

Training, training, and training

It may be acceptable on day 1 for developers and operational teams to make mistakes, taking their first steps in the public cloud.

To allow organizations to move to day 2 cloud operations, we need to heavily invest in employee training.

Encouraging a culture of experimentation, opening environments in the cloud for employee training, using the many options of online courses, and the fact that most cloud documentation is publicly available, will allow both developers and operational teams to gain confidence in using cloud services.

As more and more organizations are beginning to use more than a single cloud provider, (not necessarily a multi-cloud environment, but more than a single vendor), requires employees to have hands-on experience working with several cloud providers, with different platforms, services, and capabilities. The best way to achieve this experience is to train and gain experience working with different platforms.

Summary

It is time for organizations to move on from the day 1 cloud operations phase (initial application deployment and configuration phase) to the day 2 cloud operations phase (fine-tune, and ongoing maintenance phase).

It is a change in mindset, but it is crucial for maintaining production applications, in the modern and cloud-native era.

About the author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Top comments (1)

7h3-3mp7y-m4n • Aug 21 '24

just amazing @eyalestrin ❤️