DEV Community: vandana.platform

Most Teams Think They Have CI/CD. They Don’t.

vandana.platform — Tue, 07 Apr 2026 02:51:33 +0000

Most Teams Think They Have CI/CD. They Don’t.

Most teams say they have CI/CD.

But if someone is still SSH-ing into a server and running Docker commands manually, the system is not truly automated.

Most teams automate steps.

Very few automate the system.

This is where the gap exists.

This is not a theory post — this is based on a real working lab setup.

This article breaks down how GitHub Actions actually works — using both a real-world analogy and a technical perspective — based on a hands-on lab where a Dockerized nginx application is deployed to an EC2 instance on AWS.

🧠 Real-World View

Think of GitHub Actions like a diligent assistant who watches your mailbox.

Your house (EC2 instance) sits inside a gated community (VPC), so only authorised people can access it.

• The house is on a street (public subnet)

• The main gate (Internet Gateway) is the only way in and out

• A traffic controller (route table) directs visitors correctly

• The front door lock (security group – ports 22 and 80) controls who can enter

📬 What Happens When Code Changes?

Every time a new letter arrives in your mailbox (code merged to main):

• The assistant drives to your house (connects to EC2 via SSH)

• Picks up the old furniture (stops and removes old container)

• Brings in new furniture (pulls and runs new Docker image)

• Sends you a message when done (workflow success notification)

The assistant has a spare key (SSH key pair) stored in a secure lockbox (GitHub Secrets), so it can always access your house without asking you every time.

👉 You never have to go there yourself.

That’s what modern deployments should feel like:

repeatable, reliable, and hands-off.

🖼️ Real World → Technical Mapping

⚙️ Technical View

Think of GitHub Actions like a senior DevOps engineer who automated repetitive work.

❌ Before Automation

• Engineers SSH into EC2 for every deployment

• Run docker stop, docker pull, docker run manually

• Repeat the same steps every time

• Risk outages due to small mistakes

✅ After GitHub Actions

That same knowledge is now encoded as code in deploy.yml.

• Pipeline triggers automatically on push to main

• A runner (ubuntu-latest VM) executes the steps

• The process is consistent and repeatable

👉 No manual intervention required

⚙️ Infrastructure as Code (Terraform)

The entire environment is provisioned using Infrastructure as Code:

• Network boundary → VPC (aws_vpc)

• Network segment → Public Subnet (aws_subnet)

• External access → Internet Gateway (aws_internet_gateway)

• Traffic routing → Route Table (aws_route_table)

• Firewall → Security Group (aws_security_group)

• Secure access → SSH Key Pair (aws_key_pair)

• Compute → EC2 Instance (aws_instance)

👉 If it’s not in version control, it doesn’t exist.

🔄 Deployment Flow

In this lab setup:

Code is merged into main
GitHub Actions triggers automatically
Runner reads instructions from deploy.yml
Connects to EC2 via SSH using stored credentials
Stops and removes the old container
Pulls the latest Docker image
Runs the new container on port 80
Logs the result in workflow history

👉 A manual process becomes a fully automated release pipeline

🔐 Security Controls

• EC2 access restricted by security group (ports 22 and 80 only)

• SSH authentication uses key pair (aws_key_pair)

• Private key and host stored securely in GitHub Secrets

• Deployment executed only through workflow

• No direct manual access required during normal deployments

🎯 Key Insight

Terraform builds the system.

GitHub Actions runs the system.

Together, they eliminate manual deployments.

🚀 Final Thought

If your deployment still requires:

• SSH

• Manual commands

• “That one person”

You don’t have CI/CD.

You have automation on top of manual work.

🎥 Related Video

Watch the full GitHub Actions breakdown on Cloud AIOps Hub:

👉 https://www.youtube.com/@CloudAIopsHub

#devops #cicd #aws #terraform #docker #githubactions #beginners

Automation Does Not Equal Compliance: The Gap I Noticed While Building My Platform Lab

vandana.platform — Tue, 24 Mar 2026 01:32:12 +0000

The Uncomfortable Truth Most Teams Avoid

You can have a fully automated pipeline, Terraform-managed infrastructure, Kubernetes running workloads, and GitHub Actions firing on every push, and still have no idea whether your environment is actually compliant with anything.

That sentence should be uncomfortable. For a lot of teams, it is. And yet the prevailing assumption in most engineering organizations is that if it's automated, it must be under control.

It is not.

What the Lab Exposed

I have been building a Cloud and Platform Engineering Lab designed to simulate enterprise-scale systems. Not a sandbox for tutorials. An actual attempt to reproduce the architectural complexity, operational drift, and governance pressure of a real production platform environment.

What I expected to find: tooling gaps, performance edge cases, configuration quirks.

What I did not expect to find: a consistent, systemic disconnect between automation maturity and compliance posture.

Repo after repo in the lab had CI/CD pipelines. Most had some form of Infrastructure as Code. A few had Kubernetes manifests checked in and drift-detected. By surface-level metrics, these looked like healthy, modern engineering environments.

But when I started asking harder questions, the answers were unsettling:

Are secrets being scanned before merge, or just after the damage is done?
Does any of this IaC align with a defined security baseline, or did someone just run terraform init and figure it out as they went?
Is there a README that could survive an audit, or is it a three-line placeholder from two years ago?
If this pipeline failed a compliance gate, would anyone know which gate, why, or what to do next?

The answer, in most cases, was: not really.

Why Automation Creates a False Sense of Security

Automation solves repeatability. That is its core value proposition. Run the same steps, in the same order, every time. It eliminates human error from execution.

But compliance is not a problem of execution. It is a problem of posture, context, and intent.

Consider the difference:

Automation answers: Did the deployment succeed? Did the tests pass? Was the container built?

Compliance asks: Does this deployment introduce risk? Does the infrastructure reflect organizational policy? Is this system auditable?

These are fundamentally different questions, and automation tooling is not designed to answer the second set. Yet the presence of CI/CD pipelines is often treated, implicitly, as evidence of maturity and control.

That conflation is where the gap lives.

What DevOps Practices Miss About Compliance Visibility

Modern DevOps tooling is excellent at signaling operational health. Dashboards, alerts, pipeline statuses, SLOs. All of that is genuinely useful.

What it rarely surfaces is compliance health. Not because the data does not exist, but because nobody has built the layer that connects engineering artifacts to compliance signals.

Think about how compliance is typically handled today:

A quarterly audit arrives
Someone manually reviews pipelines, access controls, documentation
Findings are captured in a spreadsheet
Engineers scramble to address gaps
Repeat in three months

This is not a process failure. This is an architectural failure. The system was never designed to surface compliance posture continuously. It was designed to execute workloads.

Introducing the Concept of Compliance Signals

A compliance signal is not a binary pass/fail check. It is an observable characteristic of an engineering environment that carries meaningful information about risk, maturity, or alignment with policy.

The key word is "observable." Compliance signals are already present in the artifacts engineers produce every day. The problem is that nobody is reading them with compliance intent.

Here is what that looks like in practice across common signal categories:

CI/CD Presence and Configuration

Is there a pipeline at all?
Does it include test stages, or just build and deploy?
Are there branch protection rules that require review before merge?
Is there evidence of security scanning integrated into the pipeline, or bolted on as an afterthought?

Infrastructure as Code Usage

Is infrastructure defined in code, or provisioned manually?
Is the IaC versioned and peer-reviewed like application code?
Are there policy-as-code tools like Checkov, tfsec, or OPA evaluating the templates before apply?
Is there drift detection in place?

Secrets Exposure Risk

Does the repository have a secrets scanning integration enabled?
Are there historical commits that contain credentials, tokens, or API keys, even revoked ones?
Is there evidence of .env files or hardcoded configuration values being checked in?
Are secret references externalized to a vault or parameter store?

Documentation Maturity

Does the README explain what this system does, who owns it, and how to run it?
Is there an architecture decision record (ADR) trail?
Is there runbook documentation that would survive a team rotation?
Does documentation reference security controls, data classification, or dependency risk?

None of these signals require a new tool. They are present in existing repositories, configuration files, and pipeline definitions. What is missing is the read layer.

Efficient Use of LLMs: High-Signal Input, Low Noise

When I started exploring how to analyze these signals at scale, the instinct was to throw everything at a model and ask it to reason over the full context. That is expensive, slow, and often produces verbose output that is hard to act on.

The more useful approach is to be surgical about what you send.

LLMs are genuinely good at a specific subset of compliance analysis tasks:

Interpreting ambiguous configuration patterns (is this a deliberate design choice or a gap?)
Synthesizing partial evidence into a risk narrative
Assessing documentation quality and identifying what is absent, not just what is present
Detecting intent drift, where the code and the documentation no longer describe the same system

But those tasks benefit from receiving focused, pre-filtered input rather than raw, unprocessed repository content.

The practical approach looks like this:

Extract structured signals first. Parse pipeline configuration files, scan for known patterns (e.g., aws_access_key, password =, absence of .gitignore entries for .env), check for file existence (README.md, CODEOWNERS, docs/).
Build a structured signal summary. Not the raw files. A normalized representation of what was found and what was not found.
Send the summary, not the source. The model does not need to read 400 lines of Terraform. It needs to know that Terraform is present, there is no tfsec integration, and the state backend is local rather than remote.
Ask specific, bounded questions. "Based on these signals, what compliance risks are observable?" performs better than "analyze this repository for security issues."

This approach keeps token usage low and response quality high. More importantly, it keeps the human in the loop as an interpreter of findings, not a processor of noise.

The Platform Engineering Angle: Governance Needs a Control Plane

Platform engineering exists, in part, to abstract complexity away from product teams while maintaining organizational control over how systems are built and operated. The internal developer platform (IDP) is the mechanism for encoding those standards.

But most IDPs today are delivery platforms. They make it easier to build and deploy. They do not make it easier to govern.

This is not a criticism of the teams building those platforms. It reflects where investment has been directed. Delivery velocity has clear, measurable ROI. Compliance visibility is harder to quantify until something goes wrong.

The gap I am describing here is the same gap between an IDP and a governance control plane. A governance control plane would:

Continuously evaluate repositories and environments against defined compliance criteria
Aggregate findings across an organization into a single, queryable posture view
Surface risk-ranked findings to the teams responsible for addressing them
Close the loop between audit findings and engineering remediation

That is not a new category of tool. It is a missing integration layer between existing tooling and organizational policy. The signals are there. The policy exists, in most organizations, in some form. The bridge between them is what is absent.

Toward Self-Evaluating Systems

The longer-term vision here is not a compliance dashboard. Dashboards require someone to look at them.

What would be genuinely useful is an engineering environment that continuously evaluates its own compliance posture, surfaces observations in context, and makes it easier for engineers to close gaps before they become audit findings or incidents.

This is not surveillance. It is structural self-awareness. The same way a well-instrumented application surfaces performance anomalies without requiring a human to check dashboards constantly, a well-instrumented platform should surface compliance anomalies without requiring a quarterly audit cycle.

The signals already exist. The analysis capability exists. The missing piece is the integration architecture that connects them into a coherent posture view.

That is the space I am building toward in the lab. The system I have started calling Komplora is an early exploration of exactly this problem: analyzing engineering environments at the repository level, detecting compliance signals, and producing structured, actionable posture assessments without requiring a manual audit process.

It is early. But the signal detection layer is already producing useful observations.

What This Means for Platform Engineers

If you are building or operating a platform, compliance visibility should be a first-class concern, not a feature added in response to a failed audit.

The practical starting point is not tooling. It is clarity on what compliance means for your organization, expressed as observable signals in engineering artifacts. Once you have that, the path to continuous visibility becomes an engineering problem rather than a process problem.

And engineering problems, in this space, are ones we are genuinely equipped to solve.

A Question Worth Sitting With

Most organizations can tell you their deployment frequency and mean time to recovery. Far fewer can tell you their compliance posture across their engineering estate on any given day.

What would it take for your platform to answer that question continuously, not quarterly?

I am curious how others are thinking about this, especially those who have tried to close this gap at scale. What approaches have worked? What assumptions did you have to abandon?

If this resonated, follow along. I will be sharing more observations from the platform lab as the work progresses.

** #devops #platformengineering #devsecops #cloud**

Designing a Platform Engineering Lab for Enterprise Cloud Architectures

vandana.platform — Mon, 09 Mar 2026 01:53:53 +0000

Most engineers spend years learning tools.

Fewer engineers spend time practicing how large systems are actually designed.

Modern cloud environments are no longer just collections of infrastructure resources. They are complex, evolving platforms that must support distributed systems, AI workloads, governance models, and long-term operational stability.

To better understand how these systems evolve, I began designing a controlled platform engineering lab.

The purpose of this lab is not simply to deploy applications or test individual tools. Instead, it is designed to simulate how enterprise cloud architectures and platform systems evolve over time.

Why Build a Platform Engineering Lab

In smaller environments, cloud infrastructure often grows organically.

Teams deploy services, automate infrastructure, integrate monitoring tools, and gradually build CI/CD pipelines. At small scale, this works well.

However, as organizations grow, infrastructure complexity grows with it. Without clear architectural boundaries, environments begin to suffer from:

Operational coupling between teams
Inconsistent infrastructure standards
Fragmented monitoring and observability
Security policies applied unevenly
Difficulty scaling data and AI workloads

This is where platform engineering practices become essential.

A platform is not just infrastructure.

A platform is a set of systems that enable teams to build, deploy, observe, and operate workloads reliably at scale.

Platform Systems Instead of a Single Cloud Environment

Many cloud environments initially operate as a single operational domain where infrastructure, networking, delivery pipelines, monitoring systems, and security controls evolve together.

This model works at small scale but becomes fragile as complexity increases.

Enterprise environments tend to evolve differently.

Instead of one large environment, mature architectures organize capabilities into independent platform systems, each responsible for its own lifecycle and operational standards.

Typical platform systems include:

Application Platforms
Networking Platforms
Data Platforms
DevOps / Delivery Platforms
Observability Platforms
Security Platforms

Each system evolves independently while still operating under a shared governance model and centralized control plane.

This separation provides several long-term advantages:

Reduced operational coupling between teams
Clear ownership boundaries for platform capabilities
Consistent infrastructure standards across environments
Stronger policy enforcement and governance models
Greater scalability for cloud and AI workloads

Architecture of the Platform Engineering Lab

The engineering lab is structured to simulate how platform layers interact inside enterprise environments.

Rather than focusing on isolated tools, the lab models platform architecture patterns.

A simplified view of the environment looks like this:

Platform Engineering Lab Architecture

Local Engineering Environment
│

Infrastructure as Code Layer

│

Cloud Environments / Accounts

│

Kubernetes Platform Layer

│

Observability and Security Systems

│

AI / ML Infrastructure Workloads

This layered structure allows experimentation with:

Platform governance models
Automation patterns
Reliability engineering practices
Distributed system behavior

Areas Being Explored

The lab environment focuses on several key areas of modern platform design.

Multi-Cloud Operating Models

Large organizations rarely operate a single cloud account or environment. Instead, they manage multiple accounts, environments, and sometimes multiple cloud providers.

The lab explores how infrastructure governance and operational standards can be maintained across these distributed environments.

Kubernetes-Native Platform Architectures

Container orchestration platforms have become foundational to modern application platforms.

This lab explores how Kubernetes clusters act as platform substrates, enabling application deployment, policy enforcement, and operational observability.

Infrastructure Standardization

Infrastructure-as-code enables organizations to standardize how infrastructure is provisioned and maintained.

The lab focuses on modeling reusable infrastructure patterns and automation pipelines that maintain consistency across environments.

Observability and Reliability Systems

As distributed systems grow, observability becomes critical.

The environment explores how monitoring, logging, tracing, and reliability engineering practices can be integrated into platform systems from the beginning.

AI and ML Infrastructure Workloads

Modern cloud platforms must increasingly support AI and machine learning workloads.

Model training pipelines, inference services, and GPU-intensive workloads introduce new operational constraints that traditional cloud environments were not originally designed to handle.

The lab explores how platform infrastructure interacts with AI workloads, including:

Workload isolation strategies
GPU-aware scheduling patterns
Distributed inference architectures
Governance models for AI infrastructure

Platform Maturity Requires System Thinking

A common misconception in cloud engineering is that maturity comes from adopting more tools.

In reality, platform maturity comes from how systems are designed and governed.

The most resilient environments are not those with the largest number of services deployed. They are environments where:

System boundaries are clearly defined
Platform capabilities evolve independently
Governance models guide infrastructure behavior
Operational ownership is clearly understood

This engineering lab is an attempt to explore these architectural patterns and better understand how platform systems interact at scale.

What Comes Next

This platform lab will continue evolving to explore several areas of enterprise platform architecture, including:

Platform control planes and governance models
Observability systems for distributed workloads
Multi-cloud platform operating models
Failure domain modeling for reliability engineering
Infrastructure support for AI and ML workloads

The goal is not experimentation alone.

It is practicing platform architecture intentionally — even before production systems demand it.

Because the engineers who design scalable systems are rarely the ones who only learned tools.

They are the ones who learned to design platforms.

Key Takeaways

Platform engineering focuses on designing systems, not just deploying infrastructure
Mature cloud environments require clear platform system boundaries
Independent platform systems improve scalability and operational ownership
AI workloads introduce new constraints that traditional cloud platforms must adapt to
Practicing platform architecture helps develop stronger systems thinking

Discussion

How are other engineers structuring internal platform environments or architecture labs to simulate enterprise cloud systems?

I’d be interested to hear how different teams approach platform system boundaries and governance models.

#PlatformEngineering #EnterpriseArchitecture #CloudArchitecture #AIInfrastructure #CloudStrategy #DistributedSystems #PrincipalEngineer #StaffEngineer #DevOps #MLOps #AIOps