DEV Community: Mallagari Sri Datta

Avoid Vulnerable Build: Light onto Cryptographic Source Code Security

Mallagari Sri Datta — Tue, 31 Mar 2026 20:18:15 +0000

Modern DevOps and SRE ecosystem, we have grown dangerously comfortable outsourcing our trust. We lock down our production environments, enforce Zero Trust networking, and implement rigorous CI/CD pipelines, yet we blindly trust the platform hosting our source code. We assume that if the code hosting platform (the "forge") displays a green checkmark, the commit is safe.

But what happens when the forge is compromised?

Spearheaded by incubating projects within the Open Source Security Foundation (OpenSSF), that fundamentally rethinks supply chain security. By shifting policy enforcement away from centralized SaaS platforms and embedding it directly into the version control protocol, platform engineers are creating truly immutable, cryptographically verified supply chains. Here is the deep-dive guide into decentralized source code trust and how to architect it.

The Core Vulnerability: The Forge as a Single Point of Failure
The fundamental flaw in modern source code management is that the forge acts as the ultimate, centralized arbiter of security.

While Git utilizes Merkle trees (SHA hashes) to guarantee the mathematical integrity of files, it possesses no native mechanism for distributing and verifying the public keys of the humans writing the code. We rely entirely on the forge's UI and API to enforce merge policies, manage maintainer identities, and restrict branch access.

If advanced persistent threat (APT) bypasses the forge's perimeter—or if a critical API bug is exploited—the attacker can inject malicious code directly into the repository. Because downstream CI/CD pipelines implicitly trust the forge, they will automatically build and deploy the compromised payload. Real-world incidents, such as the JuniperVPN backdoor and the Trivia tag-overwrite attack, have already demonstrated how compromising repository infrastructure allows attackers to seamlessly manipulate release tags and inject vulnerabilities.

Shifting to : Decentralized, Client-Side Trust
To eliminate this single point of failure, decouple security policies from the hosting provider. The emerging standard for this is a "Trust on First Use" (TOFU) architecture, exemplified by the git-tough project.

Instead of relying on the forge's database to dictate who can merge code, the security policies, access rules, and public keys are stored inside the repository itself, specifically within Git's hidden refs/ namespace.

This seemingly simple change completely inverts the security model:
When a developer executes a git pull or git fetch, their local machine—or the CI runner—independently verifies the cryptographic signatures of the incoming commits against the security policy embedded in the repository. If the hosting platform was compromised and an unauthorized commit was injected, the client-side verification will immediately detect the policy violation and block the code.

Architecture of Immutable Trust : To achieve this level of decentralized security, the architecture relies on four foundational pillars:

Root Metadata (The Anchor): The ultimate root of trust is a metadata file signed by the project's core owners. This file defines the highest-level cryptographic identities and establishes the foundational rules for the repository.
The Primary Rule File: This acts as the access control list (ACL) as code. It defines granular rules for specific namespaces, dictating exactly which cryptographic identities are authorized to modify specific branches, release tags, or even individual directories.
Cryptographic Identity Agnosticism: Modern architectures must support a hybrid of identities. While legacy systems rely on SSH and GPG keys, decentralized trust models seamlessly integrate with ephemeral, identity-based mechanisms like Sigstore and OIDC (OpenID Connect).
The Reference State Log (RSL): This is arguably powerful mechanism. The RSL is an append-only hash chain that acts as an immutable ledger of all repository activity, including code merges and policy updates. Because it is a mathematically linked chain stored within the repository, any attempt by an attacker to silently rewrite history or delete forensic logs is instantly detectable by any client holding a previous state of the log.

Implementing decentralized trust unlocks several advanced, high-assurance operational strategies:

M-of-N Threshold Signatures: To neutralize the "rogue maintainer" threat or the compromise of a single administrator's credentials, repositories can enforce thresholding. A policy can dictate that merging code to the main branch or altering an access rule mathematically requires cryptographic signatures from at least two out of five authorized maintainers.
Scoped Cryptographic Delegations: Trust in large ecosystems is rarely binary. Decentralized models allow for granular delegation. A core maintainer can cryptographically delegate authority over a specific namespace (e.g., the /docs folder or a specific microservice directory) without granting them overarching repository access.
Transparent Developer Experience: By storing all metadata in the Git refs/ namespace and utilizing custom remote transport helpers, the entire verification process happens invisibly in the background. Standard developers can continue running git push and git pull without changing their daily workflows, while the security guarantees are enforced silently.

True Zero Trust means acknowledging that even your most critical SaaS providers can be compromised. By moving security policy enforcement out of the centralized web UI and embedding it directly into fabric of the version control system, engineering teams can guarantee the integrity of software supply chain from the developer's workstation all the way to the production cluster.

Engineer's Guide to Surviving Global Cyber Compliance: Unpacking the OSPS Baseline

Mallagari Sri Datta — Tue, 31 Mar 2026 20:06:56 +0000

For years, open-source maintainers and platform engineers have operated under an unspoken social contract: we build the infrastructure of the internet, and you use it at your own risk.

Today, that contract is being torn up by international regulators.

With a 44% year-over-year increase in the exploitation of public-facing applications and the cost of cybercrime projected to hit $10.5 trillion annually, global legislation is radically shifting the landscape. We are moving from a fragmented, voluntary security culture into an era of strict, punitive frameworks like the EU’s Cyber Resilience Act (CRA), NIS2, and DORA.

For senior engineers, platform architects, and open-source maintainers, this regulatory wave feels like a looming administrative nightmare. However, a architectural Rosetta Stone has emerged to solvethis : OpenSSF OSPS (Open Source Security Practices) Baseline.

Here is the definitive breakdown of how the OSPS Baseline abstracts away the legal chaos, providing with a unified engineering framework to secure your supply chain without assuming commercial liability.

The Core Problem: The Legislative Wall
Currently, 26% of organizations view cyber regulations negatively, primarily because they struggle to ensure third-party and open-source vendor compliance. The legislation driving this panic includes:

NIS2: Impacts 18 critical sectors (from energy to healthcare), indirectly forcing enterprises to secure their entire open-source supply chain to guarantee service continuity.
DORA (Digital Operational Resilience Act): Imposes strict digital resilience and third-party risk management requirements specifically on the financial sector.
Cyber Resilience Act (CRA): This is the most disruptive. It mandates "Security by Design" and "Security by Default," but critically, it attempts to place strict legal and financial liability on the "manufacturer" (the entity placing the product on the market) for all components used—including open-source libraries.

Because CNCF and OSS projects power the world's critical infrastructure, enterprise consumers are passing these regulatory burdens upstream, maintainers with endless, disparate security questionnaires.

The OSPS Baseline Architecture, released to bridge the gap between developers and regulators, the OSPS Baseline isn't just another arbitrary standard; it is a highly prescriptive mapping tool. It translates vague legal requirements into strict engineering realities.

The baseline is structured mathematically around practical execution:

40 Mandatory Requirements: The baseline entirely rejects the ambiguous word "should" in favor of strict "must" controls, ensuring that every required action has a measurable impact on the project's security posture.
3 Maturity Levels: It scales from Level 1 (Basic Hygiene), to Level 2 (Standardized), up to Level 3 (High Assurance).
8 Critical Areas: The framework maps directly to engineering workflows: Access control, build/release, documentation, governance, legal, quality, security assessment, and vulnerability management.

The true power of the OSPS Baseline lies in its strategic application. Here are the elite takeaways for navigating this new era:

The "One-to-Many" Compliance Hack: You don't have the engineering cycles to map your CI/CD pipeline to 50 different international laws. The OSPS Baseline acts as a multiplexer. By satisfying a single technical OSPS requirement—such as generating a cryptographic Software Bill of Materials (SBOM)—your project simultaneously checks the compliance boxes for the EU CRA, the US NIST SSDF, the NIST CSF, and Open Chain. Write the pipeline once, and the baseline translates it into global legal compliance.
The Liability Shield (Maintainer vs. Manufacturer)
There is a massive legal "moat" that OSS maintainers must understand. Under regulations like the CRA, open-source maintainers are not considered "manufacturers" or "economic operators," meaning they do not bear financial or legal liability for the software.
However, downstream commercial users do bear that liability. The strategy is to use the OSPS Baseline to provide voluntary, machine-readable signals of your security posture. By adopting the baseline, you hand enterprise users the exact due-diligence checklist they need to pass their audits, building immense trust and adoption, all while explicitly stating via disclaimers that you assume no commercial liability.
Moving from "Trust Me" to Evidence-Based Trust
The era of putting a "security.md" file in your repo and asking users to trust you is over. The future of operations relies on machine-readable attestations. The OSPS framework is actively driving toward a future of automated evaluation, where your project's compliance with these 40 requirements is continuously verified and broadcasted to downstream consumers via automated tooling.

Global cyber compliance is no longer just a problem for the legal department; it is a distributed systems engineering challenge.

By adopting the OpenSSF OSPS Baseline, you stop treating security mandates as chaotic, disjointed chores. Instead, you integrate them into a unified, actionable framework. You protect your team from legal ambiguity, drastically reduce the toil of enterprise security audits, and guarantee that your architecture is resilient enough to power the next generation of critical infrastructure.

The Next Frontier of SRE: Agentic Operations and Immutable Trust

Mallagari Sri Datta — Tue, 31 Mar 2026 19:51:37 +0000

As cloud-native architectures scale, The sheer complexity of microservices, service meshes, and deployment pipelines has created an unsustainable operational burden. To scale to the next order of magnitude, the industry must fundamentally rethink how infrastructure is operated and secured.

We are entering an era defined by two massive paradigm shifts: Agentic Infrastructure and Decentralized, Cryptographic Trust. Definitive guide to navigating this frontier, moving beyond traditional GitOps and perimeter security into a truly autonomous, zero-trust operational model.

The Agentic Operations Paradigm
For the past decade, the cloud-native ecosystem has experienced explosive growth, expanding to over ten million users . However, this growth has introduced a massive barrier: the "YAML wall" . SREs spend an exorbitant amount of time navigating fragmented tools, reading complex documentation, and manually writing declarative configuration files . To reach the next ten million users, we must abstract this complexity using Agentic AI .

Intent-Based Infrastructure via MCP: The future of infrastructure management relies on the Model Context Protocol (MCP) . MCP servers act as bridges between AI models and infrastructure tools (like deployment orchestrators or source control systems) . Instead of manually writing YAML configurations, SREs define intent using natural language—such as requesting an HTTP route to a frontend service—and the AI agent translates that intent into necessary configuration ``.
Human-in-the-Loop GitOps: Handing the keys over to an AI agent sounds like an operational nightmare, but the architecture solves this by integrating directly with GitOps workflows . When an agent formulates a change, it does not apply it directly to the live cluster. Instead, it automatically generates a Pull Request (PR) for human review . This maintains reliability and governance, ensuring that autonomous changes are strictly audited before they are merged and synced into the production environment ``.

The goal of modern platform engineering is no longer to learn and maintain dozens of fragmented CLI tools. The strategic advantage lies in building an ecosystem of "agentic skills" and open MCP servers . By standardizing how agents interact with clusters, you shift the SRE role from writing configuration syntax to governing autonomous, policy-bound workflows.

Decentralized Trust & The Immutable Source
As we automate operations, the security of the underlying source code becomes the ultimate bottleneck. Modern security architectures often rely entirely on the platform hosting the code (the "forge") to enforce access controls ``. This creates catastrophic vulnerability.

The Forge as a Single Point of Failure: Treating your code hosting platform as the ultimate arbiter of truth is a flawed security model . If the platform's UI or API is bypassed—or if an attacker compromises the infrastructure itself—malicious code can be injected seamlessly . Downstream pipelines will pull this code, trusting the platform's "green checkmark," leading to severe supply chain attacks like unauthorized tag overwrites or hidden backdoors ``.
Inverting the Security Model (Trust on First Use):
The solution is to decouple trust from the hosting provider and embed it directly into the version control system . Advanced architectures achieve this by storing security policies, access rules, and public keys within hidden repository namespaces (such as Git's refs/ directory) . This shifts policy enforcement from the centralized server to the developer's local client . When an engineer executes a pull or fetch, their local machine independently verifies the cryptographic signatures against the embedded policy, instantly detecting if unauthorized code was merged by a compromised forge .
The Reference State Log (RSL):
To guarantee absolute auditability, implement a Reference State Log (RSL) . This is an append-only hash chain that records all repository activity, including policy changes and deployment approvals . Because it is an immutable chain stored within the repo, any attempt to rewrite history or delete activity logs is immediately detected by any client holding a previous state of the log ``.

True Zero Privilege extends to your hosting providers. Implement M-of-N threshold signatures to ensure no single rogue maintainer or compromised admin account can unilaterally change a security policy or merge code . Utilize scoped delegations to grant contributors cryptographic authority over specific folders or branches without giving them full repository access . By treating the forge as untrusted infrastructure, you guarantee the integrity of your entire software supply chain at the mathematical level ``.

The next evolution of SRE is defined by delegating execution to AI agents while simultaneously locking down the cryptographic integrity of the systems those agents interact with. By merging intent-driven automation with decentralized, client-verified trust, engineering organizations can scale their operations infinitely without sacrificing security or reliability

The Zero Privilege Paradigm: Definitive Guide to Immutable Security

Mallagari Sri Datta — Tue, 31 Mar 2026 19:35:26 +0000

In the world of Site Reliability Engineering (SRE) and platform architecture, we are taught the principle of "Least Privilege." We spend countless hours meticulously scoping IAM roles, configuring RBAC, and auditing permissions. But what if "Least Privilege" is fundamentally flawed because it still leaves privileges on the table?

For the past three years, a quiet revolution has been taking place in the namespace-as-a-service ecosystem. By pushing access management to its absolute mathematical limit, platform engineers have pioneered a new standard: Zero Privilege Architecture. The results over a 36-month period running massive, enterprise-scale workloads speak for themselves—exactly zero security breaches and 100% platform uptime.

Security and reliability are inextricably linked; a system cannot be considered truly reliable if it is vulnerable, nor can it be secure if it is constantly offline. Here is the definitive breakdown of how to architect a Zero Privilege platform, neutralize modern threats, and sleep soundly when you are on-call.

The Core Philosophy: The Production Floor is Lava
The foundational mantra of the Zero Privilege architecture requires a radical rewiring of how engineers view production: "Security is achieved not when there is nothing more to add, but when there is no credential left to take away".

Think of modern IT infrastructure like a robotic automotive manufacturing plant. You would never allow a human to casually stroll across an active assembly line amidst swinging robotic arms—it compromises the quality of the car and introduces catastrophic safety risks. Zero Privilege mandates a "Zero Touch" production environment. No natural persons are allowed on the IT "production floor" during runtime.

This philosophy is enforced through three uncompromising pillars:

The GitOps Iron Curtain (Desired State): Any change introduced to the system must originate from a strictly controlled, peer-reviewed CI/CD pipeline. Absolutely no single natural person is granted the ability to perform manual state changes.

Ruthless Ephemerality over Patching: Traditional enterprise IT relies on complex, risky patching cycles. Zero Privilege rejects live patching. If any component deviates from its declared state, it is not debugged live; it is immediately killed and redeployed. By ensuring that most running containers are between 0 and 30 days old, the window of vulnerability for any given exploit is drastically minimized.

Policy as Code: Technical State Compliancy (TSM) and anomaly detection rules are managed entirely as code, continuously auditing the live environment against the single source of truth in the repository.

Neutralizing Apex Threats
When you entirely remove the ability for humans to log in, execute commands, or mutate state in production, lateral movement by adversaries becomes virtually impossible. Here is how Zero Privilege proactively neutralizes the most terrifying threats in the industry:

1. Ransomware & State Mutation
Ransomware operates on a simple premise: it requires elevated user access to encrypt or delete files. Under a Zero Privilege model, even the highest-level platform administrators have zero mutating verbs (e.g., create, update, delete) attached to their accounts. Because all mutations occur exclusively via pipeline-driven intent, a compromised admin credential is a blank cartridge. The ransomware payload literally lacks the mechanical mechanism to execute file changes.

2. Third-Party Vendor Outages ("Overprivileged Software" Problem)
We have see global outages triggered by faulty updates from security vendors or third-party agents. Zero Privilege prevents this by treating vendor software with extreme paranoia. Software versions are strictly pinned, and automated upstream triggers are completely severed. No vendor update is permitted to mutate the state of the platform without being explicitly tested and pushed through the deployment pipeline.

3. Supply Chain & NPM Attacks
Defending against compromised dependencies requires defense-in-depth. Beyond strict pipeline security scans, Zero Privilege relies on centralized frameworks and runtime anomaly detection. By deploying tools like Falco, the runtime environment actively checks for and immediately severs any connection attempts to non-reputable domains, stopping malicious packages from phoning home.

4. Token Theft & Metadata Exploitation
To mitigate the risk of compromised credentials, the platform enforces the use of short-lived tokens. To take it a step further into "alpha" territory, the architecture actively prevents metadata exploitation by restricting even platform administrators from viewing token duration limits.

5. Container Orchestration Vulnerabilities
In Kubernetes environments, permissions like nodes/proxy can allow an attacker to bypass boundaries and execute code in neighboring pods. Zero Privilege architects explicitly strip this permission from all users and deploy Admin Network Policies that directly block the Kubelet API port from unauthorized internal access.

The Infrastructural Bedrock
Beyond identity and access, the underlying network and orchestration layers must be hostile to unauthorized activity. Simplicity in code design and centralized frameworks enhance maintainability, ensuring that these protections are inherited by every application deployed on the platform.

Default Deny Networking: The platform employs draconian default network policies that drop all connectivity. Traffic is only permitted to flow if it is explicitly whitelisted and fundamentally required for a specific microservice to function.

Restricted Execution Contexts: All pods are forced to run under the most restrictive Security Context Constraints (SCC). This acts as guarantee that no "high privilege" pods can be spun up, even if an attacker manages to bypass the pipeline.

Manual On-Premise Egress: For platforms running on private clouds, all outbound (egress) traffic must clear manual firewall request validation. If an internal component is somehow compromised and attempts data exfiltration, it hits a brick wall at the network perimeter.

The Mindset Shift
Transitioning to a Zero Privilege architecture is a cultural one. It requires stripping away the comfortable "admin access" that operators have relied on for decades. However, by enforcing continuous validation, treating infrastructure as immutable, and funneling all mutations through code, organizations can achieve a state of operations where security and reliability are no longer competing priorities, but the exact same emergent property.

DevEx paradigm, it's called Backstage

Mallagari Sri Datta — Sat, 28 Jun 2025 22:00:57 +0000

The DevEx Revolution Is Here—And It’s Called Backstage

If you’re a developer, you know the feeling. It’s that persistent friction that grinds you down every day.

The endless search for the right repo. The ticket you filed last week for a new database that vanished into a black hole. The internal wiki with documentation so old mentioning technologies company hasn't used since ages. This isn’t just annoying; it’s a silent tax on your organization’s productivity. Decades were spent optimizing user experience (UX), but we’ve left own engineers to fend for themselves in a digital jungle of siloed tools. We need to speak about Developer Experience (DevEx) that's where Backstage comes in.

Born in Spotify, Backstage is a shift in how we think about internal engineering platforms. It doesn't replacing your existing stack but pane of glass to unify it, platform for building a seamless DevEx.

1. The Fog of War: “Where Is Anything?”

Orgs grow to have have hundreds, maybe thousands, of microservices, libraries, websites, and data pipelines. They live in Git, run on Kubernetes, and are monitored by a dozen different tools making developers feel like being dropped into a foreign land without a map. Veteran devs hoard a fragile collection of bookmarks, one browser crash away from total amnesia.

The Software Catalog
This isn't a list; it's a living map of your entire tech ecosystem. The genius of the Catalog is that it doesn’t require manual entry. Instead, you place a simple catalog-info.yaml file in your Git repositories. Backstage automatically discovers these files and populates a rich, searchable catalog of all your software components.

Through a powerful plugin system, the Catalog becomes a central dashboard for each component. Imagine clicking on a service and seeing:

Ownership: Who owns this? Who's on call? (via PagerDuty plugin)
CI/CD: What are the latest build statuses? (via Jenkins, GitHub Actions, or Tekton plugins)
Operations: How is it performing in production? (via ArgoCD, Datadog, or Grafana plugins)
Project Management: What are the associated Jira tickets?

The Waiting Game: The Agony of Ticket Purgatory
When you need a new microservice. The process? Find the right team, file a ticket, justify your existence, and wait. And wait. And wait. By the time your request is fulfilled, project's momentum is gone.

Software Templates
Backstage formalizes the concept of "Golden Paths." Platform teams can create standardized, best-practice templates for anything—a new React frontend, a Go microservice, a Python data pipeline.

As a developer, just go to Backstage, choose a template, fill out a simple form (like the service name), and click "Create." Behind the scenes, the Software Template engine kicks into high gear, automating a sequence of tasks:

Scaffolds a new repository with your company’s standard structure.
Creates the project in GitHub or GitLab.
Sets up the CI/CD pipeline.
Provisions a new namespace and secrets in Kubernetes.

and automatically registers the new component in the Software Catalog closing the loop, long process into a two-minute, self-service operation.

The Crypt of Forgotten Knowledge: The Document Graveyard
Documentation is first casualty of tight deadlines. It’s written once, thrown into wiki sea, and hardly updated.

TechDocs (Docs-as-Code, Done Right).
Backstage understands a truth: documentation will only stay current if it lives with the code.

With TechDocs, developers write their documentation in Markdown files right inside their component's repository. When they submit a pull request to change the code, they update the docs in the same PR. The code and its documentation are now reviewed and merged together.

Backstage automatically discovers these docs, renders them into a beautiful, easy-to-navigate website, and links it directly from the component's page in the Catalog. No more hunting. No more outdated wikis. The documentation you’re reading is guaranteed to be as fresh as the code it describes.

The Labyrinth: Drowning in Information
got a catalog, templates, and docs. but now a new problem: information overload. How do you find the one thing you need in a sea of data?

A Powerful, Unified Search

The search bar in Backstage isn't an afterthought; it’s a core pillar indexing everything.

Software Catalog entities (services, libraries, websites).
The full text of all your TechDocs.
Software Templates.
Even content from plugins (e.g., searching for a specific Confluence page or Stack Overflow for Teams question).

Need to know if your org already has a service for currency conversion? Search. Want to find the team that owns the authentication library? Search. Looking for the API docs for the "payment-processor" service? Search. It’s the universal compass that makes your entire engineering ecosystem navigable.

Backstage’s greatest strength is its plugin architecture. It ships with the core components above, but its power is in its extensibility. The open-source community and vendors have built plugins for nearly everything:

Security: See Snyk or Trivy vulnerability scans directly on a component's page.
Cloud Costs: Integrate a cost-insights dashboard to show how much your service is costing on AWS, GCP, or Azure.
Feature Flags: Manage your LaunchDarkly or Split.io flags from within Backstage.

Backstage is built to grow with you, a framework for building perfect developer portal, making a great DevEx

Bottom Line

Backstage is an investment in your engineering culture prmoting ownership, standardizing best practices, and eliminating the cognitive friction that burns out your best minds, empowering developers to spend more time building great products.

Whether choosing the DIY path with the open-source project or opt for an enterprise-ready, supported version like Red Hat Developer Hub, the era of fragmented, frustrating DevEx's is over.

Making AI Models Accessible Anywhere :: Scaling AI Traffic with Envoy AI Gateway

Mallagari Sri Datta — Sat, 28 Jun 2025 20:42:47 +0000

In the GenAI gold rush, Every developer, startup, and enterprise is scrambling to build the next killer AI-powered application. But beneath the shiny surface lies a messy, complex, and expensive reality: connecting to Large Language Models (LLMs) is a infrastructural nightmare.

Traditional API gateways, the trusted gatekeepers of the cloud-native world, are buckling under the punishing demands of AI traffic, here lets explore this problem of making AI models accessible anywhere at scale with Envoy AI Gateway & learn the shift in how we manage, scale, and control the flow of AI, built on the foundation of popular cloud-native proxy, Envoy.

Problem: Why Your Old Gateway Can't Handle New AI

Managing GenAI traffic isn't just about routing requests; it's a whole new set of challenges:

The Fractured Model Universe: Your app might want to use GPT-4 for complex reasoning, Claude 3 for long-context tasks, and a self-hosted Llama 3 for cost-efficiency. Each has a different API schema, different authentication, and different performance characteristics, making application code a tangled mess of SDKs and conditional logic.

Cost is Unpredictable and Explosive: Unlike a API call, the cost of an LLM request isn't flat rate, but based on tokens—the number of words or parts of words in both input and the model's output. A long, complex request can cost hundreds of times more than a short one. Traditional rate-limiting (e.g., 100 requests/minute) is no longer useful for budget control.

Latency is Long and Variable: LLMs think rather than just fetching data. A response can take many seconds or even minutes to generate. This requires a completely different approach to timeouts, retries, and user experience, often involving streaming responses token-by-token.

The Resilience Roulette: When model provider has an outage or is running at full capacity? Your application experiences same, for fallback you need to incorporate more intelligent, cross-provider workloads to maintain high availability, but building this is a significant engineering effort.

Security and Safety are Paramount: Managing dozens of API keys securely. More importantly, to filter both prompts and responses for harmful content, PII, and other sensitive data in real-time.

Doing this in every single application is redundant, insecure, and unscalablem needing a solution to this at the infrastructure layer.

What is Envoy AI Gateway? How does it solve problem?

Envoy AI Gateway is an open-source, AI-native gateway designed to solve the challenges of GenAI traffic. It's a sub-project of the Envoy Proxy ecosystem with mission to act as a universal translator and an intelligent control point for all your AI services.

Envoy AI Gateway builds on the robust, high-performance, and incredibly extensible foundation of Envoy Proxy, extending its popular filter chain mechanism to handle AI-specific tasks.

Core Features of Envoy AI Gateway

Unified API: Speak One Language to All LLMs
Application can speak a single, standardized API format (e.g., the OpenAI API format) to the gateway. Envoy AI Gateway then transparently transforms that request into the specific format required by the backend, whether it's Azure OpenAI, AWS Bedrock, Google's Gemini, or a self-hosted model completely decoupling application from the backend model.

Switch from GPT to Claude can be achieved with simple configuration change in the gateway, with zero changes to your application code.

Cost-Based Rate Limiting: Finally, Control Your AI Budget
Envoy AI Gateway understands the concept of "token usage" enabling to set rate limits based on cost, not just request counts.

An example AIGatewayRoute CRD :

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: azure-gpt4o
spec:
  # ... other routing config
  limitRequestCosts:
    # A CEL expression calculating cost based on tokens
    - cel: 'input_tokens / 1.0 + output_tokens * 3.0'
      # Metadata key to store the calculated cost
      metadataKey: tokenCost
  # ... more routing config

In the above policy, we defined a flexible cost formula.
The gateway can enforce rules like "Allow 500,000 cost units per minute," providing precise control over spending across different user tiers.

Intelligent Load Balancing and Fallbacks

The gateway is capable of managing traffic with priority-based routing.

# ...
rules:
  - backendRefs:
      - name: azure-backend
        priority: 0  # Highest priority
      - name: openai-backend
        priority: 1  # Fallback
# ...

When the primary backend (azure-backend) becomes unavailable or runs out of capacity, the gateway will automatically and seamlessly spill over the traffic to fallback backend (openai-backend) providing critical resilience for applications without complex client-side logic.

Centralized Security and Credential Management

Stop scattering API keys across applications and environment variables. With BackendSecurityPolicy CRD, manage all credentials in one secure, central place.

Even supporting advanced mechanisms like OIDC Federation, allowing the gateway to use its own identity to securely exchange temporary credentials with cloud providers like AWS and Azure eliminating the need for long-lived static keys and automates credential rotation, dramatically improving your security posture.

How It Works: A Glimpse Under the Hood

Envoy AI Gateway's architecture leverages the existing Envoy Gateway project for its control plane and introduces its controller.

Users define AI-specific needs using simple CRDs like AIGatewayRoute.
The Envoy AI Gateway controller translates these into standard Gateway API resources.
It then uses Envoy's powerful External Processing (ExtProc) extension to inject AI-specific logic.
An ExtProc sidecar runs alongside the main Envoy pod, handling tasks such as token counting, request/response transformation, and content moderation keeping the core Envoy proxy lean and fast while allowing for rich, AI-specific functionality to be developed and deployed independently.

Envoy AI Gateway kind of projects are laying the foundation for a more standardized, secure, and cost-effective MLOps and LLMOps landscape. By solving complex problems at the infrastructure level, freeing developers to continue to build amazing applications.

Ready to tame your AI traffic?

GitHub: github.com/envoyproxy/ai-gateway
Web : aigateway.envoyproxy.io

How PlayStation achieved 99.99% uptime on Kubernetes

Mallagari Sri Datta — Sat, 28 Jun 2025 13:40:09 +0000

When you're powering services for millions of PlayStation gamers, "downtime" isn't just an inconvenience—it's a headline. So, when Sony Interactive Entertainment (SIE) revealed achievement of 99.995% availability for Kubernetes platform last year, SIE has also shared learnings for the platform engineering teams could take cue from.

From Silos to a Unified Kingdom

Before 2021, SIE was similar to many large organizations: different teams in the US and Japan had their own platforms, leading to duplicated work and inconsistent standards. The solution was a massive "platform unification" program that created one global team and one platform: the Unified Kubernetes Service (UK Platform).

The UK Platform was built on three simple, powerful ideas:

Unification: One way to manage everything. No more ad-hoc fixes or team-specific quirks.
Multi-tenancy: Services from dozens of teams would run on shared clusters, maximizing resource usage.
Standardization: All services would use common Helm charts provided by the platform team. This ensures consistency and makes management sane.

The foundation is built on AWS EKS for managed clusters and Karpenter for node management. The platform team handles the core infrastructure, while service teams focus on what they do best: building amazing applications.

But as any SRE will tell you, a great start doesn't guarantee a smooth ride. As the platform grew, new challenges emerged. Here’s how they slayed each.

1 : The Battle for Availability

Keeping services online 24/7 is the ultimate goal.

Problem : Uneven Pod Spreading
Developers observed uneven pod distribution, especially during traffic spikes, despite configuring PodDisruptionBudget(PDB) and PodTopologySpreadConstraints(PTSC) with whenUnsatisfiable set to ScheduleAnyway. Employing the descheduler, which periodically checks the cluster and evicts pods from overcrowded nodes, forcing them to reschedule onto less crowded ones. Simple, effective, and automated.
Problem : Slow Pod Scaling
Pods didn't scale quickly enough during peak hours (e.g., major title launches, in-game events), leading to increased latency and errors. The total scale-up time was a bottleneck, comprising node creation time and pod startup time.
Overprovisioning: Prepared spare capacity using low-priority placeholder pods. These placeholders are evicted when application pods need to scale, allowing immediate space allocation without waiting for new nodes. This balances cost and responsiveness4.
Adopting Karpenter: Adopted Karpenter, a node autoscaler that directly provisions and consolidates EC2 instances, making node creation faster than traditional Cluster Autoscaler.
Problem : CoreDNS Issues
CoreDNS is the phonebook of your cluster handling DNS resolution. As the platform grew, CoreDNS pods, which were running on a limited set of nodes, started hitting rate limits from the upstream DNS resolver. This caused a cascade of failures across applications.
First, using pod anti-affinity ensured ensure CoreDNS pods were spread across many more nodes, distributing the load.
But the real breakthrough came from a counter-intuitive move: they removed Investigation revealed, CPU limits were causing throttling, which severely impacted performance and tail latency. By removing the limits and relying on CPU requests and the kernel's scheduler, performance dramatically improved.

2 : Taming Maintenance Challenges

Upgrades are a fact of life in the Kubernetes world. But with over 50 clusters, manual maintenance was a recipe for burnout.

Problem : Add-on Upgrades Took Forever Manually upgrading an add-on (like a logging agent or metrics server) across all clusters took over 300 minutes of engineering time. It was repetitive, tedious, and error-prone. The team built a fully automated workflow. The process now includes:
Running smoke tests on the first cluster.
Automatically progressing to the next cluster on success.
Automatically rolling back on failure.
Upgrade time dropped from 300 minutes to under 15 minutes, and reliability skyrocketed.
Problem : Bad Configs Blocked Node Upgrades
Node upgrades were constantly blocked because service teams had configured their PDBs improperly. This required manual intervention from the platform team to fix.
Standardized Helm Charts: PDB settings were baked into the common Helm charts that all services use.
Kyverno Policies: Kyverno, a policy engine for Kubernetes, to automatically block any deployments with improper PDB settings at the API server level. No more bad configs could enter the cluster.

3 : People, Process, and Time Zones

Problem : Burnout and Rising Operational Load
As the platform scaled, the on-call burden on the team grew, threatening work-life balance. An alert at 3 AM in California is a problem.
A Follow-the-Sun Global Team SIE built a global team with engineers in the US, Japan, and India. This "follow-the-sun" model provides 24/7 coverage, with clean handoffs between regions. When an incident occurs, the on-call engineer for that time zone handles it, preventing any single person from being awake all night.
Problem : Knowledge Gaps and Communication Delays
Working across time zones created information silos and slow down decision making. A question asked in Tokyo might not get an answer from California for 12 hours.
A Culture of Documentation and Shared Knowledge
Knowledge Sharing Sessions: Regular sessions to keep everyone in sync.
Documentation: Key decisions and architectures are formally documented.
Incident Management Process: A clear, three-phase process (Before, During, After) ensures that every incident is a learning opportunity, with action items tracked to continuously improve the platform.

Key Takeaways for Engineering Teams

SIE's journey offers a powerful blueprint for running platforms at scale:

Culture is Everything : Success started with a culture that values data, tracks metrics, and acts on them. Inclusive leadership made the global team model work.
Embrace the Ecosystem : Kubernetes and its rich open-source ecosystem (Karpenter, Kyverno, descheduler) provided the building blocks. without having to reinvent the wheel.
Automate Ruthlessly : Automation isn't a luxury; it's a necessity for reliability and freeing up engineers to solve bigger problems.
Master the Basics: Ultimately, 99.995% uptime comes from relentless focus on solving foundational problems with simple, robust, and well-understood solutions.