DEV Community: Saim Safdar

How Enterprises Can Retain Local Control Through Digital Sovereignty | Ep 142 #cloudnativefm

Saim Safdar — Fri, 07 Nov 2025 09:25:15 +0000

Lessons from my conversation with Gabriel Gaura — Head of Infrastructure, Security & Risk, T-Systems International (Cloud Professional Services) on CloudNative.FM Podcast

TL;DR: Digital sovereignty has exploded from a policy conversation into an operational and strategic problem for enterprises largely because geopolitical moves (tariffs, legal reach) suddenly turned vendor choice into a business-risk calculation. The shocks landed hardest on heavy public-cloud and SaaS adopters whose workloads and operations were tightly coupled to non-local vendors. Gabriel outlines a pragmatic, three-pillar view of sovereignty (Data, Operational, Technological) and a stepwise roadmap (exposure analysis → classification → pragmatic roadmap → contingency) that balances control with innovation. Read the full story here

Why “digital sovereignty” re-entered the lexicon — tariffs, reaction, and risk

Over the past year, the phrase digital sovereignty stopped being an academic/policy term and became boardroom material. Gabriel traced that shift to geopolitical actions, tariffs and regulatory pressure that produced immediate business consequences and second-order reactions (for example, talk of reciprocal measures in Europe). The point isn’t just ideology: it’s plain economics. If software and platform licensing suddenly become subject to tariffs or legal restrictions, companies face material increases in cost and legal exposure.

Two key mechanics drove the jump in attention:

Direct shocks tariffs, export restrictions, or sanctions that make access to certain vendors or services suddenly constrained.

Reciprocal policy talk and legal reach, laws (and legal precedents) that create uncertainty about who can access data, where it can be processed, and under which jurisdiction.

When your cloud provider or a critical SaaS vendor becomes a geopolitical lever, sovereignty is no longer optional; it’s part of risk management.

Which sectors got surprised — and why

Gabriel highlighted that the sectors most shaken were those that had embraced public cloud and vendor-managed services without contingency plans:

Finance and banking: heavy reliance on global clouds, strict regulation, and low tolerance for service interruption made banks especially vulnerable (he referenced a widely discussed case of a bank denied access to its cloud resources after sanctions).
Large SaaS consumers & companies using embedded managed services: organizations that embedded vendor PaaS/serverless services into app code found portability and contingency extremely costly.
Organizations outsourcing critical operations: those who outsourced runbooks, ops, or platform management were exposed operationally and politically.

In short, the more you rely on third-party, cross-border platforms, especially where core logic is embedded in vendor services, the larger your sovereignty exposure.

Practical checkpoints & short wins for teams

Run the exposure analysis this quarter. Map your top 50 mission-critical services and where they live.
Classify risk appetite. Board and supervisory committees must understand the “what-if” scenarios.
Favor open source for critical control planes where it reduces lock-in risk, but remember open source alone doesn’t immunize you from geopolitical service risks.
Create a sovereignty scorecard (financial / operational / technological dimensions) for leadership reporting.
Start small: identify one or two workloads where moving to a local or alternative platform yields high risk reduction for modest cost.

Innovation vs. sovereignty — a false tradeoff

A key line in our discussion: innovation often happens under constraint. Moving away from a single-vendor comfort zone can spark creative alternatives, confidential computing, interoperable container patterns, and edge-continuum initiatives. Gabriel argues that while being forced to leave a comfort zone is disruptive, it can reignite platform innovation rather than stifle it, as long as organizations plan pragmatically and avoid blanket “rip and replace” strategies.

Questions for readers (help shape episode two)

Which area should we deep-dive next? Tell us in the comments or via CloudNativeFM:

A) Data residency & compliance playbooks (GDPR, DORA, sector rules)
B) Migration patterns & portability (how to untangle PaaS/serverless lock-in)
C) Procurement & vendor governance (contract tweaks, exit clauses, audits)
D) Building a “sovereignty playbook” templates, scorecards, tabletop exercises

Final thought

Digital sovereignty is not a binary state; it’s a pragmatic, cross-functional journey. The choice facing enterprises isn’t “sovereignty or innovation” but “how much control do we need where, and at what cost?” Gabriel’s framework: exposure first, classification second, pragmatic roadmap third, gives organizations a realistic path that reduces catastrophic risks while preserving the ability to build and innovate.

If you found this helpful, drop your pick from the reader questions above, and we’ll use it to plan episode two with Gabriel on CloudNative.FM Podcast

HashiCorp Project Infragraph — The “Google Maps” for Cloud Infrastructure? #cloudnativewisdom12

Saim Safdar — Wed, 05 Nov 2025 10:43:05 +0000

We built the cloud to be simpler, more flexible, and infinitely more scalable than legacy data centers. Yet somewhere along the way, we lost one of the things that made operations manageable in the first place: a reliable, enterprise-grade inventory — a system of record that tells you what you own, where it runs, who owns it, and what policies apply.

In this short explainer, I talk to my guest Richard Simon about what InfraGraph is, why it matters, how it compares to older data-center inventory tools, and the implications for automation, AI agents, and third-party tooling.

HashiCorp’s Project InfraGraph (announced at HashiConf) aims to restore that capability, not as another dashboard, but as a relationship-first knowledge substrate that connects infrastructure, applications, services, ownership, and policy into a single, trusted model. If it delivers on its promise, InfraGraph could become the missing piece for safer automation, simpler Day-2 operations, and better AI-driven decision-making across hybrid and multi-cloud estates.

In this post, I want to explain why we lost inventory in the shift to cloud, what a graph-based infrastructure model brings back, and how teams should prepare for a future where context, not just telemetry, enables trustworthy automation.

What we used to have (and why it mattered)

In traditional data centers, organizations relied on inventory and lifecycle tools, IBM Tivoli, network managers, systems directors, and similar platforms, which did two critical things:

They cataloged everything. Hardware, firmware, network devices, and software were discovered and recorded.

They were actionable. Tools could push updates, apply patches, and reconfigure devices because they had a trusted view of the environment.

That “system of record” gave operators a single place to answer questions like: what’s running, where is it running, who owns it, and what versions are in use? It was messy, sure, but it worked; it created a foundation for predictable change and controlled automation.

Real use cases worth watching

If InfraGraph works as planned, several practical Day-2 scenarios immediately improve:

Faster triage: Instead of chasing traces and logs across tools, you query a single model to find affected services, owners, and related policies.
Drift detection with context: Detect configuration drift and immediately see its impact surface (apps, owners, SLAs).
Policy enforcement & auditability: Map policies to resources and owners in a structured way; decisions and remediation actions become auditable.
Agentic automation: Provide LLMs or agents with a trustworthy context so automated remediation can be precise and compliant.
Third-party integrations: SIEM, SRE, and platform tools can consume the same substrate instead of maintaining competing inventories.

Recommendations for platform teams

Map your current state: Before adopting anything, document where inventory exists in your stack and what’s missing.
Define ownership & SLOs for inventory accuracy. Who is responsible for updates? What’s a tolerable freshness window?
Prioritize APIs and integrations: Ensure any graph solution exposes clean APIs you can integrate into automation and incident workflows.
Start with high-value use cases: Triage and policy enforcement are low-risk, high-value first consumers of a graph.
Plan guardrails early: RBAC, audit logs, and approval workflows should be part of the rollout plan.

Conclusion: A practical return to inventory

We don’t need another silo; we need one trustworthy view that multiple teams and tools can rely on. Project InfraGraph is an ambitious attempt to reintroduce that system-of-record in a world of multi-cloud and hybrid complexity. If it can deliver accurate ingestion, relationship-first modeling, and open integrations, and if teams treat the graph as a governed asset rather than a commodity, it could dramatically simplify Day-2 operations and unlock safer, more auditable automation.

The cloud era taught us to be modular and specialized. Now it’s time to stitch those modules together with context. InfraGraph could be the thread we’ve been missing.

Are you excited, cautious, or both? Drop a comment, I’ll collect feedback and share a follow-up demo after a private beta or invite a guest from the Infragraph team, if there’s interest.

For more explainer videos, and let me remind you of the platform engineering panel series going on. Invest in yourself, start learning new skills by hitting Subscribe to @cloudnativefm | @CloudTherapist. Because the skills we choose today determine the careers/jobs we get tomorrow.

Introduction to llm-d Open-source Kubernetes-native Framework for Distributed LLM Inference | Ep 140 #cloudnativefm

Saim Safdar — Wed, 15 Oct 2025 14:22:43 +0000

I recently had a great conversation with the Red Hat team about llm_d, a new open-source effort that’s starting to tackle a problem we’re seeing more and more in production ML stacks: inference workloads becoming monolithic, heavy, and hard to scale.

A few highlights from the discussion and why llm-d matters:

Inspiration: llm-d draws a lot of inspiration from work done in projects like vLLM which optimized inference on everything from laptops to DGX clusters (caching, speculative decoding, distribution).

The problem: Today we often run inference as one big container (model + runtime + observability + config + pipelines). When you scale, you end up copying too much state across nodes, inefficient and brittle.

The idea: Treat the model and its runtime as disaggregated, first-class components inside Kubernetes. Break the container into parts (cache, prefill/decode, GPU-bound work, CPU-bound work) and let the platform place and scale each piece independently.

Why it’s promising: cache-aware routing and componentized serving lets you avoid unnecessary duplication, match workloads to the right resources (GPU vs CPU), and enable smarter scaling across clusters, which can reduce cost and improve responsiveness.

The opportunity: If you’re building ML infra or platform capabilities, this opens a path to far more efficient inference at scale, especially as model sizes continue to grow.

llm-d is still early, but it’s a practical, infrastructure-first approach to a real industry pain point. I’ll share a clip of our conversation and a short explainer diagram — would love to hear from folks building inference at scale:

Question: How are you scaling inference today, monolithic instances, autoscaling replicas, or something disaggregated? Drop a comment, I’d love to compare notes.