DEV Community: Anna Silva

Astro is Great, Actually

Anna Silva — Tue, 07 Apr 2026 13:00:00 +0000

I've been building personal websites long enough to have opinions about Bootstrap 2. Not nostalgia — opinions. It was the right tool for 2013, it held its ground on IE, and if you think that's funny you've never debugged a flexbox fallback at 1am for a browser that predates flexbox.

Since then: CRA when I wanted to play with the fancy React hooks, Next.js when Next.js was the new hotness, Tailwind the moment I learned it existed (and it never left), Vite when I wanted a cleaner foundation. The through-line is Tailwind and .tsx. That's where I live. Everything else is negotiable.

I Wanted to Write Things Down

Dev.to was the obvious first move. It's where dev content lives, the tooling is fine, the audience is there. So I started writing there.

And then I wanted to write about other things. Keyboards. Thoughts that don't resolve into a tutorial. Life, vaguely. Dev.to technically allows it — some people are fine with that — but it never felt right. Dev.to is for dev content. Personal stuff belongs somewhere personal.

Which meant I needed a blog. Which meant I needed blog plumbing.

The Obvious Problem

My portfolio site was just Vite. Static. No backend, no CMS, no server doing anything interesting. "Just add a blog" implies some infrastructure I don't have.

Someone said: Astro.

Astro the Framework

Astro is a static site generator built on top of Vite. Which, first of all, COOL. Migration from my existing setup would be mostly renaming things. But the actually interesting part: Astro has a component model where you can drop in React, Svelte, Vue — it handles hydration for whichever ones need it, leaving everything else as zero-JS static HTML. The integrations list is absurd. MDX out of the box. Image optimization. RSS feeds. Sitemaps. Content collections with schema validation. It's not opinionated about what you're building; it just handles the boring parts so you don't have to.

Okay. I'm interested. What's the catch?

Astro the Language

The catch is .astro files.

---
import { getCollection } from 'astro:content';
import { PostList } from '../components/blog/post-list.astro';
import { BaseLayout } from '../layouts/base-layout.astro';
import { SITE_DESCRIPTION, SITE_TITLE } from '../consts';

const posts = (await getCollection('posts')).sort(
  (a, b) => b.data.pubDate.valueOf() - a.data.pubDate.valueOf(),
);
---

<BaseLayout title={SITE_TITLE} description={SITE_DESCRIPTION}>
  <section>
    <h1>Posts</h1>
    <PostList posts={posts} />
  </section>
</BaseLayout>

I looked at them and went: ugh. Frontmatter at the top, template below, logic mixed in — it gave me flashbacks to the Jekyll era of GitHub Pages, which itself gave me flashbacks to PHP files opening with a cascading reverse-indented avalanche of </div>s from the template three includes up. You know the files I'm talking about.

What made React work for me was precisely that it isn't that. A component is a function that returns HTML — not really HTML, it's JSX, it's movie magic — but the DOM is being treated as an object, not a text file with <?php echo stitched into it. I did not get into React to write PHP again.

Oh, I Can Just Keep Using React

Turns out: yes.

My portfolio could stay in React and .tsx. The blog content is entirely Markdown with frontmatter. The blog's theming could also be fully written in React and .tsx as well. The .astro files handle the parts that would be awkward in JSX anyway — the <html> shell, path handlers, layout wrappers. Everything with actual logic is still React. Still a function that returns HTML.

Which left me with one question: what actually are .astro files?

The Lingua Franca

They read like a PHP template. They parse more like Vue or Svelte with frontmatter. They have the structural feel of a Jekyll theme. It's as if someone looked at the entire history of web templating, said "yes, all of it," and produced something simultaneously familiar and alien depending on which corner you're looking at.

The generous read — and I think it's the correct one — is that .astro is the lingua franca of frontend templating. English absorbed Latin, German, French, Spanish, and somehow became a language. Not clean, not internally consistent, but widely legible. .astro files are like that. Not everyone's cup of tea. Definitely not mine, initially.

But once I understood what layer they operate at, the strangeness stopped mattering. The PHP comparison doesn't hold. It just looks that way from a distance.

The blog works. The whole thing is fast, the build tooling is familiar, and at no point did Astro ask me to change how I write components.

That last part is the actual endorsement. Frameworks have opinions. Astro had opinions too, and none of them got in my way. That's impressive.

Cover photo by Justin Wolff on Unsplash

Design Constraints as Art: Maximizing Your AWS Free Tier

Anna Silva — Fri, 03 Apr 2026 00:05:16 +0000

There's a school of thought in creative fields — architecture, music, graphic design — that constraints produce better work than freedom. You don't write a sonnet because fourteen lines is the optimal poem length. You write a sonnet because the form forces decisions you wouldn't have made otherwise, and some of those decisions turn out to be the interesting ones.

Software doesn't get talked about this way. Instead we talk about "optimization" and "cost reduction" like they're chores — things you do after the real design is done, or when someone notices the AWS bill needs its own line item in the board deck. We treat the budget as an obstacle to the architecture, something to apologize for, not something to design with.

But the most elegant systems I've built weren't the ones where I had unlimited resources. They were the ones where the budget was effectively zero and the product still had to work. Start from that constraint — not as a failure mode but as a design input — and something different happens. You stop designing systems and start designing within systems.

This is a post about AWS's free tier. More specifically: how three years of building the cheapest possible production backends taught me that the constraints aren't obstacles to good architecture. They are the architecture.

The Free Tier Is Weirder and More Generous Than You Think

Most people hear "AWS free tier" and think of the 12-month trial. Spin up an EC2 instance, forget about it, get a bill that makes you reconsider your career choices. That's not what I'm talking about.

AWS has an always-free tier. Not a trial. Not "free for your first year." Always free. And the serverless portion of it is genuinely, almost suspiciously generous.

Here's what you get for nothing, forever:

Lambda: 1 million invocations/month. 400,000 GB-seconds of compute.
API Gateway: 1 million HTTP API calls/month.
DynamoDB: 25 GB of storage. 25 read capacity units, 25 write capacity units. (On-demand pricing has its own always-free allocation too.)
S3: 5 GB storage. 20,000 GET requests. 2,000 PUT requests.
CloudFront: 1 TB data transfer out. 10 million requests/month.
SES: 3,000 messages/month if you're sending from Lambda.
Cognito: 50,000 monthly active users. Fifty thousand. For free.

Read that Cognito number again. Fifty thousand monthly active users on the auth layer. For a SaaS product. For free. If you have fifty thousand MAU, you either have revenue to pay for auth or you have much bigger problems than a Cognito bill.

The combined effect of these isn't "you can run a toy project." It's "you can run a real SaaS product with real users and your infrastructure bill is, plausibly, zero." Not low. Zero.

I spent three years at a company that specialized in exactly this. Lambda backends, DynamoDB data stores, the whole serverless stack. We weren't doing it for fun — we were doing it because the economics are absurd and our clients liked absurd economics. And somewhere along the way, the constraints stopped being constraints and started being a design language.

The Trap Doors

Not everything in AWS is this benign. There are services that look like they belong in this stack and will absolutely ruin your zero-dollar streak.

RDS is the big one. AWS's managed relational database service. It has a 12-month free tier — not always-free. After your first year, you're paying for a db.t3.micro that costs roughly $15/month doing nothing. If you started a project on RDS because "I know SQL," congratulations: your note-taking app now has a recurring bill because you didn't want to learn DynamoDB's access pattern model. The constraint was trying to tell you something.

EC2 is the other one. If you're running Lambda and you also have an EC2 instance "for the things Lambda can't do," you've left the free tier reservation. EC2's always-free allocation is 750 hours/month of t2.micro — for 12 months. After that, it's metered. And if you're running something on EC2 that Lambda can't do, you should ask yourself whether you're building a serverless product or a server product that's embarrassed about it.

The rule is simple: if the service doesn't have an always-free tier, it doesn't belong in the zero-dollar stack. Treat the 12-month trials as what they are — onboarding ramps that terminate in invoices.

Let's Build Something

Let's experiment with this. Say you want to build a note-taking SaaS. Markdown-based, collaborative enough, the kind of thing a developer might actually use. Let's call it Insight Notes.

I have no notion as to why I'd name it that.

Insight Notes needs: user authentication, a way to store and retrieve documents, a web frontend, transactional email (verification, password resets), and an API. That's it. That's most SaaS products, actually — a surprising number of them are just "auth + CRUD + a nice frontend" wearing different clothes.

Here's the stack: Cognito for auth. Lambda for the API. API Gateway to route HTTP to Lambda. DynamoDB for document storage. S3 + CloudFront for the frontend. SES for email.

Seven services. All of them have always-free tiers. All of them talk to each other natively without you writing glue code. And the total monthly cost for a product with, say, a few hundred active users?

Somewhere between "rounding error" and "the price of a coffee." If you charge a dollar a month per user, you're keeping most of those hundred dollars. That's not a side project math trick — that's a margin most SaaS companies spend years of infrastructure work trying to approach.

This is not a contrived example. This is basically every B2B SaaS with a different coat of paint — and if you can build this cheaply, you can build most things cheaply.

One Lambda to Rule Them All

Here's the first thing three years of doing this professionally teaches you: use as few Lambdas as possible.

The instinct, especially if you're coming from microservices, is to split things up. One Lambda for auth callbacks. One for document CRUD. One for search. One for email triggers. It's tidy. It's also wrong.

Every Lambda has a cold start. The first invocation after a period of inactivity has to boot the runtime, load your code, initialize your connections. For a Node.js Lambda with reasonable dependencies, that's somewhere between 200ms and 800ms. For Java or .NET, multiply generously.

One Lambda means one cold start. One user hitting your API with any kind of regularity keeps that Lambda warm. The website doesn't feel like it's booting up a server for every request — because it effectively isn't, as long as someone used it recently enough.

Multiple Lambdas mean multiple independent cold starts. Your auth callback Lambda hasn't been invoked in an hour? Cold start. Your search Lambda? Also cold. Your user just waited 600ms for login and then another 500ms for their first search. The product feels broken and you haven't even done anything wrong — you just split your code the way the microservices blog told you to.

One Lambda. Route internally. Use lambda-api — a framework built specifically for this, with zero dependencies, designed around Lambda's execution model rather than retrofitted onto it. It handles the API Gateway proxy integration for you, parses requests, formats responses, and has a router that feels like Express without the thirty transitive dependencies Express brings to your cold start.

Your single Lambda receives everything, routes it, and responds. Cold start happens once, warming benefits everything.

This is the constraint producing the design. You're not choosing a monolith because monoliths are philosophically superior. You're choosing it because the cold start penalty makes the alternative feel terrible to use. The constraint said "you get one warm execution environment" and the design fell out of it.

DynamoDB: The 25 GB Puzzle

DynamoDB is the most opinionated database you will ever use and it's free for 25 GB. Whether that's a gift or a curse depends entirely on whether you're willing to think about your data the way DynamoDB wants you to.

If you're coming from Postgres or MySQL, your first instinct will be to model your data relationally. You'll want foreign keys, you'll want JOINs, you'll want to normalize everything into tidy third-normal-form tables.

Your second instinct will be to search for the AWS serverless SQL solution. That's the devil speaking. Aurora Serverless exists, and it will let you write SELECT * FROM notes WHERE user_id = ? like a civilized person, and it will also cold start for up to 30 seconds on the first connection, bill you per ACU-hour whether you're doing anything or not, and cheerfully generate a surprise invoice the moment you get any real traffic. It is not a free tier play. It is not even a cheap play. It is a trap with a familiar interface.

So: DynamoDB. And look — the instinct to resist it is correct, because DynamoDB is genuinely strange. But strange in a way that pays off.

DynamoDB will let you design a relational model. DynamoDB will then punish you for it at read time, slowly, expensively, and without remorse. What it wants instead is single-table design. It wants you to think about your access patterns first and your data model second. How will Insight Notes be queried? By user ID. By document ID. By user ID sorted by last modified. That's three access patterns, and if you're clever about your partition key and sort key, that's one table.

For Insight Notes, that table might look like this:

PK	SK	Data
`USER#anna`	`PROFILE`	`{ email, name, created }`
`USER#anna`	`NOTE#01JADX...`	`{ title, content, updated }`
`USER#anna`	`FOLDER#work`	`{ name, color, created }`
`USER#anna`	`NOTE#01JADX...#META`	`{ folder: "work", tags: [...] }`

The raw key structure is what DynamoDB actually stores. What you write in code looks considerably more civilized — dynaorm is a type-safe client that handles marshaling and unmarshaling, validates your items against a Zod schema before they touch the database, and gives you a fluent query builder so .query().wherePK("USER#anna").whereSK("begins_with", "NOTE#") does exactly what it looks like. The constraint forced the data model. The library makes the data model livable.

Everything for one user lives under one partition key. Getting all of a user's notes is a single query. Getting a specific note is a point read. Getting all notes in a folder requires a secondary index or a filter — and this is where the design constraint forces you to think about access patterns before you write a single line of code. In Postgres, you'd add a WHERE folder_id = ? and not think about it. In DynamoDB, that query either needs to be modeled into the table structure or it costs you an index. Which means you design around how you read, not how you write. Which — and this is the art part — often produces a better data model than the relational one, because you're forced to think about actual user experience instead of abstract data relationships.

The 25 GB free tier sounds small until you do the math. A markdown document is text. Text is small. At 3 KB average per note — title, content, metadata, organizational data — 25 GB holds roughly 8 million notes. That's not a toy constraint. That's a constraint most products will never actually reach.

The real constraint is throughput, not storage. 25 read capacity units and 25 write capacity units is roughly 25 strongly-consistent reads per second for items up to 4 KB, and 25 writes per second for items up to 1 KB. For Insight Notes with a few hundred users, that's fine. For thousands of concurrent users all editing simultaneously — you'll need on-demand capacity, which still has a free allocation but works differently.

The point is: the limit makes you think. Think about item size. Think about access patterns. Think about what "enough" means for your actual product, not some hypothetical scale you haven't earned yet.

The Frontend: Smaller Is Literally Cheaper

S3 + CloudFront for static hosting is standard. What's less obvious is that the free tier makes your frontend size a financial concern, not just a performance one.

5 GB of S3 storage is plenty for a frontend. But 20,000 GET requests per month means every asset your page loads counts against a real number. And while CloudFront's 10 million requests and 1 TB transfer are generous, the S3 origin requests behind it aren't free once you exceed the allocation.

So your React bundle size isn't just a Lighthouse score. It's a line item. Fewer assets, smaller bundles, aggressive caching headers — these aren't best practices you should get around to someday. They're the difference between a zero-dollar bill and a not-zero-dollar bill.

This is where the constraint does its best work. You were supposed to ship a smaller frontend. You were supposed to set proper cache headers. You were supposed to lazy-load that charting library nobody uses on the landing page. The free tier just gave you a reason that shows up on an invoice instead of a performance audit.

The Part Where Autoscaling Tries to Bankrupt You

Here's a thing about serverless that nobody warns you about until it's too late.

Traditional servers crash under load. That's bad, but it's also a natural circuit breaker. Your server falls over, users get errors, you wake up and fix it. There's a ceiling.

Lambda doesn't crash under load. Lambda scales. Automatically. To whatever your concurrency limit allows. A thousand concurrent invocations? Lambda will handle it. Ten thousand? Sure, if your account limit permits. DynamoDB on-demand? Scales to meet the request rate. API Gateway? Routes it all through.

This is excellent until someone writes a bot that hits your API ten million times in a day. Or until a legitimate usage spike pushes you past the free tier allocation on every service simultaneously. Or until a misconfigured retry loop in your own frontend hammers your own backend at scale.

The traditional server would have crashed. Your bill would have been zero because the server was down. The serverless stack stays up. The serverless stack scales to meet the demand. The serverless stack sends you a bill for meeting that demand.

You need rate limiting. You need it at the API Gateway level (throttling is built in, configure it). You need it at the application level (per-user, per-endpoint, per-operation). You need CloudFront caching in front of everything that can be cached, not for performance but for cost containment. You need billing alerts — AWS lets you set them, and you should set them at thresholds that make you uncomfortable, like $1 and $5 and $10, because the jump from $0 to $50 happens fast when the constraint you relied on was "nobody's using this yet."

Caching, rate limiting, and billing alerts aren't operational maturity for a zero-dollar product. They're structural requirements. The system doesn't crash anymore. Which means the system doesn't stop you from spending money anymore. That's your job now.

The Deal With the Devil

I've spent this entire post talking about AWS services like they're building blocks in a design exercise. And they are. But I need to be honest about what you're actually doing when you build this way, because the constraint-as-art metaphor has a dark edge.

This stack — Lambda, API Gateway, DynamoDB, S3, CloudFront, SES, Cognito — is not "cloud-native." It's AWS-native. There's a difference, and the difference matters.

DynamoDB is not Postgres. Your single-table design, your GSIs, your DynamoDB Streams triggers — none of that transfers to Azure or GCP or your own hardware. Lambda's execution model, its cold start characteristics, its integration with API Gateway — those are AWS implementation details dressed up as abstractions. Cognito's user pools, its hosted UI, its token format — AWS-specific.

If you build Insight Notes on this stack and then decide to move to a different cloud provider, you are rewriting most of your backend. Not migrating. Rewriting. The data model changes because DynamoDB's model is DynamoDB's. The auth layer changes because Cognito is Cognito. The compute model changes because Lambda is Lambda.

This is the deal. AWS gives you an extraordinarily generous free tier on services that are extraordinarily specific to AWS. The generosity and the lock-in are the same feature. They want you to build something real on their platform, for free, because they know that "for free" becomes "too expensive to move" once your product has users and your data model is shaped like DynamoDB.

And here's the part that makes the deal complicated rather than simple: for a lot of products, this is fine. Insight Notes doesn't need multi-cloud portability. Most SaaS products don't. The exit cost is real, but the exit is hypothetical, and the operational cost of not using the purpose-built services — of running your own Postgres on EC2, your own auth on a container, your own email infrastructure — is higher than the lock-in cost for any product that isn't planning to leave AWS.

The constraint is: you're building on AWS's terms. The art is knowing that, choosing it deliberately, and designing within those terms rather than pretending they don't exist. Don't use DynamoDB and then complain that it's not Postgres. Don't use Lambda and then complain about cold starts. You chose these. They chose you back.

$200 and a Plan

AWS gives you $200 in credits when you sign up. Combined with the always-free tier, that's not a toy budget. That's "launch a product, get your first paying customers, and let the revenue catch up to the infrastructure" budget.

The credits cover the things the free tier doesn't — the occasional Lambda burst that exceeds 1M invocations, the DynamoDB spike during a product launch, the SES costs once you're sending more than 3,000 emails/month. Think of the credits as the buffer between "this is free" and "this costs money but the money is mine now."

Most products that fail don't fail because of infrastructure costs. They fail because nobody used them. The serverless free tier means infrastructure costs are the last thing that kills you, which means you get to fail for the right reasons instead.

There's something deeply satisfying about building a system where every design decision has a reason and half those reasons are "because the free tier works this way." It's like writing a sonnet, except the meter is measured in Lambda invocations and the rhyme scheme is your DynamoDB access patterns.

The constraints are real. The lock-in is real. The risk of accidentally autoscaling yourself a bill is real. But the product is also real, and it cost you nothing to build, and the decisions the constraints forced on you — one Lambda, single-table design, tiny frontend, aggressive caching — are decisions you should have been making anyway.

Design constraints as art. AWS as the medium. Your invoice as the critic.

Just set up rate limiting first.

Cover photo by © This is Engineering on Unsplash

Infisical is Great, Actually

Anna Silva — Fri, 27 Mar 2026 20:00:00 +0000

I run ArgoCD. Full GitOps — if it's not in the repo, it doesn't exist. That's great for everything except secrets, where "if it's in the repo, it might not exist for long either." GitHub secret scanning will catch an API key in a private repo, helpfully disable it, and send you a polite notification that you messed up. So I needed an ESO backend. Here's what I looked at.

Shopping Around

I was already applying secrets manually via kubectl — which works fine until it doesn't, and doesn't scale past "just me doing everything." The plan was always to wire up External Secrets Operator; the question was just what it would point at.

SOPS came up first — a Claude recommendation. It encrypts secrets in-repo, which sounds elegant, but the decryption key has to live somewhere, and in practice that somewhere is the machine doing the decrypting. If that machine is compromised, the attacker gets the key, and the key opens everything. Security theater. My brain wanted something that felt like AWS Parameter Store — a place secrets live, accessed over an authenticated API, not a place they're hidden inside something else with the unhiding tool sitting right next to them.

The second AI-generated recommendation was GitHub Actions Secrets. Which, sure — except my IaC repo has GitHub Actions in it. Secrets that live inside the same system they're deploying feel like a liability waiting to happen. At this point it was clear I needed an actual secrets service, so I started shopping for an ESO backend.

AWS Secrets Manager, AWS Parameter Store, and OCI Vault were the clear, industry-standard options — the real products that do the job properly. But I'm running K3s on a free OCI ARM VM specifically because I want zero dependencies I can't walk away from. If OCI's free tier ever goes south, I want to pack up and leave — depending on OCI Vault would chain my secrets to the same cloud I'm trying to stay portable from. And pulling in AWS just for secrets would be adding a second cloud dependency for no reason.

I also stumbled upon Doppler, which I actually liked the look of. The DX is genuinely good, the CLI is pleasant, the UI is clean. Then I hit the pricing page: service accounts — the thing you need for any automated workflow — are a Team plan feature. $21/month per user. For just me. For secrets. No.

Then I found Infisical.

What Infisical Is

Infisical is an open-source secrets manager with a managed cloud offering and a self-hostable option. The cloud tier is genuinely free for reasonable usage. It has a UI, a CLI, SDKs, and first-class Kubernetes support — either via its own operator or as an External Secrets Operator backend, which is what I ended up using.

The Setup I Actually Use: ESO + Infisical Cloud

Rather than the Infisical operator, I went with External Secrets Operator (ESO) using Infisical as the backend. ESO is a CNCF project with a clean abstraction: you define a SecretStore (or ClusterSecretStore) pointing at your secrets backend, then ExternalSecret resources that describe which secrets to sync and where. The output is always a standard Kubernetes Secret. Swap out the backend someday and your app manifests don't change.

Installing ESO via ArgoCD

I manage everything with ArgoCD, so the ESO install is an Application pointing at the official release manifest:

# applications/external-secrets/kustomization.yml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - https://github.com/external-secrets/external-secrets/releases/download/v2.2.0/external-secrets.yaml
  - cluster-secret-stores.yml

The ArgoCD Application itself:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: external-secrets
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
spec:
  project: default
  source:
    repoURL: https://github.com/notjustanna/iac.git
    targetRevision: main
    path: applications/external-secrets
  destination:
    server: https://kubernetes.default.svc
    namespace: external-secrets
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

The sync-wave: "-1" ensures ESO is fully deployed before anything tries to create ExternalSecret resources.

Configuring the ClusterSecretStore

Create a machine identity in Infisical (Project → Access Control → Machine Identities), give it read access to your environment, then store the credentials:

kubectl create secret generic infisical-auth \
  --from-literal=clientId=<your-client-id> \
  --from-literal=clientSecret=<your-client-secret> \
  -n external-secrets

Then the ClusterSecretStore:

# applications/external-secrets/cluster-secret-stores.yml
apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
  name: infisical
spec:
  provider:
    infisical:
      auth:
        universalAuthCredentials:
          clientId:
            name: infisical-auth
            namespace: external-secrets
            key: clientId
          clientSecret:
            name: infisical-auth
            namespace: external-secrets
            key: clientSecret
      secretsScope:
        projectSlug: "your-project-slug"
        environmentSlug: "prod"
        secretsPath: "/"
        recursive: true

One ClusterSecretStore serves the whole cluster. Every namespace can reference it.

Consuming Secrets: A Real Example

Here's how I pull in the Cloudflare API token for Traefik. In Infisical, the secret lives under /traefik. The ExternalSecret syncs everything under that path into a Kubernetes secret in the traefik namespace:

# applications/traefik/external-secret.yml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: cloudflare-api-token
  namespace: traefik
spec:
  secretStoreRef:
    name: infisical
    kind: ClusterSecretStore
  refreshInterval: 1m
  target:
    name: cloudflare-api-token
    creationPolicy: Owner
    template:
      type: Opaque
  dataFrom:
    - find:
        path: /traefik

Traefik just sees a Kubernetes secret. It has no idea Infisical exists. That's the whole point.

The Part That Actually Sold Me

The ESO pattern means application manifests don't change when you change secret backends. Secrets are just Kubernetes secrets to everything downstream — no SDK, no sidecar, no secret-fetching logic in application code. Infisical lives entirely in the infra layer, which is where secrets management should live.

The path-based organization (/traefik, /monitoring, and so on) felt immediately familiar — it's the same mental model as AWS Parameter Store. That's not a coincidence; it's just the right way to organize secrets. The recursive: true on the ClusterSecretStore means ExternalSecret resources can scope as narrowly or broadly as makes sense per workload.

Should You Use the Cloud or Self-Host?

The managed free tier covers up to 5 projects with unlimited secrets. For a homelab or small production workload, there's genuinely no reason to self-host unless you want to.

If you do want to self-host — the Helm chart is well-documented. Just be aware of the chicken-and-egg problem: if Infisical lives on the same cluster it's serving secrets to, it becomes a bootstrap dependency, and that gets messy fast. I wrote about this failure mode in more detail here. Infisical Cloud sidesteps it entirely, which is why I'm using it.

Bottom Line

I stopped looking for a better option. Infisical + ESO hit the threshold of "this is clearly correct" — open source, free at the scale I need, Kubernetes-native without being invasive, and not locked to any cloud. The setup I showed above is the whole thing. If you're still managing secrets via committed files, manual kubectl create secret commands, or CI-only stores — this is the way out.

Cover photo by Sid Verma on Unsplash

Self-Hosting Everything, Including the Single Point of Failure

Anna Silva — Fri, 27 Mar 2026 11:00:00 +0000

Homelabbing is genuinely fun. I want to say that upfront, before I tell you about the time I locked myself out of my own infrastructure for an afternoon.

The premise is compelling: you have a VM, you have K3s, and the open source ecosystem has basically everything you'd pay a SaaS for. Keycloak for OIDC. Forgejo for Git. Headscale for VPN. ArgoCD for GitOps. A few YAML files and you have a self-hosted stack that would make a startup founder weep.

And for a while, it works beautifully. Ansible (yes, Ansible — I've since retired it, but that's another post) kept everything converging to the right state. ArgoCD synced my apps from a Forgejo repo. Kubectl reached K3s over Headscale. Users — well, me — logged into Forgejo via Keycloak OIDC. My custom Keycloak theme was hosted on Forgejo.

It was elegant. It was self-referential. It was fine, right up until it wasn't.

The Dependency Graph Nobody Warned Me About

Let me draw the graph for you:

ArgoCD deploys everything, including Forgejo, Keycloak and Headscale
Forgejo hosts the GitOps repo ArgoCD syncs from
Keycloak handles auth for Forgejo
Forgejo hosts my custom Keycloak theme
Headscale is how my kubectl reached K3s remotely

You see the problem. The dependency graph doesn't have a root. Everything depends on something else that it manages. It's turtles all the way down, except the turtles are also responsible for each other's wellbeing.

This is fine when everything is running. A self-healing cycle. ArgoCD keeps things in sync, everything stays up, life is good.

It is considerably less fine when Oracle decides to release a new Ubuntu Minimal image and terraform apply silently replaces your VM.

Do Not `terraform apply` While Distracted

I won't go into the specifics of how Oracle releasing a new image source caused my Terraform to replace the VM — that's a configuration lesson for another day, and also too embarrassing to fully recount. The point is: one moment I had a running K3s node, the next I had a fresh VM.

And everything on it was gone.

This was before the containers setup. No clean /data mount, no steward, no "two and a half minutes to a working kubernetes." Just a blank VM and the knowledge that everything I needed to recover was hosted on the thing that no longer existed.

And separately: Headscale was down because ArgoCD hadn't deployed it yet — and Headscale was specifically what kubectl used to reach K3s remotely. No block volume, no /data, no steward. Cloud-init got as far as reloading K3s and standing up a blank ArgoCD — the Ansible playbook was in an OCI bucket, so at least that survived — but with no way to kubectl in, that was as far as it got.

Chicken, meet egg.

The fix was ugly: I had to edit the Ansible playbook to point ArgoCD at GitHub instead of Forgejo, push that, let it redeploy — and then begin a long series of commits that were mostly commenting and uncommenting service configs. Each change exposed a new chicken-and-egg problem. Forgejo needs Keycloak. Keycloak needs Forgejo for the theme. ArgoCD needs Forgejo. Comment out the OIDC config, deploy, uncomment, redeploy, something else falls over. Repeat.

When it was finally stable, I left myself a note: disassemble this before it explodes again.

What I Learned

Self-referential systems need an escape hatch.

The pattern I had — GitOps repo on the same machine GitOps is managing — is a known antipattern. The right answer is that your bootstrap source should be independent of what you're bootstrapping.

But here's the thing: I knew what I was building had philosophical coherence. The whole point was to depend less on SaaS. Not just for the homelab, but as a stance — I wanted off GitHub, off Tailscale, off the assumption that someone else's free tier was load-bearing infrastructure. Running Forgejo on the same cluster wasn't carelessness; it was the logical extension of that goal. Own everything, all the way down.

What I hadn't stress-tested was the dependency graph when the cluster itself was the patient.

The mistake wasn't Forgejo — it was Forgejo on the same cluster it was managing. Those are different decisions. And honestly, the right fix for wanting off GitHub isn't to run Forgejo on your only VM; it's to use something like Codeberg. You can move off SaaS without taking on the operational risk of a self-hosted bootstrap source. The self-hosting purist move and the pragmatic resilience move don't have to be the same decision.

Headscale was behind the locked door.

The reason the chicken-and-egg problem was so painful wasn't just the circular deployments — it's that Headscale was my only path to kubectl. When the cluster died, my remote access died with it. The tool I needed to fix my infrastructure was running on the infrastructure I needed to fix.

This is the same lesson as Forgejo, applied to access instead of deployment: if something is part of your recovery path, it can't live inside the thing you're recovering.

I genuinely enjoyed running Headscale — the project is impressive and there's something satisfying about owning your own VPN coordination. But after the incident, I couldn't justify self-hosting the thing that stands between me and my cluster. Tailscale's free tier covers 100 devices and 3 users. I am one person with a homelab. I migrated back and haven't thought about it since.

Some self-hosting is load-bearing in ways that compound.

After all this, I dropped Forgejo and Keycloak too — at least for now. The overhead of maintaining the auth layer, the themes, the OIDC integration, stopped being worth it relative to what I was actually getting. The homelab is lighter for it.

What I kept: the lesson. If something is load-bearing for your bootstrap sequence, it needs to be independent of what it's bootstrapping. That's the rule. Everything else is negotiable.

The Honest Version

I self-hosted Headscale. I learned what Tailscale actually provides, operationally, by having to do all of it myself. I made an informed decision to go back.

I self-hosted Forgejo inside a circular dependency and paid for it when the cluster went down. I moved the bootstrap repo off the critical path — and eventually off the cluster entirely.

The homelab taught me which abstractions were worth paying for (VPN coordination — yes, for me, right now). It also taught me that the interesting self-hosting problems aren't "can I run this" — you always can — but "where does this fit in the dependency graph, and what happens when it's down."

Homelabbing is fun. Highly recommend. Just draw the dependency graph first.

Cover photo by Eastman Childs on Unsplash

I Run Nomad on my Gaming PC (It's Great)

Anna Silva — Thu, 26 Mar 2026 15:00:00 +0000

HashiCorp Nomad is a workload orchestrator. Think Kubernetes, but without the container-first dogma — it can schedule containers, sure, but also raw executables, Java applications, scripts, whatever you have. It's designed for fleets: multiple datacenters, hundreds of nodes, cross-machine scheduling. The kind of infrastructure where "where does this service run" is a question Nomad answers for you.

That's the intended use case. Here's another one.

My first professional encounter with Nomad was at a company where our team had no RDP access to the Windows Server we were deploying to. No remote desktop, no SSH, nothing — just Nomad and whatever we chose to run through it.

So Nomad became everything. Restart a service? Nomad. Check backend logs? Nomad. Copy files onto the machine? We deployed Filebrowser through Nomad so we could do that too. The machine was, for all practical purposes, a black box we could only interact with through the web UI and job specs.

This wasn't one rogue team. There were a dozen teams doing the same thing, each with their own trio of Windows Servers — DEV, STG, PRD. Each machine a proper behemoth: 32 or 64GB of RAM, terabytes of storage, quietly running 20-something JVM-based services that would eventually consume most of that memory. Nomad scheduling all of it. Single node, pointed at itself.

Was this production? Yes. Was it highly available? No — it was roughly on par with a Linux box running systemd services, just with a shinier interface and HCL files instead of unit files. But it ran. Teams shipped things. Nomad, against all architectural intent, worked.

None of us questioned this. I had come from SSH, Docker Compose, Portainer — tools where "the interface to the machine" and "the service manager" are different things. Nomad fit neatly into my mental model as Portainer but for bare-metal Windows. Web UI, log access, restart buttons. Close enough.

I only found out this was off-label when I finally read the HashiCorp docs. They were very clearly written for someone orchestrating fleets. Multi-datacenter topology. Node pools. Cross-machine scheduling. I was running one agent, pointed at itself, on a machine I couldn't even open a terminal on. The docs and I were not describing the same situation.

And yet: it worked.

How it followed me home

When I decided to self-host Jellyfin and DDNS-Go on my personal machine — which, at the time, ran Windows — I reached for the tool I knew.

Windows, as a self-hosting platform, has two notable properties: it has no good log story, and it has no good service management UI. Task Scheduler exists. Services exist. Neither of them will make you feel good about yourself.

Nomad had a web UI I could reach from anywhere on my Tailscale network, live log tailing, and a restart button. I wrote a job spec, set up raw_exec, and Jellyfin was running. Logs accessible. Restartable from work, from my phone, from wherever — no terminal required.

And then I migrated to Linux

Nomad came with me.

This is, I think, the real testament to the setup. When I finally moved to Linux — where proper service management exists, where journalctl is right there, where I had every excuse to do it correctly — I kept Nomad. Because clicking through a web UI beats memorizing journalctl flags. Because I already had the job specs written and they just worked.

My gaming PC now runs a Nomad agent pointed at itself. A cluster of one. Same off-label usage I accidentally learned at work, just with less mystery Windows Server and significantly more RGB.

When I moved to Linux, I didn't just migrate Jellyfin and DDNS-Go. I added Sunshine to the roster too. And Code Tunnel. The thing I originally wanted to restart from work — the whole reason I went down this path — ended up as just another Nomad job. One more entry in the web UI. Restartable from anywhere, as long as I have a VPN connection to the PC.

Is this wrong?

Technically, yes. Nomad is built for fleets. Running it on one node is like using Kubernetes to manage your dotfiles.

But the off-label usage holds up — and I say this as someone who stumbled into it by accident and only realized it was off-label afterward. It survived a Windows Server I had no other way to access. It survived being someone's actual production infrastructure across an entire organization. It survived my personal Windows machine. It survived a migration to Linux where I had every incentive to switch. What you get at the end of all that is a web UI where you can see your services, read their logs, and restart them from anywhere — and a declarative spec file that travels with you across operating systems.

At some point you stop calling it wrong and start calling it yours.

Sometimes the right tool is the one you already know how to use. Even if you learned it wrong.

Cover photo by Balkouras Nicos on Unsplash

Containers, The Wrong Way: Lessons Learnt

Anna Silva — Wed, 25 Mar 2026 23:41:50 +0000

This is a follow-up of "Containers, The Wrong Way, For Always-Free Fun and Profit"

In my last post, I told you all a wild idea: stop caring about the host OS of your EC2/VM. Take the OS hostage. Make it a babysitter of privileged container, and from that point on it's as relevant as a bastion VM. Your environment lives in an Docker/Podman image.
Versioned, reproducible, and testable on your laptop/QEMU/VMWare.

A week later, 119 files changed, +612 -4210 lines changed (this is what an Ansible retirement looks like) and I have one thing to say:

The core idea was right. I just hadn't "thought with containers" all the way through.

Prelude: The host OS matters. A tiny bit.

Here's the thing about the "host OS doesn't matter" premise: it only holds if the host OS agrees to not matter. Your privileged container needs to start and be able to take the host OS hostage. That's the whole deal. The host gets you to that point, and then it gets out of the way.

Oracle Linux ships with SELinux enforcing by default. And SELinux, doing exactly what SELinux is designed to do, looked at my privileged container with host networking and nested containers and said: wait just a goshdarned second. And, to be honest? SELinux is right. Windows Defender would have a field day trying to defend itself against this OS-level hijacking attack with admin rights.

The fix is... straightforward enough — set SELinux to permissive, reboot. But that's the problem for me. This would mean writing a whole "is SELinux enabled? try disable it and reboot" script to my cloud-init, which is already way too big IMHO.

This isn't Oracle Linux being bad. It isn't SELinux being wrong. It's a contract violation: I needed a host that would get out of the way, and this one wouldn't.

So... back to Ubuntu 24.04 Minimal. No drama. unattended-upgrades at 6am UTC. The host goes back to being furniture.

I Wasn't Actually Thinking With Containers

My original idea was, conceptually: take whatever free Linux distro the cloud handed you, bolt a privileged container with Alpine, run everything inside. One container. One image. K3s, Tailscale, manifests, startup logic — all of it, together.

I thought I was thinking with containers. I was actually thinking "how do I run a VM without virtualizing another Linux kernel." The question I should have asked earlier: why stop at one container?

rancher/k3s is a scratch image. No shell, no package manager, nothing. It ships that way intentionally — K3s bundles exactly what it needs and nothing else. The moment I tried to extend it, I was working against a clear signal. The image was telling me something: don't touch me, use me.

Same with Tailscale. tailscale/tailscale exists, maintained by the people who wrote Tailscale, optimized for exactly this use case. Why was I installing tailscaled inside my Alpine image?

Instead of fighting the host OS, I was now fighting my own container image. All the pieces existed upstream, and yet, I was trying to further disassemble them.

The Steward Container

I once used Portainer to manage an entire fleet of VMs and baremetal servers. Got a lot of flack from the internet for using it, too. Portainer manages your container, but it too is a container. It just required mounting the Docker socket, and it managed everything. I should have done that from the beginning.

My once massive container image quickly shrunk into a steward container — a thin Alpine image whose only job is orchestrating other containers. It doesn't run K3s. It doesn't run Tailscale. It uses podman-compose to bring them up and manages their lifecycle.

FROM alpine:3.21

RUN apk add --no-cache bash podman podman-compose gettext kubectl curl

COPY . /image
ENTRYPOINT ["sh", "-c"]
CMD ["/image/steward.sh"]

K3s runs from rancher/k3s:v1.35.2-k3s1. Tailscale runs from tailscale/tailscale:v1.94.2. Both are purpose-built, upstream, and updated by bumping a version tag. I don't own their
internals. I just sequence them.

services:
  tailscale:
    image: docker.io/tailscale/tailscale:v1.94.2
    privileged: true
    network_mode: host
    # ...

  k3s:
    image: docker.io/rancher/k3s:v1.35.2-k3s1
    privileged: true
    network_mode: host
    # ...

And because the steward is just glue — Alpine, bash, Podman Compose — it's entirely mine to change. If I want to rewire how bootstrap works, add a new sequencing step, or swap out how secrets are injected, I edit the steward. K3s and Tailscale don't care — they just get started in a different order, or with different arguments. The concern separation works both ways: I don't touch their images, they don't constrain mine. And the host OS surely won't know any better.

Trade-offs along the way

I found a total of one trade-off: I lost Longhorn.

Longhorn is the right persistent storage story for K3s. It's also sitting behind an iSCSI requirement, which means kernel modules — which means I'd need to build a custom longhorned-k3s image that has the right binaries, reaches into the host kernel and hopes it guessed right. That image would be mine to maintain forever, against a scratch base I can't easily inspect or extend.

This is, ironically, the exact trap the whole setup was designed to avoid. So I didn't do it. /data on the block volume is fine for a homelab. Local-path PVCs do the job.

The headache of homelab infrastructure is supposed to be fun headache. There's a line between "productive friction you learn from" and "work you do instead of the actual thing." Longhorn
crossed that line. I cut it. If this were production, I'd be on EKS and none of this would exist.

The Ephemerality Project is a Success

It takes 2 minutes and 30 seconds from the second Oracle Cloud finishes creating a VM to ArgoCD being fully deployed and deploying my root app.

That's cloud-init, Podman starting the steward, Tailscale coming up, K3s initializing, the API ready, bootstrap done, ArgoCD CRDs registered, root app deployed.

preserve_boot_volume = false in Terraform. I genuinely don't care if Oracle recycles the boot volume. Everything stateful is on the block volume. Everything ephemeral is in the image. The VM is cattle. That was the original promise.

It delivered. I just had to actually follow the logic through.

The Short Version

The host OS matters exactly once: getting your first container running. Pick one that gets out of the way immediately. Ubuntu Minimal does this. SELinux-enforcing distros don't — not because they're wrong, but because they conflict with the premise.
Don't extend scratch images. Use them. If an upstream image is minimal or scratch, that's a signal. Compose around it, don't modify it.
One container per concern. The steward pattern — a thin orchestrator managing purpose-built upstream images — is what container thinking actually looks like. "One container for the whole machine" is just a VM with extra steps.
Know when to cut. Not every yak needs shaving. Longhorn would be fun. Longhorn would also be a project. This is a homelab.

Cover photo by Paul Calescu on Unsplash

Containers, The Wrong Way, For Always-Free Fun and Profit

Anna Silva — Mon, 23 Mar 2026 11:00:00 +0000

Prelude: Oracle Cloud's Always-Free Tier

Oracle Cloud has an always-free tier. Not a trial. Not "free for 12 months." Always free.

Four ARM-based cores, 24GB of RAM, 200GB of storage. For nothing. Forever.

Their Ampere Altra processors are genuinely good silicon. People benchmark these against x86 and come away impressed. And the ARM64 ecosystem is in good shape — most container images you'll actually use (your databases, your ingress controllers, your monitoring stack) have ARM64 builds. The days of "does this even run on ARM" are mostly behind us.

The fact that Oracle is giving this hardware away to get people onto their platform is, frankly, their problem.

If you have any kind of homelab itch — self-hosted apps, a personal Kubernetes playground, a place to run things you don't want living on your laptop — you should have one of these VMs. The barrier is a credit card for verification (they won't charge it) and about twenty minutes.

The VM

Once you have an account, getting a VM up is a few clicks in the OCI console:

Compute → Instances → Create Instance
Change the shape to VM.Standard.A1.Flex — that's the ARM one
Set OCPUs to 4 and memory to 24GB (max free allocation)
Pick your image.
- Oracle Linux if you're comfortable with dnf. Ubuntu Server if you're a apt person.
- Either works — both have Minimal variants that strip out a lot of packages out. And if you're about to do what I'm about to describe, you'll want the Minimal version.
Add your SSH key
Create

You'll also want a separate block volume: Storage → Block Volumes → Create, 150GB, attach it to your instance.

If you're Terraform-inclined, it looks something like this:

resource "oci_core_instance" "my-arm-instance" {
  shape = "VM.Standard.A1.Flex"
  shape_config {
    ocpus         = 4
    memory_in_gbs = 24
  }
  source_details {
    source_type             = "image"
    source_id               = var.oracle_linux_minimal_image_id
    boot_volume_size_in_gbs = 50  # OCI minimum
  }
}

resource "oci_core_volume" "data" {
  # this is where anything you'd like to still exist tomorrow lives
  size_in_gbs = 150
}

Free Kubernetes. Kinda.

K3s goes up in about thirty seconds:

curl -sfL https://get.k3s.io | sh
kubectl get nodes

This gives you a single-node Kubernetes cluster. One control plane, one worker, same machine. It's not highly available — if the VM goes down, your cluster goes down.

If you want something closer to HA, OCI's free tier technically allows you to split your 4 OCPUs and 24GB across multiple VMs — you could do three VMs at 1 OCPU / 8GB each (control pane gets 2 OCPUs) and run a proper multi-node setup with an embedded etcd quorum. That's a valid path. This post is about the single-node case because I don't care about any of that and neither should you for a homelab.

The point is: Kubernetes, on ARM, for free. This feels unreasonable but here we are.

The Part Where You Discover You Have a Long-Lived Server Problem

Here's the thing about a cloud VM: you don't touch it daily like your laptop or your desktop. You SSH into it when something breaks, or when you need to add a new service. If you're running Kubernetes, you SSH in even less — kubectl handles most of it and you rarely need to touch the node directly.

But the machine keeps running. Packages drift out of date. You hotfixed something at 2am while sick and half-forgot. You're not sure which K3s version you're actually on or whether it's the one you meant to be running. The machine accumulates entropy in the background while you're not looking.

This is the long-lived server problem and it's why configuration management tools exist. The standard answer: write an Ansible playbook, push it to git, have the machine pull and run it on a schedule. Define the desired state, let Ansible converge to it.

So you write the playbook. It works. You commit it to git, set up a cron job on the VM to pull and run it every five minutes, and declare victory.

And then you start noticing the friction.

Ansible is... fine for keeping a server configured. But what you're increasingly fighting is a different problem — you want something closer to what apt upgrade does, but for your entire environment. Not "apply these tasks," but "this is the version of the world I want, please be it." Ansible can do it but it's not really what it's designed for, and you can feel the difference. The playbook describes a path to the desired state, not the desired state itself. Those are subtly different things and the difference starts to matter when you're maintaining the playbook over months.

The other problem: there's no local testing story. You make a change, push to git, wait for the cron job, SSH in to see if anything broke. Your laptop is not a Linux ARM server with K3s running on it. You can't just run the playbook locally and catch problems before they hit the VM.

The Thought That Won't Go Away

I work with Kubernetes other every day. Kubernetes runs containers. Containers are versioned, immutable artifacts — you build one, push it to a registry, pull it somewhere else, it behaves exactly the same. You can run it locally to test it. You update it by pushing a new version. Rolling back means pulling an old tag.

Everything about this model is better than a long-lived server managed by a configuration management tool.

And somewhere around the third time I was debugging why the playbook had done something unexpected, I had the thought: why can't I just containerize this entire problem?

K3s is not a web app. K3s needs to interact with the host kernel — manipulate iptables, create network interfaces, manage cgroups, set up container networking. You can't just docker run k3s and expect it to work.

Or... can you?

The `--privileged` flag

--privileged is Docker and Podman's "I know what I'm doing" flag. It gives the container essentially full access to the host kernel — every capability, every device, no security filtering. It's the nuclear option.

And it turns out, it's also the officially documented way to run K3s in a container. From K3s's own docs:

docker run \
  --privileged \
  --name k3s-server \
  -p 6443:6443 \
  -d rancher/k3s:v1.29.3-k3s1 \
  server

That's in the K3s documentation. Not a workaround. Not a hack. --privileged is how you do this.

Adding --network host gives the container the host's network stack directly, which K3s needs to set up its own networking correctly:

podman run -d \
  --privileged \
  --network host \
  -v /data:/data \
  --restart always \
  ghcr.io/you/k3s-env:latest

Does this feel wrong? Yes. A privileged container with host networking is basically just a process. Security people will say this. They're not wrong.

But it's a process defined by an OCI image. Which means it has a version tag. A Dockerfile in a git repo. A build pipeline. And — crucially — you can docker run it on your laptop and test it before it touches anything real.

The trade is: you give up container isolation (which K3s was going to make you give up anyway to do its job) and you get everything else that comes with the container model. That's a good trade.

The Image

FROM rancher/k3s:v1.29.3-k3s1

# Whatever else you want running alongside K3s
RUN apk add --no-cache tailscale

# K3s auto-deploys anything placed here on startup
COPY argocd-install.yaml /var/lib/rancher/k3s/server/manifests/argocd.yaml
COPY root-app.yaml /var/lib/rancher/k3s/server/manifests/root-app.yaml

COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

rancher/k3s as the base means K3s is pre-installed at a specific pinned version. K3s has an auto-deploy feature: anything in /var/lib/rancher/k3s/server/manifests/ gets automatically applied when K3s starts. Drop ArgoCD's install manifest in there, point ArgoCD at a git repo, and everything else deploys itself from git. The image doesn't need to know about any of it.

The entrypoint starts things in the right order:

#!/bin/bash

# Secrets are in /data/env, written at startup
source /data/env

# Start networking, wait for it to be ready
tailscaled --state=/data/tailscale.state &
tailscale up --authkey=$TAILSCALE_AUTHKEY
until tailscale status; do sleep 2; done

# K3s in the foreground — keeps the container alive
exec k3s server --data-dir=/data/rancher/k3s

Oracle Linux's New Job Description

A small shell script runs at VM startup (wired up via systemd) and does exactly this:

#!/bin/bash
dnf install -y podman
mkdir -p /data
mount /dev/sdb /data

cat > /data/env <<EOF
TAILSCALE_AUTHKEY=your-key-here
OTHER_SECRET=whatever
EOF
chmod 600 /data/env

podman run -d \
  --privileged \
  --network host \
  -v /data:/data \
  --env-file /data/env \
  --restart always \
  --name k3s-env \
  ghcr.io/you/k3s-env:latest

Five commands. Run once on first boot. That is the entire configuration management story for Oracle Linux.

After that, Oracle Linux's responsibilities are:

Keeping Podman alive
Applying its own package updates at 3am via dnf-automatic (or unattended-upgrades on Ubuntu) — install it, enable the timer, forget about it

That's it. Oracle Linux is now a very fancy process supervisor. It doesn't know what's running inside the container and doesn't need to.

The Part That Makes It Worth It

You can test it locally.

docker run -it --privileged --network host \
  -v /tmp/test-data:/data \
  --env-file .env.test \
  ghcr.io/you/k3s-env:latest

Your entire server environment, running on your laptop. K3s comes up, deploys things, you poke at it with kubectl. You find the problem before it touches the VM. This was impossible with Ansible.

Updates are a push.

Change anything in the Dockerfile — update the K3s version, add a package, update a manifest — build a new image tag, push to your registry. A systemd timer on the VM checks for image updates at 3am:

podman pull ghcr.io/you/k3s-env:latest && podman restart k3s-env

No SSH. Noscottrodgerson playbook run. No waiting for convergence.

The boot volume can die and you don't care.

Everything stateful — K3s cluster state, persistent volumes, all of it — lives at /data on the separate block volume, mounted into the container. Oracle can update or replace the boot volume. Run the startup script, the container comes back up, finds its state on /data, continues exactly where it left off.

Is This "Correct"?

No. Using --privileged is not what containers are designed for. Running K3s inside a container on a VM where you could just run K3s directly is adding a layer that doesn't need to be there from a pure architecture standpoint.

But "architecturally pure" and "actually useful for your situation" are different questions. This approach gives you a reproducible, testable, versionable environment on a server you interact with twice a year. The feedback loop goes from "push to git, wait for cron, SSH in and hope" to "docker run on your laptop, push when it works."

For a free ARM Kubernetes node that you barely touch, that's the right trade.

Also, --privileged is in the K3s docs. So maybe it's fine.

Cover photo by Paul Calescu on Unsplash

Thinking Differently About Universal Microkernels

Anna Silva — Fri, 20 Mar 2026 16:33:11 +0000

As a not-Macbook owner, I have to yell this into the void: I really like macOS' kernel. Not macOS — XNU. The kernel underneath.

XNU's a hybrid microkernel — meaning most of the OS lives in userspace rather than baked into the kernel itself. Device drivers? Architecturally designed to be userspace programs, even if Apple doesn't always do it that way. Crash a device driver, you crash the device driver. Not the whole system. The kernel stays up. The kernel doesn't care.

Compare that to Linux, where a buggy driver can take down the entire machine because it's all in the same address space, with the same privileges, one bad pointer away from a kernel panic.

XNU's model is just... better in every way. Cleaner. The kernel does the minimum, as the SOLID gods stated on their Single Responsibility Commandment. Everything else is up to the rest of the operating system. XNU is an actual kernel rather than a cob of corn — which in Linux's case, is one fire away from exploding into popcorn.

And I would love this on my CachyOS with COSMIC desktop. I should be allowed to have bad taste in desktop environment but good taste in kernel architecture. I want my drivers isolated. I want to run whatever userspace I feel like, on x86 or ARM, and not particularly care which one I'm on. I want something like Apple's Universal Binary — one thing that runs everywhere — except actually universal, not "universal between the two architectures Apple currently sells."

So Anyway, I Started ~~Blasting~~ Googling.

Why Does Only Apple Get A Decent Kernel?

Actually...

Microkernels aren't a new idea. They're not even an "Think Different" idea. They've been the "obviously correct" architecture in OS research since the 1980s. The theory is sound: small trusted kernel, everything else in userspace, hardware isolation enforces the boundaries. Fewer things can go wrong. The things that do go wrong are contained.

And yet the consumer OS landscape is:

macOS/iOS — XNU, hybrid microkernel ✓
Windows — monolithic-ish NT kernel, practically speaking
Linux — monolithic, famously so
Android — Linux underneath, with a Java runtime on top

(Everything else is too niche to matter for this conversation. For now. We'll get back to it.)

And WHY didn't microkernels win?

The honest answer is performance. Which is funny, because the company that makes the only mainstream microkernel OS also makes the fastest consumer laptops on the market right now. Apple proved microkernels can be fast. Apple also makes it impossible to run anything else on their hardware. Make it make sense.

But historically: early microkernel implementations (Mach, which XNU is partly based on) were slow because crossing the kernel/userspace boundary has overhead, and if your networking stack and filesystem are both in userspace, every I/O operation crosses that boundary multiple times. L4, a later microkernel, proved you could make those crossings fast enough to matter. But by then Linux had momentum and "fast enough" wasn't enough to displace it.

So we ended up in a world where the kernel architecture that makes more engineering sense, runs on the fastest and most efficient laptop hardware money can buy... which can only be obtained from the Cupertino company.

"The Universal Binary"-sized Elephant in the Room

So while I was down this rabbit hole admiring XNU's architecture, I figured I'd look at the Universal Binary thing more closely. They're neat. The idea that you can ship one file that Just Works™ on both x86 and ARM is exactly the kind of "computers should be less annoying" energy I respect.

I assumed they were doing something clever under the hood. Like shipping LLVM IR and compiling to native on first run. Late-binding optimization. Your binary gets smarter when it lands on new hardware.

Reader, they are not doing that.

A Universal Binary is a fat binary. It contains the x86-64 version of your app. And the ARM64 version of your app. Bundled together. The OS picks the right one at launch and ignores the other half.

You're shipping two binaries in a trench coat.

The optimization is frozen at compile time. When Apple releases a new chip, your binary doesn't get smarter. It just... runs the ARM slice, which was compiled for a generic ARM target, not your specific chip, not your cache hierarchy, not anything about the actual hardware underneath it.

It's the right solution to the wrong problem. It solves "how do we run x86 apps on ARM during a transition period." It does not solve "how do we write software once and have it run optimally everywhere, forever."

I wanted the deeper solution. LLVM IR would have solved this — compile to IR, ship the IR, recompile for whatever hardware you're actually running on. Late-binding optimization. Your binary gets smarter when Apple releases a new chip. For free.
Apple literally makes the compiler that could do this. They maintain Clang. They built their own LLVM-based toolchain. They had all the pieces.
Why???

Someone Already Tried This. Several Someones, Actually.

Let's set the microkernel idea on the shelf for a moment and look at the universal binary problem separately. They're going to converge, I promise. But before we get to what I think the answer is — for both — it's worth knowing that this problem has a graveyard.

Microsoft Research built Singularity in 2003 — an entire OS written in managed C#, where the type system replaced hardware memory protection. That's not a metaphor. The language verifier was doing the job the MMU normally does. Load-bearing C#, if you will.

It evolved into a project called Midori, got far enough to run Microsoft's search infrastructure in production, and was quietly killed in 2015. Too much to ask the world to abandon their existing software ecosystem for a managed-code utopia. Graveyard, plot one.

Then, Google built Fuchsia — a capability-based microkernel OS with proper driver isolation, a real component model, everything done right. It shipped briefly on the Nest Hub smart display. Then got rolled back to Linux. Now it exists in a state of quantum superposition between "advanced research" and "we'll get to sunsetting that eventually."

The pattern is consistent: build the right thing, hit the ecosystem wall, die.

But there's one entry in this space that didn't die, and it's interesting because of how it survived.

eBPF: the idea that snuck in sideways

eBPF is nominally a "packet filtering" system in the Linux kernel. The name stands for "extended Berkeley Packet Filter" — very boring, sounds like a networking detail, easy to ignore.

It isn't a networking detail.

Here's what eBPF actually is: a bytecode format, a verifier, and a JIT compiler living inside the Linux kernel. You write a program, the verifier checks that it's safe (no unbounded loops, no invalid memory access, all paths terminate), and then it runs in kernel space with near-zero overhead. No ring transitions. No syscall overhead. Just verified code running at the most privileged level because the verifier already proved it can't do anything wrong.

That's the Singularity bet — type safety replacing hardware protection — except applied narrowly enough that nobody objected to merging it into mainline Linux.

And the scope of what eBPF can do keeps expanding: network processing, system call filtering, security policy enforcement, TCP congestion control, and now — as of recent Linux versions — writing CPU schedulers. Userspace-authored, verifier-checked code making scheduling decisions in ring 0. Meta runs their entire production infrastructure networking on eBPF. Cloudflare's DDoS mitigation runs on eBPF.

eBPF is the VM-as-OS idea that actually shipped at scale. It just wore a disguise.

WebAssembly changes the equation

Here's where I think things get interesting.

WebAssembly (WASM) is a bytecode format originally designed for running code in browsers at near-native speed. That's its origin story. That's what it says on the tin.

I'm not interested in what it says on the tin.

WASM is defined to be safe by spec. No arbitrary pointer arithmetic. No unverified control flow. Memory accesses are bounds-checked. The verifier is part of the standard. This means: if you have a WASM runtime embedded in your kernel, and you load a driver as a WASM module, the verifier checks the driver before a single instruction executes. The driver cannot, by construction, corrupt memory it wasn't given access to.

If your brain works in the particular weird way that mine does, something just clicked together that probably shouldn't have. XNU isolates drivers through hardware address space separation. WASM isolates modules through verified bytecode. These are the same guarantee. One costs a ring transition. The other costs a verifier pass at load time.

You could just. Use WASM. As the driver sandbox. Instead of the MMU.

And while we're at it — why stop at drivers?

WASI (the WebAssembly System Interface) exists precisely to run system-level code in WASM. It's POSIX, but typed. Capability-gated. You declare what your module needs — filesystem access, network access, memory-mapped I/O — and the host grants exactly that. Nothing undeclared is accessible. Not "restricted." Not "monitored." Just. Not there.

That's not a driver sandbox. That's an entire OS component model.

A network stack as a WASM component that imports wasi-sockets. A filesystem driver that imports wasi-filesystem. A display server that imports wasi-gpu or whatever we'd call it. Each one verified before it runs. Each one incapable of touching what it didn't declare. Each one replaceable without touching anything else.

XNU does this with hardware isolation and a bespoke driver framework. WASM does this with a verifier and an interface definition file.

The Cupertino company spent decades building the infrastructure for this. We have a W3C spec and a Rust library.

These are not as different as they sound.

Great idea. Where do we get drivers? Where do we get apps?

Here's where the LLVM thing becomes important: the drivers and applications are just waiting to be recompiled.

WASM is a valid LLVM target. LLVM is the backend that powers Clang, Rust, Swift, Kotlin Native, and most modern compiled languages. Which means any language that compiles through LLVM can emit WASM. Not as an afterthought. As a flag you pass to the compiler.

So when someone asks "who writes the drivers for your WASM microkernel" — the answer is nobody. They're already written. In C. Sitting in the Linux kernel tree. Twenty years of accumulated hardware knowledge, weird edge cases, datasheets that lied, and fixes in comments nobody has read since 2009.

You don't rewrite them. You recompile them.

clang --target=wasm32-wasi driver.c. The Linux driver doesn't know it isn't on Linux. It asked for memory-mapped I/O access. It got a WASI capability that provides memory-mapped I/O access. Same semantic. Different implementation. Verified safe by construction. Sandboxed by the verifier before a single instruction executes.

Is this emulation? No. There's no semantic gap being papered over. It's just compilation with an intermediate stop that happens to give you safety, portability, and late-binding optimization as byproducts.

Late-binding as in: your kernel stores the WASM bytecode. First boot on new hardware, it recompiles everything with the LLVM backend targeting your actual CPU. AVX-512. Your specific cache hierarchy. Your branch predictor. Your five-year-old driver binary gets Zen 5 optimizations its author never knew existed.

Apple ships two frozen binaries in a trench coat. This ships one bytecode and derives the optimal binary at runtime, on your hardware, for free.

Universal Binary was the right idea. This is the actual implementation.

So what would this actually look like?

A Rust microkernel — Rust because memory safety in the kernel itself matters, and because Rust has excellent embedded/bare-metal support. A small, trusted core: interrupt handling, capability-based IPC, a scheduler, physical memory management. As little as possible.

A WASM runtime (Wasmtime is embeddable as a Rust library, this is its designed use case) handling module loading, verification, and JIT compilation via Cranelift for startup speed and the LLVM backend for optimizing hot paths.

WASI as the system interface. Drivers and kernel modules are WASM components that declare their capability imports. The kernel grants capabilities at load time. A network driver imports networking hardware access. It doesn't get filesystem access. It can't even ask for it.

A persistent module store: compiled WASM cached as native artifacts per hardware profile. Recompiled when the hardware changes. Profile-guided optimization over time as the system learns which paths are hot.

POSIX compatibility as a WASM component itself — a userspace layer, not baked into the kernel. You want Linux semantics? Load the POSIX compatibility module. You want something else? Load something else. The kernel doesn't care.

The result: your COSMIC desktop, or KDE, or whatever you want, running on top of a clean microkernel with isolated drivers, on x86 or ARM, with code that gets better at running on your specific hardware over time.

The wall

And... here's where I have to be honest.

This requires WebAssembly to become a serious systems target, not just a browser and edge-compute story. That's happening — slowly. The Bytecode Alliance (Mozilla, Microsoft, Fastly, Intel, Red Hat) is doing real work on WASI and the Component Model. Wasmtime is production quality-ish. The pieces exist.

But "the pieces exist" is a long way from "the ecosystem exists." Linux driver authors aren't thinking about WASM targets. Systems programmers aren't writing kernel modules in WASM-first workflows. The toolchain integration is immature. The debugging story is rough.

Meanwhile the Android black hole keeps pulling everything in. Amazon just shipped Vega OS — their escape from Android's GPL gravity — and their solution to "what's the application runtime" was JavaScript. React Native on Linux. They escaped the JVM and landed in the V8 engine. Different VM, same fundamental bet, worse type system. The ecosystem gravity is so strong that even the companies with resources to build something better keep reinventing 1996 with a different runtime.

Why I'm not defeated about this

While we were busy eulogizing Singularity and Fuchsia, something quietly happened to Linux.

Someone snuck eBPF in. It now handles networking, security policy, system call filtering, TCP congestion control, and as of recent kernel versions — CPU scheduling. Someone wrote a CPU scheduler in eBPF. It merged. Linus signed off on it.

And separately, people are smuggling Rust into the kernel. Driver by driver. Not a rewrite — a slow infiltration. Memory-safe, verifiable, LLVM-native code quietly becoming acceptable in the codebase that's been C since before most of us were born.

The monolithic cob of corn is being hollowed out. Slowly. Commit by commit. By people with CVEs on their conscience and patience measured in decades.

Maybe Linux gets back to being kernel-sized someday. It wouldn't be the first time an idea took forty years to arrive.

Cover photo by Łukasz Rawa on Unsplash