DEV Community

Landlight
Landlight

Posted on

I Am Building a Cloud: Lessons From Designing My Own Infrastructure From Scratch

I Am Building a Cloud: Lessons From Designing My Own Infrastructure From Scratch

Three months ago, I typed mkdir my-cloud and told myself: how hard can it be?

Spoiler: very. But also deeply rewarding in ways I didn't expect.

This isn't a tutorial. It's a brutally honest account of what it actually takes when you decide to stop renting compute from AWS, GCP, or Azure and start building your own cloud — even a small, self-hosted one. Whether you're doing this for learning, for cost savings at scale, or because you genuinely want control over your stack, there's a lot nobody tells you upfront.

Let me fix that.


Why Build a Cloud at All?

Before we dive into architecture diagrams and YAML files, let's be honest about motivation. Here are the real reasons developers end up down this rabbit hole:

  • Cost at scale. Managed services are convenient but punishing at volume. At a certain number of VMs or data transfer gigabytes, the math tilts hard toward owning iron.
  • Control and compliance. Some industries (healthcare, finance, government) need data sovereignty that public clouds make complicated.
  • Learning. Nothing teaches you how Kubernetes actually works like building the thing that Kubernetes runs on.
  • The itch. Sometimes you just want to know if you can.

I fall into the last two camps. I have a rack of second-hand servers in a co-location facility, a dangerous amount of free time, and a strong aversion to accepting "just use managed X" as a final answer.


What "Building a Cloud" Actually Means

A cloud platform is not one thing. It's a layered stack of problems:

┌──────────────────────────────┐
│     Developer Portal / API   │  ← Users interact here
├──────────────────────────────┤
│     Orchestration Layer      │  ← Kubernetes, Nomad, etc.
├──────────────────────────────┤
│     Networking Layer         │  ← SDN, load balancers, DNS
├──────────────────────────────┤
│     Storage Layer            │  ← Block, object, file
├──────────────────────────────┤
│     Compute Layer            │  ← Hypervisors, bare metal
├──────────────────────────────┤
│     Physical / Bare Metal    │  ← The actual servers
└──────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Most tutorials pick one layer and call it a day. Building a cloud means you have to care about all of them — and more importantly, how they talk to each other.


The Stack I Chose (And Why)

Compute: Proxmox VE

I run Proxmox VE as my hypervisor layer. It's open-source, handles both KVM virtual machines and LXC containers, and has a solid REST API I can script against.

Spinning up a VM via the Proxmox API looks like this:

curl -s -k -b "PVEAuthCookie=${TICKET}" \
  -H "CSRFPreventionToken: ${CSRF}" \
  -X POST \
  "https://proxmox-host:8006/api2/json/nodes/pve/qemu" \
  -d 'vmid=101&name=my-vm&memory=2048&cores=2&net0=virtio,bridge=vmbr0&ide2=local:iso/ubuntu-22.04.iso,media=cdrom&scsihw=virtio-scsi-pci&scsi0=local-lvm:20'
Enter fullscreen mode Exit fullscreen mode

This is the foundation. Every VM, every container, everything runs on top of this.

Networking: Open vSwitch + VXLANs

This is where things get interesting — and painful. When you want tenant isolation (so different users or projects can't see each other's traffic), you need a software-defined network.

I use Open vSwitch (OVS) with VXLAN overlays. Each project gets its own VXLAN segment:

# Create a VXLAN tunnel between two hypervisor nodes
ovs-vsctl add-br br-overlay
ovs-vsctl add-port br-overlay vxlan0 -- \
  set interface vxlan0 type=vxlan \
  options:remote_ip=10.0.0.2 \
  options:key=1001
Enter fullscreen mode Exit fullscreen mode

This gives you Layer 2 connectivity across physically separate hosts — the same trick that AWS and GCP use under the hood (just with a lot more engineering muscle behind it).

Storage: Ceph

For distributed block and object storage, I run a small Ceph cluster across three nodes. Ceph is the backbone of many production clouds, including parts of OpenStack deployments.

Creating a storage pool:

ceph osd pool create my-cloud-vms 128
rbd pool init my-cloud-vms
Enter fullscreen mode Exit fullscreen mode

Is Ceph operationally complex? Absolutely. But it gives you replicated, fault-tolerant storage that behaves like EBS or GCS under the hood.

Orchestration: Kubernetes (K3s)

For running containerized workloads, I use K3s — a lightweight Kubernetes distribution that doesn't require a team of SREs to operate. It runs comfortably on VMs provisioned by Proxmox.

# Install K3s on a fresh VM
curl -sfL https://get.k3s.io | sh -s - \
  --cluster-init \
  --disable traefik \
  --node-name cloud-control-01
Enter fullscreen mode Exit fullscreen mode

The Control Plane: The Hardest Part Nobody Talks About

Here's the dirty secret: the compute, storage, and network layers are solved problems. There is open-source software for all of it.

The genuinely hard part is the control plane — the API and logic that ties everything together and exposes it to users.

When a user says "give me a 4-core VM with 8GB RAM in region east," something needs to:

  1. Authenticate the request
  2. Check quota and billing
  3. Select the right hypervisor node (scheduling)
  4. Call the Proxmox API
  5. Configure networking for the new VM
  6. Register the VM in a state database
  7. Return an IP address and credentials to the user

That workflow is your control plane. I'm building mine as a Go service with a PostgreSQL backend:

func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) {
    // 1. Validate and authenticate
    user, err := s.auth.Validate(ctx, req.Token)
    if err != nil {
        return nil, ErrUnauthorized
    }

    // 2. Check quota
    if err := s.quota.Check(ctx, user.ID, req.Resources); err != nil {
        return nil, ErrQuotaExceeded
    }

    // 3. Schedule: pick a hypervisor node
    node, err := s.scheduler.Select(ctx, req.Resources)
    if err != nil {
        return nil, ErrNoCapacity
    }

    // 4. Provision the VM
    vmID, err := s.proxmox.CreateVM(ctx, node, req)
    if err != nil {
        return nil, fmt.Errorf("provisioning failed: %w", err)
    }

    // 5. Configure networking
    ip, err := s.network.Allocate(ctx, vmID, user.ProjectID)
    if err != nil {
        _ = s.proxmox.DeleteVM(ctx, node, vmID) // rollback
        return nil, fmt.Errorf("network allocation failed: %w", err)
    }

    // 6. Persist state
    vm := &VM{ID: vmID, NodeID: node.ID, IP: ip, OwnerID: user.ID}
    if err := s.db.SaveVM(ctx, vm); err != nil {
        return nil, err
    }

    return vm, nil
}
Enter fullscreen mode Exit fullscreen mode

This is simplified, but it captures the essence. Notice the rollback on network allocation failure — distributed system failures bite hard when you don't handle partial states.


Mistakes I've Made (So You Don't Have To)

1. Skipping Idempotency Early On

Cloud APIs need to be idempotent. If a VM creation request times out and the client retries, you must not create two VMs. I didn't implement idempotency keys early enough and ended up with orphaned VMs I couldn't account for.

Fix: Accept a client_request_id on every mutating API call and deduplicate in your database.

2. Underestimating Networking Complexity

I naively thought networking was "just routing." It's not. You're dealing with ARP storms, MTU mismatches across VXLAN tunnels, asymmetric routing, and firewall state tables that don't survive node reboots. Budget triple the time you think you need here.

3. No Observability From Day One

I added metrics and logging as an afterthought. Big mistake. When your control plane starts behaving weirdly at 2 AM, printf debugging across three hypervisor nodes is a nightmare.

Fix: Instrument everything from day one. I now use [[Prometheus and Grafana for metrics]] alongside structured JSON logging shipped to a central Loki instance.

// Instrument your handlers from the start
func (s *Server) CreateVM(ctx context.Context, req *CreateVMRequest) (*VM, error) {
    timer := prometheus.NewTimer(vmCreationDuration)
    defer timer.ObserveDuration()

    vmCreationTotal.Inc()
    // ... rest of the logic
}
Enter fullscreen mode Exit fullscreen mode

What This Teaches You About Public Clouds

Building even a tiny cloud fundamentally changes how you read AWS or GCP documentation. Phrases like "availability zone," "VPC peering," "instance scheduling," and "eventual consistency" stop being buzzwords and become concrete engineering decisions you've personally wrestled with.

You start to understand why EBS volumes have latency characteristics they do, why cross-AZ traffic costs money, why spot instances can be interrupted. These aren't arbitrary decisions — they're consequences of real physical and logical constraints.

If you want to become a genuinely better cloud engineer (not just a cloud user), [[infrastructure-as-code tools and deep cloud internals books]] will only take you so far. At some point, you have to build.


Where I Am Today

After three months, my cloud can:

  • ✅ Provision and destroy VMs via API
  • ✅ Allocate isolated project networks automatically
  • ✅ Serve block storage from Ceph
  • ✅ Run Kubernetes workloads across the VM fleet
  • ✅ Track basic resource usage per user/project

Still working on:

  • 🔧 Live VM migration between hypervisor nodes
  • 🔧 A proper billing and quota system
  • 🔧 A usable developer portal (the API is functional but ugly)
  • 🔧 Automated certificate management for tenant workloads

The repo is private for now, but I'm planning to open-source the control plane once it's less embarrassing. Follow me here on DEV if you want to be notified when that drops.


Should You Do This?

If you want to deeply understand distributed systems, networking, and how modern infrastructure actually works — yes, absolutely. Building a cloud, even a toy one, is one of the most educational things I've ever done as an engineer.

If you need to ship a product next quarter — no, rent compute. That's what AWS is for.

But if you have the itch, scratch it. mkdir my-cloud and see where it takes you.


Resources That Actually Helped

  • Designing Data-Intensive Applications by Martin Kleppmann — essential for understanding the state management challenges
  • The Proxmox API documentation (surprisingly good)
  • The Ceph documentation (less good, but comprehensive)
  • [[Cloud Native Patterns and architecture guides]] — for thinking about multi-tenancy correctly
  • The OpenStack source code — not to run it, but to read how they solved problems

If you're building something similar, or you've done this before and want to tell me where I'm going wrong — drop a comment below. I read everything. And if you want updates as this project evolves, follow me here on DEV. There's a lot more to come.

Top comments (0)