DEV Community

佐藤玲
佐藤玲

Posted on

I Am Building a Cloud: Lessons from Designing Your Own Cloud Infrastructure from Scratch

I Am Building a Cloud: Lessons from Designing Your Own Cloud Infrastructure from Scratch

Everybody uses the cloud. Almost nobody builds one.

Six months ago, I made a decision that most engineers quietly file under "interesting but insane": I started building my own cloud platform. Not a toy. Not a weekend hack. A functional, multi-tenant infrastructure platform capable of spinning up compute, managing storage, and exposing APIs — the kind of thing you'd recognize if you squinted at AWS or GCP long enough.

This article is a living journal of that journey. Whether you're a curious developer, a startup thinking about infra costs, or someone who just wants to understand what's actually happening underneath the clouds you rent every day — this one's for you.


Why Build a Cloud?

Before we get into the architecture, let me answer the obvious question: why?

The honest answers:

  • Cost at scale. Renting compute forever is expensive. Owning hardware and orchestrating it yourself, at the right scale, can be 30–60% cheaper.
  • Learning depth. You don't truly understand distributed systems until you've been paged at 2am because your control plane split-brained.
  • Data sovereignty. Some workloads cannot leave your jurisdiction. Building your own cloud solves that.
  • Pure curiosity. Some of us just need to know how things work.

If none of those resonate, that's fine — rent your infra from Amazon like a sensible person. But if even one of them does, read on.


The Architecture: What a "Cloud" Actually Is

Strip away the marketing, and a cloud platform is composed of a few core layers:

┌─────────────────────────────────┐
│         Control Plane           │  ← API, Auth, Scheduler
├─────────────────────────────────┤
│         Compute Layer           │  ← VMs, Containers
├─────────────────────────────────┤
│         Network Layer           │  ← SDN, VPCs, Load Balancers
├─────────────────────────────────┤
│         Storage Layer           │  ← Block, Object, File
├─────────────────────────────────┤
│         Physical Hardware       │  ← Servers, Switches, Power
└─────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

When you're building a cloud, you're essentially engineering every one of those layers and making them talk to each other reliably, securely, and at scale.


Phase 1: The Control Plane

The control plane is the brain. It's the API that users and services talk to when they say "give me a VM" or "create a storage bucket."

I built mine in Go, using a RESTful API backed by etcd for consistent distributed state. Here's a simplified version of the instance creation handler:

package api

import (
    "encoding/json"
    "net/http"
    "github.com/google/uuid"
)

type CreateInstanceRequest struct {
    Name     string `json:"name"`
    Image    string `json:"image"`
    Flavor   string `json:"flavor"`
    Region   string `json:"region"`
}

type Instance struct {
    ID     string `json:"id"`
    Name   string `json:"name"`
    Status string `json:"status"`
    IP     string `json:"ip"`
}

func CreateInstanceHandler(w http.ResponseWriter, r *http.Request) {
    var req CreateInstanceRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        http.Error(w, "Invalid request", http.StatusBadRequest)
        return
    }

    instance := Instance{
        ID:     uuid.New().String(),
        Name:   req.Name,
        Status: "provisioning",
    }

    // Write to etcd, enqueue for scheduler
    if err := scheduleInstance(instance, req); err != nil {
        http.Error(w, "Scheduling failed", http.StatusInternalServerError)
        return
    }

    w.WriteHeader(http.StatusAccepted)
    json.NewEncoder(w).Encode(instance)
}
Enter fullscreen mode Exit fullscreen mode

The key insight here: the API doesn't do the work. It accepts the request, validates it, writes state to etcd, and hands off to the scheduler. Decoupling intent from execution is fundamental to cloud design.


Phase 2: Compute — Containers vs. VMs

When building a cloud compute layer, you face an early fork in the road:

Virtual Machines (KVM/QEMU)

  • Full isolation
  • Heavier, slower to boot
  • Best for untrusted multi-tenant workloads

Containers (containerd / gVisor)

  • Lightweight, fast
  • Weaker isolation (though gVisor helps)
  • Best for trusted workloads

I chose a hybrid approach: containers for internal services, lightweight VMs (using Firecracker) for tenant workloads. Firecracker was originally built by AWS for Lambda and Fargate. It boots a minimal Linux VM in under 125ms with a tiny memory footprint — perfect for a cloud that needs to be both fast and secure.

Spinning up a Firecracker microVM looks like this in its simplest form:

# Start Firecracker
firecracker --api-sock /tmp/firecracker.socket &

# Set kernel
curl -s -X PUT \
  --unix-socket /tmp/firecracker.socket \
  'http://localhost/boot-source' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "kernel_image_path": "/opt/kernels/vmlinux",
    "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
  }'

# Set rootfs
curl -s -X PUT \
  --unix-socket /tmp/firecracker.socket \
  'http://localhost/drives/rootfs' \
  -H 'Content-Type: application/json' \
  -d '{
    "drive_id": "rootfs",
    "path_on_host": "/opt/images/ubuntu-22.04.ext4",
    "is_root_device": true,
    "is_read_only": false
  }'

# Start the VM
curl -s -X PUT \
  --unix-socket /tmp/firecracker.socket \
  'http://localhost/actions' \
  -d '{"action_type": "InstanceStart"}'
Enter fullscreen mode Exit fullscreen mode

This is the foundation. From here, the scheduler automates this across a pool of physical hosts.


Phase 3: Software-Defined Networking

Networking is where most DIY cloud projects die. It's also where things get genuinely fascinating.

I'm using Open vSwitch (OVS) combined with VXLAN tunnels to create tenant-isolated virtual networks across physical hosts. Every tenant gets their own L2 broadcast domain, isolated from every other tenant — exactly like a VPC in AWS.

Here's the mental model:

Host A                          Host B
┌──────────────────┐           ┌──────────────────┐
│  VM1   VM2       │           │  VM3   VM4        │
│  │      │        │           │   │     │          │
│  └──OVS Bridge──-│──VXLAN────│──OVS Bridge───┘   │
│  (tenant VLAN 10) │           │  (tenant VLAN 10) │
└──────────────────┘           └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

VMs on the same tenant network can talk to each other across hosts as if they're on the same switch, while being completely invisible to other tenants. This is software-defined networking doing real work.


Phase 4: Object Storage

Every cloud needs storage. Block storage (think EBS) is complex; I started with object storage (think S3) because the API surface is smaller and the consistency model is simpler.

Rather than building from scratch, I deployed MinIO in a distributed configuration across four nodes, giving me S3-compatible object storage with erasure coding for resilience:

# docker-compose.yml for distributed MinIO
version: '3.7'
services:
  minio1:
    image: minio/minio
    command: server http://minio{1...4}/data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: admin
      MINIO_ROOT_PASSWORD: supersecretpassword
    volumes:
      - /data/minio1:/data
    ports:
      - "9000:9000"
      - "9001:9001"
  minio2:
    image: minio/minio
    command: server http://minio{1...4}/data
    volumes:
      - /data/minio2:/data
  # ... minio3, minio4
Enter fullscreen mode Exit fullscreen mode

With this setup, I can lose two of the four nodes and still serve all objects. Good enough for v1.


The Hardest Problems Nobody Warns You About

Building this has taught me things no blog post prepared me for:

1. Time is your enemy

Distributed systems require synchronized clocks. NTP drift of even a few hundred milliseconds can cause etcd leader elections to thrash, logs to be unreadable, and TLS certs to fail validation. Run Chrony everywhere and monitor clock skew obsessively.

2. The control plane must be HA

I lost a Friday afternoon to a single control plane node failing. If the API is down, no VMs can be created, modified, or deleted — even if everything else is running fine. Run at least three control plane nodes behind a load balancer, always.

3. Noisy neighbors are real

One VM doing aggressive disk I/O will hurt every other VM on that host. Implement cgroups v2 resource limits from day one, not as an afterthought:

# Limit a cgroup to 50% CPU and 2GB RAM
cgcreate -g cpu,memory:/tenant-vm-abc123
cgset -r cpu.max="50000 100000" /tenant-vm-abc123
cgset -r memory.max=2147483648 /tenant-vm-abc123
Enter fullscreen mode Exit fullscreen mode

4. Observability is not optional

You cannot fix what you cannot see. I run the Prometheus + Grafana + Loki stack across the entire platform. Every VM host exports node metrics, every API call is traced with OpenTelemetry, and every log line lands in Loki. self-hosted observability stack setup was one of the best investments of time I made early on.


What It Costs (Honestly)

Here's my current hardware spend for a modest but real platform:

Component Hardware Cost
Compute nodes 4× refurbished Dell R640, 32 cores / 256GB RAM each ~$4,800
Storage nodes 4× machines with 12TB raw NVMe each ~$3,200
Networking 2× 10GbE switches, SFP+ cabling ~$800
Control plane 3× small VMs on separate hardware ~$200
Total ~$9,000

That sounds like a lot. But consider: running equivalent workloads on AWS (say, 8× m6i.8xlarge reserved instances + storage) would cost roughly $4,200 per month. This hardware pays for itself in under three months at that utilization level.

If you want to explore cloud cost optimization tools before committing to your own hardware, there are excellent options for right-sizing your existing cloud spend first.


Tools I'm Using (The Stack)

Here's the full current toolchain:

  • Compute: Firecracker microVMs, orchestrated by a custom Go scheduler
  • Networking: Open vSwitch + VXLAN
  • Storage: MinIO (object), LVM thin pools (block)
  • Control Plane: Go API + etcd v3
  • Auth: Keycloak (OIDC)
  • Observability: Prometheus, Grafana, Loki, OpenTelemetry
  • IaC: Terraform for bootstrapping, Ansible for configuration
  • DNS: CoreDNS

If you're starting this journey, infrastructure as code learning resources will save you enormous amounts of time when managing configuration at scale.


What's Next

The roadmap for the next three months:

  • [ ] Kubernetes integration — allow tenants to request managed K8s clusters
  • [ ] Live VM migration — move running VMs between hosts without downtime
  • [ ] Billing engine — track resource consumption per tenant and generate invoices
  • [ ] Self-service portal — a UI so not everything requires API calls
  • [ ] Multi-region — replicate the control plane across two physical locations

Building a cloud is not a project — it's a practice. The architecture evolves, failures teach you things no documentation can, and the satisfaction of watching a VM boot on hardware you own and control never really gets old.


Should You Build One Too?

Maybe. Here's the honest truth:

You should build a cloud if:

  • You have consistent, predictable workloads at meaningful scale
  • You have the engineering depth to operate distributed systems
  • Data sovereignty or compliance requirements demand it
  • You want to understand infrastructure at a deep level

You should not build a cloud if:

  • Your workloads are spiky and unpredictable
  • You have a small team and no dedicated ops capacity
  • You're early stage and should be building product, not platforms

The act of attempting to build a cloud — even partially — will make you a dramatically better infrastructure engineer. You'll understand why AWS charges what it charges, why certain architectural patterns exist, and why distributed systems are genuinely hard.

That knowledge is worth something, even if you end up back on EC2.


If you're on a similar journey — building, experimenting, or just curious — I'd love to compare notes. Follow me here on DEV for updates as this build progresses, and drop your questions or war stories in the comments below.

Next article: how I implemented live VM migration using QEMU's postcopy mode. Subscribe so you don't miss it.

Top comments (0)