Bootstrapping an Infrastructure in 2025

Notes on "Bootstrapping an Infrastructure"

My job is Cloud Ops at a large media company. We're moving a ton of users and other resources to a new cloud tenant. I enjoyed the opportunity to re-visit the classic paper Bootstrapping an Infrastructure by Steve Traugott and Joel Huddleston, published all the way back in 1998. They compare booting a cluster to booting a computer - each is composed of a large set of services, each one supporting the following ones.

Steps

bootstrap diagram

Summary of paper

They model a cloud made of many machines, as a single machine, not as a collection of "pet" computers.
By following a specific series of steps, each one supporting the others, a single cloud is constructed.

The paper has a whole section on "Infrastructure Thinking":

Providing capable, reliable infrastructures which grant easy access to applications makes users happier and tends to raise the sysadmin's quality of life.
The "virtual machine" concept simplified how we maintained individual hosts. Upon adopting this mindset, it immediately became clear that **all nodes in a "virtual machine" infrastructure needed to be generic, each providing a commodity resource to the infrastructure.** It became a relatively simple operation to add, delete, or replace any node.

Commentary

The 16 steps are in four layers, each one building atop the layers that came before. Each layer focuses on a single audience, and delivers a specific feature in the cluster to that audience.

The four layers are:

Infrastructure
Support
Client Hosts
Cluster services

Infrastucture

Version Control - CVS, track who made changes, backout
Gold Server - only require changes in one place
Host Install Tools - install hosts without human intervention
Ad Hoc Change Tools - 'expect', to recover from early or big problems

These tools support the cluster management, and are designed for the cluster admins only. Like all layers, they support the layers above.

Coming from a modern/cloud perspective, this is very familiar and very different. Version control and central "server" makes sense. Host Install means PXE: machine boots, asks central server what OS distribution and customization to install, and does that over minutes/hours. The modern equivalent would be an AWS AMI (machine image) or Hashicorp Packer or Docker image.

This is great: start with nothing, install a full blob all at once. If it doesn't work as expected, iterate. Tweaking individual machines is fine for experimentation but acknowledges that local data is ephemeral and will be reset soon.

Support

Directory Servers - DNS, NIS, LDAP
Authentication Servers - NIS, Kerberos
Time Synchronization - NTP
Network File Servers - NFS, AFS, SMB
File Replication Servers - SUP

These services provide low-level data to the cluster and to users. DNS (cluster-wide network names) and LDAP (~ shared user, printer, other resource info) provide trusted low-level data to the cluster. Authorization requires two way trust: a server only allows access if person knows a secret password. NFS (Network File System) provides raw storage to the cluster, to be used by higher-level services.

The Support services make different types of data available to the cluster.

Surprise: the concept of "replication", where data is centrally managed but then copied to local machines, isn't something I've seen much. I guess it makes sense. In the world of physical machines, being able to provide apps to local users even if the network is gone, is a great idea.

Client

Client File Access - automount, AMD, autolink
Client OS Update - rc.config, configure, make, cfengine
Client Configuration Management - cfengine, SUP, CVSup
Client Application Management - autosup, autolink

Unlike modern cloud, the paper talks about apps running locally on each physical machine.

The Client services manage app support at the single, cluster level:

Automount: each machine makes available specific parts of the shared network file system for the local user(s) and services.
OS Update: machine operating systems are managed centrally.
Client Configuration and Application Management: at this layer, individual machine differences are managed centrally. If a user wants to make a local change, it's setup and managed centrally, so the entire machine can be replaced without concern.

Cluster Services

Mail and Printing: these are user-level services that are managed and maintained centrally, but are understandable and directly usable by the end users.

Monitoring: Another cluster-level service, this one's audience is the admins themselves

More Commentary

Reading this paper from the late 1990s was enlightening and also surreal. Many details have changed (Perl! Cfengine! brrrrrr), but the overall flow of the ideas is 100% solid. In the modern world, when a new cloud provider or "tenant" is onboarded, the sequence of layers is extremely similar.

Surprise: of the dozens of tools/services mentioned only two are still in common use: DNS and NTP. And admins still love to complain about DNS breaking things.

Surprise: the authors didn't divide things into layers, nor did they mention "audience" except for "client" as in user-facing apps.

Surprise: no security services! I guess a Web Application Firewall would be a big ask in the 1990s, but a central "these things are happening on those machines by these users" service would be valuable. E.g. AWS CloudTrail or CloudWatch Logs or Splunk.

Similarly, no app dev services: application logs or traceback collectors. ~ New Relic, Datadog. Years ago as a dev we lived by our Sentry app showing us where our app was crashing.

Surprise: authors put "cluster monitoring" at the very end of the process. They mentioned never getting around to central logging! This was shocking to me: they spend a huge amount of time controlling each layer, without the support of cluster-level feedback mechanisms. "Cluster Admin" is an important audience. Cluster-wide services can be divided into "Infra" (for the admins), or "Common" (for end users). Central logging and networking and security services are "Infra", CICD pipelines are "Common".

I study and teach Feedback Loops. The central idea is: 1) make a change, 2) receive feedback, 3) adjust the next change loop based on feedback. Presumably the authors would ssh into each machine, make a change from the central server, then watch on the local machine what happened. This is fine: it's easy to get multiple high quality logs and other data locally. However some problems only show up at the cluster level, over larger time scales.

Developers talk about "Test Driven Development". Instead of developing a feature by writing code, a feature is developed by 1) writing a test which fails, 2) writing "just enough" feature code to get the test to pass, then 3) refactor the test and code. Tests are an artifact that require investment but give value forever. Test automation gives devs and the business the confidence that new changes don't break business-critical features.

For a cluster (or cloud tenant), this helps tremendously. Build the cluster-wide feedback services first. This gives rapid, reliable, actionable feedback to the whole boostrap process.

Test Driven Development hasn't reached all of the Cloud / DevOps world for some reason. Maybe it's time for me to publish more articles and videos...