Mustafa ERBAY

Posted on Jun 9 • Originally published at mustafaerbay.com.tr

Bootstrap Deadlock: When the DC Needs the Cluster That Needs It

#windowsserver #activedirectory #highavailability #dhcp

Note: All IPs, the domain name (corp.local) and server names in this post are illustrative, not real infrastructure values.

A day that started with a power flicker

The call came in from the site: after a brief power outage, the two-node Hyper-V failover cluster had come back, the servers had booted — but not a single virtual machine was running on them. Cluster disks "offline", a fault light on the SAN controller. The first few minutes went by on pure recovery reflex: bring the disks online, start the cluster service, fire up the roles.

Then came the sentence that changed everything:

"My domain is on one of those VMs too."

That's where the real problem revealed itself.

The chicken-and-egg: a bootstrap deadlock

Let's lay it out:

For the cluster to come up cleanly, it needs Active Directory authentication (cluster account, CSV permissions, Kerberos…).
Active Directory was running on a single Domain Controller.
That Domain Controller was a virtual machine on the very cluster it was supposed to authenticate.

So: the DC needed the cluster to start, and the cluster needed the DC to start properly. This is a bootstrap deadlock — the system depends on itself to bring itself up. In a real outage, this lock can add hours to your recovery.

On top of that, having a single DC means: if that machine goes down for any reason, nobody in the domain can log in, DNS stops resolving, and DHCP stops handing out IPs. The classic single point of failure (SPOF).

We survived that day's emergency by restoring the VMs from backup (Acronis → NAS) and running them standalone. But the lesson was unmistakable: this architecture had to be redesigned.

The fix: an independent second Domain Controller

The plan was clear — build a second DC on separate physical hardware, completely independent of the cluster and SAN. That single move solves two problems at once:

No more SPOF: there are now two DCs; if one goes, the other carries on.
The bootstrap deadlock breaks: because the second DC boots without needing the cluster, even in a major outage you have a live AD + DNS in hand. The cluster can authenticate against it and come up.

Key rules:

No cloning. Copying and booting an existing DC risks USN rollback. A clean install + Install-ADDSDomainController is the right path.
Keep FSMO roles in place. The operations master roles (PDC, RID, Infra, Schema, Naming) stay on the existing DC; the new one is a secondary/replica.
Make the new DC a Global Catalog + DNS + DHCP server too, so it provides genuine redundancy.

The constraint: nobody at the machine

The hard part: the server was at a different location, and nobody was physically next to it. All we had was:

iLO (HP's out-of-band management) — a web console,
And a MacBook.

No Rufus to build Windows install media, no hand to plug in a keyboard or USB. Everything had to be done remotely.

Step 1 — A bootable Windows Server USB on macOS (no Rufus)

The Windows ISO's install.wim is larger than 4 GB, and FAT32 caps single files at 4 GB. The fix: split the file into .swm parts. Entirely from the macOS command line:

# 1) Format the USB as FAT32 / MBR (for UEFI boot)
diskutil eraseDisk "MS-DOS FAT32" WINSRV MBR /dev/diskN

# 2) Copy all ISO contents EXCEPT install.wim
rsync -r --no-perms --no-times \
  --exclude='sources/install.wim' "/Volumes/CCCOMA_.../" "/Volumes/WINSRV/"

# 3) Split install.wim into <4GB parts (wimlib-imagex: brew install wimlib)
wimlib-imagex split "/Volumes/CCCOMA_.../sources/install.wim" \
  "/Volumes/WINSRV/sources/install.swm" 3800 --check

# 4) Verify (--ref is REQUIRED for split WIMs)
wimlib-imagex verify "/Volumes/WINSRV/sources/install.swm" \
  --ref="/Volumes/WINSRV/sources/install*.swm"

Gotcha: if two USB sticks are mounted and you format them with the same label, /Volumes/WINSRV can resolve to the old disk and you'll write to the wrong one. Unmount the old disk first with diskutil unmountDisk force.

Step 2 — Install via the iLO console

In the iLO HTML5 console, a Mac's F-keys behave as media keys; Fn+F10 sends a real F10 (e.g. for Intelligent Provisioning). We built RAID1 in Smart Storage Administrator (2× SSD mirrored), then a plain Windows Server 2019 install from USB. Server 2019 already ships the RAID controller driver (P408i-a / smartpqi) in-box, so no extra drivers were needed.

Step 3 — Managing Windows like Linux: OpenSSH

Here's the crux of the whole story. The moment you install OpenSSH Server + key-based auth on Windows Server, the box becomes remotely, programmatically manageable — exactly like a Linux server:

Add-WindowsCapability -Online -Name OpenSSH.Server~~~~0.0.1.0
Set-Service sshd -StartupType Automatic
# For admin keys: administrators_authorized_keys
$key = 'ssh-ed25519 AAAA...your-public-key'
New-Item -ItemType Directory -Force -Path C:\ProgramData\ssh | Out-Null
[IO.File]::WriteAllText("C:\ProgramData\ssh\administrators_authorized_keys", $key,
  (New-Object System.Text.UTF8Encoding($false)))   # NO BOM — or sshd rejects it
icacls C:\ProgramData\ssh\administrators_authorized_keys /inheritance:r `
  /grant "*S-1-5-32-544:F" /grant "*S-1-5-18:F"     # Administrators + SYSTEM only
Start-Service sshd

Two classic traps: (1) if administrators_authorized_keys is UTF-8 with a BOM, sshd silently rejects the key — write it BOM-less. (2) The file's ACL must be Administrators + SYSTEM only; otherwise StrictModes kicks in and you get Permission denied (publickey).

The "clipboard doesn't work" problem, and a neat fix

Copy-paste from the Mac into the iLO console didn't work, and typing a long public key by hand is torture. The fix: serve the key from a tiny HTTP server on the Mac and have the server pull it with a short command. (Inside the server's own browser, the clipboard worked fine.)

# On the Mac (from a folder containing only the public key):
python3 -m http.server 8000

# On the server — one line, no manual key entry:
curl.exe -m 8 http://<mac-ip>:8000/k -o C:\ProgramData\ssh\administrators_authorized_keys

Even better: we pre-tested whether the server could reach back to our Mac by running curl from another server on the same subnet. Because the server's gateway is the firewall terminating the VPN, the route worked.

Step 4 — The double-hop trap (SSH + Active Directory)

When you log into SSH with a key, your session has no network credentials (no delegation). So a dcdiag run over SSH that tries to bind to another DC returns "Access Denied" — but that's not a real failure; the session simply carries no credentials. The classic "double-hop" problem.

To see the true health, run the tool in the SYSTEM context (i.e. as the machine account, which does have network identity). A practical trick — a scheduled task:

# Write the commands to a .cmd, run as SYSTEM, capture output to a file
schtasks /create /tn dccheck /tr "C:\dccheck.cmd" /sc once /st 00:00 /ru SYSTEM /f
schtasks /run /tn dccheck
# Inside C:\dccheck.cmd: dcdiag /q & repadmin /replsummary & nltest /sc_verify:corp.local

Run in SYSTEM context, the RidManager and Connectivity tests came back passed, and the secure channel returned NERR_Success — meaning the DC was healthy all along; the scary lines were just a side effect of the SSH session.

Step 5 — Promotion, DNS and DHCP failover

The promotion itself (from the console, with domain-admin credentials, so passwords stay with the operator):

Install-ADDSDomainController -DomainName "corp.local" `
  -Credential (Get-Credential "CORP\Administrator") `
  -InstallDns -SiteName "Default-First-Site-Name" `
  -NoGlobalCatalog:$false `
  -SafeModeAdministratorPassword (Read-Host "DSRM" -AsSecureString) -Force

Schema note: Server 2019 and 2022 share the same schema version (objectVersion 88). If your existing DC is 2022, the schema is already at 88, so adding a 2019 DC needs no adprep. Mixed (2019 + 2022) DCs are fully supported — no new functional level has been introduced since 2016.

For DNS redundancy, hand clients two DNS servers:

Set-DhcpServerv4OptionValue -ScopeId <scope> -OptionId 6 -Value 10.0.10.10,10.0.10.11

And finally DHCP failover — the two servers share the load, and if one dies the other takes over the entire pool:

Add-DhcpServerv4Failover -ComputerName "DC01.corp.local" -Name "DC01-DC02" `
  -PartnerServer "DC02.corp.local" -ScopeId 10.0.10.0,10.0.20.0 `
  -LoadBalancePercent 50 -SharedSecret $secret -Force

The result: how it works now

Failure	Outcome
Primary DC goes down	No outage — the secondary serves login + DNS + DHCP
Cluster / SAN goes down	AD/DNS/DHCP stay up — the secondary is independent
Secondary DC goes down	No problem — the primary can already do everything
A disk dies	RAID1 keeps the DC running

The bootstrap deadlock is broken, the SPOF is gone. The cluster no longer depends on a VM running on itself to come up — an independent DC is always there.

Takeaways

Never run a single DC. Always at least two Domain Controllers — ideally on different physical foundations.
Don't virtualize your only DC on the cluster it's meant to bootstrap. That's the deadlock.
OpenSSH + keys make Windows ops as fluid as Linux — but watch BOM and ACLs on administrators_authorized_keys.
Know the double-hop: an SSH key session has no network identity; run AD tools in the SYSTEM context.
2019 + 2022 DCs coexist cleanly — same schema (88), no adprep.

Sometimes the most resilient architecture starts with honestly answering one question: "if this fails, what's still standing?"

DEV Community