DEV Community

Solomon Neas
Solomon Neas

Posted on • Originally published at solomonneas.dev

Replacing SCCM with FOG Project

Replacing SCCM with FOG Project

When I moved our infrastructure from Hyper-V to Proxmox, I also took the chance to rip out one of the heaviest pieces of the old stack: SCCM.

For an enterprise with thousands of endpoints and a full Microsoft licensing budget, SCCM can make sense. For four instructional labs with 72 workstations, it was too much. It wanted Windows Server, SQL Server, licensing baggage, and constant babysitting just to do the thing I actually needed: boot a machine over the network, lay down a clean image, put it back in the domain, and get out of the way.

FOG Project was the answer.

This post is the technical deep-dive for the imaging side of the migration. For the full Hyper-V to Proxmox migration story, see the migration deep-dive.

The environment I was replacing

The target was straightforward on paper:

  • 4 classrooms
  • 72 total workstations
  • 3 hardware-specific Windows 11 images
  • 47 machines with known MAC addresses at import time
  • 25 machines that would need to self-register on first PXE boot
  • Active Directory domain join after imaging
  • Room-specific OU placement

The architecture ended up looking like this:

  • FOG server: Debian Trixie 13 LXC on Proxmox, 10.0.1.20
  • Domain controllers / DHCP: Windows Server, 10.0.1.10 and 10.0.1.11
  • Domain: LAB.LOCAL
  • Join account: svc-domainjoin@LAB.LOCAL
  • Rooms: Room-A, Room-B, Room-C, Room-D
  • Images: Win11-Lab-G4, Win11-Lab-Ultrawide, Win11-Lab-G9

I did not build images in Proxmox VMs. I built them on reference machines that matched the real lab hardware. The rooms were not identical. One had HP G4s, one had HP G9s, and two rooms had Dell FCT2250s paired with Dell UltraSharp 49" curved monitors (U4924DW/U4919DW), which needed their own driver set.

Three golden images, three hardware profiles:

Room Image
Room-A Win11-Lab-G4
Room-B Win11-Lab-Ultrawide
Room-C Win11-Lab-Ultrawide
Room-D Win11-Lab-G9

Active Directory prep

The Linux box gets the attention in a FOG write-up, but the boring Windows prep is what made the deployment clean.

I created a least-privilege service account called svc-domainjoin and delegated only what FOG needed on the classroom OUs: create computer objects, delete computer objects, and full control on descendant computer objects.

The OU layout was one OU per room. Then I delegated permissions with dsacls:

dsacls "OU=Room-A,DC=LAB,DC=LOCAL" /I:T /G "LAB\svc-domainjoin:CC;computer"
dsacls "OU=Room-A,DC=LAB,DC=LOCAL" /I:T /G "LAB\svc-domainjoin:DC;computer"
dsacls "OU=Room-A,DC=LAB,DC=LOCAL" /I:S /G "LAB\svc-domainjoin:GA;;computer"
Enter fullscreen mode Exit fullscreen mode

Same pattern on all four classroom OUs.

Before touching imaging at all, I cleaned house in AD: moved 38 misplaced computer objects into the right OUs, deleted 60 stale ones from old naming schemes, unlinked the SCCM GPO from the room OUs, and exported a CSV of all 72 lab machines.

Then I set DHCP options 66 and 67 on the classroom scopes to point to FOG:

  • Option 66: 10.0.1.20
  • Option 67: initially ipxe.efi

Should have been enough. It was not.

The SCCM ghost in DHCP

PXE still misbehaved on some scopes even with the scope-level options pointing to FOG.

The culprit was stale SCCM and WDS policies still attached to several scopes. DHCP policies take priority over scope options. Clients matching the PXE vendor class were quietly getting sent to the retired SCCM server at 10.0.1.14 instead of FOG at 10.0.1.20. The old policies referenced smsboot\x64\wdsmgfw.efi and smsboot\x64\wdsnbp.com with the old boot server IP.

Eleven of those policies, spread across five scopes. Once I removed them all, PXE stopped getting hijacked by dead infrastructure.

Note: If PXE settings look right but clients keep booting somewhere else, check DHCP policies before you blame FOG.

FOG on Debian Trixie: three bugs in a trenchcoat

The FOG server was a Debian 13 LXC container on Proxmox. When I first looked at it, the web UI was running, which made it look installed.

It was not. The backend was missing entirely. No TFTP boot files. No NFS exports. No FOG services. The web UI was a shell with nothing behind it.

Re-ran the installer:

cd /tmp/fogproject/bin
bash installfog.sh -y
Enter fullscreen mode Exit fullscreen mode

Trixie had other plans.

Bug 1: wrong osid

FOG had the wrong OS ID in .fogsettings. It detected the machine as Arch Linux instead of Debian. Fix in /opt/fog/.fogsettings:

osid='3'
Enter fullscreen mode Exit fullscreen mode

to:

osid='2'
Enter fullscreen mode Exit fullscreen mode

Without that, every package operation targeted the wrong distro.

Bug 2: libcurl4 renamed on Debian 13

Debian 13's t64 transition renamed libcurl4 to libcurl4t64. FOG's installer still asked for the old name. I patched the package check in functions.sh to handle Debian >= 13.

Bug 3: lastlog became lastlog2

FOG's installer checks lastlog to verify user creation. Trixie dropped it for lastlog2. Quick fix:

apt-get install -y lastlog2
ln -sf /usr/bin/lastlog2 /usr/local/bin/lastlog
Enter fullscreen mode Exit fullscreen mode

Once I patched all three, the installer completed and the backend finally matched the web UI.

NFS inside LXC: the container wall

With the installer patched, the next wall was image storage. FOG uses NFS for capture and deployment. Inside a Proxmox LXC container, kernel NFS does not cooperate:

mount: /proc/fs/nfsd: permission denied
Enter fullscreen mode Exit fullscreen mode

I tried the usual Proxmox container feature flags:

pct set <CTID> -features nesting=1,nfs=1
pct reboot <CTID>
Enter fullscreen mode Exit fullscreen mode

Better, but still not a reliable kernel NFS server. I stopped fighting it and went userspace: nfs-ganesha.

apt install nfs-ganesha nfs-ganesha-vfs
Enter fullscreen mode Exit fullscreen mode

One annoying gotcha: Ganesha's pseudo paths cannot nest. Having /images and /images/dev as pseudo paths breaks child lookup. I had to flatten the namespace.

The working /etc/ganesha/ganesha.conf:

NFS_CORE_PARAM {
    Protocols = 3,4;
    mount_path_pseudo = true;
    allow_set_io_flusher_fail = true;
}

EXPORT {
    Export_Id = 1;
    Path = /images;
    Pseudo = /images;
    Protocols = 3,4;
    Access_Type = RO;
    Squash = no_root_squash;
    FSAL { Name = VFS; }
    CLIENT { Clients = *; Access_Type = RO; }
}

EXPORT {
    Export_Id = 2;
    Path = /images/dev;
    Pseudo = /imagesDev;
    Protocols = 3,4;
    Access_Type = RW;
    Squash = no_root_squash;
    FSAL { Name = VFS; }
    CLIENT { Clients = *; Access_Type = RW; }
}
Enter fullscreen mode Exit fullscreen mode

The critical line for LXC: allow_set_io_flusher_fail = true;. Without it, Ganesha dies with EPERM on PR_SET_IO_FLUSHER. That flag lets it skip the kernel call and keep running.

Verify with showmount -e localhost and rpcinfo -p localhost. If both look right, NFS is serving.

Note: Kernel NFS in LXC was a dead end. Userspace NFS was the practical way through it.

The golden image pipeline

With the server side stable, the real work was image prep. Same process every time:

  1. clean install Windows 11 on the reference machine
  2. install drivers, software, updates, and room-specific settings
  3. keep it off the domain
  4. install the FOG client if needed for post-deploy tasks
  5. clean it aggressively
  6. sysprep and shut down
  7. PXE boot and capture in FOG

Step 5 became its own thing. The cleanup one-liner I kept copy-pasting between machines. Ugly, but it works:

Stop-Service wuauserv -Force; Remove-Item C:\Windows\SoftwareDistribution\* -Recurse -Force -EA SilentlyContinue; Start-Service wuauserv; Dism /online /Cleanup-Image /StartComponentCleanup /ResetBase; Remove-Item C:\Windows\Panther\* -Recurse -Force -EA SilentlyContinue; Remove-Item C:\Windows\Temp\* -Recurse -Force -EA SilentlyContinue; Remove-Item C:\Windows\Prefetch\* -Recurse -Force -EA SilentlyContinue; Remove-Item "$env:TEMP\*" -Recurse -Force -EA SilentlyContinue; Clear-RecycleBin -Force -EA SilentlyContinue; cleanmgr /sagerun:1; C:\Windows\System32\Sysprep\sysprep.exe /oobe /generalize /shutdown
Enter fullscreen mode Exit fullscreen mode

I started calling it the Frankenstein command. It's stitched together from half a dozen different guides and does way too much in one line. But for repeatable image prep, it earned its place.

Broken down:

  • Stop-Service wuauserv -Force / Remove-Item SoftwareDistribution / Start-Service wuauserv: kill Windows Update, nuke the cache, bring it back.
  • Dism /online /Cleanup-Image /StartComponentCleanup /ResetBase: collapse superseded component versions. Frees real space.
  • The four Remove-Item lines: purge Panther logs, system temp, Prefetch, and user temp.
  • Clear-RecycleBin: self-explanatory.
  • cleanmgr /sagerun:1: Disk Cleanup with preconfigured settings.
  • sysprep.exe /oobe /generalize /shutdown: strip machine identity, stage OOBE, power off.

Note: After sysprep shuts the machine down, do not let it boot back into Windows before capture. If it does, you just spent your sysprep state and get to do it again.

Sysprep on Windows 11: Appx package hell

The worst part of this project was not FOG. It was Sysprep.

If you've fought Windows 11 Sysprep before, you know the error. If you haven't, it looks like this in the Panther logs:

Sysprep was not able to validate your Windows installation.
Enter fullscreen mode Exit fullscreen mode

More specifically:

Package <app> was installed for a user, but not provisioned for all users. This package will not function properly in the sysprep image.
Failed to remove apps for the current user: 0x80073cf2.
Enter fullscreen mode Exit fullscreen mode

On Windows 11 23H2 and 24H2, the Microsoft Store silently updates Appx packages per-user. That creates a version mismatch between the installed state and the provisioned state. Sysprep sees it, panics, refuses to generalize.

What I tried that did not work:

  • Remove-AppxPackage -AllUsers
  • Remove-AppxProvisionedPackage
  • the mythical SkipAppxValidation registry setting
  • editing Generalize.xml
  • renaming AppxSysprep.dll

Some fail outright. Some appear to work, then the packages come back after a reboot. Some just swap one error for a different one.

What actually worked: changing the source image entirely.

I used Chris Titus Tech's WinUtil to build a MicroWin ISO. Instead of installing stock Windows and spending hours ripping Store junk back out, I started from a clean ISO that never had the Appx baggage in the first place.

Prevention over cleanup:

  1. build MicroWin ISO
  2. install on the reference machine
  3. configure drivers and software
  4. disconnect from the network before sysprep if needed
  5. run the cleanup command
  6. sysprep, shut down, capture immediately

That was the difference between fighting Sysprep every time and having it just work.

PXE on newer hardware: snponly.efi

One more issue that looked like FOG but wasn't. A newer motherboard would download ipxe.efi, start up, and freeze at:

iPXE initializing devices...
Enter fullscreen mode Exit fullscreen mode

No ok. Just stuck.

iPXE's built-in NIC driver couldn't handle the newer chipset. The fix: switch DHCP option 67 from ipxe.efi to snponly.efi.

The difference is simple. ipxe.efi ships its own NIC drivers. snponly.efi delegates to the UEFI firmware's native SNP driver. On newer boards, the firmware driver works where iPXE's doesn't.

One DHCP change, ipxe.efi to snponly.efi, and the machine booted straight to the FOG menu. It became the default for all our UEFI clients.

72 systems, deployed by room

With images captured and PXE stable, FOG could finally do its job.

Bulk imported the workstation list from CSV:

  • 47 had MAC addresses ready for direct import
  • 25 had no active lease at the time, so they were left to auto-register when first PXE booted

Each host got the right image and room group. Deploying a room was: select the group, schedule a deploy task, PXE boot the room. That's it.

FOG uses Partclone under the hood, so it only transfers used blocks. A 500 GB drive with 30 GB used doesn't push 500 GB over the wire. It pushes roughly 30 GB, compressed. Room-wide imaging is faster than people expect.

After deployment, the FOG client sets the hostname, joins the domain via svc-domainjoin, and drops the computer object in the correct OU. Multicast is also available for imaging an entire room simultaneously.

What I learned

Setting up FOG is not just "install FOG." It's DHCP policy archaeology, AD delegation, Windows image hygiene, boot loader compatibility, and the reality of running services inside Linux containers.

The short list:

  • DHCP policies override scope options. Check for leftover WDS/SCCM policies before blaming FOG.
  • A running web UI does not mean FOG is installed. Verify TFTP, services, and NFS.
  • FOG on Debian Trixie needs manual patches. osid, libcurl4t64, lastlog2.
  • Kernel NFS in Proxmox LXC is a dead end. Use nfs-ganesha.
  • Build images on reference machines. VM images don't carry driver quirks correctly.
  • Debloat before install, not after. MicroWin saves hours of Appx cleanup.
  • Test Sysprep on something disposable first. Appx breakage on your finished image is a bad time.
  • snponly.efi for newer hardware. If iPXE hangs at device init, switch boot files.
  • Never boot a sysprepped machine before capture. Generalize is one-shot.
  • FOG's hostName is VARCHAR(16). Plan your naming accordingly.

Once the rough edges were filed down, the day-to-day got simple. Pick the room. PXE boot. Image. Domain join. Done. Way less overhead than SCCM ever was.

That's the kind of boring I want from lab infrastructure.

This post is part of a larger infrastructure migration series. For the full Hyper-V to Proxmox migration story, see the project page on solomonneas.dev.

Originally published at solomonneas.dev.

Top comments (0)