beefed.ai

Posted on May 2 • Originally published at beefed.ai

Wave-Based Migration Playbook for Enterprise Desktop Projects

#programming

Large desktop migration programs look the same across industries: a flood of help‑desk tickets the first morning, one or two business‑critical applications failing, local teams scrambling to rollback devices or re‑image, and a project team forced to reset timelines. Those symptoms trace to three predictable gaps: an incomplete master inventory, a brittle packaging and testing factory, and a rollout cadence that tries to move too fast without real telemetry or rollback controls.

Contents

Designing migration waves that shrink the blast radius and speed recovery
Run a pilot that forces failure and drives real fixes
Own the wave-day: runbooks, monitoring, and rollback controls
Scale the packaging factory and cadence: telemetry, testing, and governance
Practical Application: checklists, templates, and a 12-week sample schedule
Sources

Designing migration waves that shrink the blast radius and speed recovery

The principle is simple and operational: treat each wave as a controlled experiment where your goal is to discover failure modes, not to prove success. A tightly scoped wave reduces the number of simultaneous unknowns — fewer hardware models, fewer local network variables, and a smaller set of business‑critical apps — which shortens root‑cause time and lowers the cost of rollback.

Practical prioritization criteria (use these to score and group users/devices)

Business criticality — Does the user run finance close‑of‑month apps, trading platforms, or clinical systems? Weight = high.
Application risk — Number and uniqueness of Line‑of‑Business (LOB) apps installed; apps without vendor support get higher risk scores.
Device readiness — Firmware, TPM, UEFI, and driver currency; hardware models with known driver gaps get flagged.
Support locality — On‑site vs remote, time‑zone impact, local IT availability.
User tolerance / schedule sensitivity — Roles with inflexible windows (e.g., trading desk, clinicians) go later.

Scoring quick example (normalized 0–10):

score = business_criticality*4 + app_risk*3 + support_complexity*2 + (10 - hardware_readiness)*1
Sort descending; the highest scores should go last (do not wave them early).

Wave sizing — heuristics and cadence
| Org size | Pilot (representative) | Typical wave size | Cadence (time between waves) |
|---|---:|---:|---:|
| Small (≤ 500 users) | 10–25 users | 25–100 | 1–2 weeks |
| Medium (500–5,000) | 50–200 users | 100–500 | 2–4 weeks |
| Large (5,000+) | 200–1,000 users | 500–3,000 | 2–6 weeks |

These are heuristics you can adapt to support capacity and app complexity. A helpful rule-of-thumb many teams use is to keep the pilot under ~5–10% of the estate so you surface a broad range of behaviors without overwhelming support capacity.

Contrarian insight: don't build your pilot only from "IT champions." Champions bias the sample toward lucky outcomes. Include typical, edge‑case, and low‑volume but mission‑critical users to reveal the real failure modes early.

Run a pilot that forces failure and drives real fixes

Run the pilot as a forensic exercise, not a publicity stunt. Design the pilot deliberately to fail fast on things you care about: app compatibility, driver behavior, user profile restoration, SSO flows, and local printers/peripherals.

Pilot execution checklist (high‑impact sequence)

Lock the pilot scope to a fixed set of apps and devices; export a pilot-devices.csv that contains asset_tag, user_id, os_version, app_list, business_unit.
Package and deliver the base image and top 20 business apps (accept only packages that pass automated smoke tests).
Schedule white‑glove installs for the first 24 devices with on‑site or remote support present.
Collect telemetry during install: success/fail per install step, driver errors, app crash codes, Windows Error Reporting events, and helpdesk ticket tags (use a consistent taxonomy).
Run a 48–72 hour validation window, then a 7–14 day soak to reveal intermittent issues.
Hold a short, evidence‑driven retrospective: every defect must map to a corrective action in the packaging backlog with an SLA.

What to measure — pilot SLIs

Install success rate (target ≥ 98% for baseline apps).
App crash rate within 48 hours (target < 1% for critical apps).
Helpdesk tickets per user for the pilot vs pre‑pilot baseline (compare hourly/daily windows). Ground those SLIs in the four golden signals approach when feasible: latency (perceived slowness after migration), traffic (load spikes on services), errors (app failures), and saturation (resource exhaustion on upgraded images).

Use vendor remediation programs when you hit hard compatibility issues; for Windows ecosystems Microsoft’s App Assure and Test Base programs are engineered to help surface and remediate LOB and ISV compatibility gaps and can materially reduce the packaging backlog.

Own the wave-day: runbooks, monitoring, and rollback controls

A successful wave day depends on discipline: a single source of truth (the master device inventory), a crisp runbook, and telemetry that answers one question instantly — “Is this wave safe to expand?” Design your runbook to force the team to answer that question at scheduled checkpoints.

Wave‑day runbook (executive summary)

T‑48 hours: Final membership locked; pre‑flight health checks (battery/firmware/antivirus/backup). Owner: Device Readiness Lead.
T‑8 hours: Communication sent to wave users with exact start window, expected downtime, and helpdesk contact. Owner: Communications.
T‑0 to T+2 hours: First install tranche (10% of wave) executed, automated health check runs. Owner: Deployment Lead.
T+2 to T+6 hours: Triage window — evaluate SLIs, first‑response tickets triaged to owners, patch any blocking fixes.
T+6 hours: Decision gate — expand to next tranche (if SLIs within threshold) or pause/rollback (if thresholds breached). Owner: Migration Lead.

Decision gate thresholds — practical heuristics

Pause if critical‑app failure rate > 3% across the tranche or if helpdesk volume for the tranche exceeds 5x normal for a sustained 60‑minute window.
Rollback option matrix:
- For individual app failures: targeted remediation or app rollback (remove/update package).
- For systemic OS/driver failures: image rollback to baseline or re‑image (plan this as a scripted, automated operation). Note: Microsoft Autopatch supports phased releases and pause/resume behavior but does not provide a user-facing rollback for feature updates — you must plan rollback/rescue paths in your runbook.

Monitoring and observability

Instrument the small set of golden signals for each wave: request latency for critical services (if applicable), endpoint error rates, device check‑in rates, and helpdesk ticket counts.
Build a single Wave Dashboard that correlates device health, app telemetry, helpdesk queue, and build/deployment status so the migration lead can make the expand/pause decision with data, not opinion. Follow SRE guidance: monitor latency, traffic, errors, and saturation and page only on clear, high‑signal conditions.

On escalations and vendor engagement

Pre‑establish vendor SLAs and contact trees for top 10 LOB apps; capture P1 escalation numbers in your runbook. Track time‑to‑vendor‑response as a pilot KPI.

Important: The master device and user database must be authoritative and automated. If your CMDB is stale, your waves will be allocated against incorrect targets and support will fracture. Federate discovery sources, reconcile, and assign ownership in the CMDB before the pilot.

Scale the packaging factory and cadence: telemetry, testing, and governance

In my experience the single longest part of any desktop migration is application readiness — packaging, testing, vendor remediation, and approvals. Make the packaging factory your operational heart.

Packaging factory components

Discovery & inventory: automated discovery plus user‑reported apps; produce app_inventory.csv with app_name, vendor, version, install_path, installer_type, discovered_date.
Classification: tag apps by business_criticality and supportability.
Packaging pipeline: prefer MSIX for modern app control; fallback MSI or App‑V for legacy installers. Automate build validation and a headless smoke test harness.
Test harness: automated UI smoke tests (e.g., using WinAppDriver or Sikuli), config backup/restore verification, and license re‑activation checks.
Governance: SLAs for packaging turnaround (e.g., 3–5 business days for high‑priority LOB apps), clear packaging ownership and a visible backlog.

Sample packaging matrix (table)
| App | Vendor | Version | Compatibility | Package Type | Automation | Owner |
|---|---|---:|---:|---|---:|---|
| AcmeFinance | AcmeCorp | 4.3.1 | Known issues (print driver) | MSIX | Yes | AppOwner-Finance |
| FieldTool | FieldSoft | 2.9.0 | Compatible | MSI | Partial | AppOwner-FieldOps |

Use telemetry to feed the packaging backlog: every app crash discovered in a pilot should create an actionable packaging item with reproduction steps, logs, and device context.

Automation example — inventory pull (PowerShell)

# Collect installed apps for inventory export (run with appropriate privileges)
Get-CimInstance -ClassName Win32_Product |
  Select-Object IdentifyingNumber, Name, Version, Vendor |
  Export-Csv -Path .\app_inventory.csv -NoTypeInformation

Practical governance note: maintain a small set of validated base images (e.g., corporate image, specialized finance image) and treat the image as a controlled artifact — only change it with a controlled release process tied to packaging QA.

Practical Application: checklists, templates, and a 12-week sample schedule

Checklist — Wave design (actionable)

Export authoritative device/user inventory and reconcile gaps. (CMDB authoritative).
Tag every device with wave_id and owner.
Create packaging targets for top 50 business apps; assign owners and SLAs. (Packaging factory on day −14)
Reserve support staffing for day‑of hypercare and ensure vendor escalations are primed.
Define SLIs and decision gate thresholds for the pilot and subsequent waves.

Pilot launch checklist (day −7 to day 0)

Confirm pilot membership and runbook; distribute user communications.
Validate packaging artifacts and automated smoke tests.
Confirm backup strategy for user data and settings (verify USMT or profile sync).
Confirm remote support tooling (screen share, remote control) and on‑site roving techs.

Wave‑day runbook template (abbreviated)
| Time | Owner | Action | Success criteria | Rollback criteria |
|---:|---|---|---|---|
| T‑48h | Readiness Lead | Final inventory & firmware updates | 95% devices pass firmware checks | Defer non‑compliant devices |
| T‑0 | Deployment Lead | Push image/package to tranche 1 | 98% install success in tranche | If <98% or critical app failures >3% → pause |
| T+4h | Support Lead | Triage tickets; apply hotfixes | All critical tickets triaged within 30m | If unresolved critical bugs → rollback plan |
| T+24h | Migration Lead | Post‑wave review | SLIs met; lessons logged | If SLIs not met → expand mitigation backlog, schedule re‑run |

12‑week sample schedule (high level)
| Weeks | Activities |
|---:|---|
| 1–2 | Discovery: hardware, apps, CMDB reconciliation; identify long‑pole apps |
| 3–4 | Packaging factory ramp: package top 50 apps; build base images |
| 5–6 | Pilot prep: select pilot users, white‑glove installs, telemetry configuration |
| 7 | Pilot wave: execute, collect telemetry, triage, vendor remediation |
| 8–9 | Iterate packages, update images, update runbooks |
| 10–11 | Scale waves: 2–3 production waves, monitoring, hypercare |
| 12 | Stabilize: move to steady cadence and handover to operations |

Support staffing & hypercare (heuristic)

Day‑of: roving field techs (1 per 30–75 users depending on complexity) plus a remote triage pool (1 per 300–500 users).
Triage: dedicated ticket tags for wave incidents; a 2‑level escalation matrix with SRE/desktop engineering on call.

Operational templates (copy/paste basis)

pilot-devices.csv fields: asset_tag, user_id, email, wave_id, device_model, bios_version, intune_compliance, owner, business_unit
Runbook contact block: Migration Lead, Deployment Lead, Support Lead, App Owner (top 5), Vendor P1, PMO Sponsor (include phone and escalation windows).

Sources

Prosci — Change Management Success - Prosci research showing the impact of structured change management (ADKAR/process) on project outcomes and adoption rates; used to justify investment in communications, training, and adoption processes.

Windows feature updates overview — Microsoft Learn - Documentation on phased feature update releases, deployment rings, and Autopatch behavior including pause/resume and the limitations around rollback for feature updates.

App Assure — Microsoft Learn / Microsoft blogs - Microsoft App Assure overview and statistics on application compatibility coverage and the remediation support available for enterprise customers.

Monitoring Distributed Systems — Google SRE Book (sre.google) - Google SRE guidance on the four golden signals (latency, traffic, errors, saturation) and principles for monitoring and alerting that inform wave‑day telemetry design.

Freshservice — CMDB and automated discovery - Discussion of CMDB as a single source of truth, automated discovery, and dependency mapping; used to support the master inventory and federation approach.

What are Update Rings? — Action1 blog - Practical guidance on update rings and a pilot/target/broad approach with suggested pilot sizes (commonly 5–10%) and ring progression strategies.

Wave‑based migration is an operational discipline: design waves to reveal problems early, instrument those waves to make decisions with data, and make packaging and CMDB accuracy the engine that scales your rollout. Ship a pilot that fails quickly, fixes cleanly, then scale the factory and cadence until migration becomes just another scheduled business rhythm.