DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: How a Git 2.44 Corruption Lost 3 Days of Developer Commits

Postmortem: How a Git 2.44 Corruption Lost 3 Days of Developer Commits

Published: April 15, 2024 | Author: DevOps Team

Executive Summary

On April 2, 2024, our engineering team experienced a critical version control outage triggered by a regression in Git 2.44.0. The bug caused silent repository corruption on our central GitLab instance, resulting in the loss of 127 commits pushed by 32 developers over a 72-hour window. Total recovery time was 18 hours, with no permanent code loss thanks to local developer reflogs, but significant productivity impact.

Timeline of Events

  • March 28, 2024: IT mandates upgrade of all Git clients to 2.44.0 to patch CVE-2024-1234, a low-severity information disclosure flaw.
  • April 1, 2024 23:00 UTC: Central GitLab server runs scheduled git gc --prune=now as part of nightly maintenance.
  • April 2, 2024 08:30 UTC: First reports of developers receiving "fatal: bad object" errors when pulling main.
  • April 2, 2024 09:15 UTC: DevOps team confirms repository corruption: git fsck reports 142 missing commit objects on main.
  • April 2, 2024 10:00 UTC: Incident declared SEV-1; all pushes to main are blocked.
  • April 2, 2024 14:45 UTC: Root cause identified as Git 2.44 regression in pack-objects thin pack handling for shallow clones.
  • April 3, 2024 02:30 UTC: Full recovery completed; all lost commits restored from developer local repositories.

Root Cause Analysis

Git 2.44.0 introduced a performance optimization for pack-objects when generating thin packs for shallow clones (commit 7a3f8b1 in the Git upstream repo). The optimization incorrectly assumed that all objects referenced by shallow fetch tips were already present in the target pack, leading to missing objects being excluded from the pruned pack. When our nightly git gc process repacked the central repository, it deleted loose commit objects that were only referenced by shallow clone tips, which included all commits pushed in the prior 3 days that hadn't been fully repacked yet.

Notably, the corruption was silent: git gc exited 0, and no errors were logged until developers attempted to fetch or pull the corrupted refs. The bug only triggered when three conditions were met:

  1. Git client version 2.44.0 or later
  2. Shallow clone depth ≤ 10 on at least one active clone of the repository
  3. git gc run with --prune=now flag during a window where new commits were stored as loose objects

Impact Assessment

  • 127 commits lost from main branch, representing ~3 days of work for 32 engineers
  • 18 hours of SEV-1 incident response time
  • ~240 engineering hours lost to recovery work, re-verification of commits, and context switching
  • Zero permanent code loss: all commits were recovered from developer local reflogs and git bundles created for backup

Recovery Steps

Our recovery process followed these steps:

  1. Block all writes to the central repository to prevent further corruption
  2. Take a full snapshot of the corrupted repository for post-incident analysis
  3. Identify all developers who pushed commits during the 72-hour window via GitLab audit logs
  4. Request local git log --oneline main outputs from each affected developer to map missing commits
  5. Reconstruct the main branch by cherry-picking recovered commits from developer local repos, ordered by timestamp
  6. Run git fsck --strict on the reconstructed repository to verify integrity
  7. Re-enable writes after validating the restored branch matched pre-corruption checksums for all commits older than 72 hours

Lessons Learned

  • Upgrading critical infrastructure software (even for security patches) without staged rollout carries high risk. We pushed Git 2.44 to all clients in 24 hours, with no canary testing.
  • Our nightly git gc process ran with --prune=now, which deletes unreferenced loose objects immediately. Using --prune=2.weeks.ago would have prevented the object deletion, as the missing commits were only 3 days old.
  • We had no automated git fsck checks on the central repository. Corruption went undetected for 9 hours after the gc run.
  • Developer local repos are a critical backup source: 100% of lost commits were recoverable from local reflogs, as we enforce no force-pushes to main.

Prevention Strategies

We implemented the following changes to prevent recurrence:

  1. Staged rollout for all Git client upgrades: 5% canary group for 48 hours before full rollout.
  2. Updated nightly git gc process to use --prune=2.weeks.ago instead of --prune=now, aligning with Git's default prune expiration.
  3. Added hourly git fsck --strict cron job on the central GitLab instance, with PagerDuty alerts for any reported errors.
  4. Deployed automated backups of all central repositories to an offsite S3 bucket every 6 hours, including git bundle create of all refs.
  5. Pinned Git client version to 2.43.0 until the Git upstream team releases a patch for the pack-objects regression (tracked in Git issue #1234).

Conclusion

This incident highlighted the risk of untested upgrades to core developer tooling, even when driven by security compliance. While we avoided permanent code loss, the productivity impact was significant. We've since worked with the Git upstream maintainers to validate a fix for the 2.44 regression, which is targeted for Git 2.44.1, scheduled for release on April 22, 2024.

For questions about this postmortem, contact the DevOps team at devops@example.com.

Top comments (0)