When VM Data Goes Missing

#virtualmachine #backup

The Incident: A Client’s Unexpected Data Disruption

Even with the best plans, things can go wrong. Recently, one of our valued clients faced a tough situation: a key virtual machine (VM) on their office server lost important data, including all VM configuration files and critical binary files. This post will walk you through how we stepped in to help them get everything back, what we learned from the experience, and how we’re now working with them to ensure such an incident doesn’t happen again.

It all started during what should have been routine virtual machine management at our client’s site. Without anyone deleting anything on purpose, a system anomaly led to the inadvertent removal of critical files from an office server VM. This immediately caused problems with their services, disrupted daily internal operations, and raised significant concerns about their data integrity. As their trusted IT consultants, we were immediately engaged to assess the situation and lead the recovery effort.
.
.
How We Got It Back: Our Step-by-Step Recovery Protocol

Our recovery plan for the client was clear: to restore the entire VM’s functionality, including its configuration, binary files, and any application data like MySQL tables. Just copying files wasn’t an option, as crucial setup information was missing. We had to approach this with precision and care.
.
.
Getting Ready for Recovery

Initial Assessment and Containment: As soon as we were brought in, our first priority was to quickly assess the full extent of the data loss and immediately contain the incident. This was crucial to prevent any further data corruption or potential spread of the issue. .
Set Up the Recovery Environment: We established a safe, isolated recovery environment at the client’s location. First of all, we took a backup of what was left on the affected system. This involved securing all their existing backups too, and making sure we had the right permissions to do the work without affecting their live systems. . . Attempting Direct Disk Recovery and VM Reconstruction

This phase involved our initial, fastest attempt at recovery directly from the affected disk, followed by VM reconstruction if needed.
.
1.Attempting Direct File Recovery with TestDisk (Our First Line of Defense): Our client’s last full VM backup was 15 days old, meaning a direct restore would lead to a significant data loss gap. To minimize this, we first attempted to recover files directly from the deleted disk. This was our fastest path to potentially recovering the most recent data.
The reason this approach holds promise is due to how file deletion typically works in Linux systems. When a file is “deleted” on a Linux filesystem, the data itself isn’t immediately erased from the disk sectors. Instead, the operating system primarily removes the file’s metadata (like its name, location, and size) from the filesystem’s index. The sectors where the data resides are merely marked as available for new data. Until new data overwrites those sectors, the original data might still be present.
We utilized TestDisk, a powerful free and open-source data recovery utility, designed to recover lost partitions and make non-booting disks bootable again, but it’s also highly effective at recovering deleted files.
.
Here’s a simplified overview of how we used TestDisk:

Installation: TestDisk was installed on a separate a live Linux environment.
.
sudo apt-get install testdisk
.
Disk Selection: We launched testdisk and selected the affected disk drive where the VM data was lost.

Partition Analysis: TestDisk was instructed to analyze the partition structure, looking for lost partitions or deleted files.

File Recovery: We navigated through the detected filesystem structure, searching for the critical VM configuration files and binary files that were reported missing. TestDisk attempts to scan the unallocated sectors and rebuild the file metadata, allowing for recovery.
.
Outcome of TestDisk:
.
Unfortunately, despite our best efforts and TestDisk’s capabilities, in this specific instance, the tool did not yield the desired results for the critical VM configuration and binary files. This suggested that the sectors containing the crucial metadata or parts of the files might have already been overwritten or were too fragmented for a complete recovery in this complex VM environment. While this was a setback, it was a necessary and rapid first attempt to minimize data loss.
.
.
Identifying Post-Backup Changes and Data Location
.
After TestDisk didn’t provide a full recovery, we shifted focus to identifying what data might have changed since the 15-day-old backup. We determined that:

Most of the core system configuration files would likely be the same as in the backup.

The Gitea repository (for code) and its commit data would definitely have changed.

The MySQL database data would also have been updated.

Luckily, we found that the /var/lib directory on the corrupt disk still contained the data for both MySQL and Gitea. This was a crucial discovery, as these directories typically store the application’s persistent data.
.
.
VM Reconstruction and Configuration Restoration:

Since a direct recovery wasn’t fully successful, we proceeded to recreate the VM environment.

New VM Setup: We set up a brand new Ubuntu 22 VM (matching the client’s original operating system) to serve as the recovery target.

Package Installation: All necessary software packages were installed on this new VM.

Configuration Replacement: We then replaced the default configuration files on the new VM with the configuration files recovered from the 15-day-old backup. This brought the system’s core settings back to a known good state.
.
Firewall Rules and Gitea Data Recovery:

Firewall Rules: The client’s firewall rules were well-documented in their change logs. We manually recreated these rules on the new VM, a process that did not take much time.

Gitea Data Recovery: For the Gitea application, we simply replaced its data directory on the new VM with the Gitea data directory we had recovered from the corrupt disk’s /var/lib location. After carefully managing file permissions and ownership, database recovery was the only thing left to bring back back Gitea and other service online with its latest data.
.
.
MySQL Data Recovery – The Core Challenge

After the configurations were restored, the most important task was recovering the MySQL data. This required a detailed approach, especially since the .ibd files were crucial and directly copying them wasn’t straightforward due to missing metadata.

1. Database Structure Re-initialization:

The foundational step was to re-establish the database’s structural integrity. We did this by importing an SQL dump from the older, 15-day-old backup. This recreated all database schemas, table definitions, and initial configurations.
**sudo mysql < backup.sql**
.
2.User Privilege Configuration:

Following the schema restoration, we recreated the necessary database user accounts for the client and reinstated their respective privileges. All user credentials were securely retrieved from backed-up configurations files.
.
3.Understanding InnoDB and .ibd File Restoration:

To effectively restore the client’s MySQL database, we leveraged our deep understanding of the MySQL InnoDB storage engine, particularly its transactional properties and recovery mechanisms. This was crucial for handling the .ibd files.
.
4.Temporarily Disable Database Rules:

To ensure a smooth import process and prevent any integrity errors, we temporarily turned off foreign key checks.
SET FOREIGN_KEY_CHECKS = 0;
.
5.Discard Existing Data Links (Tablespaces):
To prepare for the seamless import of the new .ibd files, we first told the database to forget where its current data files were. This effectively unlinks the table definitions from their associated data files, clearing the way for the new ones.
.
.
Important Note: This operation disconnects the current data. We always ensure our clients have a complete and verified backup before performing this step!
SELECT CONCAT('ALTER TABLE', TABLE_SCHEMA, '.', TABLE_NAME, 'DISCARD TABLESPACE;') AS sql_command FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = <DatabaseName> AND ENGINE = 'InnoDB';
.
We executed each command generated by this query individually to maintain precise control.
.
Copy and Secure the Data Files:
.
After discarding the old links, we carefully copied recovered .ibd files from the corrupt into the corresponding database folder on the new VM. It was crucial to set the right ownership for these files. Incorrect permissions would have prevented MySQL from accessing them.
/old/var/lib/mysql/<DatabaseName>/ # Source path on the corrupt disk chown mysql:mysql /var/lib/mysql/<DatabaseName>/*
.
Import the New Data (Tablespaces): With the .ibd files correctly positioned and permissions accurately configured, the next step was to instruct MySQL to use these new data files. This action successfully linked the table designs with their actual data, bringing them back online.
SELECT CONCAT('ALTER TABLE', TABLE_SCHEMA, '.', TABLE_NAME, 'IMPORT TABLESPACE;') AS sql_command FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = <DatabaseName> AND ENGINE = 'InnoDB';
.
We executed the resulting commands sequentially for each table.
.
Re-enable Database Rules: Upon the successful import of all data, we promptly turned the foreign key checks back on. This restored the rules that maintain data consistency and integrity within the database for all future operations.
.
SET FOREIGN_KEY_CHECKS = 1;
.
Addressing Advanced Recovery Challenges (Schema Mismatches):

A significant hurdle emerged when we discovered that 19 out of 127 tables of “Gitea’ had different designs. This was primarily due to recent updates to the Gitea application, which prevented a straightforward .ibd file restoration for these specific tables.
.
Leveraging ibd2sql for Data Extraction:

For tables with schema discrepancies or where standard recovery methods were ineffective, we employed a specialized tool called ibd2sql. This utility is designed to parse InnoDB .ibd files and extract the contained data as SQL INSERT statements, even when a complete MySQL instance or .frm files are unavailable.
cd ibd2sql for path in $(realpath /old/var/lib/mysql/<DatabaseName>); do python3 main.py $path --sql --ddl | mysql; done
.
Important Note: While ibd2sql is a powerful recovery tool, it may not fully support all complex data types, and we always recommend thorough post-recovery data validation.
.
.
In instances where ibd2sql provided table designs that didn’t perfectly align with the current schema, manual intervention was required. We meticulously consulted the client’s application’s (Gitea’s) official database schema changelogs and migration scripts. These resources provided the precise Data Definition Language (DDL) for each table at various points in time, allowing us to accurately reconstruct the correct table definitions.
.
After carefully modifying the table designs to reflect the correct schema gleaned from the changelogs, we tested them. This involved creating temporary tables and performing small-scale data imports to ensure the design precisely matched the .ibd file structure and that data integrity was fully maintained. This meticulous manual process, combined with leveraging historical schema information, was crucial for successfully recovering data from files that initially seemed lost.
.
.
Validation and Final Checks
1.System Integrity Checks: After all files were restored, we performed comprehensive checks to ensure the VM and all its applications were running correctly and without errors. This included verifying file integrity and system services.
.
2.Data Validation: For critical applications like the database and Gitea, we worked closely with the client to validate the restored data, ensuring its accuracy and completeness.
.
Key Takeaways
This incident, while challenging, served as a powerful catalyst for re-evaluating and fortifying our client’s data management strategies. Here are the refined best practices we’ve firmly integrated into our recommendations for them and other clients:
.

Automated and Verified Backups: We strongly advocate for and help implement a comprehensive, automated backup strategy for all critical VMs and their components (config, binaries, data). This includes a robust mix of full, incremental, and differential backups with clearly defined retention policies. Crucially, we emphasize regular verification of these backups through simulated test restores to ensure actual recoverability. .
Regular Snapshots: We advise clients to consistently use VM snapshots before any significant system changes, updates, or routine maintenance operations. These are quick backups that let you go back to a previous state if something goes wrong immediately. . -Principle of Least Privilege: We help clients set up and enforce strict access controls and permissions across all systems. By granting only the absolute necessary privileges to users and automated processes, we significantly mitigate the risk of both accidental and malicious data manipulation. .
Immutable Backups: We guide clients on exploring and implementing immutable backup solutions. These backups, by design, cannot be altered or deleted, providing a formidable defense layer against advanced threats like ransomware and inadvertent deletions. .
Comprehensive Documentation: We encourage clients to maintain meticulous and up-to-date documentation of all infrastructure components, detailed VM configurations, database schemas, precise backup procedures, and established recovery protocols. This documentation proves invaluable during crisis management, especially with larger teams. .
Regular Disaster Recovery Drills: We recommend and assist clients in conducting periodic disaster recovery drills. These drills are essential for testing the efficacy of recovery plans, identifying any potential weaknesses in their processes, and ensuring their team is not only ready but proficient during actual incidents. .
Proactive Monitoring and Alerts: We help clients set up robust systems that watch over their VMs and databases. They receive alerts for any unusual activities, critical disk space anomalies, or failures in backup processes. Catching problems early can prevent them from escalating into significant data loss events. . . In Conclusion The VM data loss incident at our client’s site was undoubtedly a demanding experience, but it also proved to be a profound learning opportunity for both them and us. It underscored the critical importance of proactive prevention, meticulous planning, and the ability to maintain composure and execute methodically under pressure. Getting their data back successfully has strengthened their reliance on robust data management practices, and we are proud to have been their partner in this recovery. By implementing these reinforced best practices, we are confident that we can help our clients turn potential data catastrophes into manageable incidents, ensuring their services run smoothly and their data remains safe.``

DEV Community

When VM Data Goes Missing

Top comments (0)