DEV Community: Sajit Maharjan

Troubleshooting OpenStack Instance I/O Errors: A Ceph Blocklist Case

Sajit Maharjan — Mon, 18 May 2026 13:05:42 +0000

Author: Aaditya Pageni | Infrastructure Engineer

Following a planned power outage in our datacenter, we encountered an issue where all existing VMs became unresponsive while newly created VMs functioned normally. This post documents the investigation, root cause, and resolution.

## Problem Description

After the power restoration:

Ceph reported HEALTH_OK
OpenStack services appeared operational
Newly created VMs booted successfully

However, all pre-existing VMs failed to start, dropping into initramfs with I/O errors before reaching the root filesystem.

No init found. Try passing init= bootarg.

BusyBox v1.36.1 (Ubuntu 1:1.36.1-6ubuntu3.1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)

The key observation:

Newly created VMs worked without issues.

This indicated the problem was specific to the relationship between existing VMs and their storage, rather than network or storage infrastructure issues.

Root Cause Analysis

The issue stemmed from Ceph's RBD exclusive locking mechanism.

This feature prevents simultaneous writes to the same image from multiple clients, avoiding data corruption. When a compute node connects to an RBD volume, it acquires an exclusive lock; when disconnected cleanly, it releases the lock.

During the power outage, compute nodes lost power without clean disconnection. When they returned, they appeared as untrusted clients:

bash ceph osd blocklist ls

Example output:

bash 10.88.10.91:0/3853293677 2026-05-06T08:59:47.102488+0000 10.88.10.90:0/316670229 2026-05-07T00:26:11.581329+0000 10.88.10.90:0/3783311129 2026-05-07T00:26:11.581329+0000 ... listed 14 entries

Ceph blocklists clients that crash without releasing locks to prevent zombie processes from corrupting data.

The old locks remained held by client IDs that no longer existed, creating a deadlock where:

VMs needed the locks to boot
the locks were held by processes that would never release them

Resolution

Verifying Lock State

We verified the theory by checking an affected volume's lock state:

bash rbd lock list --pool volumes --image volume-48ed0d20-f065-4536-b3f2-eac5f3abc5be

Output:

`bash
There is 1 exclusive lock on this image.

Locker ID Address
client.3406724 auto 135766063836400 10.88.10.91:0/3853293677
`

The address matched a blocklisted entry.

The lock was held by a client that would not return to release it.

Removing Stale Locks

The command syntax for force-removing an RBD lock requires positional arguments with quoted strings:

bash rbd lock remove volumes/volume-48ed0d20-f065-4536-b3f2-eac5f3abc5be \ "auto 135766063836400" "client.3406724"

Verification:

bash rbd lock list --pool volumes --image volume-48ed0d20-f065-4536-b3f2-eac5f3abc5be

Output:

bash No locks on this image.

The VM rebooted successfully.

Bulk Resolution

For multiple affected volumes, we used the following script:

`bash
for vol in $(rbd ls volumes); do
locks=$(rbd lock list volumes/$vol 2>/dev/null)

if echo "$locks" | grep -q "client"; then
echo "Removing lock on: $vol"

lock_id=$(rbd lock list volumes/$vol | awk 'NR==3{print $2" "$3}')
locker=$(rbd lock list volumes/$vol | awk 'NR==3{print $1}')

rbd lock remove volumes/$vol "$lock_id" "$locker"

echo "Done: $vol"

fi
done
`

Then hard rebooted all affected VMs:

`bash
for vm in $(openstack server list --all-projects -f value -c ID); do
name=$(openstack server show $vm -f value -c name)
status=$(openstack server show $vm -f value -c status)

echo "Rebooting: $name ($vm) - Current status: $status"

openstack server reboot --hard $vm
done
`

All VMs recovered successfully.

Clearing Blocklist Entries

After confirming all locks were released and VMs were healthy, we cleared the blocklist entries:

bash ceph osd blocklist rm 10.88.10.90 ceph osd blocklist rm 10.88.10.91

Important: Only perform this step after confirming crashed nodes will not return with stale state. Reconnecting zombie processes while another client holds the lock risks data corruption.

Prevention Measures

Granting OpenStack Blocklist Capabilities

OpenStack requires specific Ceph capabilities to manage blocklist entries automatically.

Without:

bash allow command "osd blocklist"

Nova cannot clear stale entries automatically.

Step 1: Check Current Capabilities

bash ceph auth get client.openstack

Step 2: Add Blocklist Capability

`bash

First, save existing OSD caps

ceph auth get client.openstack -o /tmp/openstack.keyring

Then update caps (adjust pool names and OSD caps for your environment)

ceph auth caps client.openstack \
mon 'allow r, allow command "osd blocklist"' \
osd 'allow class-read object_prefix rbd_children, allow rwx pool=images, allow rwx pool=volumes, allow rwx pool=vms, allow rwx pool=backups'
`

Note: Adjust pool names according to your environment (vms, volumes, images, etc.).

Step 3: Verify the Update

bash ceph auth get client.openstack

Nova Configuration Tuning

The following settings were added to nova.conf on compute nodes:

ini [libvirt] hw_disk_discard = unmap disk_cachemodes = network=writeback rbd_io_timeout = 30

The rbd_io_timeout parameter gives the RBD client additional time to recover during transient issues rather than immediately failing I/O.

Key Takeaways

Ceph's blocklist mechanism protects data from split-brain scenarios.
The issue arises from unclean shutdowns leaving orphaned locks behind.
New VMs working while existing VMs fail is a strong diagnostic indicator.
This pattern after an outage strongly suggests blocklist-related issues.
Proactively grant blocklist permissions to the OpenStack Ceph client.
The allow command "osd blocklist" capability enables automatic recovery without manual intervention.
The rbd lock remove syntax requires positional arguments with quoted strings.
The --locker flag is not available in many versions.

Correct format:

bash rbd lock remove <pool>/<image> "<lock_id>" "<locker>"

Include lock-related failure scenarios in disaster recovery testing. Standard monitoring and backup verification may not catch this failure mode.

Building Hypervisor Infrastructure with Proxmox and Ceph

Sajit Maharjan — Mon, 18 May 2026 12:14:07 +0000

Author: Rajendra Acharya

Why Proxmox and Ceph?

Proxmox VE is a powerful open-source platform for managing virtual machines and containers. When combined with Ceph, it enables a highly available, scalable, and cost-effective infrastructure.

Key Benefits

High Availability (HA): Minimized downtime through automatic failover.
Scalability: Seamlessly expand compute and storage resources as needed.
Cost-Effectiveness: Open-source, community-driven technologies with enterprise-grade capabilities.

This combination is ideal for enterprises, startups, and home lab enthusiasts looking to build a production-grade hypervisor infrastructure.

.

Setting Up Proxmox Infrastructure

Step 1: Install Proxmox VE

Download the ISO

Download the latest Proxmox VE ISO from the official website:

Proxmox VE Official Website

Install Proxmox VE

Create a bootable USB drive or use PXE boot.
Install Proxmox VE on your server hardware.

Initial Configuration

During installation:

Configure the hostname.
Assign a static IP address.
Set the root password.

* Verify management network connectivity.

Step 2: Configure Networking

For production environments, proper network segmentation is highly recommended.

Recommended Network Setup

Use bonded interfaces for redundancy and improved reliability.
Create separate VLANs for:
- Management traffic
- Ceph storage traffic
- Virtual machine traffic

This separation improves both performance and security.

Setting Up Ceph for Distributed Storage

Step 1: Prepare Proxmox Nodes

Before deploying Ceph:

Ensure you have at least three Proxmox nodes for cluster reliability.
Install additional disks on each node for Ceph OSDs (Object Storage Daemons).

Step 2: Install and Configure Ceph

Enable Ceph

Navigate to:

Datacenter → Ceph
.

Initialize the Ceph cluster.

Deploy OSDs

Add dedicated disks as OSDs from:
Ceph → OSD
.
These disks will provide distributed storage capacity for the cluster.

Configure MONs (Monitors)

Deploy at least three monitor nodes to maintain quorum and ensure cluster reliability.

Create Ceph Pools

Create storage pools and define redundancy policies such as:

Replication
Erasure Coding

Pool configuration depends on your performance and redundancy requirements.

Integrating Ceph with Proxmox

Step 1: Add Ceph Storage to Proxmox

Navigate to: Datacenter → Storage .
Add Ceph RBD as a storage backend.
Verify connectivity and pool access.

Step 2: Migrate VM Storage

Move VM disks to the Ceph RBD pool to achieve:

Centralized storage
Live migration capabilities
High availability support

High Availability and Maintenance

Configure HA in Proxmox

To enable high availability:

Create HA Groups and define failover priorities.
Configure quorum and fencing properly to avoid split-brain scenarios.

Regular Maintenance

Monitor Ceph Cluster Health

Use the following command regularly:
ceph status

.

Monitor Infrastructure Performance

Use the Proxmox dashboard to:

Monitor VM performance
Track storage utilization
Review cluster health

The Takeaway

Building a hypervisor infrastructure using Proxmox and Ceph provides an excellent balance of flexibility, reliability, and scalability. While the initial setup requires careful planning, the end result is a resilient and enterprise-grade infrastructure powered entirely by open-source technologies.

Whether you're building a data center platform, private cloud, or advanced home lab, Proxmox and Ceph offer a powerful foundation capable of competing with many commercial solutions.

# Related Reading

By understanding and implementing these technologies, you gain hands-on experience with modern infrastructure design while building a platform that delivers enterprise-grade performance and reliability.

The Digital Tithe: Why Open Source is Nepal’s Only Path to Sovereignty

Sajit Maharjan — Mon, 18 May 2026 11:05:08 +0000

From Digital Vassalage to National Pride: Why a Sovereign Tech Stack is Nepal’s New Declaration of Independence.

Author: Bishal KC | Cloud Native
.
Nepal’s foreign exchange reserves have reached a historic high of Rs. 2,677.68 billion. On paper, this is a triumph—a fortress built against the volatile tides of the global economy. Yet, beneath the celebratory headlines of the Nepal Rastra Bank lies a structural hemorrhage.

But even as this wealth flows in through the front door, a significant and growing share of it quietly exits through the digital back door. It flows directly to the glass towers of Silicon Valley and the corporate campuses of Redmond.

In our national accounts, these outflows are sanitized with labels like “charges for the use of intellectual property” or “computer services.” Let us call it what it actually is: a digital tithe. We are paying a recurring, never-ending rent simply to operate the systems that run our own country. As we push forward with the Digital Nepal Framework 2.0, we aren’t just building a modern state; we are building a digital plantation where we are the sharecroppers.
.

.
The Anatomy of Digital Colonialism: The Trap of Vendor Lock-In
.
In the 20th century, colonialism was about land and resources. In the 21st, it is about code and data. The hidden cost of proprietary software is not the initial purchase price; it is the vendor lock-in—a state of captivity where the cost of switching becomes so high that the customer becomes a permanent asset of a foreign corporation.

For Nepal, this means our state institutions are being held hostage by foreign boardrooms. The financial stakes are staggering. Globally, enterprise virtualization platforms—the software that allows one server to act as many—are licensed per CPU core. For a modest government data center with 1,000 cores, the annual “rent” can exceed $550,000 (approx. NPR 7.2 crore).

This is not a hypothetical risk. Following Broadcom’s acquisition of VMware, customers globally—including Nepal’s banking and government sectors—were hit with price hikes of 150% to 300% almost overnight. As a small market, Nepal has zero negotiating power. If our National ID system, Social Security Fund, or Tax records are built on these “black boxes,” we have effectively handed the keys to our national infrastructure to a stranger who can change the locks at will.

We are building our digital future on land we do not own, with materials we cannot inspect, for a rent we cannot control.
.
.
The Sovereign Stack: Infrastructure is Not a Product
.
As we build our Digital Public Infrastructure (DPI)—the digital equivalent of our highways, electricity grids, and water systems—we must realize that Proprietary DPI is a contradiction in terms. DPI consists of the foundational layers of identity, payments, and data exchange. In the physical world, we would never allow a foreign corporation to own the exclusive rights to the blueprints of our national highways or the “off-switch” for our power grid. Yet, by building our “Citizen Stack” on closed-source platforms, we are doing exactly that.

The antidote is the adoption of a “Sovereign Stack” built on foundations that are already the global standard. This is not an experimental dream; it is a reality proven by battle-tested open-source giants:
.

.
For Nepal, the mandate to use these open standards in our DPI is a Declaration of Independence. It ensures Portability as Power. If we build our national infrastructure on an open-source, Kubernetes-based architecture, our digital services become platform-agnostic. We are no longer at the mercy of any single provider’s uptime or pricing. If a vendor raises prices, we can move our entire national infrastructure to local servers in Kathmandu or a different provider overnight. We own the orchestration; therefore, we own the destiny of our data.
.
.
Reversing the Brain Drain: From Resellers to Architects
.
Nepal’s greatest loss is not financial; it is human. In 2023 alone, over 808,000 Nepalese departed for foreign shores. We produce approximately 9,000 IT graduates annually—brilliant minds capable of competing on a global stage. However, our current procurement ecosystem provides them with a bleak choice: leave or become a clerk.

When the government and large enterprises continuously outsource their infrastructure to foreign proprietary vendors, local IT firms are reduced to acting as mere “license resellers.” This is a hollow business model. Our engineers become support staff for foreign products, trained only to “click buttons” on a dashboard designed in California.

Transitioning to an open-source-first model fundamentally changes this dynamic. Mastering technologies like Kubernetes and OpenStack requires high-level architectural design and systems integration. It transforms our domestic IT sector from a middleman economy into a value-added industry.

By adopting this model, the government’s massive IT budget—currently estimated at NPR 20–30 crore annually for licensing alone—can be rechannelled directly into the domestic economy. Instead of buying a license from a foreign giant, the government can pay a local Nepali company for support, customization, and maintenance. This creates high-value employment locally, offering our brightest engineers a compelling reason to stay and build the future of Nepal’s digital infrastructure. We must stop being a nation that consumes software and start being a nation that builds it.
.
.
Shattering the Security Myth: Transparency over Obscurity
.

To make this transition, we must dismantle the most dangerous misconception held by conservative policymakers: the myth that “open source is open and therefore vulnerable.” This fallacy of “security through obscurity”—the idea that hiding source code makes it safer—is rejected by every modern cybersecurity standard.

The security of open-source software is underpinned by Linus’s Law: “Given enough eyeballs, all bugs are shallow.” Because the code of the Linux kernel or Kubernetes is transparent, it is audited by a global community of thousands of independent security researchers. Vulnerabilities are detected and patched at a speed that proprietary “black boxes” cannot match.

In a closed-source system, a flaw can sit hidden and unpatched for years, known only to the vendor and perhaps a sophisticated hacker or a foreign intelligence agency. True digital sovereignty is impossible if the state cannot independently audit the code running its critical infrastructure. In an era of increasing global cyber-warfare, using software you cannot inspect is like building a fortress and letting a stranger hold the only set of keys. If the titans of global technology trust open-source architecture to secure their empires, the Government of Nepal has no reason to fear it.
.
.
The Geopolitical Necessity: Digital Non-Alignment
.
In a world increasingly divided by “technological cold wars,” Nepal must practice Digital Non-Alignment. Relying on a single proprietary stack from one geopolitical region makes our national infrastructure a target for sanctions, trade wars, or political leverage.

Open-source software belongs to no single nation. It is a global commons. By building on FOSS, Nepal ensures that its digital services remain operational regardless of changes in international relations or foreign export controls. Whether it is a health registry in Humla or a payment gateway in Birgunj, our software should not depend on the whims of a foreign government’s trade policy. We must protect our “Digital Sovereignty” by ensuring that no single foreign entity has the power to “turn off” Nepal.
.
.
A Roadmap for Reform: The Three Pillars
The path forward requires more than just speeches; it requires bold political will. The government must take three structural steps immediately:
.
1. Mandate an “Open Source First” Procurement Policy: Public procurement laws must be updated. All government agencies must evaluate battle-tested open-source solutions by default. A proprietary solution should only be procured if the agency provides a rigorous, publicly documented justification proving that no viable open alternative exists, factoring in the long-term financial risks of vendor lock-in.
.
2. Establish a National Open Source Program Office (OSPO): Under the Ministry of Communication and Information Technology, a Nepalese OSPO would act as the central hub for strategy, capacity building, and security compliance. This office will bridge the gap between the state and our vibrant local tech community, ensuring that our civil servants are trained to manage the Sovereign Stack.
.
3. Invest in “Local-First” Support Ecosystems: The government should provide incentives for local IT startups that specialize in open-source implementation. By creating a certification program for “Sovereign Stack Providers,” we can ensure that when a government office chooses open source, they have reliable, local technical support.
.
.
The Software Must Belong to the Nation
.
Adopting open-source technology is not a “cheaper” alternative; it is a strategic national imperative. We are at a crossroads. We can continue down the legacy path of digital vassalage, perpetually tethering our national budgets to foreign entities. Or, we can choose the path of resilience and independence.

The era of renting our digital future must end. The software that runs the nation must belong to the nation.

When VM Data Goes Missing

Sajit Maharjan — Mon, 18 May 2026 06:39:52 +0000

The Incident: A Client’s Unexpected Data Disruption

Even with the best plans, things can go wrong. Recently, one of our valued clients faced a tough situation: a key virtual machine (VM) on their office server lost important data, including all VM configuration files and critical binary files. This post will walk you through how we stepped in to help them get everything back, what we learned from the experience, and how we’re now working with them to ensure such an incident doesn’t happen again.

It all started during what should have been routine virtual machine management at our client’s site. Without anyone deleting anything on purpose, a system anomaly led to the inadvertent removal of critical files from an office server VM. This immediately caused problems with their services, disrupted daily internal operations, and raised significant concerns about their data integrity. As their trusted IT consultants, we were immediately engaged to assess the situation and lead the recovery effort.
.
.
How We Got It Back: Our Step-by-Step Recovery Protocol

Our recovery plan for the client was clear: to restore the entire VM’s functionality, including its configuration, binary files, and any application data like MySQL tables. Just copying files wasn’t an option, as crucial setup information was missing. We had to approach this with precision and care.
.
.
Getting Ready for Recovery

Initial Assessment and Containment: As soon as we were brought in, our first priority was to quickly assess the full extent of the data loss and immediately contain the incident. This was crucial to prevent any further data corruption or potential spread of the issue. .
Set Up the Recovery Environment: We established a safe, isolated recovery environment at the client’s location. First of all, we took a backup of what was left on the affected system. This involved securing all their existing backups too, and making sure we had the right permissions to do the work without affecting their live systems. . . Attempting Direct Disk Recovery and VM Reconstruction

This phase involved our initial, fastest attempt at recovery directly from the affected disk, followed by VM reconstruction if needed.
.
1.Attempting Direct File Recovery with TestDisk (Our First Line of Defense): Our client’s last full VM backup was 15 days old, meaning a direct restore would lead to a significant data loss gap. To minimize this, we first attempted to recover files directly from the deleted disk. This was our fastest path to potentially recovering the most recent data.
The reason this approach holds promise is due to how file deletion typically works in Linux systems. When a file is “deleted” on a Linux filesystem, the data itself isn’t immediately erased from the disk sectors. Instead, the operating system primarily removes the file’s metadata (like its name, location, and size) from the filesystem’s index. The sectors where the data resides are merely marked as available for new data. Until new data overwrites those sectors, the original data might still be present.
We utilized TestDisk, a powerful free and open-source data recovery utility, designed to recover lost partitions and make non-booting disks bootable again, but it’s also highly effective at recovering deleted files.
.
Here’s a simplified overview of how we used TestDisk:

Installation: TestDisk was installed on a separate a live Linux environment.
.
sudo apt-get install testdisk
.
Disk Selection: We launched testdisk and selected the affected disk drive where the VM data was lost.

Partition Analysis: TestDisk was instructed to analyze the partition structure, looking for lost partitions or deleted files.

File Recovery: We navigated through the detected filesystem structure, searching for the critical VM configuration files and binary files that were reported missing. TestDisk attempts to scan the unallocated sectors and rebuild the file metadata, allowing for recovery.
.
Outcome of TestDisk:
.
Unfortunately, despite our best efforts and TestDisk’s capabilities, in this specific instance, the tool did not yield the desired results for the critical VM configuration and binary files. This suggested that the sectors containing the crucial metadata or parts of the files might have already been overwritten or were too fragmented for a complete recovery in this complex VM environment. While this was a setback, it was a necessary and rapid first attempt to minimize data loss.
.
.
Identifying Post-Backup Changes and Data Location
.
After TestDisk didn’t provide a full recovery, we shifted focus to identifying what data might have changed since the 15-day-old backup. We determined that:

Most of the core system configuration files would likely be the same as in the backup.

The Gitea repository (for code) and its commit data would definitely have changed.

The MySQL database data would also have been updated.

Luckily, we found that the /var/lib directory on the corrupt disk still contained the data for both MySQL and Gitea. This was a crucial discovery, as these directories typically store the application’s persistent data.
.
.
VM Reconstruction and Configuration Restoration:

Since a direct recovery wasn’t fully successful, we proceeded to recreate the VM environment.

New VM Setup: We set up a brand new Ubuntu 22 VM (matching the client’s original operating system) to serve as the recovery target.

Package Installation: All necessary software packages were installed on this new VM.

Configuration Replacement: We then replaced the default configuration files on the new VM with the configuration files recovered from the 15-day-old backup. This brought the system’s core settings back to a known good state.
.
Firewall Rules and Gitea Data Recovery:

Firewall Rules: The client’s firewall rules were well-documented in their change logs. We manually recreated these rules on the new VM, a process that did not take much time.

Gitea Data Recovery: For the Gitea application, we simply replaced its data directory on the new VM with the Gitea data directory we had recovered from the corrupt disk’s /var/lib location. After carefully managing file permissions and ownership, database recovery was the only thing left to bring back back Gitea and other service online with its latest data.
.
.
MySQL Data Recovery – The Core Challenge

After the configurations were restored, the most important task was recovering the MySQL data. This required a detailed approach, especially since the .ibd files were crucial and directly copying them wasn’t straightforward due to missing metadata.

1. Database Structure Re-initialization:

The foundational step was to re-establish the database’s structural integrity. We did this by importing an SQL dump from the older, 15-day-old backup. This recreated all database schemas, table definitions, and initial configurations.
**sudo mysql < backup.sql**
.
2.User Privilege Configuration:

Following the schema restoration, we recreated the necessary database user accounts for the client and reinstated their respective privileges. All user credentials were securely retrieved from backed-up configurations files.
.
3.Understanding InnoDB and .ibd File Restoration:

To effectively restore the client’s MySQL database, we leveraged our deep understanding of the MySQL InnoDB storage engine, particularly its transactional properties and recovery mechanisms. This was crucial for handling the .ibd files.
.
4.Temporarily Disable Database Rules:

To ensure a smooth import process and prevent any integrity errors, we temporarily turned off foreign key checks.
SET FOREIGN_KEY_CHECKS = 0;
.
5.Discard Existing Data Links (Tablespaces):
To prepare for the seamless import of the new .ibd files, we first told the database to forget where its current data files were. This effectively unlinks the table definitions from their associated data files, clearing the way for the new ones.
.
.
Important Note: This operation disconnects the current data. We always ensure our clients have a complete and verified backup before performing this step!
SELECT CONCAT('ALTER TABLE', TABLE_SCHEMA, '.', TABLE_NAME, 'DISCARD TABLESPACE;') AS sql_command FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = <DatabaseName> AND ENGINE = 'InnoDB';
.
We executed each command generated by this query individually to maintain precise control.
.
Copy and Secure the Data Files:
.
After discarding the old links, we carefully copied recovered .ibd files from the corrupt into the corresponding database folder on the new VM. It was crucial to set the right ownership for these files. Incorrect permissions would have prevented MySQL from accessing them.
/old/var/lib/mysql/<DatabaseName>/ # Source path on the corrupt disk chown mysql:mysql /var/lib/mysql/<DatabaseName>/*
.
Import the New Data (Tablespaces): With the .ibd files correctly positioned and permissions accurately configured, the next step was to instruct MySQL to use these new data files. This action successfully linked the table designs with their actual data, bringing them back online.
SELECT CONCAT('ALTER TABLE', TABLE_SCHEMA, '.', TABLE_NAME, 'IMPORT TABLESPACE;') AS sql_command FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = <DatabaseName> AND ENGINE = 'InnoDB';
.
We executed the resulting commands sequentially for each table.
.
Re-enable Database Rules: Upon the successful import of all data, we promptly turned the foreign key checks back on. This restored the rules that maintain data consistency and integrity within the database for all future operations.
.
SET FOREIGN_KEY_CHECKS = 1;
.
Addressing Advanced Recovery Challenges (Schema Mismatches):

A significant hurdle emerged when we discovered that 19 out of 127 tables of “Gitea’ had different designs. This was primarily due to recent updates to the Gitea application, which prevented a straightforward .ibd file restoration for these specific tables.
.
Leveraging ibd2sql for Data Extraction:

For tables with schema discrepancies or where standard recovery methods were ineffective, we employed a specialized tool called ibd2sql. This utility is designed to parse InnoDB .ibd files and extract the contained data as SQL INSERT statements, even when a complete MySQL instance or .frm files are unavailable.
cd ibd2sql for path in $(realpath /old/var/lib/mysql/<DatabaseName>); do python3 main.py $path --sql --ddl | mysql; done
.
Important Note: While ibd2sql is a powerful recovery tool, it may not fully support all complex data types, and we always recommend thorough post-recovery data validation.
.
.
In instances where ibd2sql provided table designs that didn’t perfectly align with the current schema, manual intervention was required. We meticulously consulted the client’s application’s (Gitea’s) official database schema changelogs and migration scripts. These resources provided the precise Data Definition Language (DDL) for each table at various points in time, allowing us to accurately reconstruct the correct table definitions.
.
After carefully modifying the table designs to reflect the correct schema gleaned from the changelogs, we tested them. This involved creating temporary tables and performing small-scale data imports to ensure the design precisely matched the .ibd file structure and that data integrity was fully maintained. This meticulous manual process, combined with leveraging historical schema information, was crucial for successfully recovering data from files that initially seemed lost.
.
.
Validation and Final Checks
1.System Integrity Checks: After all files were restored, we performed comprehensive checks to ensure the VM and all its applications were running correctly and without errors. This included verifying file integrity and system services.
.
2.Data Validation: For critical applications like the database and Gitea, we worked closely with the client to validate the restored data, ensuring its accuracy and completeness.
.
Key Takeaways
This incident, while challenging, served as a powerful catalyst for re-evaluating and fortifying our client’s data management strategies. Here are the refined best practices we’ve firmly integrated into our recommendations for them and other clients:
.

Automated and Verified Backups: We strongly advocate for and help implement a comprehensive, automated backup strategy for all critical VMs and their components (config, binaries, data). This includes a robust mix of full, incremental, and differential backups with clearly defined retention policies. Crucially, we emphasize regular verification of these backups through simulated test restores to ensure actual recoverability. .
Regular Snapshots: We advise clients to consistently use VM snapshots before any significant system changes, updates, or routine maintenance operations. These are quick backups that let you go back to a previous state if something goes wrong immediately. . -Principle of Least Privilege: We help clients set up and enforce strict access controls and permissions across all systems. By granting only the absolute necessary privileges to users and automated processes, we significantly mitigate the risk of both accidental and malicious data manipulation. .
Immutable Backups: We guide clients on exploring and implementing immutable backup solutions. These backups, by design, cannot be altered or deleted, providing a formidable defense layer against advanced threats like ransomware and inadvertent deletions. .
Comprehensive Documentation: We encourage clients to maintain meticulous and up-to-date documentation of all infrastructure components, detailed VM configurations, database schemas, precise backup procedures, and established recovery protocols. This documentation proves invaluable during crisis management, especially with larger teams. .
Regular Disaster Recovery Drills: We recommend and assist clients in conducting periodic disaster recovery drills. These drills are essential for testing the efficacy of recovery plans, identifying any potential weaknesses in their processes, and ensuring their team is not only ready but proficient during actual incidents. .
Proactive Monitoring and Alerts: We help clients set up robust systems that watch over their VMs and databases. They receive alerts for any unusual activities, critical disk space anomalies, or failures in backup processes. Catching problems early can prevent them from escalating into significant data loss events. . . In Conclusion The VM data loss incident at our client’s site was undoubtedly a demanding experience, but it also proved to be a profound learning opportunity for both them and us. It underscored the critical importance of proactive prevention, meticulous planning, and the ability to maintain composure and execute methodically under pressure. Getting their data back successfully has strengthened their reliance on robust data management practices, and we are proud to have been their partner in this recovery. By implementing these reinforced best practices, we are confident that we can help our clients turn potential data catastrophes into manageable incidents, ensuring their services run smoothly and their data remains safe.``