Postmortem: A Fluent Bit 2.9 Bug Caused Lost Logs for 3 Hours During Security Incident

#postmortem #fluent #caused #lost

Postmortem: Fluent Bit 2.9 Bug Caused Lost Logs for 3 Hours During Security Incident

Executive Summary

On October 12, 2024, a critical bug in Fluent Bit version 2.9.4 caused intermittent log drops for 3 hours during an active security incident affecting our production payment processing environment. Approximately 12% of application logs (1.2TB of data) were lost, delaying incident response and forensic analysis by 4 hours. The bug was traced to a race condition in the tail input plugin's file rotation handler, which was triggered by high-volume log writes during the security incident.

Incident Timeline

All times in UTC:

14:02 – Security team detects unauthorized access attempts to payment API endpoints; incident declared.
14:15 – Fluent Bit 2.9.4 deployed to production as part of a routine logging stack update (approved via change management 2 days prior).
14:22 – First reports of missing logs from payment service pods; on-call engineering team alerted.
14:45 – Root cause identified as Fluent Bit tail plugin bug; rollback to version 2.8.9 initiated.
15:07 – Fluent Bit 2.8.9 fully rolled out; log collection resumes with no further drops.
17:02 – Incident resolved; 3 hours of log loss confirmed between 14:15 and 17:02 (overlap with rollback period).

Impact

The log loss directly impacted incident response efforts:

12% of payment service logs (1.2TB) were permanently lost, including 40% of authentication and access logs for the affected period.
Forensic analysis of the security incident was delayed by 4 hours, as teams had to reconstruct partial event timelines from secondary monitoring tools.
No customer data was exfiltrated, but the delay increased the risk of undetected lateral movement by the attacker.
Compliance teams had to file a minor incident report with PCI-DSS auditors due to incomplete log retention for the 3-hour window.

Root Cause

The bug was introduced in Fluent Bit 2.9.0 as part of a performance optimization for the tail input plugin, which reads log files from disk. The optimization added a non-thread-safe cache for tracked file inodes to reduce stat() system call overhead. During high-volume log writes (triggered by the security incident's verbose API access logging), the following race condition occurred:

The tail plugin's main thread updates the inode cache when a log file is rotated.
A separate I/O thread simultaneously writes new log entries to the rotated file before the cache update completes.
The plugin incorrectly marks the rotated file as "fully read" and stops tracking it, dropping all unwritten entries in the I/O buffer.

We confirmed the issue by reproducing the race condition in a staging environment with simulated high-volume log writes (100k events/sec) and Fluent Bit 2.9.4. The bug was assigned Fluent Bit Issue #12345 and fixed in version 2.9.5, released on October 15, 2024.

Remediation

Immediate actions taken during the incident:

Rolled back Fluent Bit to version 2.8.9, which did not have the buggy inode cache implementation.
Redirected all log output to a secondary Fluentd instance as a temporary backup to prevent further loss.
Re-enabled verbose logging for the payment API only after confirming stable log collection with the rolled-back version.

Post-incident fixes:

Upgraded all production Fluent Bit instances to version 2.9.5 (with the race condition fix) on October 16, 2024.
Patched the tail plugin's inode cache to use thread-safe mutex locks for all update operations.

Prevention Steps

To prevent similar incidents in the future, we implemented the following changes:

Stricter change management for logging stack updates: All Fluent Bit version upgrades now require 7 days of staging validation under high-volume log loads (≥50k events/sec) before production deployment.
Enhanced monitoring for log drops: Deployed a custom Prometheus exporter to track Fluent Bit's input_tail_dropped_records metric, with alerts triggered at >0.1% drop rate.
Secondary log collection: All critical services now ship logs to both Fluent Bit and a cloud-native logging service (AWS CloudWatch Logs) as a redundant backup.
Incident response playbook updates: Added a step to verify log collection health immediately after any logging stack change, especially during active incidents.
Contribution to upstream: Our engineering team contributed the thread-safe inode cache fix back to the Fluent Bit open-source project.

Conclusion

This incident highlighted the risk of deploying unvalidated performance optimizations to critical infrastructure during active incidents. While no customer data was lost, the 3-hour log gap delayed our security response and created compliance overhead. We have since hardened our logging stack deployment process and added redundant log collection to ensure complete coverage even during high-load events or software bugs.