DEV Community

Ryosuke Hara for Axelspace

Posted on

Saved EBS Costs by Cleaning Up 3 TiB of Duplicate Data in InfluxDB v1

Hi, I’m rhara, a software engineer at Axelspace.

In this post, I’ll share how we reduced the size of an InfluxDB OSS v1 EBS volume from 5.9TiB to 2.9TiB by removing duplicate data. Since I couldn’t find much information on this process, I’m writing this as both a record and a reference.

Note: This method requires creating a new InfluxDB database with a different name, meaning the original database name cannot be retained.

To restore the original name, additional steps such as using the SELECT * INTO clause again are required.

What is InfluxDB?

InfluxDB is a time-series database optimized for storing and querying time-based data, especially in use cases where high write and read throughput is important—such as log aggregation. (Docs)

InfluxDB at Axelspace

At Axelspace, we use InfluxDB to store satellite telemetry data, such as power consumption data. While InfluxDB v2 is the mainstream version today, we've continued using the OSS version of v1 in Docker on an EC2 instance since our first satellite, GRUS-1A, was launched in 2018.

Over time, we noticed that some of the data had accumulated with duplication, leading to a bloated EBS volume. To reduce storage costs, we decided to clean up the duplicate data and migrate to a smaller EBS volume.


InfluxDB Cleanup

How to Reduce EBS Costs?

Since EBS pricing depends on storage size, reducing the volume size can lower costs. However, AWS does not allow shrinking EBS volumes directly.

Following this AWS article, we opted to create a new, smaller volume and replace the existing one.


Cleaning Up Duplicate Data in InfluxDB

Our first idea was to use InfluxDB's DELETE command to remove duplicate data directly from the existing volume, then copy the cleaned data to the new volume.

However, as noted in the official documentation, DELETE only allows specifying data to delete by timestamp or tag value—making it unsuitable for fine-grained removal of duplicates.

We considered several alternatives:

  1. Use the SELECT * INTO clause provided by InfluxQL
  2. Use Flux queries
  3. Export and deduplicate data with Telegraf, then re-ingest

We ultimately chose Option 1: SELECT * INTO.

This clause allows flexible data selection, including filtering or dropping fields, and writing the result into a new database—ideal for deduplication.

We ruled out Option 2 because some query was not implemented, and Option 3 because it required full export and re-import of data.


Challenges with SELECT * INTO

One downside is that SELECT * INTO creates a new copy of the data, temporarily increasing total data size. To avoid enlarging the existing EBS volume, we used the new volume as the destination for the copied data.

Also, since new telemetry data was being written during the cleanup, we had to ensure that InfluxDB could remain online and that memory usage wouldn’t spike. We processed data in small time windows (e.g., one or three days) to keep memory usage manageable.


Step-by-Step Procedure

As mentioned, the core idea is to move data to a new smaller EBS volume using SELECT * INTO, which cleans the duplicated data by its own functionality and then replace the volumes.

This required making both volumes accessible to a single InfluxDB process, which took some workarounds.

We followed these five steps:

  1. Set up a smaller EBS volume
  2. Symlink both old and new volumes into the InfluxDB directory structure
  3. Execute SELECT * INTO to deduplicate and copy data
  4. Copy any additional data from old to new volume
  5. Replace the old volume with the new one

1. Prepare the New EBS Volume

We created a new EBS volume to hold the cleaned data. We refer to this as the new volume and the currently used one as the old volume.

Due to EBS limitations, we couldn't shrink the old volume, so our goal was to replace it.

How to attach a volume is covered in the official AWS documentation.


2. Using Both Volumes in InfluxDB

InfluxDB OSS v1 doesn't natively support splitting storage across multiple volumes.

So, we had to take steps not documented officially.

2.1 Using Symbolic Links

By default, InfluxDB stores data in /var/lib/influxdb/ with three subdirectories: data, wal, and meta. (Docs)

Since each database has its own subdirectory, we could symlink only the new database's directories:

ln -s /mnt/new_ebs/wal/new_db /var/lib/influxdb/wal/new_db
ln -s /mnt/new_ebs/data/new_db /var/lib/influxdb/data/new_db
Enter fullscreen mode Exit fullscreen mode

In this explanation, we use /mnt/new_ebs as the mount point for the new volume.

Image description

Note: Symlinks don’t always work reliably in InfluxDB (the link discusses v2), so data may appear missing temporarily.

To avoid this, you can configure InfluxDB to directly use the real path rather than through symlinks.

NOTE: Using bind mounts instead of symlinks may be more robust:

(https://community.influxdata.com/t/how-to-move-var-lib-influxdb-to-a-different-location/30163/2)

2.2 Create the New Database

After setting up the symlinks, run the following InfluxQL command to create the new database:

CREATE DATABASE new_db
Enter fullscreen mode Exit fullscreen mode

3. Copy and Deduplicate Data via SELECT * INTO

We ran Command like this:

SELECT * INTO new_db FROM old_db WHERE ...
Enter fullscreen mode Exit fullscreen mode

You can also specify tags or fields instead of using *.

This was the most time-consuming step—it took months to process about 5 TiB of data spanning several years.


4. Copy Additional Data to New Volume

Some remaining steps:

  • Copy /var/lib/influxdb/meta/meta.db to /mnt/new_ebs/meta/meta.db
  • If there are other databases beyond old_db, copy them as well:
cp -avi /var/lib/influxdb/data/other_db /mnt/new_ebs/data
cp -avi /var/lib/influxdb/wal/other_db /mnt/new_ebs/wal
Enter fullscreen mode Exit fullscreen mode

Using rsync is also a valid option.


5. Replace the Volumes

Finally, point InfluxDB to the new volume by adjusting mount points or configuration.

This can be done via the config file or environment variables (Docs).


Final Notes

While our main target was duplicate data, the SELECT * INTO clause offers flexibility to remove or transform data during migration.

Again, note that this approach does not preserve the original database name.

Since InfluxDB doesn't support renaming databases, if you must retain the name, you'll need to re-create the database with the same name and run SELECT * INTO back into it.

Top comments (0)