<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ryosuke Hara</title>
    <description>The latest articles on DEV Community by Ryosuke Hara (@ryosuke_hara_8f0127278823).</description>
    <link>https://dev.to/ryosuke_hara_8f0127278823</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3112509%2F62af01dd-c067-4306-a7f8-d12ac1b32c9d.png</url>
      <title>DEV Community: Ryosuke Hara</title>
      <link>https://dev.to/ryosuke_hara_8f0127278823</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ryosuke_hara_8f0127278823"/>
    <language>en</language>
    <item>
      <title>Saved EBS Costs by Cleaning Up 3 TiB of Duplicate Data in InfluxDB v1</title>
      <dc:creator>Ryosuke Hara</dc:creator>
      <pubDate>Wed, 28 May 2025 07:33:39 +0000</pubDate>
      <link>https://dev.to/axelspace/saved-ebs-costs-by-cleaning-up-3-tib-of-duplicate-data-in-influxdb-v1-23hp</link>
      <guid>https://dev.to/axelspace/saved-ebs-costs-by-cleaning-up-3-tib-of-duplicate-data-in-influxdb-v1-23hp</guid>
      <description>&lt;p&gt;Hi, I’m &lt;a href="https://www.axelspace.com/" rel="noopener noreferrer"&gt;rhara&lt;/a&gt;, a software engineer at Axelspace.&lt;/p&gt;

&lt;p&gt;In this post, I’ll share how we reduced the size of an &lt;a href="https://docs.influxdata.com/influxdb/v1/" rel="noopener noreferrer"&gt;InfluxDB OSS v1&lt;/a&gt; EBS volume from 5.9TiB to 2.9TiB by removing duplicate data. Since I couldn’t find much information on this process, I’m writing this as both a record and a reference.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: This method requires creating a new InfluxDB database with a different name, meaning the original database name cannot be retained.&lt;br&gt;&lt;br&gt;
To restore the original name, additional steps such as using the &lt;code&gt;SELECT * INTO&lt;/code&gt; clause again are required.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is InfluxDB?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.influxdata.com/influxdb/v1/" rel="noopener noreferrer"&gt;InfluxDB&lt;/a&gt; is a time-series database optimized for storing and querying time-based data, especially in use cases where high write and read throughput is important—such as log aggregation. (&lt;a href="https://docs.influxdata.com/influxdb/v1/concepts/crosswalk/#influxdb-is-not-crud" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  InfluxDB at Axelspace
&lt;/h2&gt;

&lt;p&gt;At Axelspace, we use InfluxDB to store satellite telemetry data, such as power consumption data. While InfluxDB v2 is the mainstream version today, we've continued using the OSS version of v1 in Docker on an EC2 instance since our first satellite, GRUS-1A, was launched in 2018.&lt;/p&gt;

&lt;p&gt;Over time, we noticed that some of the data had accumulated with duplication, leading to a bloated EBS volume. To reduce storage costs, we decided to clean up the duplicate data and migrate to a smaller EBS volume.&lt;/p&gt;




&lt;h2&gt;
  
  
  InfluxDB Cleanup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How to Reduce EBS Costs?
&lt;/h3&gt;

&lt;p&gt;Since EBS pricing depends on storage size, reducing the volume size can lower costs. However, AWS does not allow shrinking EBS volumes directly.&lt;br&gt;&lt;br&gt;
Following &lt;a href="https://repost.aws/en/knowledge-center/ebs-increase-decrease-volume-size" rel="noopener noreferrer"&gt;this AWS article&lt;/a&gt;, we opted to create a new, smaller volume and replace the existing one.&lt;/p&gt;


&lt;h3&gt;
  
  
  Cleaning Up Duplicate Data in InfluxDB
&lt;/h3&gt;

&lt;p&gt;Our first idea was to use InfluxDB's &lt;code&gt;DELETE&lt;/code&gt; command to remove duplicate data directly from the existing volume, then copy the cleaned data to the new volume.&lt;/p&gt;

&lt;p&gt;However, &lt;a href="https://docs.influxdata.com/influxdb/v1/query_language/manage-database/#delete-series-with-delete" rel="noopener noreferrer"&gt;as noted in the official documentation&lt;/a&gt;, &lt;code&gt;DELETE&lt;/code&gt; only allows specifying data to delete by timestamp or tag value—making it unsuitable for fine-grained removal of duplicates.&lt;/p&gt;

&lt;p&gt;We considered several alternatives:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the &lt;a href="https://docs.influxdata.com/influxdb/v1/query_language/explore-data/#the-into-clause" rel="noopener noreferrer"&gt;&lt;code&gt;SELECT * INTO&lt;/code&gt; clause&lt;/a&gt; provided by InfluxQL
&lt;/li&gt;
&lt;li&gt;Use &lt;a href="https://docs.influxdata.com/flux/v0/" rel="noopener noreferrer"&gt;Flux&lt;/a&gt; queries
&lt;/li&gt;
&lt;li&gt;Export and deduplicate data with &lt;a href="https://docs.influxdata.com/telegraf/v1/" rel="noopener noreferrer"&gt;Telegraf&lt;/a&gt;, then re-ingest&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We ultimately chose Option 1: &lt;code&gt;SELECT * INTO&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This clause allows flexible data selection, including filtering or dropping fields, and writing the result into a new database—ideal for deduplication.&lt;br&gt;&lt;br&gt;
We ruled out Option 2 because some query was not implemented, and Option 3 because it required full export and re-import of data.&lt;/p&gt;


&lt;h3&gt;
  
  
  Challenges with &lt;code&gt;SELECT * INTO&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;One downside is that &lt;code&gt;SELECT * INTO&lt;/code&gt; creates a new copy of the data, temporarily increasing total data size. To avoid enlarging the existing EBS volume, we used the new volume as the destination for the copied data.&lt;/p&gt;

&lt;p&gt;Also, since new telemetry data was being written during the cleanup, we had to ensure that InfluxDB could remain online and that memory usage wouldn’t spike. We processed data in small time windows (e.g., one or three days) to keep memory usage manageable.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step-by-Step Procedure
&lt;/h2&gt;

&lt;p&gt;As mentioned, the core idea is to move data to a new smaller EBS volume using &lt;code&gt;SELECT * INTO&lt;/code&gt;, which cleans the duplicated data by its own functionality and then replace the volumes.&lt;br&gt;&lt;br&gt;
This required making both volumes accessible to a single InfluxDB process, which took some workarounds.&lt;/p&gt;

&lt;p&gt;We followed these five steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set up a smaller EBS volume
&lt;/li&gt;
&lt;li&gt;Symlink both old and new volumes into the InfluxDB directory structure
&lt;/li&gt;
&lt;li&gt;Execute &lt;code&gt;SELECT * INTO&lt;/code&gt; to deduplicate and copy data
&lt;/li&gt;
&lt;li&gt;Copy any additional data from old to new volume
&lt;/li&gt;
&lt;li&gt;Replace the old volume with the new one&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  1. Prepare the New EBS Volume
&lt;/h3&gt;

&lt;p&gt;We created a new EBS volume to hold the cleaned data. We refer to this as the &lt;em&gt;new&lt;/em&gt; volume and the currently used one as the &lt;em&gt;old&lt;/em&gt; volume.&lt;br&gt;&lt;br&gt;
Due to EBS limitations, we couldn't shrink the old volume, so our goal was to replace it.&lt;/p&gt;

&lt;p&gt;How to attach a volume is covered in the &lt;a href="https://docs.aws.amazon.com/ebs/latest/userguide/ebs-attaching-volume.html" rel="noopener noreferrer"&gt;official AWS documentation&lt;/a&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Using Both Volumes in InfluxDB
&lt;/h3&gt;

&lt;p&gt;InfluxDB OSS v1 doesn't natively support splitting storage across multiple volumes.&lt;br&gt;&lt;br&gt;
So, we had to take steps not documented officially.&lt;/p&gt;
&lt;h4&gt;
  
  
  2.1 Using Symbolic Links
&lt;/h4&gt;

&lt;p&gt;By default, InfluxDB stores data in &lt;code&gt;/var/lib/influxdb/&lt;/code&gt; with three subdirectories: &lt;code&gt;data&lt;/code&gt;, &lt;code&gt;wal&lt;/code&gt;, and &lt;code&gt;meta&lt;/code&gt;. (&lt;a href="https://docs.influxdata.com/influxdb/v1/administration/config/#dir--varlibinfluxdbdata" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Since each database has its own subdirectory, we could symlink only the new database's directories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /mnt/new_ebs/wal/new_db /var/lib/influxdb/wal/new_db
&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /mnt/new_ebs/data/new_db /var/lib/influxdb/data/new_db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this explanation, we use &lt;code&gt;/mnt/new_ebs&lt;/code&gt; as the mount point for the new volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0mmt9h1rv3j2wkiy856b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0mmt9h1rv3j2wkiy856b.png" alt="Image description" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note: &lt;a href="https://community.influxdata.com/t/how-can-i-find-my-data-after-upgrade/21233/4" rel="noopener noreferrer"&gt;Symlinks don’t always work reliably in InfluxDB&lt;/a&gt; (the link discusses v2), so data may appear missing temporarily.&lt;br&gt;&lt;br&gt;
To avoid this, you can configure InfluxDB to directly use the real path rather than through symlinks.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE: Using bind mounts instead of symlinks may be more robust:&lt;br&gt;&lt;br&gt;
(&lt;a href="https://community.influxdata.com/t/how-to-move-var-lib-influxdb-to-a-different-location/30163/2" rel="noopener noreferrer"&gt;https://community.influxdata.com/t/how-to-move-var-lib-influxdb-to-a-different-location/30163/2&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;
  
  
  2.2 Create the New Database
&lt;/h4&gt;

&lt;p&gt;After setting up the symlinks, run the following InfluxQL command to create the new database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;new_db&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Copy and Deduplicate Data via &lt;code&gt;SELECT * INTO&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;We ran Command like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;new_db&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;old_db&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also specify tags or fields instead of using &lt;code&gt;*&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
This was the most time-consuming step—it took months to process about 5 TiB of data spanning several years.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Copy Additional Data to New Volume
&lt;/h3&gt;

&lt;p&gt;Some remaining steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy &lt;code&gt;/var/lib/influxdb/meta/meta.db&lt;/code&gt; to &lt;code&gt;/mnt/new_ebs/meta/meta.db&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If there are other databases beyond &lt;code&gt;old_db&lt;/code&gt;, copy them as well:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-avi&lt;/span&gt; /var/lib/influxdb/data/other_db /mnt/new_ebs/data
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-avi&lt;/span&gt; /var/lib/influxdb/wal/other_db /mnt/new_ebs/wal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;rsync&lt;/code&gt; is also a valid option.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Replace the Volumes
&lt;/h3&gt;

&lt;p&gt;Finally, point InfluxDB to the new volume by adjusting mount points or configuration.&lt;br&gt;&lt;br&gt;
This can be done via the config file or environment variables (&lt;a href="https://docs.influxdata.com/influxdb/v1/administration/config/#dir--varlibinfluxdbdata" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;While our main target was duplicate data, the &lt;code&gt;SELECT * INTO&lt;/code&gt; clause offers flexibility to remove or transform data during migration.&lt;/p&gt;

&lt;p&gt;Again, note that this approach does &lt;strong&gt;not&lt;/strong&gt; preserve the original database name.&lt;br&gt;&lt;br&gt;
Since InfluxDB doesn't support renaming databases, if you must retain the name, you'll need to re-create the database with the same name and run &lt;code&gt;SELECT * INTO&lt;/code&gt; back into it.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
