TiDB Community

Posted on Jul 4

TiKV Component GC (Physical Space Reclamation) Principles and Common Issues

#database #tutorial

This article was written by Shirly Wu, Support Escalation Engineer at PingCAP.

In TiDB, the GC worker uploads the GC safepoint to the PD server, and all TiKV instances regularly fetch the GC safepoint from PD. If there is any change, the TiKV instances will use the latest GC safepoint to start the local GC process. In this article, we will focus on the principles of TiKV-side GC and discuss some common issues.

GC Key in TiKV

During GC, TiDB clears all locks before the GC safepoint across the entire cluster using the resolve locks mechanism. This means that once the data reaches TiKV, all transaction statuses before the GC safepoint are already resolved, and there are no remaining locks. The transaction statuses can be retrieved from the primary key of the distributed transaction. At this point, we can confidently and safely delete old version data.

So, how exactly does the deletion process work?

Let’s look at the following example:

The current key has four versions, written in the following sequence:

1:00 — New write or update, data stored in the default CF
2:00 — Update, data stored in the default CF
3:00 — Delete
4:00 — New write, data stored in the default CF

If the GC safepoint is 2:30, which ensures snapshot consistency up to 2:30, we will retain the version read at 2:30: key_02:00 => (PUT, 01:30). All previous versions before 2:30 will be deleted, including key_01:00. The write-cf and default cf corresponding to this transaction will both be deleted.

What if the GC safepoint is 3:30?

Similarly, the version read at 3:30,key_03:00 => DELETE, will be retained. All versions before 3:00 will be deleted.

We see that the snapshot at 3:30 indicates that the transaction for key_03:00 is deleting this key.So, is it necessary to retain the MVCC version at 03:00? Not necessary. Therefore, under normal circumstances, if the GC safepoint is 3:30, the data that needs to be garbage collected for this key will be:

This is how the GC process works for a specific key — TiKV’s GC process requires scanning all the keys on the current TiKV instance and deleting the old versions that meet the GC criteria.

Related Monitoring

gc_keys can create read pressure on the system since it needs to scan all versions of the current key to determine whether old versions should be deleted. Related monitoring can be found in tikv-details -> GC -> GC scan write/default details, which records the pressure on RocksDB’s write/default CF during GC worker execution:

The duration of gc_keys can be seen in tikv-details -> GC -> GC tasks duration. If there is high latency in this area, it indicates significant GC pressure or that the read/write pressure on the system is affecting the GC process.

GC in TiKV

GC Worker

Each TiKV instance has a GC worker thread responsible for handling specific GC tasks. The GC worker in TiKV mainly handles the following two types of requests:

GC_keys: This involves scanning and deleting old versions of a specific key that meet the GC criteria. The detailed process was described in the first chapter.
GC(range): This involves using GC_keys on a specified range of keys, performing GC on each key individually within the range.
unsafe-destroy-range: This is the direct physical cleanup of a continuous range of data, corresponding to operations like truncate/drop table/partition mentioned in the previous article.

Currently, the GC worker has two key configurations, which cannot be adjusted:

Thread count: Currently, the GC worker only has one thread. This is hardcoded in the code here, and no external configuration is provided.
GC_MAX_PENDING_TASKS: This is the maximum number of tasks the GC worker queue can handle, set to 4096.

Related Monitoring

GC tasks QPS/duration can be monitored in tikv-details -> GC -> GC tasks/GC tasks duration. If the GC tasks duration is high, we need to check whether the GC worker’s CPU resources are sufficient, in combination with the QPS.

GC worker CPU usage: tikv-details -> thread CPU -> GC worker

GC manager

The GC manager is the thread responsible for driving the GC work in TiKV. The main steps are:

Syncing the GC safepoint to local memory
Globally guiding the execution of specific GC tasks

1. Syncing GC Safepoint to Local
The GC manager regularly requests the latest GC safepoint from PD every ten seconds and refreshes the safepoint in memory. The related monitoring can be found in tikv-details -> GC:

Common Issues
When monitoring shows that the TiKV auto GC safepoint is stuck for a long time and not advancing, it indicates that the GC state on the TiDB side may have a problem. At this point, you need to follow the troubleshooting steps in the previous article to investigate why GC on the TiDB side is stuck.

2. Implementing GC Jobs
If the GC manager detects that the GC safepoint has advanced, it begins implementing the specific GC tasks based on the current system configuration. This part is mainly divided into two methods based on the gc.enable-compaction-filter parameter:

a. Traditional GC, where GC(range) is called for each region.

b. GC via compaction filter (default method after 5.0): Instead of performing a real GC, this method waits until RocksDB performs compaction, at which point old versions are reclaimed using the compaction filter.

TiKV GC Implementation Methods

Next, we will explain the principles and common troubleshooting steps for these two GC methods.

GC by Region (Traditional GC)

In traditional GC, when gc.enable-compaction-filter is set to false, the GC manager performs GC on each TiKV region one by one. This process is referred to as “GC a round.”

GC a Round

In traditional GC, a round of GC is completed when GC has been executed on all regions, and we mark the progress as 100%. If GC progress never reaches 100%, it indicates that GC pressure is high, and the physical space reclamation process is being affected. GC progress can be monitored in tikv-details -> GC -> tikv-auto-gc-progress. We can also observe the time taken for each round of GC in TiKV.

After introducing the concept of a ‘GC round,’ let’s now look at how TiKV defines a round during execution.

In simple terms, the GC manager starts GC work from the first region, continuing until the last region’s GC work is completed. However, if there is too much GC work, the GC safepoint may advance before the GC reaches the last region. In this case, do we continue using the old safepoint or the new one?

The answer is to use the latest GC safepoint in real time for each region that follows. This is a simple optimization of traditional GC.

Here’s an example of how TiKV processes an updated GC safepoint during GC execution

When GC starts, gc-safepoint is 10, and we use safepoint=10 for GC regions 1 and 2.
After finishing GC in region 2, the GC safepoint advances to 20. From this point on, we use 20 to continue GC in the remaining regions.
Once all regions have been GC’d with gc safepoint=20, we start again from the first region, now using gc safepoint=20 for GC.

Common Issues

In traditional GC, since all old versions must be scanned before being deleted, and then the delete operation is written to RocksDB’s MVCC, the system is heavily impacted. This is manifested in the following ways:

1.Impact on GC Progress:

The GC worker becomes a bottleneck. Since all reclamation tasks need to be handled by the GC worker, which only has one thread, its CPU usage will be fully occupied. You can monitor this using: tikv-details -> thread CPU -> GC worker

2.Impact on Business Read/Write:

Raftstore read/write pressure increases: The GC worker needs to scan all data versions and then delete the matching ones during GC_keys tasks.
The increased write load on RocksDB in the short term causes rapid accumulation of L0 files, triggering RocksDB compaction.

3.Physical Space Usage Increases:

Since a DELETE operation in RocksDB ultimately results in writing a new version of the current key, physical space usage might actually increase.
Only after RocksDB compaction completes will the physical space be reclaimed, and this compaction requires temporary space.

In such cases where business cannot tolerate these impacts, the workaround is to enable the gc.enable-compaction-filter parameter.

GC with Compaction Filter

As mentioned earlier, in traditional GC, we scan TiKV’s MVCC entries one by one and use the safepoint to determine which data can be deleted, sending a DELETE key operation to Raftstore (RocksDB). Since RocksDB uses an LSM tree architecture and its internal MVCC mechanism, old versions of data are not immediately deleted when new data is written, even during a DELETE operation. They are retained alongside the new data.

Why Compaction in RocksDB?

Let’s explore the RocksDB architecture to understand the compaction mechanism.

RocksDB uses an LSM tree structure to improve write performance.

RocksDB Write Flow
When RocksDB receives a write operation (PUT(key => value)), the complete process is as follows:

1.When a new key is written, it is first written to the WAL and memtable, and then a success response is returned.

RocksDB appends the data directly to the WAL file, persisting it locally.
The data is written on an active memtable, which is fast because it operates in memory, and the data in the memtable is ordered.

2.As more data is written, the memtable gradually fills up. When it becomes full, the active memtable is marked as immutable, and a new active memtable is created to store new writes.

3.The data from the immutable memtable is flushed to a local file, which we call an SST file.

4.Over time, more SST files are created. Note that SST files directly converted from memtables contain ordered data, but the data ranges are global [~, ~]. These SST files are placed in Level 0.

RocksDB Read Flow
When a read request is received, the lookup follows this sequence:

Search in the memtable. If the key is found, return the value.
Search in the block-cache (the data in the block-cache comes from SST files, which we will refer to as SST here). Search for the corresponding key in the SST files. Since the SST files in Level 0 are the most recent, data will be searched first in these files. Additionally, the SST files in Level 0 are directly converted from memtables, and their data ranges are global. In extreme cases, it may be necessary to search through all such SST files one by one.

Compaction for Improved Read Performance
From the above read flow, we can see that if we only have SST data in Level 0, as more and more files accumulate in Level 0, RocksDB’s read performance will degrade. To improve read performance, RocksDB performs merge sorting on the SST files in Level 0, a process known as compaction. The main tasks of RocksDB compaction are:

Merging multiple SST files into one.
Keeping only the latest MVCC version from RocksDB’s perspective (GC).
Compacting lower levels into Level 1~Level 6.

From the above process, we can see that RocksDB’s compaction work is similar to what we do during GC. So, is it possible to combine TiKV’s GC process with RocksDB’s compaction? Of course, it is.

Combining TiKV GC with Compaction

RocksDB provides an interface for compaction filters, which allows us to define rules for filtering keys during the compaction process. Based on the rules we provide in the compaction filter, we can decide whether to discard the key during this phase.

Implementation Principle

Let’s take TiKV‘s GC as an example to understand how data is reclaimed during the compaction process when the compaction filter is enabled.

TiKV‘s Compaction Filter Only Affects Write-CF

First, TiKV’s compaction filter only affects write-cf. Why? Because write-cf stores MVCC data, while data-cf stores the actual data. As for lock-cf, as we’ve discussed on the TiDB side, after the GC safepoint is updated to PD, there are no more locks in lock-cf before the safepoint.

Directly Filtering Unnecessary MVCC Keys in Compaction
Next, let’s look at how an MVCC key a behaves during a compaction process:

The TiKV MVCC key a in write-cf has a commit_ts suffix. Initially, let’s assume we are compacting two SST files on the left, which are in Level 2 and Level 3. After compaction, these files will be stored in Level 3.
The SST file in Level 2 contains a_90.
The SST file in Level 3 contains a_85, a_82, and a_80.
The current GC safepoint is 89. Based on the GC key processing rules from Chapter 1, we know we need to retain the older version a_85, meaning all versions before a_85 can be deleted.
From the right, we can see that the new SST file only contains a_85 and a_90. The other versions, along with the corresponding data in default-cf, are deleted during compaction.

In summary, compared to traditional GC, using a compaction filter for GC has the following advantages:

It eliminates the need to read from RocksDB.
The deletion (write) process for RocksDB is simplified.

Although compaction introduces some pressure, it completely removes the impact of GC on RocksDB’s read/write operations, resulting in a significant performance optimization.

Compaction on Non-L6 SST Files with Write_type::DEL

When compaction occurs, if we encounter the latest version of data as write_type::DEL, should we delete it directly? The answer is no. Unlike the gc_keys interface, which scans all versions of the key, compaction only scans the versions within the SST files involved in the current compaction. Therefore, if we delete write_type::DEL at the current level during compaction, there might still be older versions of the key at lower levels. For example, if we delete a_85 => write_type::DEL during this compaction, when users read the snapshot at gc_safepoint=89, a_85 would be missing, and the latest matching version would be a_78, which breaks the correctness of the snapshot at gc_safepoint=89.

Handling Write_type::DEL in Compaction Filter
As we know from the gc_keys section, write_type::DEL is a special case. When the compaction filter is enabled, how do we handle this type of key? The answer is yes, it requires special handling. First, we need to consider when we can delete write_type::DEL data.

When compacting the Lowest-level SST files, if we find that the current key meets the following conditions, we can safetly reclaim this version using gc_keys(a,89):

The current key is the latest version before the gc_safepoint 89.
The current key is in Level 6, and it is the only remaining version. This ensures there are no earlier versions (and ensures no additional writes are generated during gc_keys).

After this compaction:

The new SST file will still include a write_type::DEL version.
gc_keys will write a (DELETE, a_85) operation to RocksDB. This is the only way to generate a tombstone for write-cf with the compaction filter enabled.

Related Configuration

As we’ve discussed, when the compaction filter is enabled, most physical data reclamation is completed during the RocksDB write CF compaction. For each key:

For tso > gc safepoint, the key is retained and skipped.
For tso <= gc safepoint: The latest version is retained, and older versions are filtered.

Now, the next question is: Since the GC process (physical space reclamation) relies heavily on RocksDB’s compaction, how can we stimulate RocksDB’s compaction process?

In addition to automatic triggers, TiKV also runs a thread that periodically checks the status of each region and decides whether to trigger compaction based on the presence of old versions in the region. Currently, we offer the following parameters to control the speed of region checks and whether a region should initiate compaction:

region-compact-check-interval: Generally, this does not need adjustment.
region-compact-check-step: Generally, this does not need adjustment.
region-compact-min-tombstones: The number of tombstones required to trigger RocksDB compaction. Default: 10,000.
region-compact-tombstones-percent: The percentage of tombstones required to trigger RocksDB compaction. Default: 30%.
region-compact-min-redundant-rows (introduced in v7.1.0): The number of redundant MVCC data rows needed to trigger RocksDB compaction. Default: 50,000.
region-compact-redundant-rows-percent (introduced in v7.1.0): The percentage of redundant MVCC data rows required to trigger RocksDB compaction.

Notably, for versions prior to 7.1.0, since we did not have the redundant MVCC version detection, we must manually compact regions to trigger the first round of compaction.

Related Monitoring

tikv-details -> GC -> GC in Compaction-filter:

Key field definitions: During the compaction filter process, if the key-value meets the following conditions:

If it is before the GC safepoint and not the latest version (a-v1, a-v5, b-v5):

filtered: The number of old versions directly filtered (physically deleted, no new writes) by the compaction filter, representing the effective reclamation of old version data. If this metric is empty, it means there are no old versions to reclaim.
Orphan-version: After the old version is deleted from write-cf, data in default-cf needs to be cleaned up. If cleaning fails, it will be handled via GcTask::OrphanVersions. If there is data here, it indicates that there were too many versions to delete, causing RocksDB to be overloaded.

Latest version before the GC safepoint (the oldest version to retain, a-v10, b-v12):

rollback/lock: Write types of Rollback/Lock, which will create a tombstone in RocksDB.
mvcc_deletion_met: Write type = DELETE, and it’s the lowest-level SST file.
mvcc_deletion_handled: Data with write type = DELETE reclaimed through gc_keys().
mvcc_deletion_wasted: Data reclaimed through gc_keys() that was already deleted.
mvcc_deletion_wasted + mvcc_deletion_handled: The number of keys with type = DELETE that, at the bottommost level, only have one version.

Common Issues

The Problem of Physical Space Not Being Released Long-Term After Deleting Data Using Compaction-filter
Although the compaction filter GC can directly clean up old data during the compaction stage and alleviate GC pressure, we know from the above principles that relying on RocksDB’s compaction for data reclamation means that if RocksDB’s compaction is not triggered, physical space will not be released even after the GC safepoint has passed.

Therefore, stimulating RocksDB compaction becomes crucial in such cases.

Workaround 1: Adjusting Compaction Frequency via Parameters
For large DELETE operations, since no new writes occur after the data is deleted, it becomes more difficult to stimulate RocksDB’s automatic compaction.

In versions 7.1.0 and later, we can adjust the parameters related to MVCC redundant versions to stimulate RocksDB compaction, with key parameters including:

region-compact-min-redundant-rows (introduced in v7.1.0): The number of redundant MVCC data rows required to trigger RocksDB compaction. Default: 50,000.
region-compact-redundant-rows-percent (introduced in v7.1.0): The percentage of redundant MVCC data rows required to trigger RocksDB compaction.

In versions before 7.1.0, this feature was not available, so these versions require manual compaction. However, adjusting the above two parameters in such cases has minimal effect.

Currently, there are no parameters that stimulate compaction by counting deleted MVCC versions, so in most cases, this data cannot be automatically reclaimed. We can track this issue: GitHub issue #17269.

Workaround 2: Manual Compaction
If we’ve performed a large DELETE operation on a table, and the deletion has passed the GC lifetime, we can quickly reclaim physical space by manually compacting the region:

Method 1: Perform Full Table Compaction During Off-Peak Hours

1.Query the minimum and maximum keys for the table (the key is computed from TiKV and has already been converted to memcomparable):

select min(START_KEY) as START_KEY, max(END_KEY) as END_KEY from information_schema.tikv_region_status where db_name='' and table_name=''

2.Convert the *start and end keys * to escaped format using tikv-ctl:

----compact write cf----
   tiup ctl:v7.5.0 tikv --host "127.0.0.1:20160" compact --bottommost force -c write --from "zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372" --to "t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370"
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write --from zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372 --to t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370
store:"127.0.0.1:20160" compact db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 183, 95, 114, 128, 0, 0, 0, 0, 255, 69, 67, 29, 0, 0, 0, 0, 0, 250], [116, 128, 0, 0, 0, 0, 0, 0, 255, 186, 0, 0, 0, 0, 0, 0, 0, 248]) success!
  ---If the above doesn’t work, try compact default cf---
  tiup ctl:v7.1.1 tikv --host IP:port compact --bottomost force -c default --from 'zr\000\000\001\000\000\000\000\373' --to 'zt\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370'

3.Use tikv-ctl to perform compaction, adding a z prefix to the converted string, and compact both write-cf and default-cf (for all TiKV instances):

----compact write cf----
   tiup ctl:v7.5.0 tikv --host "127.0.0.1:20160" compact --bottommost force -c write --from "zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372" --to "t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370"
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write --from zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372 --to t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370
store:"127.0.0.1:20160" compact db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 183, 95, 114, 128, 0, 0, 0, 0, 255, 69, 67, 29, 0, 0, 0, 0, 0, 250], [116, 128, 0, 0, 0, 0, 0, 0, 255, 186, 0, 0, 0, 0, 0, 0, 0, 248]) success!
  ---If the above doesn’t work, try compact default cf---
  tiup ctl:v7.1.1 tikv --host IP:port compact --bottomost force -c default --from 'zr\000\000\001\000\000\000\000\373' --to 'zt\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370'

Note: As we learned earlier when cleaning up writeType::DELETE, when DELETE is the latest version, this version becomes the only version of the key to delete. It needs to be compacted to the lowest level of RocksDB before it can be fully cleaned. Therefore, we generally need to run the compaction command at least twice to reclaim physical space.

Note: RocksDB compaction requires temporary space. If the TiKV instance doesn’t have sufficient temporary space, it’s recommended to use Method 2 to split the compaction pressure.

Method 2: For large tables, to reduce performance impact on the cluster, compact by region instead of the entire table:
1.Query all regions of the table:

select * from information_schema.tikv_region_status where db_name='' and table_name=''

2.For all regions in the table, run the following commands on their respective TiKV replicas:

Use tikv-ctl to query the MVCC properties of the current region. If the mvcc.num_deletes and write_cf.num_deletes are small, the region has been processed, and you can skip to the next region.

tikv-ctl --host tikv-host:20160. region-properties -r  {region-id}
-- example-- 
 tiup ctl:v7.5.0 tikv --host  "127.0.0.1:20160" region-properties -r 20026
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 region-properties -r 20026
mvcc.min_ts: 440762314407804933
mvcc.max_ts: 447448067356491781
mvcc.num_rows: 2387047
mvcc.num_puts: 2454144
mvcc.num_deletes: 9688
mvcc.num_versions: 2464879
mvcc.max_row_versions: 952
writecf.num_entries: 2464879
writecf.num_deletes: 0
writecf.num_files: 3
writecf.sst_files: 053145.sst, 061055.sst, 057591.sst
defaultcf.num_entries: 154154
defaultcf.num_files: 1
defaultcf.sst_files: 058164.sst
region.start_key: 7480000000000000ff545f720380000000ff0000000403800000ff0000000004038000ff0000000006a80000fd
region.end_key: 7480000000000000ff545f720380000000ff0000000703800000ff0000000002038000ff0000000002300000fd
region.middle_key_by_approximate_size: 7480000000000000ff545f720380000000ff0000000503800000ff0000000009038000ff0000000005220000fdf9ca5f5c3067ffc1

Use tikv-ctl to manually compact the current region. After completion, continue looping to check whether the region’s properties have changed.

tiup ctl:v7.5.0 tikv --pd IP:port compact --bottommost force -c write --region {region-id}
  tiup ctl:v7.5.0 tikv --pd IP:port compact --bottommost force -c default --region {region-id}
 --example--
 tiup ctl:v7.5.0 tikv --host  "127.0.0.1:20160" compact --bottommost force -c write -r 20026
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write -r 20026
store:"127.0.0.1:20160" compact_region db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 84, 95, 114, 3, 128, 0, 0, 0, 255, 0, 0, 0, 4, 3, 128, 0, 0, 255, 0, 0, 0, 0, 4, 3, 128, 0, 255, 0, 0, 0, 0, 6, 168, 0, 0, 253], [122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 84, 95, 114, 3, 128, 0, 0, 0, 255, 0, 0, 0, 7, 3, 128, 0, 0, 255, 0, 0, 0, 0, 2, 3, 128, 0, 255, 0, 0, 0, 0, 2, 48, 0, 0, 253]) success!

Method 3: Before v7.1.0, you can directly disable the compaction filter and use the traditional GC method.
This approach can significantly impact the system’s read and write performance during GC, so use it with caution.

DEV Community

TiKV Component GC (Physical Space Reclamation) Principles and Common Issues

GC Key in TiKV

Related Monitoring

GC in TiKV

GC Worker

Related Monitoring

GC manager

TiKV GC Implementation Methods

GC by Region (Traditional GC)

GC a Round

Common Issues

GC with Compaction Filter

Why Compaction in RocksDB?

Implementation Principle

Related Configuration

Related Monitoring

Common Issues

Top comments (0)