If you worked with a Sitecore XP website that has been live for a long time, I am sure you are familiar with the xDB cleanup requirements. I usually recommend to agree xDB data retention and set up the clean up process from day 1, but it is not always possible. Some websites were developed on an older version of Sitecore that did not have the built-in functionality for data removal, or the volume of xDB data was underestimated, or maybe tracking was enabled later down the line without considering data retention.
This is a common issue and the exact reason why Sitecore Analytics Database Manager (ADM) module exists. It allows viewing xDB records statistics and removing collection data (raw analytical data collected by Sitecore XP) via Sitecore UI while keeping the data integrity.
However, when I tried to use ADM for the first time on Sitecore 9 and Sitecore 10 websites, I faced a few challenges, particularly when dealing with large volumes of old data. In this article, we’ll explore these challenges and look at optimisation strategies for efficient data cleanup.
How Sitecore ADM works
One thing that is critical to understand is that any ADM cleanup process consists of two key phases: task generation and task processing.
Task generation
During this phase, the module retrieves all contacts from the xDB regardless of the date range specified by the user. It begins processing the most recent contacts first and systematically works backward through the database. The module reads additional data associated with each contact, such as interactions and facets, to determine whether a record meets the deletion criteria. If a contact qualifies for deletion, its ID is stored in the [Tasks]
table of the [ADM.Tasks]
database for later task processing.
Task processing
In the second phase, the module processes the tasks generated in the previous step, deleting all records marked for removal.
While this approach seems robust at first, it comes with a few significant limitations, particularly around the performance and scalability of the task generation phase.
Challenges
1. Inefficient SQL queries
One of the main challenges is the way how the ADM module works with xDB Shard databases. Instead of relying on SQL indexes to efficiently filter contacts by date, the module retrieves all contacts and iterates through them, starting with the most recent. This results in slower data retrieval times, especially when working with large datasets spanning several years.
For instance, even if you specify a short date range in 2022, the module will still need to process and check all contacts, including those from 2023-2024. This significantly increases the load on the shard databases, leading to higher data input/output operations and longer processing times.
2. Threading and timeout issues
By default, the ADM module operates with 4 threads and processes 1,000 contacts per batch. While these settings may be sufficient for smaller databases, you may want to tweak these settings if you have a larger database.
The module’s batching system divides work between threads at the start of the process, meaning that if there is an SQL timeout in one of the threads, the entire thread will be aborted and the cleanup of this batch will not be completed. Therefore, the same process will have to be restarted from scratch to pick up remaining DB records. This can be frustrating, particularly when processing large datasets that take several hours to be processed.
3. Single task limit
Another challenge is that only one cleanup process can run at a time. Attempting to create a new task while another is running will terminate the existing process, which makes it critical to monitor and manage the process carefully.
4. Limited pause functionality
The task generation phase cannot be paused, it is supported only for task processing. There was an idea to run the cleanup in batches outside of business hours for one of the websites with a lot of xDB data to remove, but it was impossible to do run task generation in batches because it cannot be paused and it had to be done in one go.
Optimisation tactics
Here are some tricks that helped me to reduce processing times and achieve the best performance for the cleanup in the past:
1. Read documentation
I know this sounds obvious, but if you plan to use ADM and have not looked at the documentation that comes with it, do it now! It is quite technical and can answer many questions you already have.
2. Upscale to Premium tier
If your server or database resources are under strain, especially Shards
and ProcessingPools
databases, you can try temporarily upscaling them for the cleanup. Premium Azure SKUs are optimised for high data I/O, which is critical when working with the ADM intensive data processing.
3. Adjust performance settings
There is some flexibility in tweaking the performance of the ADM module. You can increase multiple NumberOfThreads
, RetrieveDataBatchSize
and NumberOfConnectionRetries
settings in the ADM configuration files (see section "6. Performance Tuning" in the ADM module documentation). This helps speed up task generation and processing but should be balanced against server and database available resources.
4. Measure and plan accordingly
Although tempting, splitting the cleanup process into smaller date ranges may not necessarily improve performance. The ADM module will still read all contacts every time, so this strategy may not be the most efficient for some cases. Instead, allow the process to finish at least one time and monitor its completion, note the time taken to better estimate and plan additional runs.
Also, if you plan to perform the cleanup in multiple runs, start with the recent date ranges because they will be accessed first during the task generation phase.
5. Consider direct SQL queries
For older versions of Sitecore, or if you struggle to run ADM process on your xDB Shards even after tweaking performance settings, an alternative approach could be used - running direct SQL queries to clean up data (you can find a few useful scripts here). However, this method requires understanding of SQL scripts and xDB tables structure, so do this only if you have backed up your Shard DBs and you are confident in your skills.
Sitecore xDB cleanup is not always simple and the ADM module can be useful tool for managing and cleaning up contact and interaction data. I hope this article helped you to understand how the module works, its limitations and tactics for optimising its performance.
Top comments (0)