I often encounter databases containing tables with old historical data. Such tables become heavy, occupy a lot of disk space, their indexes grow huge, and simple operations take a very long time. Most importantly, large tables negatively impact the performance of the entire system. Tell me, who has encountered a similar issue.
I’m asked to explain how to improve this situation and significantly enhance the performance of the database, as well as the performance of applications and services interacting with such a database.
I will consider an example of a PostgreSQL database. However, the approach I propose can also apply to other types of databases as well.
Let’s assume there is a table with many long-standing data. Let’s add several indexes to this large table to accurately depict the situation.
create table main_table (
id serial primary key,
create_date timestamp not null,
col1 varchar(100),
col2 varchar(100),
col3 bigint
);
create unique index main_table_idx1 on main_table(create_date);
create index main_table_idx2 on main_table(col1, col2, col3);
do $$
declare
n_of_recs bigint := 3000000;
random_varchar_length smallint;
random_varchar varchar(100);
random_timestamp timestamp;
random_int bigint;
query text;
rec record;
begin
for idx_rec in 1..n_of_recs loop
-- some random varchar length between 1 and 100
random_varchar_length := floor(random()*(100-1+1))+1;
-- some random varchar
random_varchar := array_to_string(array(select chr((ascii('a') + round(random() * 25)) :: integer) from generate_series(1,random_varchar_length)), '');
-- some random int between 55 and 777777777
random_int := floor(random()*(777777777-55+1))+55;
-- some random timestamp between '1900-01-01 00:00:00' and '2024-01-01 00:00:00'
random_timestamp := timestamp '1900-01-01 00:00:00' + random() * (timestamp '2024-01-01 00:00:00' - timestamp '1900-01-01 00:00:00');
query := 'insert into main_table (create_date, col1, col2, col3) values ($1, $2, $3, $4)';
execute query using random_timestamp, random_varchar, random_varchar, random_int;
if idx_rec % 100000 = 0 then
raise notice 'Num of recs inserted into the table main_table: %', idx_rec;
end if;
end loop;
end$$;
NOTICE: Num of recs inserted into the table main_table: 100000
NOTICE: Num of recs inserted into the table main_table: 200000
NOTICE: Num of recs inserted into the table main_table: 300000
NOTICE: Num of recs inserted into the table main_table: 400000
NOTICE: Num of recs inserted into the table main_table: 500000
NOTICE: Num of recs inserted into the table main_table: 600000
NOTICE: Num of recs inserted into the table main_table: 700000
NOTICE: Num of recs inserted into the table main_table: 800000
NOTICE: Num of recs inserted into the table main_table: 900000
NOTICE: Num of recs inserted into the table main_table: 1000000
NOTICE: Num of recs inserted into the table main_table: 1100000
NOTICE: Num of recs inserted into the table main_table: 1200000
NOTICE: Num of recs inserted into the table main_table: 1300000
NOTICE: Num of recs inserted into the table main_table: 1400000
NOTICE: Num of recs inserted into the table main_table: 1500000
NOTICE: Num of recs inserted into the table main_table: 1600000
NOTICE: Num of recs inserted into the table main_table: 1700000
NOTICE: Num of recs inserted into the table main_table: 1800000
NOTICE: Num of recs inserted into the table main_table: 1900000
NOTICE: Num of recs inserted into the table main_table: 2000000
NOTICE: Num of recs inserted into the table main_table: 2100000
NOTICE: Num of recs inserted into the table main_table: 2200000
NOTICE: Num of recs inserted into the table main_table: 2300000
NOTICE: Num of recs inserted into the table main_table: 2400000
NOTICE: Num of recs inserted into the table main_table: 2500000
NOTICE: Num of recs inserted into the table main_table: 2600000
NOTICE: Num of recs inserted into the table main_table: 2700000
NOTICE: Num of recs inserted into the table main_table: 2800000
NOTICE: Num of recs inserted into the table main_table: 2900000
NOTICE: Num of recs inserted into the table main_table: 3000000
DO
my_database=# select count(1) from main_table;
count
---------
3000000
(1 row)
my_database=# select count(1) from main_table where create_date<'01-Jan-2020';
count
---------
2903602
(1 row)
my_database=#
We have populated the table main_table with a sufficiently large number of records, and now let’s assume that records with a creation date earlier than January 1, 2020, can be moved to a separate table. This will allow us to have the original table with significantly fewer records, thereby improving system performance and making operations with the table main_table more efficient.
Let’s create the table main_table_historical_data and move historical data from the original main_table into it. Since the main_table_historical_data table has the same structure as the main_table table, we’ll use the following method to create and populate it.
create table main_table_historical_data
as select * from main_table where create_date < '01-Jan-2020';
my_database=# \timing
Timing is on.
my_database=#
my_database=# create table main_table_historical_data
my_database-# as select * from main_table where create_date < '01-Jan-2020';
SELECT 2903602
Time: 2704.814 ms (00:02.705)
my_database=#
my_database=# \d main_table_historical_data
Table "public.main_table_historical_data"
Column | Type | Collation | Nullable | Default
-------------+-----------------------------+-----------+----------+---------
id | integer | | |
create_date | timestamp without time zone | | |
col1 | character varying(100) | | |
col2 | character varying(100) | | |
col3 | bigint | | |
my_database=# \d main_table
Table "public.main_table"
Column | Type | Collation | Nullable | Default
-------------+-----------------------------+-----------+----------+----------------------------------------
id | integer | | not null | nextval('main_table_id_seq'::regclass)
create_date | timestamp without time zone | | not null |
col1 | character varying(100) | | |
col2 | character varying(100) | | |
col3 | bigint | | |
Indexes:
"main_table_pkey" PRIMARY KEY, btree (id)
"main_table_idx1" UNIQUE, btree (create_date)
"main_table_idx2" btree (col1, col2, col3)
my_database=#
Let’s verify that all historical records were copied to the table main_table_historical_data.
my_database=# select count(1) from main_table_historical_data;
count
- - - - -
2903602
(1 row)
my_database=# select count(1) from main_table where create_date < '01-Jan-2020';
count
- - - - -
2903602
(1 row)
my_database=#
Now we can remove the old historical records from the original table main_table:
delete from main_table where create_date < '01-Jan-2020';
When a significant number of records are deleted from a table, it and its indexes become bloated. Therefore, I recommend rebuilding the table and its indexes or repopulating the table and its indexes.
my_database=# create table main_table_bcv as select * from main_table;
SELECT 96398
Time: 201.733 ms
my_database=# truncate table main_table;
TRUNCATE TABLE
Time: 1178.911 ms (00:01.179)
my_database=# insert into main_table select * from main_table_bcv;
INSERT 0 96398
Time: 971.138 ms
my_database=#
I hope you liked my post where I explained how you can separate historical data into a separate table. You can apply this method to improve the performance and efficiency of a database, application, and whole system.
Top comments (0)