We all hear about technical debt at some point in I.T, and often, fixing that problem is not easy. Mainly because it involves Production environments that need to be live 24/7, or simply for not having enough (human) resources.
The challenge I face is having a Production Database (1.3TB) running PostgreSQL 9.2 and Ubuntu 16.04, on EC2 instances in AWS.
version
------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.21 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
(1 row)
It brings not only performance issues, but also the lack of security patches across all these years.
PostgreSQL 9.2 was released on 2012–09–10, that's almost 10 years ago. Many performance improvements have been implemented specially after PG 10, which we were not taking advantage of.
In this article, I'll talk about the challenges of migrating a live database from PostgreSQL 9.2 to 14, and improvements we've seen and where we wanna go next. :)
Database Stack
Our AWS Architecture is quite simple, deployed in two or more Availability Zones. Our database cluster had 5 instances in total, where the Master would handle all write traffic and the slaves, all the read traffic using Route53 Weighted records.
- pgsql-aws-05 = Master DB running PG 9.2–100% of the write traffic
- pgsql-aws-10 = Slave (Streaming Replication) DB running PG 9.2–0% of the read traffic (Mainly used for EBS Snapshots)
- pgsql-aws-11 = Slave (Streaming Replication) DB running PG 9.2–25% of the read traffic
- pgsql-aws-12 = Slave (Streaming Replication) DB running PG 9.2–25% of the read traffic
- pgsql-aws-13 = Slave (Streaming Replication) DB running PG 9.2–25% of the read traffic
- pgsql-aws-14 = Slave (Streaming Replication) DB running PG 9.2–25% of the read traffic
These instances were deployed on r4. instance types and had Ubuntu 16.04 as the OS.
The issue(s)
New Relic would constantly complain about high response times and high CPU Utilization, specially for the Master Database. We would love to be able to benefit from the Parallel query feature from PG 14, but we weren't there yet.
However, our biggest problem was with the Replication Lag. It would go up to 2 minutes at some cases, and that was causing several different issues to our customers, providing a bad user experience.
Maybe because Streaming Replication was introduced on PG 9.1, so it meant that we were using a very early version of it, without all the improvements among the years.
I did post some questions about this in the PostgreSQL General Mail list, but was unable to find a solution other than upgrading the DB version.
- https://www.postgresql.org/message-id/Z14WgPUs_T-sranXV8i1KakN-eGRgP7fFcKUR8wHJDmjR-V8tHXkIIVWPxEEUd69GBeQCCKD0BeFn4f9AKItYa22mk_DzQP0yKuhRoyXZCc=@sud0.nz
- https://www.postgresql.org/message-id/4fd63109744f5ef6eee098fc92c4fa87%40sud0.nz
Setting up Bucardo
Setting up Bucardo was a challenge, as I did not have any experience with it before. Its documentation is sometimes confusing, and there isn't much information there as well.
The idea was to launch a new EC2 Instance with Bucardo in it, and make it replicate from PG 9.2 to PG 14.
Basically, the steps I did to set it up were (After following these steps):
1. Create the Bucardo Role
`CREATE ROLE bucardo WITH LOGIN SUPERUSER PASSWORD 'password';`
2. Modify the `pg_hba.conf` file to allow connection from the Bucardo role.
`host replication bucardo 10.0.0.0/16 md5`
3. Export PG roles and schema (from 9.2):
```
pg_dump --schema-only --schema=public foo > /data-bucardo/foo_schemas.sql
pg_dumpall --roles-only --database=foo > /data/foo_roles.sql
```
4. Import the roles and the schema into the new PG 14 DB:
```
psql foo -f /data-bucardo/foo_roles.sql
psql foo -f /data-bucardo/foo_schemas.sql
```
5. Create all the extensions needed in PG 14
6. Disable Triggers and FKs:
```
create table if not exists foo_triggers (
seq bigserial primary key,
sql text
);
-- disable triggers
do $$ declare t record;
begin
for t in select trigger_schema || '.' || event_object_table as table_name, trigger_name from information_schema.triggers
where trigger_catalog = 'foo'
and trigger_schema in ('public')
loop
insert into foo_triggers (sql) values (
format('alter table %s disable trigger %s',
t.table_name, t.trigger_name));
execute format('alter table %s disable trigger %s', t.table_name, t.trigger_name);
end loop;
end $$;
create table if not exists dropped_foreign_keys_public (
seq bigserial primary key,
sql text
);
-- public schema
do $$ declare t record;
begin
for t in select conrelid::regclass::varchar table_name, conname constraint_name,
pg_catalog.pg_get_constraintdef(r.oid, true) constraint_definition
from pg_catalog.pg_constraint r
where r.contype = 'f'
-- current schema only:
and r.connamespace = (select n.oid from pg_namespace n where n.nspname = 'public')
loop
insert into dropped_foreign_keys_public (sql) values (
format('alter table %s add constraint %s %s',
quote_ident(t.table_name), quote_ident(t.constraint_name), t.constraint_definition));
execute format('alter table %s drop constraint %s', quote_ident(t.table_name), quote_ident(t.constraint_name));
end loop;
end $$;
```
7. Add tables and sequences to Bucardo:
```
sudo bucardo add db source_db dbname=foo host=pgsql-aws-05 user=bucardouser pass='password'
sudo bucardo add db target_db dbname=foo host=pgsql-aws-pg14 user=bucardo pass='password'
sudo bucardo add table public.* db=source_db relgroup=foo_db_group
sudo bucardo add sequence public.* db=source_db relgroup=foo_db_group
sudo bucardo add dbgroup foo_db_group source_db:source target_db:target
sudo bucardo add sync foo_sync relgroup=foo_db_group dbs=foo_db_group autokick=0
sudo bucardo validate foo_sync
```
7. Export the data from PG 9.2:
`pg_dump -Fc --data-only --disable-triggers --verbose foo > /data/foo_database_data.sql`
8. Import the data into PG 14:
`pg_restore -d foo /data/foo_database_data.sql --disable-triggers --exit-on-error -v --jobs=15`
9. Re-enable Triggers and FKs:
```
-- enable FKs
do $$ declare t record;
begin
-- order by seq for easier troubleshooting when data does not satisfy FKs
for t in select * from dropped_foreign_keys_public order by seq loop
execute t.sql;
end loop;
end $$;
-- enable triggers
do $$ declare t record;
begin
for t in select trigger_schema || '.' || event_object_table as table_name, trigger_name from information_schema.triggers
where trigger_catalog = 'foo'
and trigger_schema in ('public')
loop
execute format('alter table %s enable trigger %s', t.table_name, t.trigger_name);
end loop;
end $$;
```
10. Start Bucardo Replication:
`sudo bucardo update sync foo_sync autokick=1`
`sudo bucardo start`
From there, with the Bucardo replication working, we could create EBS Snapshots from the PostgreSQL 14 Master Database Instance at any time, and create slaves from those snapshots. We have been using this for the past few years and it works really good for us!
Ultimately, before moving forward with the upgrade, we want to have the same amount of Slaves as the old stack had, warm up the EBS volumes, and only after that they'll be ready to receive read-only traffic.
Failover to the new DB version
We had around 4 (four) hours of scheduled downtime to perform the failover to the new database stack.
The idea was to setup the new DB stack in a brand-new AWS LZ (Landing Zone) account managed by Control Tower. The steps we took were:
Launch an Application Load Balancer
2. Redirect all customers to the new ALB
3. From the ALB, redirect all customers to a maintenance page using `aws_lb_listener_rule`
4. Whitelist some of our IP addresses to access the application (This is for allowing few members of our team to actually access the application for testing purposes)
```
resource "aws_lb_listener_rule" "https_whitelist_ip_addresses_to_app" {
count = terraform.workspace == "prod" ? 1 : 0
listener_arn = aws_alb_listener.old_alb_listener_https_prod[count.index].arn
action {
target_group_arn = "arn:aws:elasticloadbalancing:ap-southeast-2:2____________0:targetgroup/bau-prod-elb/6_____e"
type = "forward"
}
condition {
host_header {
values = ["*.foo.com", "foo.com"]
}
}
condition {
source_ip {
values = [
"100.100.100.101/32", # Staff 1
"100.100.100.102/32", # Staff 2
"100.100.100.103/32", # Staff 3
"100.100.100.104/32", # Staff 4
]
}
}}
```
5. Stop Bucardo
6. Modify Route53 records and point those records to the new DB cluster (Our application connects to the database using a specific endpoint)
7. Testing, testing and more testing!
8. Modify the ALB and remove the redirect rules to allow customers to access the application
The tests mentioned on Step 7 were the following:
- pgdiff
- [k6.io](https://www.k6.io] for stress and load tests
- Manual tests using the API, Console and Mobile APPs
The new AWS Infrastructure
In the beginning of this article, I mentioned we had the database stack running on Ubuntu 16.04, which is not good. They were all deployed using CloudFormation and Bamboo, which brings us another problem. Having to manage the Bamboo on an EC2 Instance is something that we should avoid, as we would have to apply security patches, take care of backups and SSH keys rotation and more!
For that reason, when we deployed the new DB stack, we decided to go with Terraform and Bitbucket Pipelines (mainly because the company was already using Bitbucket as source of control).
The new r5.
instances were all deployed using Terraform and Bitbucket Pipelines, with encrypted EBS volumes and infrastructure CIS compliant.
The gp3
EBS volumes were deployed with 10000 IOPS and 600 Mbps of throughput, as those were the numbers we thought we needed to support the traffic.
In the diagram below, you can check the current architecture, which is all managed by Terraform and Bitbucket Pipelines.
Did we solve anything at all with the upgrade?
Yes! Performance gain was amazing! Our response times dropped from ~300ms - 350ms
to ~100ms - 120ms
, and all this just by upgrading the DB version… no code changes and no changes to the architecture.
The new Response Time from New Relic looks (so) good - and that was taken during peak time, where we were processing more than 10,000 requests per minute.
The replication lag is gone, and it's now around 1 second or less:
System load (24 hours window) goes up to 10 during peak hours, but our r5.4xlarge
master DB handles it well:
Conclusion
Overall I'm very happy with the result. Having only 4 hours of scheduled downtime for a big project like this is definitely a win, and we did not encounter any data loss or any other major problem in the following days after the upgrade.
Our EC2 cluster is performing really well, even though there are a lot to be worked on still, like slow queries improvements and changing some table designs for better performance.
I'm also very confident that we can now decrease the amount of IOPS and/or Throughput originally deployed, and potentially even move more traffic to the slaves and retire one of the slaves to save some money (at the moment each slave is getting 25% of the read traffic).
Top comments (0)