DEV Community

Cover image for How to Migrate from MySQL to PostgreSQL RDBMS: An Enterprise Approach
Dmitry Romanoff
Dmitry Romanoff

Posted on

How to Migrate from MySQL to PostgreSQL RDBMS: An Enterprise Approach

The article was initially published in JFrog blog.

1. Introduction to Migrating MySQL to PostgreSQL.

As part of JFrog’s SaaS strategy to support only the PostgreSQL database (DB), JFrog has faced a need to migrate existing customers’ databases in our SaaS from MySQL to PostgreSQL.

It was essential to develop a fully automated, reliable, and stable migration process with zero data loss and minimal downtime, capable of migrating DBs of tens/hundreds of GBs, and guarantee that the existing applications can work transparently with the migrated DBs.

We developed a MySQL2PG process based on the pgloader open-source utility [https://github.com/dimitri/pgloader] to address this demand.

2. What are the components/topology to migrate DBs from MySQL to PostgreSQL?

How to Migrate from MySQL to PostgreSQL RDBMS: An Enterprise Approach

The basic topology of the DB migration process includes the Source database (MySQL), the Destination database (PostgreSQL), and the Migration machine. The pgloader runs from the Migration machine, reading data from the Source database (MySQL) and writing it to the Destination database (PostgreSQL).

The Migration machine should have direct, reliable, and high throughput connectivity to the source and destination databases. The typical configuration of the Migration machine includes 16 CPUs and 64 GB RAM, network bandwidth of at least 10 Gbps, and storage of at least 500GB.

To guarantee compatibility of DB objects, their structure, and correspondence to specific data types, the DB migration process runs first DDL scripts suitable to a particular type/version of a product. It then copies data into the prepared in advance target DB schema.

The pgloader isn’t exporting data into a dump, placing the dump on the Migration machine, or occupying storage space by DBs data. It creates a kind of pipeline process in the Migration machine RAM and reads data selecting it from the Source DB and copying it into the Destination DB.

3. What would be a typical OS for the Migration machine?

OS: Ubuntu 20.04

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

4. What is recommended to install on the Migration machine?

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install postgresql-13
sudo apt-get install -y pgloader
sudo apt-get install mysql-client
Enter fullscreen mode Exit fullscreen mode

5. How to run pgloader to migrate a DB from MySQL to PostgreSQL?

Create pgloader configuration file:

cat pgloader.conf

LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name;
Enter fullscreen mode Exit fullscreen mode

Run pgloader:

pgloader pgloader.conf
Enter fullscreen mode Exit fullscreen mode

6. How to run MySQL2PG pgloader as a background process?

To avoid possible interruption of the MySQL2PG migration process it’s recommended to run pgloader as a background process. The procedure includes the following steps.

Create pgloader configuration file:

cat pgloader.conf

LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name;
Enter fullscreen mode Exit fullscreen mode

Create a shell script to run pgloader as a background process:

cat run_pgloader_background.sh

nohup pgloader pgloader.conf 2>&1 &
Enter fullscreen mode Exit fullscreen mode

Run the script:

run_pgloader_background.sh
Enter fullscreen mode Exit fullscreen mode

The run trace will be placed on file:

nohup.out
Enter fullscreen mode Exit fullscreen mode

NOTES:

Usually

src_db_user = dest_db_user,
src_db_pwd = dest_db_pwd,
src_db_name = dest_db_name
Enter fullscreen mode Exit fullscreen mode

7. What does a typical output trace of pgloader look like?

2022-01-08T20:47:55.046000+02:00 LOG report summary reset
                              table name     errors       read   imported      bytes      total time       read      write
----------------------------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
                         fetch meta data          0        278        278                     0.808s    
                          Create Schemas          0          0          0                     0.018s    
                        Create SQL Types          0          0          0                     0.010s    
                           Create tables          0        144        144                     1.137s    
                          Set Table OIDs          0         72         72                     0.011s    
----------------------------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
                 my_db.my_table_number_1          0  319387354  319387354    38.0 GB      37m40.265s  37m40.242s  28m41.397s
                 my_db.my_table_number_2          0   15748659   15748659     1.9 GB       1m50.060s  1m50.039s  1m22.053s
                 my_db.my_table_number_3          0    3989089    3989089   336.5 MB         20.059s    20.009s    14.337s

        (etc)
----------------------------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
                 COPY Threads Completion          0          4          4                 45m10.569s    
                          Create Indexes          0        165        165                1h2m48.214s    
                  Index Build Completion          0        165        165                  6m12.345s    
                         Reset Sequences          0          1          1                     0.123s    
                            Primary Keys          0         69         69                     0.789s    
                     Create Foreign Keys          0         41         41                  1m23.456s    
                         Create Triggers          0          0          0                     0.005s    
                         Set Search Path          0          1          1                     0.012s    
                        Install Comments          0          0          0                     0.000s    
----------------------------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
                       Total import time          ✓  363230932  363230932   123.4 GB    1h23m45.678s

Enter fullscreen mode Exit fullscreen mode

8. How to ensure migration has finished successfully? How do I ensure zero data loss?

Verify trace of the pgloader: it should have zero errors, and the read and imported values should be matched.

9. How do I exclude tables from the pgloader process?

In case the source DB contains tables that can be safely excluded from the migration process, this can be done via the “EXCLUDING TABLE NAMES MATCHING” configuration of pgloader.

Example:

LOAD DATABASE
FROM mysql://src_db_user:src_db_pwd@src_db_host/src_db_name?sslmode=<...>
INTO postgresql://dest_db_user:dest_db_pwd@dest_db_host/dest_db_name

EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three';
Enter fullscreen mode Exit fullscreen mode

10. The “Heap exhausted, game over” issue.

The “Heap exhausted, game over” message from the pgloader may indicate that the pgloader default dedicated 4GB RAM is not enough for a given DB migration run.

The solution in such a case would be to use a pgloader compiled with a more extensive DYNSIZE parameter. The DYNSIZE parameter allows modifying the default amount of memory the pgloader image will allow itself to use when running through data.

11. How to compile the pgloader to use more RAM?

​​The pgloader utility is not using all available RAM on the host machine. The amount of RAM the pgloader can use is defined in the compilation stage. To dedicate more RAM a customized version of pgloader should be compiled. The following step-by-step guide illustrates how to compile pgloader with customized parameter DYNSIZE. The DYNSIZE parameter allows modification of the pgloader image when running through data.

pgloader compilation
--------------------

Build pgloader from sources

(1) Check OS:

dima@dima-VirtualBox:~$ lsb_release  -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.3 LTS
Release:        20.04
Codename:       focal
dima@dima-VirtualBox:~$

(2) Install git:

sudo apt update

sudo apt install git

(3) Clone pgloader:

mkdir my_pgloader
cd my_pgloader/
git clone https://github.com/dimitri/pgloader.git

(4) install packages necessary for build:

sudo apt-get install sbcl unzip libsqlite3-dev make curl gawk freetds-dev libzip-dev

(5) make build:

 make DYNSIZE=10240 pgloader

(6) make outputs a ./build/bin/pgloader file for us to use.

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ pwd
/home/dima/my_pgloader/pgloader/build/bin
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ls -rtogla
total 69160
-rw-rw-r-- 1       70 Jan  7 17:17 .gitignore
drwxrwxr-x 5     4096 Jan  7 17:29 ..
-rwxr-xr-x 1 41918800 Jan  7 17:29 buildapp.sbcl
-rwxr-xr-x 1 29036008 Jan  7 17:29 pgloader
drwxrwxr-x 2     4096 Jan  7 17:29 .
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$

cp pgloader pgloader_dima_dynsize_10240

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ls -rtogla
total 97396
-rw-rw-r-- 1       70 Jan  7 17:17 .gitignore
drwxrwxr-x 5     4096 Jan  7 17:29 ..
-rwxr-xr-x 1 41918800 Jan  7 17:29 buildapp.sbcl
-rwxr-xr-x 1 29036008 Jan  7 17:29 pgloader
-rwxr-xr-x 1 29036008 Jan  7 17:33 pgloader_dima_dynsize_10240
drwxrwxr-x 2     4096 Jan  7 17:33 .
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$

dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$ ./pgloader_dima_dynsize_10240 --version
pgloader version "3.6.a94a0a3"
compiled with SBCL 2.0.1.debian
dima@dima-VirtualBox:~/my_pgloader/pgloader/build/bin$
Enter fullscreen mode Exit fullscreen mode

12. How to manage the “Connection reset by peer” pgloader issue?

The “Connection reset by peer” error message during the pgloader run usually indicates timeouts of the pgloader connecting to MySQL [ the Source DB ].

To manage this issue, we recommend adding the following parameters to the pgloader configuration script:

SET MySQL PARAMETERS
net_read_timeout = '5000',
net_write_timeout = '5000'
Enter fullscreen mode Exit fullscreen mode

13. How to improve the duration of the pgloader migration process?

To improve the duration of the pgloader migration process there are a few approaches:

– Scale up source and destination databases
– Ensure no-load/heavy activity / enough storage, RAM, connections, and resources both on the source and on the destination databases
– Use pgloader compiled with customized parameter DYNSIZE
– Scale up the Migration machine to allow more RAM
– Run pgloader with the following parameters:

SET PostgreSQL PARAMETERS
maintenance_work_mem to '512MB',
work_mem to '48MB'
Enter fullscreen mode Exit fullscreen mode

14. How to configure pgloader to run stable and reliably on big databases / on big amounts of data?

– Use the pgloader compiled with customized DYNSIZE parameter. It will allow dedicating more amount of RAM to the pgloader process
– Use the following pgloader configuration:

cat pgloader.conf

FROM LOAD DATABASE
FROM mysql://db_user:db_pwd@src_db_host/db_name?sslmode=<...>
INTO postgresql://db_user:db_pwd@dest_db_host/db_name

WITH

data only, create no indexes,

workers = 8, concurrency = 1,
multiple readers per thread, rows per range = 10000,
batch rows = 10000

SET PostgreSQL PARAMETERS
maintenance_work_mem to '512MB',
work_mem to '48MB'

SET MySQL PARAMETERS
net_read_timeout = '5000',
net_write_timeout = '5000'

EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three'

ALTER SCHEMA 'db_schema_name' RENAME TO 'public';
Enter fullscreen mode Exit fullscreen mode

15. What will a minimal pgloader configuration look like?

FROM LOAD DATABASE
FROM mysql://db_user:db_pwd@src_db_host/db_name?sslmode=<...>
INTO postgresql://db_user:db_pwd@dest_db_host/db_name

with
batch size = 2048 kB,
batch rows = 2000,
prefetch rows = 2000

SET PostgreSQL PARAMETERS
maintenance_work_mem to '512MB',
work_mem to '48MB'

SET MySQL PARAMETERS
net_read_timeout = '5000',
net_write_timeout = '5000'

EXCLUDING TABLE NAMES MATCHING 'table_to_exclude_one','table_to_exclude_two','table_to_exclude_three'

ALTER SCHEMA 'db_schema_name' RENAME TO 'public';
Enter fullscreen mode Exit fullscreen mode

16. Why is it important to monitor the source, destination databases, and Migration machine before, during, and after the process?

To guarantee the stable, reliable, and optimal work of the pgloader it’s vital to monitor the source and the destination databases before, during, and after migration. The DB’s machines should be strong enough, and they need to have enough resources to pgloader intensive activity. Both source and destination DBs machines shouldn’t be loaded by heavy activity or intensive CPU operations or high Read or(and) Writes.

Note: any load or heavy activity on the Source or the Destination DB machine directly impacts the DB migration’s performance and success.

In addition, it’s important to monitor the CPU, IOPs, memory, and network of the Migration machine during the process.

17. Importance of the split to the Source DB, the Destination DB, and the Migration machines.

It’s important to keep the Source DB Instance, the Destination DB Instance, and the Migration machine separately. Combining these components will impact DB migration performance and stability.

18. Can it be queried the destination PostgreSQL DB during the migration process? Will we see the progress of DB migration by querying the copied tables “on the fly”?

Although we can see live sessions on the Destination PostgreSQL DB during the migration, the actual data is committed only at the very end of the migration.

Querying in the middle of pgloader run any populated tables is useless: their data placed on dirty, not committed yet blocks. Only once pgloader has finished can we see the actual records on target DB querying “select * from .”

Top comments (0)