How can we sync two databases, both copying all initial data, and then finding new and updated subsequent records and delta copying these records?
The Automated (and Free) Solution
Redactics will do this running in your own infrastructure, and is free for developer usage (and we are working on open sourcing our developer-focused offerings). It is a pure SQL solution making it extremely lightweight. Delta updates approach near realtime replication (and are lightweight enough to handle aggressive table operational concurrency), and are fast enough to run every few minutes or so for many use cases.
The Homemade Solution
If you want to do this yourself, here is an approach that works well:
- Ensure that each of your tables you wish to sync have numerical primary keys, as most SQL-based databases do (typically auto-incrementing).
- Then, considering that new primary keys will always be increasing in size, for identifying new rows you can just create SQL queries to search for records with larger primary key values than your last primary key.
- For updates, ensure that your master tables have an updated date field that is updated each time the record is updated. Then, you can use SQL to search for last updated dates newer than your most recent update. It is okay if you need to search for dates a little older than the present time because there generally is no harm in re-applying the same update to existing data.
- If you want to write new data to your target database, you'll need a strategy to prevent conflicting primary keys. An approach we like is creating a new column called something like
source_primary_key
, generating CSVs (without headers) using the above two approaches (one CSV for new rows, one CSV for updates), and then piping this CSV into awk to remove the primary key column, and move its value to your source_primary_key column. For example, if your primary key column is your first column and your source_primary_key your fifth column:cat /path/to/csv.csv | gawk -vOFS=, -F "," "{print $2,$3,$4,$5,$1}" | [your SQL client]
. We found this library necessary for handling data that contains quotation marks. - For row deletions we generally recommend soft deleting records, which would be treated as an update, since setting the value of a
disabled
field to true would also update this row's last updated date.
Please let me know if you have any luck with either approach!
Top comments (0)