DEV Community

Cover image for Developing a Script for Incremental Database Backup
Artem “art-grig” Grigoryants
Artem “art-grig” Grigoryants

Posted on

2 1

Developing a Script for Incremental Database Backup

When it comes to managing databases, data integrity is crucial. However, in certain situations, such as production database corruption, it's imperative to have a robust backup strategy. This article describes a PostgreSQL script that was developed out of necessity to generate SQL statements for an incremental backup from a specific point in time. This allows applying changes to a previous backup to restore a database to its most recent, accurate state.

Overview of the Script

The script is implemented as a PostgreSQL function named generate_sql_statements. Its primary purpose is to scan specified database tables for changes (new inserts or updates) that occurred after a given timestamp (reference_timestamp). It generates corresponding SQL statements to reflect these changes, allowing the user to recreate the database's state incrementally.

CREATE OR REPLACE FUNCTION generate_sql_statements(table_names TEXT[], reference_timestamp TIMESTAMPTZ)
RETURNS TABLE (sql_statement TEXT) AS $$
DECLARE
    input_table_name TEXT;
    column_record RECORD;
    action_type TEXT;
    row_json JSONB;
    formatted_value TEXT;
    formatted_values TEXT;
    columns TEXT;
    update_set_clause TEXT;
BEGIN
    -- Loop through each table name
    FOREACH input_table_name IN ARRAY table_names
    LOOP
        -- Step 1: Get column information for the current table
        FOR column_record IN
            SELECT column_name, data_type, ordinal_position
            FROM information_schema.columns
            WHERE table_schema = 'public' AND table_name = input_table_name
            ORDER BY ordinal_position
        LOOP
            -- Step 2: Identify rows for INSERT or UPDATE and convert rows to JSON
            FOR row_json, action_type IN
                EXECUTE format('
                    SELECT to_jsonb(cf), CASE
                        WHEN cf."CreatedOn" > $1 THEN ''INSERT''
                        WHEN cf."UpdatedOn" > $1 AND cf."CreatedOn" <= $1 THEN ''UPDATE''
                        ELSE NULL
                    END
                    FROM %I cf
                    WHERE cf."CreatedOn" > $1 OR cf."UpdatedOn" > $1', input_table_name)
                USING reference_timestamp
            LOOP
                -- Step 3: Generate formatted column values for SQL statements
                formatted_values := '';
                columns := '';
                update_set_clause := '';

                FOR column_record IN
                    SELECT column_name, data_type, ordinal_position
                    FROM information_schema.columns
                    WHERE table_schema = 'public' AND table_name = input_table_name
                    ORDER BY ordinal_position
                LOOP
                    formatted_value := CASE
                        WHEN (row_json -> column_record.column_name) = 'null' THEN 'NULL'
                        WHEN column_record.data_type IN ('character varying', 'text', 'date', 'timestamp without time zone', 'timestamp with time zone') THEN
                            '''''' || REPLACE(row_json ->> column_record.column_name, '''', '''''') || ''''''
                        WHEN column_record.data_type = 'jsonb' THEN
                            '''''' || REPLACE((row_json -> column_record.column_name)::TEXT, '''', '''''') || '''''::jsonb'
                        ELSE
                            row_json ->> column_record.column_name
                    END;

                    -- Aggregate column names and values for the INSERT statements
                    IF columns = '' THEN
                        columns := quote_ident(column_record.column_name);
                        formatted_values := formatted_value;
                    ELSE
                        columns := columns || ', ' || quote_ident(column_record.column_name);
                        formatted_values := formatted_values || ', ' || formatted_value;
                    END IF;

                    -- Build the update_set_clause, excluding the "Id" column
                    IF column_record.column_name <> 'Id' THEN
                        IF update_set_clause = '' THEN
                            update_set_clause := quote_ident(column_record.column_name) || ' = ' || formatted_value;
                        ELSE
                            update_set_clause := update_set_clause || ', ' || quote_ident(column_record.column_name) || ' = ' || formatted_value;
                        END IF;
                    END IF;
                END LOOP;
                -- Step 4: Construct and return the SQL statement based on the action type
                IF action_type = 'INSERT' THEN
                    RETURN QUERY EXECUTE format('
                        SELECT ''INSERT INTO %I (%s) VALUES (%s);''',
                        input_table_name,
                        columns,
                        formatted_values
                    );
                ELSIF action_type = 'UPDATE' THEN
                    RETURN QUERY EXECUTE format('
                        SELECT ''UPDATE %I SET %s WHERE "Id" = %s;''',
                        input_table_name,
                        update_set_clause,
                        row_json ->> 'Id'
                    );
                END IF;
            END LOOP;
        END LOOP;
    END LOOP;
END;
$$ LANGUAGE plpgsql;
Enter fullscreen mode Exit fullscreen mode

How It Works

The script performs the following steps:

  1. Iterate Over the Tables: The script accepts an array of table names (table_names) as input. It loops through each table to process rows that have changed (either inserted or updated) since the specified timestamp (reference_timestamp).

  2. Identify Changes (Inserts and Updates): For each table, it queries the rows where either the "CreatedOn" or "UpdatedOn" timestamps are more recent than the reference_timestamp. Each identified row is converted to JSON format to facilitate the creation of SQL statements.

  3. Generate SQL Statements:

    • Inserts: For rows where the "CreatedOn" timestamp is newer than the reference, an INSERT statement is generated.
    • Updates: For rows where the "UpdatedOn" timestamp is newer but the "CreatedOn" timestamp is older than the reference, an UPDATE statement is generated.
  4. Build and Format SQL Statements:

    • For each column in the table, the script formats the column values to be compatible with SQL syntax. It handles various data types, including strings, dates, and JSONB.
    • For INSERT statements, the script aggregates column names and their corresponding values.
    • For UPDATE statements, it builds a SET clause to update all columns (except the "Id" column) of the specified row.
  5. Return Generated SQL Statements: Depending on whether an action is identified as an INSERT or UPDATE, the corresponding SQL statement is returned to be executed on the backup database.

Detailed Script Breakdown

Here's a step-by-step breakdown of the script:

  1. Step 1 - Column Information: For each table, it retrieves column metadata (name, data type, and position) from the information_schema.columns. This information is necessary to format the column values correctly in the SQL statements.
FOR column_record IN
    SELECT column_name, data_type, ordinal_position
    FROM information_schema.columns
    WHERE table_schema = 'public' AND table_name = input_table_name
    ORDER BY ordinal_position
LOOP
Enter fullscreen mode Exit fullscreen mode
  1. Step 2 - Identify Rows for Changes: Using a dynamic SQL statement, it identifies rows for INSERT or UPDATE. Rows are selected if:
    • The "CreatedOn" timestamp is newer than the reference timestamp (INSERT).
    • The "UpdatedOn" timestamp is newer, but the "CreatedOn" timestamp is older than the reference (UPDATE).
FOR row_json, action_type IN
    EXECUTE format('
        SELECT to_jsonb(cf), CASE
            WHEN cf."CreatedOn" > $1 THEN ''INSERT''
            WHEN cf."UpdatedOn" > $1 AND cf."CreatedOn" <= $1 THEN ''UPDATE''
            ELSE NULL
        END
        FROM %I cf
        WHERE cf."CreatedOn" > $1 OR cf."UpdatedOn" > $1', input_table_name)
    USING reference_timestamp
LOOP
Enter fullscreen mode Exit fullscreen mode
  1. Step 3 - Generate Column Values: For each identified row, the script formats the column values to suit their data type. For example:
    • Strings are enclosed in single quotes, and any existing single quotes are escaped.
    • JSONB data types are cast to JSONB in the output.

It then aggregates these values into a comma-separated list to be used in the INSERT statements. Similarly, it prepares the SET clause for UPDATE statements, excluding the "Id" column.

formatted_values := '';
columns := '';
update_set_clause := '';

FOR column_record IN
    SELECT column_name, data_type, ordinal_position
    FROM information_schema.columns
    WHERE table_schema = 'public' AND table_name = input_table_name
    ORDER BY ordinal_position
LOOP
    formatted_value := CASE
        WHEN (row_json -> column_record.column_name) = 'null' THEN 'NULL'
        WHEN column_record.data_type IN ('character varying', 'text', 'date', 'timestamp without time zone', 'timestamp with time zone') THEN
            '''''' || REPLACE(row_json ->> column_record.column_name, '''', '''''') || ''''''
        WHEN column_record.data_type = 'jsonb' THEN
            '''''' || REPLACE((row_json -> column_record.column_name)::TEXT, '''', '''''') || '''''::jsonb'
        ELSE
            row_json ->> column_record.column_name
    END;

    -- Aggregate column names and values for the INSERT statements
    IF columns = '' THEN
        columns := quote_ident(column_record.column_name);
        formatted_values := formatted_value;
    ELSE
        columns := columns || ', ' || quote_ident(column_record.column_name);
        formatted_values := formatted_values || ', ' || formatted_value;
    END IF;

    -- Build the update_set_clause, excluding the "Id" column
    IF column_record.column_name <> 'Id' THEN
        IF update_set_clause = '' THEN
            update_set_clause := quote_ident(column_record.column_name) || ' = ' || formatted_value;
        ELSE
            update_set_clause := update_set_clause || ', ' || quote_ident(column_record.column_name) || ' = ' || formatted_value;
        END IF;
    END IF;
END LOOP;
Enter fullscreen mode Exit fullscreen mode
  1. Step 4 - Construct SQL Statements:
    • For an INSERT, it generates an SQL statement to insert the row into the table with the formatted column values.
    • For an UPDATE, it generates an SQL statement to update the row, identified by its "Id" column.
IF action_type = 'INSERT' THEN
    RETURN QUERY EXECUTE format('
        SELECT ''INSERT INTO %I (%s) VALUES (%s);''',
        input_table_name,
        columns,
        formatted_values
    );
ELSIF action_type = 'UPDATE' THEN
    RETURN QUERY EXECUTE format('
        SELECT ''UPDATE %I SET %s WHERE "Id" = %s;''',
        input_table_name,
        update_set_clause,
        row_json ->> 'Id'
    );
END IF;
Enter fullscreen mode Exit fullscreen mode

Image of Datadog

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more