Krishna Tangudu

Posted on Apr 30

Part 4: Clone ++ Parallelization and Production Features

#automation #database #sql #snowflake

Previously: In Part 3, we automated permission management with dynamic RBAC provisioning.

In this post: Scale your cloning operations with parallel processing, add resume-from-failure capabilities, implement audit logging, and build production-grade orchestration.

The Performance Problem

Our repointing solution works, but doesn't scale:

-- Sequential processing (6 schemas)
CALL sp_repoint_schema('dev_db', 'prod_db', 'ADMIN');  -- 5 min
CALL sp_repoint_schema('dev_db', 'prod_db', 'INTEGRATION');     -- 8 min
CALL sp_repoint_schema('dev_db', 'prod_db', 'GOLD');       -- 12 min
CALL sp_repoint_schema('dev_db', 'prod_db', 'SILVER');            -- 10 min
CALL sp_repoint_schema('dev_db', 'prod_db', 'PLATINUM');       -- 7 min
CALL sp_repoint_schema('dev_db', 'prod_db', 'ARCHIVE');         -- 3 min

-- Total: 45 minutes ⏰

Problem: Each schema blocks the next. We're not using Snowflake's compute parallelism.

Solution: ASYNC/AWAIT Pattern

Snowflake's ASYNC and AWAIT keywords enable parallel execution:

-- Launch all schemas in parallel
ASYNC (CALL sp_repoint_schema('dev_db', 'prod_db', 'ADMINISTRATION'));
ASYNC (CALL sp_repoint_schema('dev_db', 'prod_db', 'INTEGRATION'));
ASYNC (CALL sp_repoint_schema('dev_db', 'prod_db', 'ANALYTICS'));
ASYNC (CALL sp_repoint_schema('dev_db', 'prod_db', 'DATA'));
ASYNC (CALL sp_repoint_schema('dev_db', 'prod_db', 'REPORTING'));
ASYNC (CALL sp_repoint_schema('dev_db', 'prod_db', 'ARCHIVE'));

-- Wait for all to complete
AWAIT ALL;

-- Result: ~12 minutes (limited by slowest schema)

Speedup: 45 minutes → 12 minutes = 73% faster

Parallel Architecture

High-Level Flow

SP_REPOINT_PARALLEL (Orchestrator)
├─ Get all schemas in clone
├─ For each schema:
│  └─ ASYNC (SP_REPOINT_SCHEMA_AND_LOG)
├─ AWAIT ALL
└─ Aggregate results

SP_REPOINT_SCHEMA_AND_LOG (Logging Wrapper)
├─ Call SP_REPOINT_SCHEMA
├─ Capture result
└─ Insert into temp results table

SP_REPOINT_SCHEMA (Worker)
├─ Repoint views
├─ Repoint procedures
├─ Repoint functions
├─ Repoint tasks
└─ Return JSON result

Orchestrator Pattern

-- Simplified orchestrator logic
CREATE OR REPLACE PROCEDURE sp_clone_repoint_parallel(
    clone_db VARCHAR, 
    source_db VARCHAR
)
AS
BEGIN
    -- Create temp table for results
    CREATE TEMP TABLE temp_repoint_results (
        schema_name VARCHAR,
        result VARIANT,
        completed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
    );

    -- Get all schemas
    LET schema_rs RESULTSET := (
        SELECT SCHEMA_NAME 
        FROM clone_db.INFORMATION_SCHEMA.SCHEMATA 
        WHERE SCHEMA_NAME != 'INFORMATION_SCHEMA'
    );

    -- Launch parallel workers
    FOR rec IN schema_rs DO
        ASYNC (
            CALL sp_repoint_schema_and_log(:clone_db, :source_db, rec.SCHEMA_NAME)
        );
    END FOR;

    -- Wait for all workers
    AWAIT ALL;

    -- Aggregate results
    LET final_result := (
        SELECT OBJECT_CONSTRUCT(
            'parallel_schemas', COUNT(*),
            'total_duration_seconds', 
                DATEDIFF('second', MIN(completed_at), MAX(completed_at)),
            'schema_results', ARRAY_AGG(result)
        )
        FROM temp_repoint_results
    );

    RETURN :final_result;
END;

Why temp tables? ASYNC procedures can't return values directly. We collect results in a temp table visible to the orchestrator.

Usage

CALL sp_clone_repoint_parallel('DEV_PROJECT_DB', 'PRODUCTION_DB');

-- Result:
-- {
--   "parallel_schemas": 6,
--   "total_duration_seconds": 720,  -- 12 minutes
--   "schema_results": [
--     {"schema": "ADMINISTRATION", "views_fixed": 12, "procedures_fixed": 8},
--     {"schema": "ANALYTICS", "views_fixed": 98, "procedures_fixed": 42},
--     ...
--   ]
-- }

The Same Pattern for Streams

Parallel stream recreation follows identical architecture:

-- Orchestrator launches per-schema stream workers
CREATE OR REPLACE PROCEDURE sp_clone_recreate_streams_parallel(...)
AS
BEGIN
    CREATE TEMP TABLE temp_stream_results (...);

    FOR each schema:
        ASYNC (CALL sp_recreate_streams_schema_and_log(...));

    AWAIT ALL;

    RETURN aggregated_results;
END;

Resume-from-Failure: Step-Based Tracking

The Problem

Cloning is multi-step:

Delete old RBAC mappings
Clone database
Revoke production grants
Repoint objects
Recreate streams
Create new roles
Apply RBAC mappings
Transfer ownership
Suspend tasks
Validate clone

What happens if Step 5 fails? You don't want to start over!

The Solution: Step Logging

Track each step in a dedicated table:

CREATE TABLE clone_step_log (
    log_id NUMBER AUTOINCREMENT,
    audit_id NUMBER,          -- Links to clone_audit_log
    clone_db VARCHAR,
    step_number NUMBER,
    step_name VARCHAR,
    status VARCHAR DEFAULT 'PENDING',  -- PENDING, IN_PROGRESS, SUCCESS, FAILED
    result VARIANT,
    started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    completed_at TIMESTAMP
);

Logging Pattern

// In master clone procedure
function executeStep(stepNum, stepName, stepFunction) {
    // Log step start
    execSQL(
        "INSERT INTO clone_step_log " +
        "(audit_id, clone_db, step_number, step_name, status) " +
        "VALUES (..., " + stepNum + ", '" + stepName + "', 'IN_PROGRESS')"
    );

    try {
        // Execute the step
        var result = stepFunction();

        // Log success
        execSQL(
            "UPDATE clone_step_log " +
            "SET status = 'SUCCESS', result = '" + JSON.stringify(result) + "', " +
            "    completed_at = CURRENT_TIMESTAMP() " +
            "WHERE step_number = " + stepNum + " AND status = 'IN_PROGRESS'"
        );

        return result;
    } catch (e) {
        // Log failure
        execSQL(
            "UPDATE clone_step_log " +
            "SET status = 'FAILED', result = OBJECT_CONSTRUCT('error', '" + e.message + "'), " +
            "    completed_at = CURRENT_TIMESTAMP() " +
            "WHERE step_number = " + stepNum + " AND status = 'IN_PROGRESS'"
        );
        throw e;  // Re-throw to abort remaining steps
    }
}

Resume Logic

// Master procedure signature
CREATE PROCEDURE sp_clone_create_master(
    clone_type VARCHAR,
    name_part1 VARCHAR,
    name_part2 VARCHAR DEFAULT NULL,
    resume_from_step FLOAT DEFAULT 0  -- 👈 Resume parameter
)

// Execution logic
var steps = [
    {num: 1, name: 'DELETE_RBAC', fn: deleteRBACMappings},
    {num: 2, name: 'CLONE_DATABASE', fn: cloneDatabase},
    {num: 3, name: 'REVOKE_GRANTS', fn: revokeGrants},
    {num: 4, name: 'REPOINT', fn: repointParallel},
    {num: 5, name: 'STREAMS', fn: recreateStreamsParallel},
    {num: 6, name: 'CREATE_ROLES', fn: createRoles},
    {num: 7, name: 'APPLY_RBAC', fn: applyRBAC},
    {num: 8, name: 'ICEBERG', fn: handleIceberg},
    {num: 9, name: 'SUSPEND_TASKS', fn: suspendTasks},
    {num: 10, name: 'VALIDATE', fn: validateClone}
];

for (var i = 0; i < steps.length; i++) {
    if (steps[i].num < resume_from_step) {
        continue;  // Skip this step
    }

    executeStep(steps[i].num, steps[i].name, steps[i].fn);
}

Resuming

-- Initial attempt fails at step 5
CALL sp_clone_create_master('PROJECT', 'ANALYTICS', 'DEV');
-- ERROR at step 5: Stream recreation failed

-- Check what happened
SELECT step_number, step_name, status, result
FROM clone_step_log
WHERE clone_db = 'DEV_ANALYTICS_DB'
ORDER BY step_number;

-- Fix the issue, then resume from step 5
CALL sp_clone_create_master('PROJECT', 'ANALYTICS', 'DEV', 5);
-- ✅ Steps 1-4 skipped, execution resumes from step 5

Key benefit: No need to wait another 30 minutes to re-clone. Just fix and resume.

Audit Logging: Observability

The Audit Table

CREATE TABLE clone_audit_log (
    audit_id NUMBER AUTOINCREMENT,
    clone_db VARCHAR NOT NULL,
    clone_type VARCHAR NOT NULL,      -- PROJECT, RELEASE
    action VARCHAR NOT NULL,          -- CREATE, DROP, UPDATE
    project_name VARCHAR,
    env_name VARCHAR,                 -- DEV, QA, STAGING
    source_db VARCHAR DEFAULT 'PRODUCTION_DB',
    status VARCHAR NOT NULL,          -- IN_PROGRESS, SUCCESS, FAILED
    error_msg VARCHAR(4000),
    created_by VARCHAR DEFAULT CURRENT_USER(),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    completed_at TIMESTAMP
);

Key Metrics

-- Clone success rate (last 30 days)
SELECT 
    COUNT(*) AS total_clones,
    SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) AS successful,
    ROUND(100.0 * successful / total_clones, 2) AS success_rate
FROM clone_audit_log
WHERE created_at >= DATEADD('day', -30, CURRENT_TIMESTAMP())
AND action = 'CREATE';

-- Average duration by environment
SELECT 
    env_name,
    AVG(DATEDIFF('minute', created_at, completed_at)) AS avg_minutes
FROM clone_audit_log
WHERE status = 'SUCCESS' AND completed_at IS NOT NULL
GROUP BY env_name;

-- Most common failure points
SELECT 
    step_name,
    COUNT(*) AS failure_count
FROM clone_step_log
WHERE status = 'FAILED'
AND started_at >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY step_name
ORDER BY failure_count DESC;

Task Suspension: Cost Control

The Problem

Cloned databases inherit active tasks from production:

SHOW TASKS IN DATABASE dev_project_db;
-- Result: 23 tasks, STATE = 'started' ⚠️
-- Running hourly, daily, etc. in DEV!

Cost: $200-500/month per clone in wasted compute.

The Solution

// Auto-suspend all tasks in clone
function suspendTasks() {
    var schemas = getAllSchemas(cloneDb);
    var tasksSuspended = 0;

    for each schema:
        var tasks = execSQL("SHOW TASKS IN SCHEMA " + schema);

        for each task in tasks:
            if (task.state === 'started'):
                execSQL("ALTER TASK " + task.name + " SUSPEND");
                tasksSuspended++;

    return {tasks_suspended: tasksSuspended};
}

Integration: Add as Step 9 in clone pipeline.

Result: All tasks suspended by default in non-prod clones.

Iceberg Table Handling

Auto-Grant Volume Access

// Step 8: Handle Iceberg tables
function handleIcebergTables() {
    var icebergTables = execSQL(
        "SELECT DISTINCT external_volume " +
        "FROM information_schema.tables " +
        "WHERE table_type IN ('ICEBERG TABLE', 'DYNAMIC ICEBERG TABLE') " +
        "AND external_volume IS NOT NULL"
    );

    for each volume in icebergTables:
        execSQL("GRANT USAGE ON EXTERNAL VOLUME " + volume + " TO DATABASE " + cloneDb);
        volume_grants++;

    return {
        volume_grants_applied: volume_grants,
    };
}

Master Orchestration

Bringing it all together:

-- One command to rule them all
CALL sp_clone_create_master('PROJECT', 'ANALYTICS', 'DEV');

Behind the scenes:

Step 1: DELETE_RBAC_MAPPINGS      [3 sec]
Step 2: CLONE_DATABASE             [3 sec]  ← Snowflake native
Step 3: REVOKE_GRANTS              [45 sec]
Step 4: REPOINT_PARALLEL           [8 min]  ← ASYNC/AWAIT
Step 5: RECREATE_STREAMS_PARALLEL  [2 min]  ← ASYNC/AWAIT
Step 6: CREATE_ROLES               [1 min]
Step 7: APPLY_RBAC_MAPPINGS        [15 sec]
Step 8: HANDLE_ICEBERG             [5 sec]
Step 9: SUSPEND_TASKS              [10 sec]
Step 10: VALIDATE                  [30 sec]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total: ~12 minutes

Result:
✅ Fully functional dev environment
✅ Correct permissions
✅ All references updated
✅ Streams recreated
✅ Tasks suspended
✅ Iceberg configured
✅ Validated and ready

Performance Tuning Tips

1. Warehouse Sizing

-- Use larger warehouse for parallel operations
ALTER WAREHOUSE clone_wh SET WAREHOUSE_SIZE = 'MEDIUM';

-- After clone completes
ALTER WAREHOUSE clone_wh SET WAREHOUSE_SIZE = 'SMALL';

2. Batch Processing

For 50+ schemas:

-- Process in batches to avoid overwhelming warehouse
-- Batch 1: Schemas 1-10
-- Batch 2: Schemas 11-20
-- etc.

3. Query Optimization

-- ❌ Scans entire database
SELECT * FROM information_schema.views 
WHERE view_definition ILIKE '%prod_db%';

-- ✅ Filter by schema first
SELECT * FROM information_schema.views 
WHERE table_schema = 'ANALYTICS'
AND view_definition ILIKE '%prod_db%';

Production Metrics: The Full Picture

Metric	Manual	Semi-Auto	Fully Automated
Time to clone	1-2 days	4-6 hours	8-12 minutes
Human intervention	Constant	Occasional	None
Error rate	15-20%	5-8%	<1%
Concurrent clones	1	2-3	10+
Resume capability	No	Partial	Full
Cost per clone	High	Medium	Low
Audit trail	Manual	Basic	Complete

Key Takeaways

Parallelization is essential - ASYNC/AWAIT delivers 73% speedup
Resume-from-failure saves hours - Step tracking enables smart recovery
Observability matters - Audit logs provide accountability and insights
Task suspension prevents waste - Auto-suspend saves $200-500/month per clone
Iceberg needs attention - External volumes and dynamic tables require special handling

What We've Built

Over this 4-part series, we created a production-grade cloning solution:

✅ Handles permissions - Dynamic RBAC provisioning

✅ Repoints references - All object types updated

✅ Recreates streams - With correct offsets

✅ Processes in parallel - 73% faster

✅ Resumes from failure - No starting over

✅ Logs everything - Complete observability

✅ Suspends tasks - Cost control

✅ Handles Iceberg - External volumes and dynamic tables

✅ Validates results - Health checks

One command:

CALL sp_clone_create_master('PROJECT', 'myproject', 'DEV');

Result: Fully functional dev environment in ~8 minutes.

Going Further

Clone Scheduling

-- Weekly QA refresh
CREATE TASK refresh_qa_clone
  SCHEDULE = 'USING CRON 0 6 * * 1 America/Los_Angeles'
AS
  CALL sp_clone_update('PROJECT', 'myproject', 'QA');

Self-Service UI

Build a web interface for teams to request/manage clones.

Data Masking

Apply dynamic masking policies after cloning for PII protection.

Cost Tracking

Tag clones with cost centers for chargeback.

Auto-Expiration

Drop clones after N days to control costs.

Resources

Code Repository: github.com/LALITHASWAROOPK/snowflake_cloning

Blog Series:

- Part 4: Parallelization and Production Features (this post)

Did this help? Star the repo and share with your team!

Questions? Open an issue on GitHub or comment below.

Previous: Part 3: Solving Permissions and RBAC in Cloned Databases

Series Start: Introduction and Overview

DEV Community

Part 4: Clone ++ Parallelization and Production Features

The Performance Problem

Solution: ASYNC/AWAIT Pattern

Parallel Architecture

High-Level Flow

Orchestrator Pattern

Usage

The Same Pattern for Streams

Resume-from-Failure: Step-Based Tracking

The Problem

The Solution: Step Logging

Logging Pattern

Resume Logic

Resuming

Audit Logging: Observability

The Audit Table

Key Metrics

Task Suspension: Cost Control

The Problem

The Solution

Iceberg Table Handling

Auto-Grant Volume Access

Master Orchestration

Performance Tuning Tips

1. Warehouse Sizing

2. Batch Processing

3. Query Optimization

Production Metrics: The Full Picture

Key Takeaways

What We've Built

Going Further

Clone Scheduling

Self-Service UI

Data Masking

Cost Tracking

Auto-Expiration

Resources

- Part 4: Parallelization and Production Features (this post)

Top comments (0)