Byron Hsieh

Posted on Feb 5

Why Multi-Agent Deployment Might Slow Down Your Redshift DDL Deployments

#aws #redshift #devops #database

Background

Our data warehouse team was scaling up deployment volumes. Based on prior experience (approximately 100 tables in 30 minutes), we knew larger deployments would take considerably longer.

When concerns about execution time arose, I noticed Azure DevOps offered multi-agent deployment capabilities. The idea seemed straightforward: distribute the workload across multiple agents to speed things up.

Before implementing, I investigated what would actually happen at the database level.

The Deployment Pattern

For each table deployment, we execute DDL and DML operations across a typical multi-layered data warehouse architecture:

Data ingestion and transformation layers — Create tables for staging, integration, and historical tracking
Presentation layer — Create views for data consumption and abstraction
ETL orchestration metadata — Update control tables that manage job dependencies and execution tracking

When scaling from ~100 tables to several hundred, this translates to:

Thousands of DDL statements (CREATE TABLE / CREATE VIEW)
Hundreds of DML operations on shared control tables

Azure DevOps Multi-Agent: Understanding the Mechanism

Azure DevOps supports parallel job execution through two primary interfaces, both achieving the same underlying behavior:

Classic Editor (Release Pipelines)

In Azure DevOps Classic Editor, you can configure parallelism via:
Agent Job > Execution Plan > Parallelism > Multi-agent

This setting duplicates the entire Agent Job across multiple agents.

YAML Pipelines

The equivalent in YAML pipelines uses the parallel strategy:

jobs:
- job: DeployTables
  strategy:
    parallel: 10  # This DUPLICATES the job 10 times
  pool:
    vmImage: 'ubuntu-latest'
  steps:
  - script: |
      # WARNING: Without work distribution logic,
      # ALL 10 agents will execute THE SAME steps!
      psql -h redshift-cluster -f deploy_all_tables.sql

Both approaches have the same fundamental behavior: They create multiple identical jobs where each agent executes the same tasks unless you explicitly write logic to distribute the work.

How to Actually Distribute Work

In Classic Editor (Release Pipelines)

You need to use predefined variables or custom logic in your tasks to determine which portion of work each agent should handle.

In YAML Pipelines

Azure DevOps provides system variables for manual work distribution:

jobs:
- job: DeployTables
  strategy:
    parallel: 10
  steps:
  - bash: |
      # Use system variables to determine which tables THIS agent should handle
      POSITION=$(System.JobPositionInPhase)  # Values: 1, 2, 3, ..., 10
      TOTAL=$(System.TotalJobsInPhase)       # Value: 10

      # Calculate this agent's workload (50 tables each)
      START=$((($POSITION - 1) * 50 + 1))
      END=$(($POSITION * 50))

      echo "Agent $POSITION deploying tables $START to $END"
      for i in $(seq $START $END); do
        psql -f "table_${i}.sql"
      done

Even with correct work distribution, DDL operations on Redshift face a fundamental limitation that makes parallelization ineffective.

The Critical Understanding

What developers often assume:

"Azure DevOps will automatically split my tables across 10 agents."

What actually happens (in both Classic Editor and YAML):

Azure DevOps creates 10 identical jobs. Each job runs the exact same steps unless you explicitly write logic to distribute the work.

Redshift Architecture: The Leader Node Constraint

This is where my investigation became crucial. To understand whether multi-agent deployment would help, I needed to understand how Redshift actually handles DDL operations.

How Redshift Handles DDL

Redshift uses a Leader Node + Compute Nodes architecture:

┌─────────────────────────────┐
│     Leader Node             │  ← All DDL executes here
│  - System Catalog Tables    │
│  - Query Planning           │
│  - Metadata Management      │
└─────────────────────────────┘
         │
    ┌────┴────┬────────┬────────┐
    ▼         ▼        ▼        ▼
[Compute] [Compute] [Compute] [Compute]  ← Only for data processing

Key finding: All DDL operations execute on the Leader Node only, regardless of how many compute nodes you have.

System Catalog Table Locking

When you execute DDL statements, Redshift must update its system catalog tables. While Redshift wraps PostgreSQL system catalogs with its own views (like PG_CLASS_INFO, PG_TABLE_DEF), the underlying PostgreSQL catalog tables are where the actual locking occurs:

Core PostgreSQL Catalog Tables (where locks occur):

pg_class — stores table/view metadata
pg_attribute — stores column definitions
pg_namespace — stores schema information
pg_depend — stores object dependencies
pg_type — stores data type definitions

Redshift Information Views (built on top):

PG_CLASS_INFO, PG_ATTRIBUTE_INFO — Redshift wrappers
PG_TABLE_DEF — Redshift-specific comprehensive view
SVV_TABLE_INFO — System view with distribution information

Critical finding: DDL operations require ACCESS EXCLUSIVE LOCKS on the underlying pg_class and pg_attribute tables, which are global singleton resources on the Leader Node. This forces all DDL statements to execute serially, regardless of:

How many compute nodes you have (4, 10, or 100)
How many Azure DevOps agents you use
How well you distribute the work

-- What would happen with 10 parallel agents executing CREATE TABLE:

Agent 1/Session 1: CREATE TABLE table_001 (...);  -- Executes (holds pg_class lock)
Agent 2/Session 2: CREATE TABLE table_021 (...);  -- Waiting for pg_class lock ⏳
Agent 3/Session 3: CREATE TABLE table_041 (...);  -- Waiting for pg_class lock ⏳
Agent 4/Session 4: CREATE TABLE table_061 (...);  -- Waiting for pg_class lock ⏳
... (Agents 5-10 all waiting) ...

-- Result: Serialized execution despite parallel agents

Why This Happens:

Redshift's architecture inherits PostgreSQL's catalog design, where DDL operations must maintain ACID properties across system metadata. The Leader Node must ensure:

Consistent metadata across all catalog tables
No concurrent modifications to object definitions
Proper dependency tracking for views and constraints

This design choice prioritizes data integrity over DDL parallelism.

Control Table Locking

Additionally, our deployment pattern includes updates to shared control tables:

DELETE FROM metadata_schema.etl_control WHERE table_name = 'example_table';
INSERT INTO metadata_schema.etl_control VALUES (...);

These operations acquire table-level exclusive locks, forcing all agents to queue for access to the same tables.

Performance Analysis: What Would Actually Happen

Based on our baseline data (~100 tables in 30 minutes) and understanding of Redshift's architecture, I projected what would happen with different approaches:

Scenario 1: Multi-Agent Without Work Distribution

strategy:
  parallel: 10
steps:
  - script: execute_all_tables.sql  # No work distribution!

Projected result:

Each of 10 agents executes all tables
Total DDL operations: N_tables × 10 agents
Database receives 10x the operations
Outcome: Severe database overload, significantly slower than baseline

Scenario 2: Multi-Agent With Correct Work Distribution

strategy:
  parallel: 10
steps:
  - bash: |
      # Correctly distribute: each agent handles subset of tables
      TABLES_PER_AGENT=$(( TOTAL_TABLES / 10 ))
      # ... execute only assigned tables

Projected result based on architecture analysis:

Expected (naive calculation):
  Sequential time: ~300 minutes (extrapolated from baseline)
  With 10 agents: 300 ÷ 10 = 30 minutes ✓

Actual (with Redshift catalog locks):
  Agent 1: [████████████████████] ~300 min (executing)
  Agent 2: [⏳⏳⏳⏳⏳⏳⏳⏳] wait → execute
  Agent 3: [⏳⏳⏳⏳⏳⏳⏳⏳] wait → execute
  ...
  Total time: ~300 minutes (serialized) + orchestration overhead

Why the overhead?

Connection Overhead: 10 simultaneous connections to Redshift (vs. 1)
Lock Contention Monitoring: Redshift Leader Node processing lock queues
Transaction Retry Logic: Agents detecting timeouts and retrying
Network Latency: Multiple agents competing for responses
Logging Overhead: 10x more connection logs and audit entries

Scenario 3: Single-Agent Sequential (Current Approach)

Projected result:

Clean serial execution
Predictable linear scaling from baseline
No orchestration overhead
Expected time: Scales linearly from baseline (~3x for 300 tables)

Performance Comparison Matrix

Operation Type	Lock Scope	Without Distribution	With Distribution	Actual Benefit
`CREATE TABLE`	`pg_class` (global)	❌ N×agents duplication	❌ Serialized	0%
`CREATE VIEW`	`pg_class`, `pg_depend`	❌ N×agents duplication	❌ Serialized	0%
Control table DML	Table-level exclusive	❌ N×agents duplication	❌ Serialized (same table)	0%
`SELECT` queries	Row-level/Data blocks	⚠️ N×agents queries	✅ True parallel	80-90%

Decision: Stay with Single-Agent

Based on this analysis, the decision was clear: do not implement multi-agent deployment.

Rationale:

No performance benefit: DDL operations would serialize at the database level regardless of CI/CD parallelism
Added complexity: Multi-agent requires work distribution logic, monitoring, and coordination
Potential risks: Connection overhead and lock contention could actually slow things down
Simpler is better: Single-agent deployment is more predictable and easier to debug

Alternative optimizations considered:

Batch Transactions

   BEGIN;
   CREATE TABLE table_001 (...);
   CREATE VIEW view_001 (...);
   -- More statements...
   COMMIT;

Split into Multiple Smaller Deployments
- Deploy in batches across multiple maintenance windows
- Reduces single-deployment duration
- More manageable rollback if issues occur
Optimize Individual DDL Statements
- Remove unnecessary IF NOT EXISTS checks
- Defer COMMENT statements to post-deployment
- Minimize column count where possible

Lessons Learned

1. Architecture Understanding Matters More Than Tooling

More CI/CD resources don't help if the database architecture doesn't support parallelism. Understanding Redshift's Leader Node serialization was more valuable than adding more Azure DevOps agents.

2. Challenge "Obvious" Solutions

"More parallelism = faster" is true for data processing (SELECT queries on compute nodes), but not for DDL operations on shared catalog tables. The "obvious" solution would have wasted implementation effort with no benefit.

3. Analyze Before Implementing

By investigating the architecture before implementation, we avoided:

Wasted engineering time building work distribution logic
Added operational complexity with no performance gain
Potential performance degradation from connection overhead

This preventive analysis saved resources and kept our deployment process simple and predictable.

4. Simple Can Be Better

Single-agent deployment with optimized batching beat complex multi-agent orchestration. Sometimes the best optimization is recognizing when complexity doesn't add value.

Key Takeaways

For Redshift DDL Deployments:

✅ Single-agent sequential deployment is optimal for DDL-heavy workloads

✅ Batch transactions reduce round-trip overhead

✅ Multiple smaller deployments manage risk better than one large deployment

✅ Optimize individual DDL statements for the most impact

❌ Multi-agent parallelism provides no benefit (serialized at Leader Node)

❌ More compute nodes don't speed up DDL (only data queries benefit)

❌ Complex orchestration adds overhead without performance gain

Universal Lesson:

Understanding system constraints prevents premature optimization.

The best solution isn't always the most sophisticated—it's the one that works with your architecture, not against it.

References

Official Documentation

Key Documentation Insights

Redshift DDL Processing: All DDL operations execute on the Leader Node, not compute nodes
Azure DevOps Multi-Agent: Both Classic Editor and YAML approaches duplicate jobs; require manual work distribution
System Catalog Locking: PostgreSQL (and Redshift) use ACCESS EXCLUSIVE locks on metadata tables during DDL operations

DEV Community