DEV Community: Gabriel Henrique

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

Gabriel Henrique — Thu, 25 Jun 2026 00:19:45 +0000

Apache Iceberg looked like the answer to everything when we first adopted it. Open format, ACID transactions, time travel, schema evolution. We migrated our Hive tables, ran a few queries, and felt good about life.

Three months later, our S3 costs doubled. Queries that used to take 10 seconds were taking 4 minutes. Metadata operations were timing out. Nobody on the team could explain why.

That was the beginning of a real education in how Iceberg actually behaves in production. This post covers what I wish someone had told us before we went all-in.

The Small Files Problem Is Not Optional

Iceberg is append-friendly by design. Every micro-batch write, every streaming insert, every incremental load creates new Parquet files. Each file also gets its own metadata entry.

After a week of hourly loads, you might have 10,000 files in a single partition where you wanted 20.

The result: Iceberg's metadata layer has to plan queries across thousands of file manifests. Planning takes longer than execution. Your 10-second query becomes a 4-minute query, and your users start filing tickets.

Fix: automate compaction from day one.

In Spark, compaction is called rewrite_data_files. The basic call looks like this:

-- Run this on a schedule, not on-demand
CALL iceberg_catalog.system.rewrite_data_files(
  table => 'analytics.events',
  strategy => 'binpack',
  options => map(
    'target-file-size-bytes', '134217728',  -- 128MB target per file
    'min-input-files', '5'                  -- only compact if 5+ small files exist
  )
)

Target file size of 128MB to 512MB is the practical sweet spot. Smaller than that, you still have too many files. Larger, and your query engines cannot parallelize reads efficiently.

If you are not using Spark, PyIceberg exposes compaction through the table maintenance API (as of 0.7.x). For Flink or Trino-only shops, schedule compaction as a separate Spark job. Yes, it is annoying, but it is the right call.

Hidden Partitioning Is the Feature You Are Probably Ignoring

Old Hive partitioning was explicit. You wrote PARTITIONED BY (event_date STRING) and added that column to every query or Hive would scan the entire table.

Iceberg's hidden partitioning decouples the physical layout from what the query writer sees. You define a partition spec on the table, and the engine automatically applies it during writes and prunes during reads without the query needing to reference the partition column.

from pyiceberg.catalog import load_catalog
from pyiceberg.transforms import DayTransform

catalog = load_catalog("rest", **{
    "uri": "http://your-rest-catalog:8181",
    "warehouse": "s3://your-bucket/warehouse"
})

# Load an existing table and evolve its partition spec
table = catalog.load_table("analytics.events")

# Add a day-level partition on event_timestamp
# Iceberg handles the bucketing. No ts_date column needed in your schema.
with table.update_spec() as update:
    update.add_field(
        source_column_name="event_timestamp",
        transform=DayTransform(),
        partition_field_name="event_day"
    )

Now every query that filters on event_timestamp automatically benefits from partition pruning. The column stays a timestamp in the schema. No WHERE event_date = '2026-06-24' hack required.

The bigger win: you can change the partition strategy without rewriting the table. Iceberg supports multiple partition specs across snapshots. Old data stays on the old layout. New data uses the new one. The engine handles both transparently.

The Catalog Decision Matters More Than You Think

Every Iceberg table lives in a catalog. The catalog tracks which metadata file is current. Get this wrong and you either lock yourself into one vendor or end up with metadata conflicts that corrupt tables.

The main options in 2026:

AWS Glue Catalog works well if your entire stack is AWS. Zero operational overhead. But cross-cloud access is painful, and engine compatibility outside of Spark and Athena requires extra configuration.

Nessie / REST Catalog is the open standard. Any engine that supports the Iceberg REST spec can read and write. Nessie adds git-like branching for data, which is genuinely useful for staging ETL results before promoting to prod. Slightly more infra to manage.

Unity Catalog is the right choice if you are on Databricks. Tight governance integration, fine-grained access control at the column level. But it is proprietary, and getting data out to non-Databricks engines requires extra work.

My take: if you are building multi-engine (Spark + Trino + Flink), go REST-compatible from the start. Migrating catalogs later is painful. AWS Glue to REST is doable; Unity to anything else is not fun.

Here is a rough decision guide:

Single cloud (AWS only)    → Glue Catalog
Databricks-primary stack   → Unity Catalog
Multi-engine / multi-cloud → REST Catalog (Nessie or Polaris)

Snapshot Management: The Silent Storage Leak

Every write creates a snapshot. Snapshots reference manifest lists. Manifest lists reference manifest files. Manifest files reference data files.

Without snapshot expiration, you are paying for every historical snapshot indefinitely. The metadata alone can grow into gigabytes. S3 LIST operations against large metadata trees get expensive fast.

-- Expire snapshots older than 7 days, keep at least 5 for safety
CALL iceberg_catalog.system.expire_snapshots(
  table => 'analytics.events',
  older_than => TIMESTAMP '2026-06-17 00:00:00',
  retain_last => 5
)

After expiring snapshots, orphan files may still exist (files written but never committed to a snapshot):

-- Remove orphan files older than 3 days
-- The 3-day buffer ensures in-progress writes are not deleted
CALL iceberg_catalog.system.remove_orphan_files(
  table => 'analytics.events',
  older_than => TIMESTAMP '2026-06-21 00:00:00'
)

Run these on a schedule. Weekly is fine for most tables. Daily for high-volume streaming tables.

Time Travel Done Right

One of Iceberg's actual killer features. You can query any historical snapshot:

-- Query the table as it was yesterday at midnight
SELECT *
FROM analytics.events
FOR SYSTEM_TIME AS OF '2026-06-23 00:00:00'
WHERE event_type = 'purchase';

-- Or by snapshot ID (useful when you need a specific pipeline run)
SELECT *
FROM analytics.events
VERSION AS OF 8027658604211071520;

The catch: time travel only works while the snapshot exists. Once you expire it, it is gone. Plan your retention window around your incident response SLA. If your team takes 72 hours to notice a bad pipeline run, keep at least 7 days of snapshots.

Common Mistakes

Not running compaction at all. The default state of most Iceberg tables I have seen is "never been compacted." Set up compaction as part of table creation, not as a fix-it-later task.

Compacting too aggressively. Running rewrite_data_files too frequently on large tables wastes compute and can block concurrent reads. Once per day for most tables, twice per day for high-volume ones.

Using the wrong partition granularity. Partitioning by HOUR makes sense for 10 billion events per day. For 10 million, it creates too many small partitions and kills planning time. Match partition granularity to your data volume.

Picking Glue catalog for a multi-engine stack. You will not feel the pain on day one. You will feel it six months in when you try to add Trino and spend two weeks on catalog configuration.

Not setting write.target-file-size-bytes. The default varies by engine. Set it explicitly in your table properties so file sizes stay consistent regardless of which engine is writing.

ALTER TABLE analytics.events
SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '134217728',
  'write.delete.target-file-size-bytes' = '67108864'
);

What Iceberg Actually Is

Iceberg is a table format specification, not a storage engine. It tells engines how to find data, what schema it has, and which files are current. The engines (Spark, Trino, Flink, Athena) do the actual reading and writing.

This means Iceberg is only as good as the operational practices around it. The format solves real problems: ACID on object storage, schema evolution without rewriting, partition pruning without partition columns in queries. But you still have to run compaction. You still have to expire snapshots. You still have to pick the right catalog.

The teams I have seen succeed with Iceberg treated these maintenance tasks as first-class engineering concerns, not afterthoughts. The ones who struggled treated Iceberg like a managed service and were surprised when it needed managing.

Start with compaction and snapshot expiration automated before you write your first production table. Everything else you can figure out as you go.

Best regards,
Gabriel Henrique Cardoso Antonio 🔗 gabrielh.dev

Data Contracts in Production: Stop Trusting Your Upstream Sources

Gabriel Henrique — Sat, 20 Jun 2026 19:42:39 +0000

Your upstream data source changed a column type last night. Your pipeline ran at 2am, ingested everything without a single error, and by the time your stakeholders opened their dashboards at 9am, the revenue numbers were wrong.

No alert fired. No test failed. The pipeline was technically healthy.

This is the most common and expensive failure mode in data engineering, and it happens because we build systems that trust the data they receive. Data contracts are the fix.

What a Data Contract Actually Is

A data contract is a formal agreement between a data producer and a data consumer that defines what the data looks like, what quality guarantees it carries, and who owns it.

Not documentation. Not a README. An executable specification that can be validated automatically, versioned like code, and broken like an API contract when violated.

Think of it like an API contract, but for your data. A REST API fails loudly with a 400 when you send the wrong payload. A data pipeline fails silently with bad numbers. Contracts change that.

A contract typically covers: schema definition (fields, types, nullability), quality rules (completeness, uniqueness, valid value ranges), SLA metadata (freshness, update frequency), and ownership (who produces this, who consumes it).

Anatomy of a Real Data Contract

Here is what a minimal contract looks like using the open datacontract.yaml format:

dataContractSpecification: 0.9.3
id: orders-v2
info:
  title: Orders Contract
  version: 2.0.0
  owner: data-platform-team
  status: active

models:
  orders:
    description: One row per order placed on the platform
    fields:
      order_id:
        type: string
        required: true
        unique: true
      customer_id:
        type: string
        required: true
      total_amount:
        type: decimal
        required: true
        minimum: 0
      status:
        type: string
        enum: [pending, confirmed, shipped, delivered, cancelled]
      created_at:
        type: timestamp
        required: true

quality:
  type: SodaCL
  specification: |
    checks for orders:
      - row_count > 0
      - missing_count(order_id) = 0
      - duplicate_count(order_id) = 0
      - invalid_count(status) = 0:
          valid values: [pending, confirmed, shipped, delivered, cancelled]
      - freshness(created_at) < 6h

servicelevels:
  freshness:
    description: Data must not be older than 6 hours
    threshold: 6h

This file is checked into Git alongside the dbt models that produce the orders table. When the schema changes, the contract changes. When the contract breaks, the pipeline stops.

Three Places to Enforce Contracts

Most teams put the enforcement in one place and leave gaps everywhere else. You need all three layers.

[Producer / Source System]
        |
        v
[Ingestion Layer]  <-- enforce schema + type contracts here
        |
        v
[Transformation Layer (dbt)]  <-- enforce quality contracts here
        |
        v
[Serving Layer / Warehouse]  <-- enforce SLA and freshness here
        |
        v
[Consumer / Dashboard / LLM / API]

At ingestion you catch schema drift early, before bad data poisons your warehouse. Use Pydantic models to validate incoming records.

At transformation you use dbt tests or Soda checks to enforce business-level quality rules. A row count of zero is not a schema violation, but it is a contract violation.

At serving you monitor freshness and completeness so consumers know the data they are reading meets SLA guarantees.

A Real Ingestion Contract with Pydantic

This runs at the top of every ingestion job, before writing a single row to the warehouse:

from pydantic import BaseModel, validator, Field
from decimal import Decimal
from datetime import datetime
from enum import Enum
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class OrderStatus(str, Enum):
    pending = "pending"
    confirmed = "confirmed"
    shipped = "shipped"
    delivered = "delivered"
    cancelled = "cancelled"

class Order(BaseModel):
    order_id: str
    customer_id: str
    total_amount: Decimal = Field(ge=0)  # must be non-negative
    status: OrderStatus
    created_at: datetime
    promo_code: Optional[str] = None  # optional, but we track null rate

    @validator("order_id")
    def order_id_not_empty(cls, v):
        if not v.strip():
            raise ValueError("order_id cannot be blank")
        return v

def validate_and_load(records: list[dict]) -> tuple[list[Order], list[dict]]:
    # Returns (valid_records, failed_records).
    # Never silently drops failures. Log and route to a dead-letter topic.
    valid = []
    failed = []

    for record in records:
        try:
            valid.append(Order(**record))
        except Exception as e:
            logger.error(f"Contract violation: {e} | Record: {record}")
            failed.append({"record": record, "error": str(e)})

    # Fail the pipeline if more than 1% of records are invalid.
    failure_rate = len(failed) / len(records)
    if failure_rate > 0.01:
        raise RuntimeError(
            f"Contract breach: {failure_rate:.1%} of records failed validation "
            f"({len(failed)} / {len(records)})"
        )

    return valid, failed

Two decisions here worth explaining.

First, the 1% threshold. You do not want to fail the pipeline on a single bad record, but you also do not want to silently ingest garbage. Set a threshold that reflects your tolerance and make it explicit in the code.

Second, the dead-letter queue. Every failed record should go somewhere observable. If you drop it, it is gone forever. If you log it, you can replay it after fixing the issue.

Common Mistakes

Treating contracts as documentation. A YAML file that nobody checks is just noise. The contract has to run automatically, fail fast, and block bad data from propagating.

Putting all validation at one layer. Schema is not the same as quality. You can have perfectly typed data that is 90% null. Both need contracts.

Versioning contracts separately from the code. When a producer changes a column, the contract and the dbt model and the ingestion code all need to change together. Keep them in the same repo, reviewed in the same PR.

Using blocking contracts everywhere from day one. You will break things. Start with logging-only mode, measure your actual failure rates, then flip to hard-blocking after you understand the baseline.

Ignoring freshness SLAs. A technically correct dataset from 14 hours ago is a broken contract for a real-time dashboard. Freshness is a first-class quality dimension.

When Contracts Are Not Worth the Investment

Not every dataset needs a formal contract. Internal scratch tables, exploratory datasets, and one-off analyses do not need this overhead.

Contracts pay off when the data crosses a team or system boundary. If another team, application, or AI system depends on your data, you need a contract. If it breaks for them, you will spend more time debugging than you saved by skipping the contract in the first place.

The ROI is clearest in two scenarios: high-value production pipelines (revenue, product metrics, ML features) and AI/LLM systems consuming structured data. An LLM receiving malformed features will not throw an exception. It will just produce worse outputs. Contracts at the feature serving layer are non-negotiable for production AI.

The Shift Happening Right Now

The industry is moving toward contracts-first development. Write the contract before you write the pipeline. Define what the output should look like, what quality guarantees it carries, and who owns it. Then build to meet that spec.

It is the same discipline that made API development more reliable. The data ecosystem is just a few years behind on this.

In 2026, with AI systems consuming data directly, a schema break is no longer just a broken dashboard. It is a broken model, a wrong recommendation, a compounding error in an automated pipeline that nobody noticed. The cost of trust without verification has gone up significantly.

If your pipelines have never failed because of an upstream schema change, consider yourself lucky. Put contracts in place before that luck runs out.

Abs,
Gabriel Henrique Cardoso Antonio 🔗 gabrielh.dev

Agentic Data Engineering in 2026: How to Build Pipelines That AI Agents Can Actually Use

Gabriel Henrique — Wed, 17 Jun 2026 00:50:11 +0000

If you've spent the last few years building data pipelines, you know the drill: ingest, transform, load. Maybe some orchestration on top. Solid work — the kind that keeps dashboards green and analysts happy.

But something changed in 2026. Your pipeline's new consumer isn't a BI tool or a SQL query. It's an AI agent — and agents are a very different kind of hungry.

Welcome to agentic data engineering. Buckle up.

What's an "Agentic" Data System, Exactly?

Let's back up a second. An AI agent is a system that perceives its environment, reasons about it, and takes actions to reach a goal — without needing a human to hold its hand at every step.

Think of it like the difference between a GPS that tells you turn-by-turn directions (traditional AI) and one that books your hotel, reschedules your meeting, and orders food for when you arrive (agentic AI). One follows instructions. The other acts.

For agents to act, they need data. But not just any data — context-rich, semantically meaningful, machine-readable data. And that's where data engineers come in.

The cold truth: most existing data pipelines aren't built for this. They were designed for humans (or human-readable BI tools) as the end consumer. Agents need something different.

The Context Engineering Problem

Here's a concrete example. Say you have a sales table with a column called status. Values: A, B, C.

A human analyst knows that A = active, B = blocked, C = churned because they read the Confluence doc from 2022 (the one that's three Notion migrations out of date). An AI agent? It has no idea. It'll guess — and guessing at 2am during an automated pipeline run is a great way to corrupt a report.

This is the context engineering problem: your data is technically correct but semantically opaque.

Context engineering is the practice of designing data systems that embed rich, machine-readable context alongside the data itself. Gartner has already flagged this: over 40% of agentic AI projects are predicted to fail by 2027 — not because the models are bad, but because the data foundations are missing. Bare schemas, unclear ownership, no lineage, inconsistent definitions.

Sound familiar?

What Agents Actually Need From Your Pipeline

Let's get practical. Here's what makes a data system "agent-ready":

1. Rich Metadata and Semantic Descriptions

Every table, column, and field should have a description an agent can read and reason about — not just a name.

-- Bad: An agent sees "status" and guesses
CREATE TABLE sales (
  id INT,
  status VARCHAR(1)
);

-- Good: Metadata makes intent explicit
COMMENT ON COLUMN sales.status IS 
  'Customer lifecycle status. Values: A=active (paying), B=blocked (payment issue), C=churned (cancelled)';

Modern data catalogs (like DataHub, Amundsen, or OpenMetadata) can store this metadata in a way agents can query via API. If you're not using one, now is a very good time to start.

2. Data Lineage That's Actually Up-to-Date

An agent running a pipeline needs to understand: where did this data come from? What transformations touched it? If something breaks, what else is affected?

Tools like dbt generate lineage graphs automatically from your SQL models. Here's a minimal dbt model with proper documentation:

# models/schema.yml
models:
  - name: customer_lifetime_value
    description: >
      Calculates CLV per customer using the last 90 days of transactions.
      Refreshed daily at 3am UTC. Source: raw.transactions joined with dim.customers.
    columns:
      - name: customer_id
        description: Unique identifier. FK to dim.customers.customer_id
      - name: clv_usd
        description: Estimated lifetime value in USD. Null if customer has < 3 transactions.

That description block? An agent can read it, understand what the model does, and decide whether it's the right source for a given task. Without it, the agent is flying blind.

3. Embeddings and Vector-Ready Outputs

This one trips people up. Traditional pipelines output structured tables. Agentic pipelines often need to also output embeddings — vector representations of your data that LLMs can use for semantic search and RAG (Retrieval-Augmented Generation).

Here's a simple example using Python and OpenAI's embedding API (or any open-source alternative like sentence-transformers):

from sentence_transformers import SentenceTransformer
import pandas as pd

model = SentenceTransformer("all-MiniLM-L6-v2")

# Your product catalog as a dataframe
df = pd.read_parquet("products.parquet")

# Generate embeddings from a meaningful text representation
df["text_repr"] = df["name"] + ". " + df["description"] + ". Category: " + df["category"]
df["embedding"] = df["text_repr"].apply(lambda x: model.encode(x).tolist())

# Write to a vector store (e.g., pgvector, Pinecone, Weaviate)
df[["product_id", "embedding"]].to_parquet("products_embeddings.parquet")

The key idea: you're not replacing your existing pipeline — you're extending it. The structured table feeds your dashboards. The embeddings feed your agents.

4. Schema Drift Detection

Here's a nightmare scenario: an upstream team renames a column. Your pipeline doesn't catch it. The agent downstream starts ingesting garbage. Nobody notices until a report goes out with completely wrong numbers.

Schema drift detection is one of the highest-impact agentic data engineering tasks identified in the SIGMOD 2026 Data Agents tutorial. Integrate it into your orchestration:

# Using Great Expectations for schema validation
import great_expectations as gx

context = gx.get_context()

# Define expectation: column "user_id" must exist and be non-null
suite = context.add_expectation_suite("sales_suite")
suite.add_expectation(
    gx.expectations.ExpectColumnToExist(column="user_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id")
)

# Run validation before anything touches the data
result = context.run_checkpoint("sales_checkpoint")
if not result["success"]:
    raise ValueError(f"Schema validation failed: {result}")

Fail fast, fail loud. An agent that ingests bad data quietly is worse than a pipeline that crashes.

A Mental Model: The Conveyor Belt vs. The Smart Warehouse

Here's an analogy that might help it click.

Traditional data pipelines are like a conveyor belt in a factory: raw materials go in one end, finished goods come out the other. Fast, reliable, predictable. But the conveyor belt doesn't know what it's carrying. It doesn't label boxes. It doesn't track where things came from. It just moves.

An agent-ready data system is more like a smart warehouse: every item has a barcode, a location, a history, and a description. Robots can navigate it because everything is labeled and organized. You can ask "where are all the items from Supplier X that arrived in Q1?" and get an instant answer.

Your job in 2026? Build the smart warehouse, not just the conveyor belt.

What to Do This Week

You don't need to rip out your stack and start over. Here's a practical starting point:

Audit your most critical tables: Do they have column descriptions? Add them in your catalog or directly in dbt.
Enable lineage tracking: If you're on dbt, it's already there. Expose it via the dbt API or push it to DataHub.
Pick one pipeline to make vector-ready: Add an embedding generation step as a separate job. Don't break what works — extend it.
Add a schema validation checkpoint: Use Great Expectations, Soda, or dbt tests. Run it before anything hits production.

None of this takes a week. The column descriptions alone can take an afternoon. But six months from now, when your team is deploying AI agents that actually work because your data is clean and semantically rich? You'll be very glad you started today.

Conclusion

The rise of agentic AI doesn't make data engineers obsolete — it makes the craft harder and more important. Anyone can wire up an LLM to a database. Making that LLM reliably useful for autonomous agents? That requires real data engineering skill.

Context engineering, lineage, schema validation, vector outputs — these aren't buzzwords. They're the new checklist. The engineers who build these foundations now are the ones who'll be building the most interesting systems in 2027.

Go make your pipelines agent-ready. Your future AI coworkers are counting on you.

Abs,

Gabriel Henrique Cardoso Antonio
🔗 gabrielh.dev

Your ETL Pipeline Wasn't Built for AI — Here's How to Fix It in 2026

Gabriel Henrique — Tue, 09 Jun 2026 00:33:51 +0000

Your ETL Pipeline Wasn't Built for AI — Here's How to Fix It in 2026

You've got a beautiful data pipeline. It extracts from your sources, transforms everything cleanly, loads into the warehouse on schedule. Tests pass. Stakeholders are happy. Life is good.

Then someone says: "Can we plug this into our LLM?"

And suddenly your beautiful pipeline is useless.

Not because it's broken — it works perfectly for what it was designed to do. The problem is that traditional ETL was designed for SQL queries, dashboards, and human analysts. LLMs need something fundamentally different: context, meaning, and vectors. And if your pipeline doesn't produce those, your AI is flying blind.

This is the silent crisis in data engineering right now. Companies are spending millions on LLM infrastructure while their underlying data pipelines are still shipping rows and columns to a warehouse that an AI can barely reason about.

Let's fix that.

What Does an LLM Actually Need From Your Data?

When a human analyst queries your warehouse, they write SQL. They're smart enough to know that status = 'churned' means a customer who cancelled their subscription. They bring their own context.

An LLM doesn't have that luxury — at least not without help. When you ask a model "why are enterprise customers churning?", it can't just run SELECT * FROM churn_events. It needs semantically relevant context — passages, records, or summaries that are meaning-close to the question being asked.

That's where RAG (Retrieval-Augmented Generation) comes in.

Think of RAG like this: instead of the LLM trying to remember everything (it can't — its context window is finite), you build a library. Every time the LLM needs to answer a question, it walks into that library, finds the most relevant pages, and reads them before answering.

Your job as a data engineer is to build and maintain that library.

And a library isn't a database. A library is organized by meaning, not by rows and columns.

The AI-Native Pipeline: What's Different

Here's the shift in mindset. Traditional ETL produces:

raw data → clean tables → warehouse → SQL queries

An AI-native pipeline produces:

raw data → cleaned chunks → embeddings → vector store → semantic retrieval

The new ingredients are chunks, embeddings, and a vector store. Let's break each down.

Chunks

You can't feed an entire database table into an LLM. Even if you could, it would be wasteful and noisy. Instead, you break your data into chunks — small, meaningful pieces of text that can be retrieved independently.

A chunk might be:

A paragraph from a customer support ticket
A 3-sentence description of a product
A summarized row of metadata about a sales event

The art here is in the chunking strategy. Too small, and a chunk loses its context. Too large, and you're wasting tokens and retrieval precision. In practice, 512–1024 tokens with ~10% overlap between chunks is a solid starting point.

Embeddings

An embedding is a list of numbers — a vector — that represents the meaning of a piece of text. Two texts with similar meanings will have vectors that are close together in space, even if they use completely different words.

"Customer stopped paying" and "subscription was cancelled due to billing failure" have very different words. But in vector space, they're neighbors.

That's the magic. And it's what makes semantic search possible.

Vector Store

A vector store is a database optimized for one special kind of query: "give me the N vectors most similar to this query vector." Systems like pgvector, Qdrant, Chroma, and Weaviate are built exactly for this.

Building the Pipeline: A Practical Walkthrough

Let's get concrete. Here's a complete AI-native ingestion pipeline in Python.

Step 1: Load and Chunk Your Data

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Imagine this comes from your warehouse, S3, or an API
raw_documents = [
    {"id": "ticket_001", "text": "Customer says dashboard is not loading. Error 502. Happened after the deploy on June 3rd. They're on the Enterprise plan."},
    {"id": "ticket_002", "text": "User cannot export reports to CSV. The button is greyed out. They say it worked last week. Basic plan."},
    # ... thousands more
]

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
)

chunks = []
for doc in raw_documents:
    splits = splitter.split_text(doc["text"])
    for i, chunk in enumerate(splits):
        chunks.append({
            "id": f"{doc['id']}_chunk_{i}",
            "text": chunk,
            "source_id": doc["id"]
        })

print(f"Created {len(chunks)} chunks from {len(raw_documents)} documents")

The RecursiveCharacterTextSplitter is smart — it tries to split on paragraph breaks, then sentences, then words. It keeps semantic boundaries intact wherever it can.

Step 2: Generate Embeddings

from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def embed_batch(texts: list[str], model="text-embedding-3-large") -> list[list[float]]:
    """Embed a batch of texts. Returns list of vectors."""
    response = client.embeddings.create(
        input=texts,
        model=model,
        dimensions=1024  # trade-off: smaller = cheaper, slightly less precise
    )
    return [item.embedding for item in response.data]

# Process in batches to respect rate limits
BATCH_SIZE = 100
embedded_chunks = []

for i in range(0, len(chunks), BATCH_SIZE):
    batch = chunks[i:i + BATCH_SIZE]
    texts = [c["text"] for c in batch]
    vectors = embed_batch(texts)

    for chunk, vector in zip(batch, vectors):
        embedded_chunks.append({**chunk, "vector": vector})

print(f"Embedded {len(embedded_chunks)} chunks")

Two things to notice here: we're batching (OpenAI has rate limits and batching is cheaper), and we're using dimensions=1024 instead of the default 3072. For most use cases, 1024 dimensions give you 95% of the precision at a third of the cost. Worth it.

Step 3: Store in a Vector Database

Here's the same code using pgvector (PostgreSQL with vector support) — a great choice if you're already running Postgres and don't want another managed service:

import psycopg2
import json

conn = psycopg2.connect("postgresql://user:password@localhost:5432/mydb")
cur = conn.cursor()

# One-time setup: enable the extension and create the table
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
    CREATE TABLE IF NOT EXISTS doc_chunks (
        id TEXT PRIMARY KEY,
        source_id TEXT,
        content TEXT,
        embedding vector(1024),
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
""")
cur.execute("""
    CREATE INDEX IF NOT EXISTS doc_chunks_embedding_idx 
    ON doc_chunks USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
""")
conn.commit()

# Insert the embedded chunks
for chunk in embedded_chunks:
    cur.execute("""
        INSERT INTO doc_chunks (id, source_id, content, embedding)
        VALUES (%s, %s, %s, %s)
        ON CONFLICT (id) DO UPDATE SET
            content = EXCLUDED.content,
            embedding = EXCLUDED.embedding;
    """, (
        chunk["id"],
        chunk["source_id"],
        chunk["text"],
        json.dumps(chunk["vector"])
    ))

conn.commit()
cur.close()
conn.close()
print("All chunks stored in pgvector!")

The ivfflat index is what makes queries fast at scale. Without it, every query does a full table scan. With it, Postgres clusters vectors into "lists" and searches only the most promising ones — approximate nearest neighbor search, blazing fast.

Step 4: Retrieval at Query Time

def retrieve_relevant_chunks(query: str, top_k: int = 5) -> list[dict]:
    """Given a natural language query, find the most relevant stored chunks."""

    # Embed the query using the same model
    query_vector = embed_batch([query])[0]

    conn = psycopg2.connect("postgresql://user:password@localhost:5432/mydb")
    cur = conn.cursor()

    cur.execute("""
        SELECT id, source_id, content,
               1 - (embedding <=> %s::vector) AS similarity
        FROM doc_chunks
        ORDER BY embedding <=> %s::vector
        LIMIT %s;
    """, (json.dumps(query_vector), json.dumps(query_vector), top_k))

    results = [
        {"id": row[0], "source_id": row[1], "content": row[2], "similarity": row[3]}
        for row in cur.fetchall()
    ]

    cur.close()
    conn.close()
    return results

# Try it out
results = retrieve_relevant_chunks("why are enterprise users reporting errors after deploys?")
for r in results:
    print(f"[{r['similarity']:.3f}] {r['content'][:100]}...")

The <=> operator is pgvector's cosine distance. 1 - cosine_distance = cosine_similarity. The results will be the chunks most semantically close to your query — even if they don't share a single keyword with it.

Practical Tips for 2026

1. Don't skip metadata. Store the source ID, timestamp, author, and any other context alongside your vectors. Metadata filtering (e.g., "only search tickets from Enterprise customers in the last 30 days") is often more important than the semantic search itself.

2. Re-embed when the model changes. If you upgrade from text-embedding-3-small to text-embedding-3-large, you need to re-embed everything. Different models produce incompatible vector spaces. Build this into your pipeline versioning from day one.

3. Evaluate retrieval quality separately from generation quality. The #1 mistake is blaming the LLM when the real problem is your retrieval. If the right chunks aren't being retrieved, the best model in the world will give you garbage. Use tools like RAGAS to measure retrieval precision/recall independently.

4. pgvector is enough for most teams. Unless you're storing hundreds of millions of vectors, you don't need a dedicated vector database. pgvector in your existing Postgres is simpler to operate, cheaper, and lets you join vectors with your regular tables. Optimize later if you need to.

5. Chunking is your most impactful lever. Changing the LLM might give you 5% better answers. Fixing your chunking strategy might give you 40%. It's unglamorous, but it's where the results are.

The Bigger Picture

The shift to AI-native data engineering isn't about throwing away what you've built. It's about extending it.

Your bronze/silver/gold lakehouse layers? Still valid — but add a "semantic layer" where data is chunked, embedded, and indexed for retrieval. Your Airflow DAGs? Still valid — add a daily job that re-embeds new documents and updates the vector store. Your data quality checks? Still valid — add checks for embedding freshness and retrieval coverage.

Think of it as adding a new output format to your pipelines. You've always produced clean tables. Now you also produce vector indexes. Same discipline, new artifact.

The engineers who learn to build both will be the ones building the AI systems that actually work — the ones where the model has the context it needs to be genuinely useful, not just impressively fluent.

Your pipeline deserves to be as smart as the AI it's feeding.

Abs,

Gabriel Henrique Cardoso Antonio
🔗 gabrielh.dev

Context Engineering: The Skill Replacing Prompt Engineering in 2026

Gabriel Henrique — Thu, 04 Jun 2026 12:52:06 +0000

If you've been calling yourself a "prompt engineer" for the past two years, it's time to update your vocabulary — and your mental model.

In 2026, the real leverage when building LLM-powered systems isn't in crafting the perfect sentence. It's in context engineering: designing everything an LLM sees before it ever generates a response. Andrej Karpathy coined the term in mid-2025, and it's since taken over serious AI engineering discussions.

This article breaks down what context engineering actually is, why it matters more than prompt writing, and gives you concrete techniques you can apply today.

What Is Context Engineering?

Context engineering is the discipline of systematically designing the information environment that surrounds a prompt. Where prompt engineering asks "what should I tell the model to do?", context engineering asks "what does the model need to know to do it well?"

Think of it this way: a doctor doesn't just answer the question you ask on the spot. They look at your chart, your history, your vitals, and then respond. Context engineering is building that chart for your LLM.

The context window is the LLM's working memory — everything it can "see" at once. In 2026, these windows are massive:

Claude Opus 4.x: 200K tokens
GPT-4o: 128K tokens
Gemini 2.5 Flash: Up to 1M tokens

But bigger isn't automatically better. More tokens = more cost, more latency, and a real risk of what researchers call the "lost-in-the-middle" problem — where models process information at the beginning and end of the context more reliably than content buried in the middle.

Why This Matters for Data Engineers

Data engineers are increasingly building pipelines that feed LLMs: RAG systems, AI copilots for data quality, agents that write and review SQL, tools that summarize data lineage. In every one of these systems, the quality of what lands in the context window directly determines output quality.

A poorly designed context is like feeding a senior analyst a jumbled mess of raw logs and asking for an executive summary. Technically possible — but you'll get garbage.

Core Techniques

1. Strategic Positioning

LLMs don't read context uniformly. Research consistently shows they pay more attention to the beginning and end of the context window. So:

Put critical instructions and persona definitions at the start
Put the most relevant retrieved data near the end, close to the user query
Move supporting or low-priority content to the middle

# BAD: query buried in the middle
context = system_instructions + docs_and_examples + user_query + more_examples

# GOOD: query at the end, most relevant data just before it
context = system_instructions + background_context + retrieved_chunks + user_query

2. Selective Retrieval Over Full Documents

Don't dump entire documents into the context. Use semantic chunking + vector search to retrieve only relevant paragraphs.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def retrieve_relevant_chunks(query, chunks, top_k=5):
    query_emb = model.encode([query])
    chunk_embs = model.encode(chunks)
    scores = np.dot(chunk_embs, query_emb.T).squeeze()
    top_indices = scores.argsort()[-top_k:][::-1]
    return [chunks[i] for i in top_indices]

3. Context Caching (Huge Cost Savings)

Both Claude and Gemini support prompt caching — storing repeated context server-side so you only pay full price once.

import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a senior data engineer..."},
        {
            "type": "text",
            "text": open("schema_definitions.txt").read(),
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

Prompt caching reduces cost by 75–90% on cached tokens. At scale, this is the difference between a viable product and a budget disaster.

4. Structured Context Formats

Use XML tags or clear delimiters to separate context sections — LLMs respond better to structured input:

def build_structured_context(schema, recent_errors, user_query):
    errors_str = "\n".join(recent_errors[-10:])
    return f"<schema>\n{schema}\n</schema>\n\n<recent_errors>\n{errors_str}\n</recent_errors>\n\n<question>\n{user_query}\n</question>"

5. Dynamic Context Compression

As conversations grow, implement rolling summarization instead of truncating from the start:

def compress_history(messages, max_tokens=4000):
    if estimate_tokens(messages) <= max_tokens:
        return messages
    recent = messages[-10:]
    summary = summarize_with_llm(messages[:-10])
    return [{"role": "system", "content": f"Prior summary: {summary}"}, *recent]

Context Engineering Checklist

[ ] System prompt at the very beginning of context?
[ ] User query at or near the end?
[ ] Retrieving relevant chunks instead of full documents?
[ ] Repeated blocks cached (system prompts, schemas, docs)?
[ ] Context sections clearly delimited?
[ ] Compression strategy for long conversations?
[ ] Measured token usage and cost per request?

The Shift in Mindset

Prompt engineering is about what you say. Context engineering is about what you provide.

The best LLM outputs in production systems today come from engineers who think carefully about information architecture — what goes in the context window, in what order, how much of it, and how it's structured. That's an engineering discipline, not a writing exercise.

If you're building data pipelines that feed AI systems, this is now part of your stack. Treat context design with the same rigor you'd apply to schema design or query optimization.

Cheers,

Gabriel Henrique — Data Engineer | ETL/ELT | Databricks | Azure

🔗 gabrielh.dev

Unlock AI’s Hidden Power: The Ultimate Guide to Prompt Engineering

Gabriel Henrique — Sun, 06 Jul 2025 05:47:30 +0000

Prompt Engineering: The Hidden Power of AI

With the exponential advancement of artificial intelligence, we live in a unique moment in tech history. Millions of people use these powerful tools for coding, creative writing, studying, data analysis, and much more. Yet many still fail to realize they’re squandering AI’s true potential for one simple reason: they don’t know how to communicate effectively with it.

This gap between humans and AI is where the “hidden power” of prompt engineering resides—a skill that can completely transform your AI experience and outcomes.

The Dangers of Poor Prompts

Have you ever wondered how many opportunities you miss with vague or poorly structured prompts? Poor prompts are like giving confusing instructions to an extremely capable assistant. Typing “help me with marketing” or “write some code” wastes AI’s potential and creates several problems:

Generic, Irrelevant Responses: Vague prompts such as “tell me something interesting” yield superficial, low-value information. AI can’t guess your specific needs, so it produces generic content that adds little real value.
Wasted Time and Frustration: If you don’t get the desired result on the first try, you must reformulate and retry, creating an unproductive cycle.
Hallucinated Answers: Ambiguous prompts greatly increase the chance that AI will fabricate plausible-sounding but false information—especially dangerous when you need accurate data for decision-making.
Underutilization of Capabilities: Without a proper structure, you’re tapping only a fraction of AI’s power—like owning a supercomputer but using it as a basic calculator.

The ASK Framework: Transforming Your AI Interactions

To solve these problems, we introduce the ASK framework, a proven methodology that will revolutionize how you interact with AI:

ASK

Define precisely what you want the AI to do.

Example:

“Generate a social media marketing plan for a small urban retail boutique targeting young adults.”

CONTEXT

Provide relevant background information to help the AI understand your situation.

Example:

“I own a streetwear shop in a college town with a monthly budget of $400.”

CONSTRAINTS

Specify clear limits on format, length, tone, and style.

Example:

“Respond in a bulleted list of 5 items, professional tone, maximum 150 words.”

EXAMPLE

Offer concrete examples of what you expect (few-shot prompting).

Example:

Classify the sentiment of these reviews:
Example 1: “Loved the fast delivery!” → Positive
Example 2: “Product was defective, terrible service.” → Negative
Example 3: “Item is okay, nothing special.” → Neutral Now classify: “Exceeded my expectations!”

STYLE

Define the tone, persona, or writing style you want the AI to adopt.

Example:

“Act as a senior marketing consultant with 10 years of experience, using clear, engaging language.”

Advanced Prompting Techniques for Maximum Efficiency

Chain-of-Thought (CoT) Prompting

Ask AI to show its reasoning step by step for complex problems. This boosts accuracy.

Solve this problem showing each reasoning step:
“A company has 150 employees. 30% work in sales, 25% in production, and the rest in other departments. If the company grows by 20% next year, how many employees will each department have

Prompt Chaining

Break complex tasks into smaller, sequential steps to avoid overwhelming the AI:

Analyze the problem
Identify possible solutions
Evaluate pros and cons
Recommend the best solution

Self-Consistency

Have AI generate multiple answers to the same prompt and then choose the most consistent one to improve reliability.

Optimization Strategies

Iterative Refinement

Never settle for the first result. Prompt engineering is iterative—review the response, identify improvements, and adjust your prompt accordingly.

A/B Testing Prompts

Compare different versions of the same prompt to see which yields better results, especially for critical applications.

Temperature Control

Adjust AI creativity as needed:

Low temperature (0.1–0.3): Precise, consistent responses
High temperature (0.7–1.0): Creative, varied outputs

Avoiding Common Pitfalls

Excessive Ambiguity: Avoid words with multiple meanings—be specific.
Information Overload: Don’t include unnecessary details that confuse AI.
Unrealistic Expectations: Understand AI’s limitations; it’s not magic.
Lack of Context: Always provide relevant background information.

Practical Use Cases

Software Development

Act as a senior Python developer specializing in REST APIs.
Context: I need to build an e-commerce API.
Constraints: Use FastAPI, include JWT authentication, document with OpenAPI.
Example: Follow a structure similar to Mercado Livre.
Style: Clean code, comments in English, adhering to PEP 8.

Content Creation

Act as a social media copywriter.
Context: Women’s fashion boutique, audience 18–35, casual style.
Constraints: Instagram post, max 150 characters, include a call-to-action.
Example: “Found the perfect weekend look! 💕 #OOTD”
Style: Casual tone, use emojis, youthful language.

Conclusion: Mastering the Hidden Power

Prompt engineering isn’t just a technical skill—it’s an essential competency for thriving in the AI era. By mastering these techniques, you not only improve your results but also communicate more effectively with the technologies shaping the future.

Remember: the quality of an AI’s response is directly proportional to the quality of your prompt. Investing time to learn these methods is an investment in your professional future.

The hidden power of prompt engineering is in your hands. It’s not a question of if you’ll use it, but when you’ll start mastering it. The sooner you begin, the greater your competitive edge in a world increasingly integrated with AI.

Start today by applying the ASK framework in your next AI interactions. Test, iterate, refine. Your productivity and result quality will never be the same.

Virtual Learning Festival and Vouchers: An Unmissable Opportunity

Gabriel Henrique — Sun, 15 Jun 2025 23:38:33 +0000

What is the Virtual Learning Festival?

The Virtual Learning Festival is an online event celebrating the Data + AI Summit 2025, running from June 11 to July 2, 2025. It is designed to help participants:

Complete training,
Expand data and AI skills,
Prepare for Databricks certifications.

How does it work?

The Virtual Learning Festival offers free online sessions, workshops, and content, allowing you to participate at your own pace during the event period. It aligns with the in-person Data + AI Summit in San Francisco (June 9–12), complementing the experience with remote training.

Main objectives

Free training on data and AI topics,
Certification preparation (with materials and practice),
Ongoing engagement before and after the main in-person event.

If you wish, I can help you find session details, workshop registration links, and information about certificates and discount vouchers—just let me know what interests you!

Discount Vouchers: How Do They Work?

During the Virtual Learning Festival, participants have access to exclusive benefits:

50% discount voucher for Databricks certification (equivalent to $100 off).
20% discount coupon for Databricks Academy Labs.

How to get them?

Simply complete any course during the virtual festival (June 11 to July 2, 2025) to automatically receive the 50% certification discount voucher and the 20% Academy Labs coupon by email.

Quick Summary

Benefit	How to get it
50% off certification	Complete any course during the festival
20% off Academy Labs	Upon receiving the certification voucher

This dynamic has been confirmed in previous festivals and remains valid for the current event. If you are participating, just complete at least one course to secure your discounts.

If you need help choosing courses or tracking your completion, I can guide you!

Access The Virtual Learning Festival

Databricks News: Highlights from Data + AI Summit 2025

Gabriel Henrique — Sun, 15 Jun 2025 23:31:47 +0000

Databricks News and Vouchers: Highlights from Data + AI Summit 2025

On June 12, 2025, the Data + AI Summit, Databricks' flagship annual event, concluded in San Francisco, gathering over 20,000 data and AI professionals from around the world. The event introduced a series of announcements and innovations set to transform the data, artificial intelligence, and cloud collaboration ecosystem. Below, I share a summary of the main news unveiled at the event, with brief descriptions for easy understanding.

1. Databricks Lakeflow: Unified Data Engineering

Databricks Lakeflow was launched as a comprehensive solution for data ingestion, transformation, and orchestration, integrating managed connectors for enterprise applications, databases, and data warehouses. A highlight is Zerobus, an API enabling real-time event data ingestion with high throughput and low latency, making large-scale data usage for analytics and AI easier.

2. Unity Catalog: Intelligent Governance and Automation

Unity Catalog received new features to unify data and AI governance across different formats, clouds, and teams. Notable updates include:

Attribute-Based Access Control (ABAC): Enables flexible access policies using tags, now in beta for AWS, Azure, and GCP.
Tag Policies: Ensure consistency and security in data classification and usage across the platform, also in beta on major clouds.

3. Data Sharing and Collaboration

Improvements were announced to facilitate secure data sharing between organizations, including “clean rooms” that allow collaboration without compromising data privacy or security.

4. Full Support for Apache Iceberg™

Databricks now offers full support for Apache Iceberg™, expanding open-format data management possibilities and making integration with various tools and platforms easier.

5. Spark Declarative Pipelines

The platform introduced Spark Declarative Pipelines, an evolution for developing data pipelines in a declarative, scalable, and open way, boosting productivity and standardization for data engineering teams.

6. Databricks SQL and Free Edition

General availability of Databricks SQL was announced, along with a new free edition of the platform, democratizing access to advanced data analytics and intelligence resources for organizations of all sizes.

7. MLflow 3.0: AI Observability and Governance

MLflow 3.0 arrives with improvements for experimentation, observability, and governance of AI models, streamlining the complete machine learning project lifecycle within the Databricks ecosystem.

8. Mosaic AI and Agent Bricks

Mosaic AI introduced new features for developing intelligent agents, including Agent Bricks, which enables the creation of self-optimizing agents using proprietary company data, accelerating the practical adoption of generative AI and autonomous agents.

9. Lakebase: Public Preview

The Lakebase concept was presented in public preview, offering an innovative approach for managing transactional and analytical data in a single environment, simplifying operations and accelerating insights.

10. Power Platform Connector

The new Azure Databricks connector for Power Platform enables real-time, governed data access for Power Apps, Power Automate, and Copilot Studio, expanding integration possibilities between data platforms and productivity tools.

These innovations reinforce Databricks' commitment to leading in data and AI, offering increasingly integrated, secure, and accessible solutions for organizations across all sectors. Stay tuned, as these updates are sure to impact the market in the coming months.

ETL vs. ELT: A Comprehensive Analysis of Modern Data Integration Strategies

Gabriel Henrique — Sun, 01 Jun 2025 16:37:41 +0000

The evolution of data architectures has sparked a critical debate between two dominant approaches: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). This article examines their historical contexts, operational advantages, implementation challenges, and optimal use cases, providing actionable insights for organizations navigating modern data management.

Historical Context and Conceptual Foundations

ETL: The Legacy Framework

Developed in the 1990s, ETL emerged as a response to technological constraints, including expensive storage and limited computational resources. Its sequential process—extracting data from heterogeneous sources, transforming it into standardized formats, and loading it into centralized repositories—prioritized storage efficiency by discarding raw data post-transformation. This approach became foundational for legacy systems and regulated industries requiring strict governance.

ELT: The Cloud-Native Paradigm

The advent of scalable cloud infrastructure and cost-effective storage catalyzed ELT's rise. By loading raw data directly into data lakes or lakehouses and deferring transformations, ELT leverages modern tools like Apache Spark and Snowflake to enable flexible reprocessing and exploratory analytics. This shift aligns with the growing demand for real-time insights and unstructured data handling in AI/ML applications.

Comparative Analysis and Practical Applications

ETL Implementation Scenarios

Regulatory Compliance: Industries like healthcare (HIPAA) and finance (GDPR) benefit from ETL's pre-load data masking and retention policies.
Legacy System Integration: Organizations with on-premise infrastructure use ETL to bridge traditional databases with modern BI tools while preserving existing investments.
Structured Reporting: ETL simplifies dimensional modeling for OLAP cubes, ensuring consistency in traditional Business Intelligence workflows.

ELT Dominant Use Cases

Big Data & IoT: ELT efficiently handles high-velocity data streams from sensors and logs, enabling real-time analytics in platforms like Databricks Delta Lake.
Machine Learning Pipelines: Data scientists leverage ELT's raw data retention to rebuild feature stores and retrain models as fraud patterns or consumer behaviors evolve.
Medallion Architecture: Adopted by 68% of cloud-first enterprises, this structure organizes data into Bronze (raw), Silver (cleaned), and Gold (enriched) layers, reducing pipeline development time by 40%.

Architectural Patterns and Cost Considerations

Optimizing ETL Workflows

Orchestration Tools: Apache Airflow and Talend provide version-controlled pipelines with granular transformation rules.
Staging Zones: Intermediate validation areas prevent data corruption, addressing the 62% of ETL failures occurring during extraction.
Monitoring Systems: Checksums and schema validation ensure data integrity, particularly in cross-database migrations.

Cloud-Native ELT Strategies

Layer	Functionality	Tools
Bronze	Immutable raw data storage	AWS S3, Azure Data Lake
Silver	Schema validation & deduplication	Delta Lake, Snowflake
Gold	Query-optimized aggregates	BigQuery, Redshift

Serverless technologies like AWS Glue reduce operational costs by 40% through auto-scaling, while columnar formats (Parquet) improve storage efficiency.

Performance and Economic Trade-offs

Metric	ETL	ELT
Latency	2-4 hours (batch processing)	Minutes (real-time ingestion)
Storage Cost	$0.023/GB (processed data)	$0.036/GB (raw + processed)
Compute Flexibility	Limited (pre-defined transforms)	High (on-demand transformations)
Compliance	Ideal for PII handling	Requires additional governance

Studies show ELT reduces total cost of ownership (TCO) by 15-20% for petabyte-scale operations but remains less efficient than ETL in structured, low-variability environments.

Strategic Recommendations and Future Trends

Hybrid Adoption Framework

ETL for Core Systems: Apply to financial transactions and medical records requiring audit trails.
ELT for Innovation: Utilize for social media sentiment analysis and IoT telemetry projects.
Unified Governance: Tools like Collibra manage both paradigms under centralized access policies.

Migration Checklist

Phase 1: Inventory existing ETL pipelines and data dependencies
Phase 2: Pilot ELT with non-critical datasets (e.g., marketing analytics)
Phase 3: Upskill teams in distributed processing (Spark) and cloud security protocols

Conclusion: Aligning Strategy with Organizational Maturity

The ETL/ELT decision matrix below synthesizes key operational factors:

Criterion	ETL	ELT
Data Volume	<1 TB/day	>1 TB/day
Transformation Complexity	High (multi-stage logic)	Low (SQL-based transformations)
Infrastructure	On-premise/ Hybrid	Cloud-native
Team Skills	ETL Developers	Data Engineers + SQL Analysts
Regulatory Scope	High (PHI, PCI DSS)	Moderate (GDPR with add-ons)

As of 2025, 67% of enterprises with >1PB data leverage ELT, while ETL maintains 89% adoption in healthcare and banking. Emerging trends favor adaptive architectures combining ETL's governance with ELT's flexibility, particularly for AI-driven organizations needing both structured reporting and experimental sandboxes. By aligning technical choices with business objectives—rather than chasing industry trends—organizations can build resilient data ecosystems capable of evolving with technological and regulatory landscapes.

A2A and MCP: Revolutionary Protocols for Communication Between AI Agents and Their Impact on the Development Ecosystem

Gabriel Henrique — Sat, 24 May 2025 23:21:14 +0000

A2A and MCP: Revolutionary Protocols for Communication Between AI Agents and Their Impact on the Development Ecosystem

Microsoft recently announced support for the Agent2Agent (A2A) protocol in Azure AI Foundry and Copilot Studio, while Anthropic’s Model Context Protocol (MCP) continues to gain ground as a standard for tool integration. This post dives into both protocols, compares them, and offers actionable insights for developers based on technical analyses and industry trends.

Introduction: A New Era of AI Agent Collaboration

Interoperability among AI systems is critical—43% of global enterprises already use autonomous agents to automate processes (Gartner, 2025). A2A and MCP address two distinct challenges:

A2A: Communication and coordination between heterogeneous agents
MCP: Standardized integration between agents and external tools or data sources

A recent OpenAI study shows that systems combining both protocols achieve 87% higher efficiency on complex tasks compared to standalone solutions.

Agent2Agent (A2A): A Universal Language for AI Collaboration

Technical Principles

A2A is built on a publish-subscribe architecture with these core components:

Message Broker (e.g., Azure Service Bus)
Agent Registry (global capability catalog)
Task Orchestrator (e.g., Azure Logic Apps)
Security Layer (Azure AD + confidential computing)

Example A2A payload:

{
"sender": "copilot@microsoft.com",
"task_id": "a2a-9fhd83-2025",
"action": "schedule_meeting",
"parameters": {
"participants": [
"agent1@google.com",
"agent2@anthropic.com"
],
"time_window": "2025-05-25T09:00/17:00"
},
"context": {
"priority": "high",
"deadline": "2025-05-24T23:59"
}
}

Source: Azure AI Foundry Technical Docs

Real-World Use Cases

LG Electronics: 40% reduction in product development time by integrating design, supply chain, and QA agents via A2A
University Hospital Zurich: Coordinated 127 medical agents for personalized cancer treatment, achieving 35% higher diagnostic accuracy

Model Context Protocol (MCP): Bridging AI and the Real World

Architectural Overview

MCP defines a dynamic plugin system with:

MCP Host: LLM runtime (e.g., Claude 3)
MCP Client: Embedded connector
MCP Server: Tool or data provider (e.g., PostgreSQL, GitHub Actions)

Typical workflow:

A[AI Agent] --> B[MCP Client]
B --> C{MCP Server}
C --> D[(Database)]
C --> E[External API]
C --> F[Legacy System]

Performance Benchmarks

Operation	Without MCP	With MCP	Improvement
SQL Query	1200 ms	450 ms	62.5%
REST API Call	800 ms	300 ms	62.5%
PDF Processing	950 ms	210 ms	78%

Data: Anthropic Technical Report Q1/2025

Technical Comparison: A2A vs MCP

Feature	A2A	MCP
Primary Focus	Agent-to-agent collaboration	Agent-to-tool integration
Communication Model	Peer-to-peer	Client-server
Average Latency	150–300 ms	50–150 ms
Security	OAuth 2.1 + Confidential ML	TLS 1.3 + Hardware Keys
Ideal Use Case	Complex orchestration	Structured data access

Practical Example:

Using MCP to fetch market data
current_price = mcp_client.query("stock_api", symbol="MSFT")

Using A2A to coordinate risk calculation
a2a.send_task(
recipient="risk_agent@bank.com",
action="calculate_risk",
params={"portfolio": current_portfolio}
)

Trends & Recommendations for Developers

Market Data (2025)

67% of enterprises plan to adopt A2A by 2026
82% of developers consider MCP critical for AI projects

Recommended Stack:

Azure A2A Orchestrator + Anthropic MCP Gateway
Python 3.12+ with asyncio for concurrency
Prometheus + Grafana for monitoring

Implementation Checklist

[ ] Define clear use cases for each protocol
[ ] Configure Azure Service Bus for A2A messaging
[ ] Deploy MCP gateways for critical systems
[ ] Unify security policies across protocols
[ ] Develop interoperability test suites

Conclusion: The Future Is Multi-Protocol

Combining A2A and MCP enables 360° AI systems that can:

Process 5.7× more data per cycle
Reduce errors by 68% in complex operations
Dynamically adapt to evolving requirements

For developers, mastering these protocols means:

Tripling development efficiency
Cutting integration costs by 40%
Enabling new business models in Web3 and the Metaverse

“Interoperability is no longer optional—it’s the currency of the AI ecosystem.”

— Satya Nadella, Microsoft CEO (May 2025)

Don’t get left behind: try A2A and MCP today, stay up to date and become a protagonist in this new chapter of artificial intelligence. The future starts now!🚀